Overview -- DawNLITE

What is DawNLITE?

DawNLITE stands for Daw Natural-Language-based Image Transmoding Engine. The project's aim is building a software that transforms a (large) image to a video that shows relevant regions of the image and that's nicer to watch on a mobile device than the original image. The very special feature of our approach is incorporating natural language analysis features so that the video's script follows a text that describes the image, such as a museum caption. The text will be transformed to synthetic speech that is delivered to the user along with the video, and the video is arranged synchronized to the audio, meaning that things are shown when you hear the synthetic voice mention them.

Overall system architecture

This drawing outlines the overall system's architecture:

DawNLITE overall system architecture

At the center, the Virtual Camera Control (VCC) generates the script for the camera. The script instructs the camera in detail when it should display which region of the image. The VCC could be considered to be the master brain of the DawNLITE system. It takes keyword-based annotations of regions of interest (RoIs) for the image, text, and several constraints and preferences as its input. The most important constraints are the display size, maximum overall duration, and some limits for ensuring a smooth and pleasant camera motion.

The text is passed to a text-to-speech (TTS) system for two purposes: generating audio and obtaining timing information for any fragment of the text. This is crucial for being able to synchronize the virtual camera's behaviour with the synthesized speech.

From the script, an SVG animation is generated, which can then be rendered to a video of the initially given resolution. Finally, the video and the speech audio are encoded and multiplexed (optionally with subtitles generated from the text), using codecs and formats of the user's choice.

Notes