Overview -- DawNLITE

What is DawNLITE?

DawNLITE stands for Daw Natural-Language-based Image Transmoding Engine. The project's aim is building a software that transforms a (large) image to a video that shows relevant regions of the image and that's nicer to watch on a mobile device than the original image. The very special feature of our approach is incorporating natural language analysis features so that the video's script follows a text that describes the image, such as a museum caption. The text will be transformed to synthetic speech that is delivered to the user along with the video, and the video is arranged synchronized to the audio, meaning that things are shown when you hear the synthetic voice mention them.

Overall system architecture

This drawing outlines the overall system's architecture:

At the center, the Virtual Camera Control (VCC) generates the script for the camera. The script instructs the camera in detail when it should display which region of the image. The VCC could be considered to be the master brain of the DawNLITE system. It takes keyword-based annotations of regions of interest (RoIs) for the image, text, and several constraints and preferences as its input. The most important constraints are the display size, maximum overall duration, and some limits for ensuring a smooth and pleasant camera motion.

The text is passed to a text-to-speech (TTS) system for two purposes: generating audio and obtaining timing information for any fragment of the text. This is crucial for being able to synchronize the virtual camera's behaviour with the synthesized speech.

From the script, an SVG animation is generated, which can then be rendered to a video of the initially given resolution. Finally, the video and the speech audio are encoded and multiplexed (optionally with subtitles generated from the text), using codecs and formats of the user's choice.

Notes

The use of pre-existing, non-artificial speech is a task we keep in mind for the future.
We are aware of TTS systems on the market that produce speech sounding more natural than what you hear in our current examples. However, replacing the currently employed TTS system is secondary, while access to certain preprocessing results is crucial for DawNLITE's ability to synchronize the camera motion to speech. We are looking for better solutions meeting the requirements for synchronization and quality, in this order of priority. Tweaking the currently used TTS for better quality is a possiblity still in consideration.
Keep in mind that this is work in progress, meaning that at any time the actual state of work and the capabilities shown in the showcase may deviate from descriptions on this website.