VASA: Microsoft AI generates “talking face” from photo and voice recording

The VASA-1 model is capable of creating a talking image from a still photo.

Developed by Microsoft’s Asia research team, the VASA-1 AI model can transform a single static image and a voice recording into highly realistic video clips where the image appears to articulate the spoken words.

This new technology enhances visual and auditory synchronization, capturing a broad range of expressive facial expressions and natural head movements, making the avatars seem almost lifelike.

VASA, an abbreviation for “Visual Affective Skills Avatars,” is showcased on a dedicated project page with numerous examples. These examples demonstrate the tool’s ability to generate square videos of diverse faces, each reciting texts with varying emotional undertones.

The representations, including an animated version of Leonardo da Vinci’s Mona Lisa, are all virtual, designed to emphasize the AI’s capacity for generating non-existent yet photorealistic human portraits.

Disclosure: The video contains profanity. Copyright: Microsoft.

Microsoft explains that the core innovation behind VASA-1 lies in its holistic approach to generating facial dynamics and head movements in a face-latent space. This technology has been refined through extensive experimentation and has shown significant advancements over previous methods.

It supports generating high-quality 512×512 pixel videos at impressive frame rates—up to 45 frames per second in offline mode and 40 frames per second online—thus facilitating real-time interaction with avatars that emulate human-like conversational behaviors.

Despite the potential for widespread application, from virtual tutors enhancing education to avatars offering emotional support in healthcare, Microsoft has expressed caution. The development team acknowledges the possible misuse of such technology, particularly in creating disinformative content that could impact public opinion, especially during sensitive times like election seasons.

As a result, there are no immediate plans to release this tool publicly for development or as a product. Microsoft aims to ensure that the technology will only be used responsibly before any broader release.

This “responsible use” stance is similar to Alibaba’s, which released its Animate Anyone research paper in December 2023 but never released the model itself—even after promising to do so in an update. While impressive, the technology invites too many cases of easy abuse.

The VASA-1 paper is also available on arXiv: HTML / PDF.