Microsoft presents VASA-1: Hyperrealistic avatars generated by AI

  • VASA-1 is a Microsoft AI model that generates hyper-realistic avatars from images and audio.
  • It allows you to create videos with synchronized lip movements and gestures, imitating human communication.
  • Developers aim to combat misinformation and improve accessibility through responsible use of technology.
  • Concerns about misuse of the tool have led Microsoft to not release a public demo.

VASA-1

VASA-1 is Microsoft's new artificial intelligence model. An amazing technology capable of creating realistic avatars from two simple ingredients: a static image and a voice clip. If you are interested in knowing more about VASA-1 and its hyper-realistic avatars generated by AI, we encourage you to continue reading.

It seemed that Redmond was going to concentrate all its efforts on the development of this type of technology in the assistant Copilot. A tool that combines language models with Microsoft 365 applications. However, it seems that its plans are more ambitious. The proof of this is found in VASA-1.

What is VASA-1?

VASA is the acronym for Visual Affective Skills App, a concept that can be translated as Application of Visual-Affective Skills. The number "1" is a clear reference that this is only the first of a long list of versions that will arrive in the future to leave us even more surprised.

VASA-1

What makes VASA-1 so special? What is your main innovation? There are already many applications capable of bringing photos to life with movements similar to those of a GIF. What this tool created by a team of AI researchers from Microsoft Research Asia introduces is something much more sophisticated: an artificial intelligence system that can make photographs sing and dance. It's not about animation, but something else.

The result is amazingly realistic. Hyperrealistic would be the most appropriate term. This model can produce lip movements perfectly synchronized with audio, as well as capture a wide spectrum of facial nuances and natural head movements. All in all, it presents a vivid and authentic image that has never been seen before in other similar tools.

In addition to this, the tool also allows the online generation of 512x512 videos at up to 45 frames per second (slightly less if used in offline mode) with negligible initial latency. This paves the way for real-time interactions with realistic avatars that can even reach imitate human conversational behaviors.

VASA-1: Some examples

This method shows the ability to handle wide spectrum image and audio files. Thus, artistic photographs and even audios from different languages ​​can be included, not just English. In this post we have included some examples that really leave us speechless. It is difficult to say that the faces that appear speaking and gesturing in the videos do not correspond to those of real people, but are avatars created from images and audio:

Any user with a medium-power computer (for example, an Nvidia RTX 4090 GPU) can use this tool to generate videos of this realistic level in just a few minutes.

It is impressive to see how these animations combine images and audio so effectively, giving the talking head before us an unusual degree of realism. However, Experts point out that there are still errors that reveal the fake nature of these images. Details imperceptible to most of us, but that do not escape the best trained observers: some subtle defects and signs that reveal AI intervention.

The dangers of a tool that is too precise

This tool is so excellent and so realistic that Microsoft has not dared to take the step of releasing even an open demo. The concern for the misuse and potential dangers it would pose for identity theft advises acting with great caution.

In any case, on the official website of the VASA-1 project, hosted on the Microsoft site, we find an interesting video lasting just over a minute in which we can witness the process of creating these hyper-realistic avatars:

Basically, the method consists of selecting an image (a human face) and then an audio file. The AI ​​then “marries” them. During the creation process, the user can outline numerous nuances through the buttons and bars that appear on the interface. By investing just a little time and creativity, impactful results can be achieved.

At the moment, the intentions of the VASA-1 developers are precisely the opposite of generating fake and phishing videos (or, at least, that's what they say). That is to say, help detect and combat videos deep fake. It may be true, since no one knows better than them how to trick the human mind through increasingly powerful and precise AI tools.

Despite this, the VASA-1 developers also insist on highlighting the most positive aspects of its creation: improving accessibility for people with communication difficulties, offering company or therapeutic support to those who need it and other advantages that derive from the responsible use of AI. The challenge is to make this possible.


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.