Microsoft's VALL-E 2 can convincingly recreate human voices using just a few seconds of audio, its creators claim.
Microsoft has developed a new artificial intelligence (AI) speech generator that is apparently so convincing it cannot be released to the public.
VALL-E 2 is a text-to-speech (TTS) generator that can reproduce the voice of a human speaker using just a few seconds of audio.
Microsoft researchers said VALL-E 2 was capable of generating "accurate, natural speech in the exact voice of the original speaker, comparable to human performance," in a paper that appeared June 17. In other words, the new AI voice generator is convincing enough to be mistaken for a real person — at least, according to its creators.
"VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time," the researchers wrote in the paper. "Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases."
Human parity in this context means that speech generated by VALL-E 2 matched or exceeded the quality of human speech in benchmarks used by Microsoft.
The AI engine is capable of this given the inclusion of two key features: "Repetition Aware Sampling" and "Grouped Code Modeling."
Ethics Statement..
VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public. VALL-E 2 could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While VALL-E 2 can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that VALL-E 2 is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.
Microsoft has developed a new artificial intelligence (AI) speech generator that is apparently so convincing it cannot be released to the public.
VALL-E 2 is a text-to-speech (TTS) generator that can reproduce the voice of a human speaker using just a few seconds of audio.
Microsoft researchers said VALL-E 2 was capable of generating "accurate, natural speech in the exact voice of the original speaker, comparable to human performance," in a paper that appeared June 17. In other words, the new AI voice generator is convincing enough to be mistaken for a real person — at least, according to its creators.
"VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time," the researchers wrote in the paper. "Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases."
Human parity in this context means that speech generated by VALL-E 2 matched or exceeded the quality of human speech in benchmarks used by Microsoft.
The AI engine is capable of this given the inclusion of two key features: "Repetition Aware Sampling" and "Grouped Code Modeling."
Ethics Statement..
VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public. VALL-E 2 could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While VALL-E 2 can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that VALL-E 2 is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.




, just some PR boost