Home > Media News > Microsoft AI can now clone voices to sound perfectly 'human' in seconds

Microsoft AI can now clone voices to sound perfectly 'human' in seconds
16 Jul, 2024 / 09:58 am / Microsoft

Source: http://www.webdesk.com

269 Views

(Web Desk) - Microsoft has developed an artificial intelligence tool that can replicate human speech with uncanny precision.

It is so convincing that the tech giant refuses to share it with the public, citing "potential risks" of misuse.

The tool, dubbed VALL-E 2, is a text-to-speech generator that can mimic a voice based on just a few seconds of audio.

It is trained to recognize concepts without being provided any examples of those concepts beforehand in a scenario called zero-shot learning.

The tech giant says VALL-E 2 is the first of its kind to achieve "human parity," meaning it meets or surpasses benchmarks for human likeness.

It succeeds the original VALL-E system, which was announced in January 2023.

According to developers at Microsoft Research, VALL-E 2 can produce "accurate, natural speech in the exact voice of the original speaker, comparable to human performance."

It can synthesize complex sentences in addition to short phrases.

To do so, the tool takes advantage of two features called Repetition Aware Sampling and Grouped Code Modeling.

Repetition Aware Sampling addresses the pitfalls of repetitive tokens, or the smallest units of data a language model can process - represented here by words or parts of words.

It prevents recurring sounds or phrases during the decoding process, helping vary the system's speech and making it sound more natural.

Grouped Code Modeling limits the number of tokens the model processes at once to generate faster results.

The researchers compared VALL-E 2 against audio samples from LibriSpeech and VCTK, two English-language databases.

They also used ELLA-V, an evaluation framework for zero-shot text-to-speech synthesis, to determine how well VALL-E handled more complex tasks.

The system ultimately beat out its competitors "in speech robustness, naturalness, and speaker similarity," according to a June 17 paper summarizing the results.

Microsoft claims VALL-E 2 will not be released to the public anytime soon, deeming it "purely a research project."

"Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public," the company wrote on its website.

"It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker."

The tech behemoth notes that suspected abuse of the tool can be reported using an online portal.

And Microsoft's concerns are well within reason. Just this year, cybersecurity experts have seen an explosion in the use of AI tools by malicious actors, including those that replicate speech.

"Vishing," a portmanteau of "voice" and "phishing," is a type of attack where scammers pose as friends, family, or other trusted parties on the phone.

Voice spoofing could even pose a national security risk. In January, a robocall using President Joe Biden's voice urged Democrats not to vote in New Hampshire primaries.

The man behind the plot was later indicted on charges of voter suppression and impersonation of a candidate.

Microsoft has come under increased scrutiny over its implementation of AI, on both the antitrust and data privacy fronts.

Regulators have voiced concern about the tech giant's $13 billion partnership with OpenAI and resulting control over the startup.

The company has also faced blowback from its users.

Recall, an "AI assistant" that takes screen captures of a device every few seconds, saw its release indefinitely postponed last month.

Microsoft faced a deluge of criticism from consumers and data privacy experts like the Information Commissioner's Office in the UK.

In a statement to The U.S. Sun, a company spokesperson said Recall would shift "from a preview experience broadly available for Copilot+ PCs...to a preview available first in the Windows Insider Program."