Speech synthesis has come a long way since 1978 Speak and play toy, which once amazed people with its state-of-the-art ability to read words aloud using an electronic voice. Using deep-learning AI models, software can now not only create realistic-sounding voices, but also convincingly imitate existing voices using small audio samples.
Along these lines, OpenAI this week announced Voice Engine, a text-to-speech AI model for creating synthetic voices from a 15-second segment of recorded audio. It has provided audio examples of the Voice Engine in action on its website.
Once a voice is cloned, a user can enter text into the Voice Engine and get an AI-generated voting result. But OpenAI isn’t ready to release its technology widely. The company initially planned to launch a pilot program earlier this month to allow developers to sign up for the Voice Engine API. But after further consideration of the ethical implications, the company decided to scale back its ambitions for the time being.
“In line with our approach to AI safety and our voluntary commitments, we are choosing to pilot but not release this technology at scale,” the company wrote. “We hope that this preview of Voice Engine both underlines its potential and motivates the need to strengthen societal resilience against the challenges posed by increasingly compelling generative models.”
Overall, voice cloning technology is not particularly new; that has been the case multiple AI speech synthesis models since 2022, and the technology is active in the open source community with packages such as OpenVoice And XTTSv2. But the idea that OpenAI aims to let everyone use its particular brand of voice technology is remarkable. And in some ways, the company’s reluctance to fully release it might be the bigger story.
OpenAI says the benefits of its voice technology include providing reading assistance through natural-sounding voices, enabling global reach for creators by translating content while preserving native accents, supporting non- verbal individuals with personalized speech options and helping patients recover their own voice after speech-impairing conditions.
But it also means that anyone with fifteen seconds of someone’s recorded voice can effectively clone them, and that has obvious implications for potential misuse. Even if OpenAI never widely releases its Voice Engine, the ability to clone voices has already caused problems in society telephone scam where someone imitates the voice of a loved one and election campaign robocalls with cloned voices of politicians like Joe Biden.
Also researchers and reporters have shown that voice cloning technology could be used to break into bank accounts that use voice authentication (like Chase’s). Voice ID), prompting U.S. Senator Sherrod Brown of Ohio, the chairman of the U.S. Senate Committee on Banking, Housing and Urban Affairs, to a letter to the CEOs of several major banks in May 2023 to inquire about the security measures banks are taking to counter AI-powered risks.
OpenAI recognizes that the technology could cause problems if released widely, so it initially tries to work around these problems with a set of rules. Since last year, the technology has been tested with a number of selected partner companies. For example, a video synthesis company Hello Gen has used the model to translate a speaker’s voice into other languages, while keeping the same voice sound.