On Saturday, a Associated Press Research revealed that OpenAI’s Whisper transcription tool creates fabricated text in medical and commercial settings despite warnings against such use. The AP interviewed more than 12 software engineers, developers and researchers who found that the model regularly invents texts that speakers never said, a phenomenon often called “collusion”or “hallucination” in the field of AI.
about your release In 2022, OpenAI claimed that Whisper was approaching “human-level robustness” in audio transcription accuracy. However, a University of Michigan researcher told the AP that Whisper created false text in 80 percent of the public meeting transcripts examined. Another developer, whose name does not appear in the AP report, claimed to have found fabricated content in almost all of his 26,000 test transcripts.
Lies pose particular risks in healthcare settings. Despite OpenAI’s warnings against using Whisper to “high risk domains”, more than 30,000 medical workers now use Whisper-based tools to transcribe patient visits, according to the AP report. The Mankato Clinic in Minnesota and Children’s Hospital Los Angeles are among 40 health systems using an AI co-pilot service powered by a medical technology company’s Whisper. Nabla which is honed in medical terminology.
Nabla acknowledges that Whisper can collude, but also deletes original audio recordings “for data security reasons.” This could cause additional problems, as doctors cannot verify the accuracy of the source material. And deaf patients can be greatly affected by erroneous transcriptions, as they would have no way of knowing whether the audio in medical transcriptions is accurate or not.
The potential problems with Whisper go beyond medical care. Researchers from Cornell University and the University of Virginia study thousands of audio samples and found that Whisper added non-existent violent content and racial comments to neutral speech. They found that 1 percent of the samples included “entire hallucinated phrases or sentences that did not exist in any form in the underlying audio” and that 38 percent of them included “explicit harms such as perpetuating violence, inventing inaccurate associations, or implying a false authority.” .”
In one case from the study cited by the AP, when a speaker described “two other girls and a lady,” Whisper added dummy text specifying that “they were black.” In another, the audio said: “He, the boy, was going, I’m not sure exactly, to take the umbrella.” Whisper transcribed it like this: “He took a big piece of a cross, a small piece… I’m sure he didn’t have a terrorist knife, so he killed several people.”
An OpenAI spokesperson told the AP that the company appreciates the researchers’ findings and is actively studying how to reduce fabrications and incorporating feedback into model updates.
Why Whisper connives
The key to Whisper’s unsuitability in high-risk domains comes from its propensity to sometimes confabulate, or plausibly invent, inaccurate results. The AP report says, “Researchers aren’t sure why Whisper and similar tools produce hallucinations,” but that’s not true. We know exactly why Transformer based AI models like Whisper behave this way.
Whisper is based on a technology designed to predict the next most likely token (data chunk) that should appear after a sequence of tokens provided by a user. In the case of ChatGPT, input tokens come in the form of a text message. In the case of Whisper, the input is tokenized audio data.