In March 2022, Google opened early access to their automatic dubbing program, Aloud, to a limited number of audiovisual (AV) content creators. Currently Aloud only translates into four languages, albeit four of the ten most spoken languages in the world (according to Berlitz in late 2021): Hindi, Indonesian, Portuguese, and Spanish. Of course the idea is that automatic dubbing will make monolingually produced AV content more accessible to a world audience, which is a great end result. The current state of the technology, however, leaves much to be desired when you consider what makes a high-quality dub.
Aloud certainly has impressive capabilities: it relies on various technological processes including audio separation, machine translation, and speech synthesis to produce a dub in a fraction of the time required for a traditional dub. The intended workflow is that creators provide an original language transcript for their video, but Aloud can even generate a transcript that is editable so that the creator can review it and fix any errors. Aloud is not the only nascent automatic dubbing program out there – Deepdub, a machine dubbing startup, raised $20 million in investment just a month before Aloud announced their early access release.
A current look at dubbing
There’s certainly a lot to gain from entering the dubbing market. According to Big Think, as of 2020 dubbing is preferred (over subtitles and voice-over) in much of continental western Europe. And in an interview with IndieWire, Chris Carey, the chief revenue officer and managing director of Iyuno Media Group, suggests that generally speaking, the mass Asian market is also moving towards dubbing foreign language media. In the same article, IndieWire claims that “most major streaming platforms are now treating a dub version as standard deliverable in many markets.” And just like much else in the world, automation is seemingly the logical next step towards increasing efficiency and decreasing cost. The nature of dubbing, however, means that stage one automatic dubbing may not be suitable for wide release.
Let’s unpack that: IndieWire writes that “for obvious reasons, a bad dub will lower audience engagement.” And what is considered a bad dub? One that heightens the viewer’s consciousness that who they are hearing and who they are watching are not the same person. This can be because of a poor performance by the voiceover actor, a lack of visual sync between the voiceover and the on-screen actors’ mouths, or a poor audio mix. Only so much about these three things can be controlled in an automatic dubbing workflow.
Probably the most problematic element is the lack of visual sync between the screen actors’ mouths and the voice actors’ words. Chris Carey admits that “even the best possible dub with live action actors is going to be noticeable,” but he further explains that “the closer the words and mouths match up over a consistent stretch of time, the better chance the viewer will move past the dubbing and become involved in the story.” This visual syncing involves a lot of effort, intentionality, and creativity on the part of the translator. We’ve written before about the challenges involved in translating subtitles and translating audio scripts, but the requirements around translation for dubbing are arguably a step beyond either of those. This is part of what led to the infamous Squid Game English subtitle controversy – a translation for a dub must take additional creative license to account for visual sync. So, the closed captions for the English dub would have been a less “accurate” translation compared to the English subtitle translations (though there were complaints about those too, some of which were more easily addressed than others).
There will surely be an answer to some of these issues eventually, but as of now automatic dubbing simply doesn’t seem capable of producing the same result as the traditional dubbing process. CGI, for instance, might solve the issue of an audio/visual discrepancy. But for the time being CGI remains detectable in many instances, not to mention time consuming and expensive.
Dubbing vs. voice-over
Given that in the language services industry, “dubbing” and “voice-over” mean two different things, we haven’t yet discussed what types of media might be better served by automatic dubbing. For us, dubbing implies that the person speaking appears on screen but the original speech is replaced by a different voice actor. This is why all the issues we’ve mentioned up to this point are valid. A video where no narrator appears on-screen, however, might be a great candidate for automation. We’d call that an “automatic voiceover” rather than an “automatic dub.”
Voice-over has yet another meaning, which presents an interesting potential hybrid solution. Big Think reports that much of eastern Europe prefers voice-over for consuming foreign language media, where one or a few voice actors narrate a translated script while the original voices are maintained faintly in the background. You may have seen this technique used in documentary films or newscasts where the interviewer and the person being interviewed do not speak the same language. Lip syncing wouldn’t affect this type of “dub,” though other elements such as a poor audio mix might. The same is true of a lektoring-style voice-over, though SPG Studios calls this technique “niche.”
Ewa Zawadzka, ZOO Digital’s head of dubbing for the EMEA region, affirms in an interview with Multilingual that when her company creates a dub “we’re not just creating a program, that program has to achieve the same success wherever it goes.” Obviously this will entail different things in different markets. As of now, we’d limit automatic dubbing’s potential for success to voice-overs rather than on-screen narration. If you have single-language videos that you’d like to present to a multilingual audience, we’d be happy to walk you through options – give us a call or send us a message!