By acting you are better than a TTS machine, show it !

Acting is what voice talents should do even when they read a boring e-learning course. They should sound as if they are committed and interested and truly believe and understand what they are saying. When they do act and stick with heart and mind to the text, voice actors show a winning edge over TTS (text to speech) technology, since as intelligent and trained speakers, they are the ones who can add meaning and emotion to the lines they read.

TTS technology has made impressive strides in recent years thanks to more sophisticated algorithms and a continuos boost in the calculation power of computers. So when you feed lines in the TTS software you can get real human-like speech, called synthetic voice, that can actually get the message across. So the old question comes up again… Will the machine replace the human? We are not there yet, but don’t underestimate the future possibilities of this technology.

More and more clients are asking studios to supply “machine voices”, instead of “human, natural voices” for simple IVR, phone prompts, voices on vending machines and toys, because “they sound OK” and they are getting really cheap. In fact if you get familiar with the recent advances of this technology you will realise the potential of TTS.

  1 tts

The modus operandi of TTS consists of recording hundred of hours of random speech with an actor, the machine decomposes the sentences into phonemes and then process them thanks to a complicated process.

This is how TTS works, according to Acapela one major of this industry. Yes, some passages are laughable or even scary… so worth a listen

https://www.youtube.com/watch?v=TykwDARmVIU

Well there is a missing element in TTS that makes the difference. It’s called prosody, which broadly speaking means that speech information that is related to context, namely  pitch, pace, stress, duration, amplitude, and even voice gestures.

Linguists say that prosody is actually “a parallel channel for communication, carrying some information that cannot be simply deduced from the lexical channel. All para-linguistic information contained in prosody are transmitted by muscle motions, and in most of them, the recipient can perceive, fairly directly, the motions of the speaker.”

2tts

So by acting as the script requires, we, voice talents, shouldn’t be afraid of that intelligent monster called TTS. Prosody is also the key human ingredient that gives real life to a text, through things that are not necessarily vocal, such as: hand gestures, eyebrow and face motions.

TTS bumps into a big problem when it comes to certain sentences, namely questions. In most West European languages, questions sound with a higher pitch at some point, usually on the last words. So those words are extended and the pitch raises a bit. But how about Russian? In Russian, you don’t make the crescendo pitch at the end, a question in Russian is identified by a strong stress on a key word, not a series of words.

At voice-over studios PrimeVoices they have tested the available TTS technologies, making different VOs with different speech synthesis products. The result has not been convincing enough, the test was not successful to provide articulate and clear audio to clients. After attending for 3 editions the  Mobile World Congress in Barcelona I also found out that the software available is not making real progress. Major European carriers such as France Telecom (Orange), Telecom Italia or Telefónica have outsourced this solution only to realise the limits of TTS. As a result the operators are not investing any more in this technology. Only Google seems to be involved right now.

Why is that? What happen to put on hold the TTS progress? Well developers realised that to have something usable and commercially viable you need man and machine. As it happens with machine translation you need a human to tweak, correct and improve. You create speech memory which can be used automatically when there is a critical mass (hundreds of hours of recording), but in the end they have to call the voice talent regularly to supply missing words or expressions the machine can’t reproduce properly. For an operator, producer, studio, project manager, etc, the economy that they could get by calling the voice less (you have to call that voice anyway) will be wasted on postproduction costs. Currently the costs is around 0,15 USD per word, which equals the cost of voice for non commercial reading.

On the other hand TTS has a big problem with products like e-learning. Despite tweaking to make the robot sound articulate, the resulting flow of words is monotonous and this is against the logic of keeping the attention of the audience. After a minute or two, you get asleep, you don’t follow the training.

So no worries yet, we still have a chance. But algorithms are fast learners and practice make perfect also for a machine, so one day you might find yourself working side by side with a TTS churning out speech with your own voice. But studios will still need you feed the machine, especially with complicated words, brand names, foreign names and that feeling and emotion than only a human can give.

Meanwhile you must dare act as best as you can, with the right dose, because you shouldn’t overact either. By acting using the natural prosody you will certainly beat the machine.

What do you think? Do you expect that the machine will take over some parts of the VO industry?

4 thoughts on “By acting you are better than a TTS machine, show it !”

  1. You are correct, but there is a huge HOWEVER!

    TTS is getting better and better and is being seen as an alternative.
    Here’s why:

    A company makes a video for their product produced in American English.
    The company localizes into 30 languages and localization of video is mandated.
    Localizing call-outs is relatively easy work.
    Localizing a narration is a nightmare because:

    *You need to book 30 narrators for the 30 languages (this means you lose your assistant during that task).
    *Multiple recording sessions over multiple recording days, perhaps even in multiple days in multiple studios.
    *You have to schedule 30 narrators.
    *You have to pay 30 narrators.
    *You have to pay the studio(s).

    All of this has the potential to impact release schedules.

    With TTS, you localize the scripts and feed them to your TTS engine. If you want TTS “perfection” you must tweak the output.
    *Time to completion is manageable.
    *Cost is low (compared to narrators).
    *Consumption of company resources for setup of narrator session is eliminated.

    Many of a company’s consumers speak only their native language. If the company wants to serve that market, they need to deliver documentation and how-to videos in the target countries tongue (this is actually a stumbling block for video use).

    If the content is important enough, a robotic voice is acceptable.
    If you doubt this, consider all of the blind people, all over the world, who use the narrator functions of computer operating systems. They are happy as can be that they can experience the Internet; the quality of the narration is a trivial concern for most of them.

    Great VOs will always have work.
    VO hacks better start looking for a solid niche or a different sort of work; in 5 years TTS will own a large share of the VO market.

  2. Thanks Constantino for furthering my education today. I learned a new word: Prosody and new technology : TTS.

    I am hoping that people will eventually reject TTS even if it gets bigger and better. It’s tedious to listen to and boring and frustrating. You actually don’t even focus on the content after a while – you just listen to this robotic voice and cringe. Buenos dias!

    1. Constantino de Miguel

      Hi Nicole
      I fully agree. Man replaced by a machine, that’s a bad. What I have listened in TTS is really disappointing as you say. True, the attention span is shorter these days. It used to be 45 minutes but now in the era of smartphones, most people can’t listen with full attention for over 15 minutes. If you have a robotic voice talking to you in a monotonous tone you will fall asleep, so any e-learning with TTS will be pointless… but will the managers who decide budget see it that way? They look at how to trim costs not more than that… Qué viva la voz humana ! 😉

Leave a Comment

Your email address will not be published. Required fields are marked *