C3PO: A famous robot from Starwars franchise.
C3PO: A famous robot from Starwars franchise. Image via Unsplash.


Written by Kirey Ismaya on February 18, 2021

Text to speech or what is also called the TTS system may be quite foreign to ordinary people. in fact, the majority of people must think that this is the same as speech recognition, a system that uses word commands and translates them into data that is understood by computers.

So Speech Recognition refers to the change from the human voice to text. So basically this system is the opposite of the TTS system.

In principle, the TTS system is a system that can convert from text to speech. Maybe this is not widely known by some people. Even though today's world technology is increasingly developing, not everyone really understands what the text to speech system is and what it does.

Almost some people don't know it because they don't keep up with the times that are so fast at this time. In fact, maybe only a few people who understand it and understand the meaning of this system. Actually, this system can provide benefits for the user.

What is Text-to-Speech?

Reading is a window into the world. we agree that by reading we can discover and learn a lot of new knowledge. But over time, our daily activities tend to become busier. therefore, humans must be able to adapt to their busy lives. This condition raises an overview of how human- computer technology can continue their activities as usual but can read with the help of a computer to convert text into speech. one way is to start getting used to multi-tasking. the use of a text to speech.

If defined, the TTS is a system that capable of converting a text into speech and this happens automatically by means of phonetization. What is meant by phonetization here is the arrangement of several phonemes to form a speech. You can use this system to say any word because the vocabulary that can be pronounced through this system is unlimited.

We can easily access the text to speech feature by using a computer or cellphone to read out the text on the screen for you.

With the TTS feature, there are many potential uses that can be used. But generally, this feature prioritizes comfort and convenience for people who have special conditions and need help hearing the text on a screen.

As we previously know, the sound produced by the TTS system is computerized. But over the times, this system has undergone many developments. One of them can be found in the sounds produced by this system which sound much more natural and have included the appropriate intonation to make it sound more alive.

One example is the text to speech system created by Aloud, which can produce a more natural sound and is able to pronounce every word well, even in various languages. not only that, with Aloud's text to speech, there are various options for voice characters and also gender.

To understand more about text to speech, it is better if we first learn about this technology's development history.

How Text-to-Speech works

The text-to-speech (TTS) conversion system is a system capable of producing signals speech automatically via grapheme-to-phoneme transcription for a spoken sentence. The Difference between the TTS system with ordinary talking machines is the automatic way of saying words new. Therefore TTS allows it to be implemented in any kind of application

The technique used is a diphone concatenation with the principle consisting of two sub-systems, namely: the text to phoneme converter part (text to phoneme), as well as the phoneme to speech converter section.

In principle, the system in a text to speech consists of two subsystems namely the text-to- phoneme converter and the phoneme-to converter speech (phoneme to speech). The text-to- phoneme converter section works to change the input sentence in a certain language in the form of text be a series of sound codes which are usually presented with a code phoneme, their duration and pitch. This section is very language-dependent. The phoneme-to-speech converter section will accept input phoneme codes as well as the pitch and duration generated by the previous section. Based on these codes, the phoneme-to-speech converter part produces a sound or speech signal that matches the sentence you want to say.

Two techniques that are widely used are formant synthesizer and diphone concatenation. Formant synthesizers work based on a mathematical model that will perform computations to generate the desired speech signal. This type of synthesizer has long been used in various applications. Although it can produce utterances with an easy level of interpretation well, this synthesizer cannot produce speech with a high degree of naturalness. Based on the research that has been done, it can be assumed that Synthesizers that use the diphone concatenation technique can produce speech sounds with a high level of naturalness.

The Phoneme to Speech Converter section will accept input in the form of phoneme codes as well as the pitch and duration generated by the previous section. Based on these codes, part Phoneme to Speech converter generates a speech sound or signal that matches the sentence that you want to say.

In general, the process in the TTS system consists of Natural Language Processing (NLP) in the form of a text to phoneme conversion module produces phonetic transcriptions with intonation and rhythm information (known as prosody) and Digital Signal Processing (DSP) in the form of a phoneme to speech conversion module, which converts the phonetic information it receives becomes a speech signal

Natural Language Processing (NLP)

NLP modules can be implemented with several solutions, which are often classified as dictionary- based and rule-based. Dictionary-based solutions are implemented by storing as much phonological information as possible in the dictionary.

In this method, transcription is carried out by means of a lexical database lookup method that has been compiled. While the rule-based transcription system replaces storing phonological information in a dictionary by creating rule sets letter-to-sound (or grapheme-to-phoneme).


The final processing stage of the TTS system is speech signal synthesis. In general, there are three basic methods for speech signal synthesis.

Articulatory synthesis, which attempts to model systems production of human speech signals with a direct, formant mechanical physical approach a synthesizer that models the pole frequency of a speech signal or based transfer function vocal track or source-filter model, concatenation synthesizer, which uses the length of a different portion of a recorded natural speech signal. But there are two The techniques often used are formant synthesizer and diphone concatenation.


Phonemes can also be used as speech units in the database. Some synthesis problems the circuit is compared with other methods, namely:

  1. Distortion occurs due to discontinuity at the point of connection, which can be reduced by using a diphone or some other method to smooth the speech signal.
  2. Memory requirements are very high, especially when using chained units long, such as syllables and words.
  3. Collecting data and tagging parts of the speech signal takes a long time.

Diphone Concatenation

The diphone concatenation technique works by combining sound segments previously recorded. Each segment is a diphone (a combination of two phonemes). This type of synthesizer can produce speech sounds with a naturalness level high. Speech formation in speech synthesizers uses the diphone concatenation method In principle, this is done by arranging a number of suitable diphones so that obtained the desired pronunciation.

So that the speech can pronounce all possible words or sentences in a language, so the system must be supported by a diphone database consisting of all available diphone combinations in that language. Diphone concatenation engine or unit The processor on the iPhone will receive input in the form of a list of phonemes that you want to speak, respectively accompanied by the duration of the pronunciation, as well as the pitch or frequency. Based on a list of phonemes received, this unit will determine the correct diphone array. Furthermore, this unit will do smoothing the connection between the diphone, manipulating the pronunciation duration and manipulating the pitch. In the end, the diphone concatenation engine will generate the signal which results in a suitable pronunciation.