Text to Speech Conversion Using Hugging Face Transformers

usmanmalik57 2 Tallied Votes 167 Views Share

Introduction

Text-to-speech (TTS) technology has revolutionized how we interact with devices, making accessing content through auditory means easier. TTS is vital in various applications such as virtual assistants, audiobooks, accessibility tools for the visually impaired, and language learning platforms.

This tutorial will explore how to convert text-to-speech using Hugging Face's MeloTTS transformer, a powerful model designed for high-quality TTS tasks.

We will walk through installing the necessary libraries, creating basic examples, experimenting with different accents and languages, adjusting speech speed, and ultimately, combining these elements into a comprehensive TTS function.

Note: Check out my article on how to generate stunning images from text if you are interested in text-to-image generation.

Installing Required Libraries

To begin, we must clone the MeloTTS repository from GitHub and install the required dependencies. This can be done with the following commands:


!git clone https://github.com/myshell-ai/MeloTTS.git
%cd MeloTTS
!pip install -e .
!python -m unidic download

In the above script, the git clone command fetches the MeloTTS repository, and we navigate into the cloned directory. The pip install -e . command installs the package in editable mode, allowing us to make changes if necessary. Finally, the unidic download command downloads the language dictionary required for text processing.

A Basic Example

Let's create a basic example of converting English text to speech using the MeloTTS model. In the following code, we import the TTS class from the melo.api module and set the speech speed to 1.0.

Notice that we pass the language and the device type to the TTS class constructor. The device parameter is set to auto, allowing the model to use the GPU if available. We define our English text and initialize the TTS model for the English language. The speaker_ids dictionary maps different accents to their respective IDs. Finally, we call the tts_to_file() method to generate the speech file with the American accent and save it as en-us.wav.

from melo.api import TTS

speed = 1.0


device = 'auto' # Will automatically use GPU if available

# English
text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id

# American accent
output_path = 'en-us.wav'
model.tts_to_file(text, speaker_ids['EN-US'], output_path, speed=speed)

Trying Different Accents

You can try different accents for a given language. For instance, in the previous script we created an English model. You can find the list of accents for this model using the speaker_ids dictionary.


print(speaker_ids)

Output:


{'EN-US': 0, 'EN-BR': 1, 'EN_INDIA': 2, 'EN-AU': 3, 'EN-Default': 4}

Let's try to generate speech with an Indian accent using the EN_INDIA speaker ID.


speed = 1.0

device = 'auto'

# English Indian Accent
text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id

output_path = 'en-in.wav'
model.tts_to_file(text, speaker_ids['EN_INDIA'], output_path, speed=speed)

Trying Different Languages

We can also generate speech in different languages. All languages are listed on Hugging Face MeloTTS model card. To change a language, you must pass the language ID to the language attribute of the TTS class constructor. For example, the following script uses the FR id to create a French TTS model. You can then print the accents associated with a language using the speaker_ids dictionary as shown in the script below.


speed = 1.0

device = 'auto'

# French
model = TTS(language='FR', device=device)
speaker_ids = model.hps.data.spk2id
print(speaker_ids)

Output:


{'FR': 0}

The above output shows that French has only one accent, FR. You can use this speaker id to generate French speech.


text = "Dans cette vidéo, vous allez apprendre sur les Large Language Models. Ça va être amusant."
output_path = 'fr.wav'
model.tts_to_file(text, speaker_ids['FR'], output_path, speed=speed)

Adjusting Speech Speed

The speed of the speech can be adjusted by changing the speed parameter. Here’s an example with a faster speech speed:


speed = 5


text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id

output_path = 'en-us.wav'
model.tts_to_file(text, speaker_ids['EN-US'], output_path, speed=speed)

By setting the speed to 5, the speech is generated at a much faster rate.

Putting it All Together

Finally, to streamline the process, we can create a function that generates speech based on the specified parameters:


def generate_speech(text, language = "EN", speaker_id = "EN-US", speed = 1.0, audio_path = "speech.wav"):

  model = TTS(language=language,
              device='auto')

  speaker_ids = model.hps.data.spk2id

  model.tts_to_file(text,
                    speaker_ids[speaker_id],
                    audio_path,
                    speed=speed)

Using the generate_speech function, we can quickly generate speech by providing the necessary arguments:


generate_speech(text = "Hello, this is test speech",
                speaker_id = "EN_INDIA",
                speed = 3,
                audio_path = "indian_speech.wav")

Conclusion

This tutorial explored the basics of text-to-speech conversion using Hugging Face's MeloTTS transformer. We covered installing required libraries, creating basic examples, experimenting with different accents and languages, adjusting speech speed, and putting everything together in a reusable function. With these tools, you can create high-quality speech synthesis for various applications, enhancing user experiences and accessibility.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.