Developing speech to text and text to speech or speech recognition model for Amharic Language

Yihalem Mandefro
8 min readOct 8, 2021

--

Creating Speech Recognition dataset for any language

As state of art algorithms and code are available almost immediately to anyone in the world at the same time, thanks to Arxiv, github and other open source initiatives. GPU deep learning training clusters can be spun up in minutes on AWS. What are companies competitive edge as AI and machine learning is getting more widely adopted in every domain?

The answer is of course data, and in particular cleaned and annotated data. This type of data is either difficult to get a hold of, very expensive or both. Which is why many people are calling data, the new gold.

So for this post I’m going to walk through how to easily create a speech recognition dataset for (almost) any language, bootstrapped. Which for instance can be used to train a Baidu Deep Speech model in Tensorflow for any type of speech recognition task.

For English there are already a bunch of readily available datasets. For instance the LibriSpeech ASR corpus, which is 1000 hours of spoken english (created in similar fashion as described in this post).

Mozilla also has an initiative for crowdsourcing this data to create an open dataset as well https://voice.mozilla.org/data

Introduction of Data preparation for Amharic

We’ll create a dataset for Amharic, but this bootstrap technique can be applied to almost any language.

Using free audiobooks, with permissive licenses or audiobooks that are in the public domain together with e-books for these books to create our datasets.

The process will include preprocessing of both the audio and the eBook text, aligning the text sentences and the spoken sentences, called forced alignment. We’ll use Aeneas to do the forced alignment which is an awesome python library and command line tool.

The last step will include some manual work to finetune and correct the audio samples using a very simple web ui. This postprocessing and tuning will also involve transforming our final audio and text output map to the proper training format for the Tensorflow deep speech model.

Data collection

As there are many audio books written in Ethiopia. They can not be used for Speech Recognition Dataset as their background classical music as much noise which is not good for our model.

So the best approach's for Amharic voice data collection is making clear audios transcribe manually which takes many human resource. An it is time taker.

But there is another approach in which Ethiopian Artificial intelligence is using, which is transcribing phone call speeches by dealing with Ethiotelecom. and this is the best approach as we got many accents. and quality of data.

But for startups and for students I recommend using clear audio books or transcribing audios which are made before. or recording your own voice as i am going to do in this article.

Many language Recognition models are trained using a dataset of many sentences and their voice. but for Amharic I ignored that path, as Amharic is not a language like others(In Amharic every alphabet has its own speech) but in English every speech needs different combination of alphabet. but I recommend using sentences if you have data.

After recording the data:

To cut some of the audio from the beginning of the file I use Audacity which is completely free.

Preprocessing the text:

We need to transform the raw text file to get to the Aeneas text input format described here,

Aeneas plain text input format

We used NLTK for this, mostly because the NLTK sentence splitter is regex based and no language specific model is needed, and the English one works fairly well but for Amharic we will write our own dat for preprocessing Amharic text as there is no nlp toolkit for Amharic. You can get the toolkit which is being made in my github repository. but for this article as I am using alphabets I wouldn’t go deep on text preprocessing for Amharic.

To load the first chapter of the ebook and split it up into paragraphs and sentences:

# load nltk
from nltk.tokenize import sent_tokenize# load text from 1st chapter
with open('books/18043-0.txt
, 'r') as f:
data = f.read()

We examined the ebook, and it contained clearly defined paragraph using 2 newline characters “\n\n” . The Aeneas text input format also makes use of paragraphs, so we decided use the ebook paragraphs as well:

paragraphs = data.split(“\n\n”)

Now brace yourself for some ugly code. Cleaning of some special characters and the actual sentence splitting using NLTK as well as adding the sentences to the paragraph lists:

paragraph_sentence_list = []
for paragraph in paragraphs:
paragraph = paragraph.replace(“\n”, “ “)
paragraph = paragraph.replace(“ — “, “”)
paragraph = re.sub(r’[^a-zA-Z0–9_*.,?!åäöèÅÄÖÈÉçëË]’, ‘ ‘, paragraph)
paragraph_sentence_list.append(sent_tokenize(paragraph))

Forced alignment using Aeneas

Aeneas can be run from the command line or as a Python module. We decided to use it as module to be able to extend it to any future automation of the task.

Install and Import the Aeneas module and methods

!pip install Aeneas
from aeneas.executetask import ExecuteTask
from aeneas.task import Task

Create the task objects that holds all relevant configurations.

# create Task object
config_string = ”task_language=swe|is_text_type=plain|os_task_file_format=json”
task = Task(config_string=config_string)task.audio_file_path_absolute = “books/18043–0/goteborgsflickor_01_stroemberg_64kb_clean.mp3”task.text_file_path_absolute = “books/18043–0_aeneas_data_1.txt”task.sync_map_file_path_absolute = “books/18043–0_output/syncmap.json”

First create the config string, pretty straight forward, define language, “swe” for Swedish, the type for the input text format is plain or mplain. Finally JSON as our output sync map format.

Next we define the audio file, the text file corresponding to the audio file and what we want our output sync map to be named when its saved, in this case just syncmap.json.

To run it:

# process Task
ExecuteTask(task).execute()# output sync map to file
task.output_sync_map_file()

For this sample/chapter it took less than 1–2 seconds to run, so it should be pretty fast.

Awesome! We should now have a audio/ebook Aeneas syncmap:

Next step means fine-tuning and validating the syncmap.

Validating and fine tuning Aeneas sync maps

There is a very simple web interface created for Aeneas to load the syncmap and the audio file and make it easy to fine tune the sentence end and start time stamps.

Download or clone the finetuneas repository. Open finetuneas.html in Chrome to start the finetuning.

Then load your voice data and syncmap.json you generated using the above code

fine tuning our data using finetuneas

On the right pane, you will see the start timestamp, and a “+” and “-” sign for adjusting the start and end time. Beware that the end time for a audio clip is the next sentence start time which is a bit confusing.

Click the text to play the section of interest. Adjust and finetune the start and end timestamps if necessary. When done, save the finetuned syncmap using the the controls in the left pane.

When the finetuning is done it’s time to do the final post processing to transform the dataset into a simple format which can be used to train the Deep Speech model.

Convert to DeepSpeech training data format

The data we own is now json and audio file. but the deepspeech doesn’t use this file as input so we will change this to deepspeech input format which is csv

The last step is to convert the data into a format which can be easily used. We used the same format Mozilla uses at https://voice.mozilla.org/data. It is also a common practice and format to use CSV referencing media files (text, images and audio) to train machine learning models in general.

Basically a CSV file looking like this:

the last csv file which is used to train our model

Each audio-file will contain one sentence, and one row per sentence but in this case alphabet . There are some other attributes that are optional and added if possible, in this case only gender is known.

Upvotes and downvotes are metrics for whenever people are validating a sentence as a good sample or not.

We’ll use a library called pydub to do some simple slicing of the audio files, create a pandas dataframe and save it to a CSV.

from pydub import AudioSegment
import pandas as pd
import jsonbook = AudioSegment.from_mp3("books/18043-0/goteborgsflickor_01_stroemberg_64kb_clean.mp3")with open("books/18043-0_output/syncmap.json") as f:
syncmap = json.loads(f.read())

Load the audio and the syncmap you created previously.

sentences = []
for fragment in syncmap[‘fragments’]:
if ((float(fragment[‘end’])*1000) — float(fragment[‘begin’])*1000) > 400:
sentences.append({“audio”:book[float(fragment[‘begin’])*1000:float(fragment[‘end’])*1000], “text”:fragment[‘lines’][0]})

Loop through all the segments/fragments/sentences in the syncmap. Do a sanity check that they are more than 400 milliseconds long. Pydub works in milliseconds, and the syncmap defines all beginnings and ends in seconds, which is why we need to multiply everything with a 1000.

A placeholder dataframe is created

df = pd.DataFrame(columns=[‘filename’,’text’,’up_votes’,’down_votes’,’age’,’gender’,’accent’,’duration’])

Append the sliced pydub audio object and the text for that fragment to an object in an array.

# export audio segment
for idx, sentence in enumerate(sentences):
text = sentence[‘text’].lower()
sentence[‘audio’].export(“books/audio_output/sample-”+str(idx)+”.mp3", format=”mp3")
temp_df = pd.DataFrame([{‘filename’:”sample-”+str(idx)+”.mp3",’text’:text,’up_votes’:0,’down_votes’:0,’age’:0,’gender’:”male”,’accent’:’’,’duration’:’’}], columns=[‘filename’,’text’,’up_votes’,’down_votes’,’age’,’gender’,’accent’,’duration’])
df = df.append(temp_df)

Lowercasing all text, normalizing it. Next we export the saved audio object with a new name.

Save the new audio filename and the text to the temporary Dataframe and append it to the placeholder Dataframe.

Take a look a the Dataframe to make sure it looks sane:

df.head()

Finally save it as a CSV:

df.to_csv(“books/sample.csv”,index=False)

This is all all about data collection and preprocessing the next step is training the model.

we will import and download deepspeech for training

# Prepare to use DeepSpeech
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs sox libsox-fmt-mp3
!git lfs install
# Install DeepSpeech MODULE via setup.py
!pip install git+https://github.com/mozilla/DeepSpeech.git@master
# Install GPU version of tensorflow AFTER DeepSpeech installs CPU version
!pip uninstall -y tensorflow tensorflow-gpu
!pip install ‘tensorflow-gpu==1.15.2’
# Make sure GPU works before going further
import tensorflow as tf
assert tf.test.is_gpu_available()

we will use tensorflow-gpu version 1.15.2 as Deepspeech use this

The complete code and step for training the model can be found here.

If you have issue doing this contact me telegram,linkedin,email

thank you https://medium.com/@klintcho for your help.

Thankyou!!

--

--