On-demand webinar: navigating the challenges to AI success

Working with Children: Helping Machines Understand Child Speech

Published on
April 11, 2017

If you have a mobile device, tablet, smart-home system, or any other device in your home that uses automatic speech recognition, you’ve probably experienced this: the software works fine for mom and dad, but not so well for the kids. Why? Because there are several nuances to training machines to understand child speech that are not always well understood.Part of the reason for this is that children speak very differently to adults – and not all speech recognition devices are well equipped to deal with this.How is child speech different from adult speech?At a surface level, we’re all familiar with the idiosyncratic ways that children speak. Ask an adult to do ‘baby talk’, and they’ll give you their best impression of a voice that is high pitched, with poorly formed vowels, mixed up consonants, possibly with some invented words or imaginative grammar. But at their core, these intuitive observations about how children speak reflect many of the actual problems that machines have when dealing with child speech.High-pitched voicesFrom a purely biological standpoint, the vocal tract of a child is less developed than that of an adult. The vocal tract, which is shorter in adult females than in adult males, resulting in females’ higher-pitched voices, is also shorter in younger humans. The vocal folds (commonly known as vocal cords) of children are also shorter than those of both adult males and females.The result is that the fundamental frequency of sounds generated by children averages over 300Hz, compared to 210Hz in adult females, and 125Hz in adult males. Speech recognition devices that are trained to tune in to voices with lower frequencies will often miss much of what a child says.Learning to speakThe human vocal tract is complicated, and learning to use it takes time. Certain sounds require quite precise placement of articulators (active articulators such as the tongue, lips, teeth etc. relative to passive articulators like the palate and alveolar ridge), which young children have yet to master.This results in the mispronunciation of words like ‘helicopter’ as ‘hewwicopter’ which, while admittedly cute, can cause chaos for speech recognition software that is trained to equate a set of pronunciations with a set of words in its lexicon – it’s not going to recognise that particular substitution of sounds.As inexperienced speakers, children will also tend to stutter more, repeat themselves, or change direction mid-sentence; all things that automated speech recognition will struggle with when parsing input.Word playPart of learning to speak is experimenting and playing with words, and this is something that children do exceptionally well. Aside from genuine mistakes in pronouncing complex words, such as pronouncing ‘hospital’ as ‘hopspital’, children also engage in word play at word-level and sentence-level.Young children who are still familiarising themselves with English morphological and inflectional processes might say ‘brunged’ instead of ‘brought’ for the past-tense of ‘bring’, or ‘sheepses’ for the plural of ‘sheep’. They might make up words for lack of a better word, like ‘take-home’ for ‘takeaway’ that is brought home, or even invent just for fun!And in many cases, it is all about fun – to a child, a speech recognition device is a toy like any other, and more often than not, they will experiment and play with it just to see what it will do next.Appen can helpAs we mentioned in our previous blog post, When Speech Recognition Goes Wrong, it’s all about the data. Having the right data to ensure your systems are trained to deal with the challenges of child language is the key to developing a speech recognition device that caters to every member of the family, no matter how small. At Appen, we have experience in collecting both spontaneous and scripted child speech. We also work with transcribers that are familiar with child language, and use our knowledge of spelling standardisation to create the most accurate data possible. Contact us to talk about your needs and how we can help.

Related posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Insights from the International Conference on Acoustics, Speech, and Signal Processing

Appen recently sponsored the IEEE International Conference on Acoustics, Speech, and Signal Processing (iCASSP) in Brighton. Our VP of Business Development in Europe, Dorota
Read more

Appen Becomes Leading Language Service Provider; Maintains Leading Position in APAC

Appen is excited to announce our official ranking as one of the largest language service providers (LSPs) in the global translation and interpreting industry. Issued May 2019
Read more

What is Text Annotation in Machine Learning?

Everything You Need to Know About Text Annotation with Yao Xu. Every day, we interact with different media (such as text, audio, images, and video), relying on our brain to
Read more

How Off-the-Shelf Training Datasets Can Save Your ML Teams Time and Money

New Off-the-Shelf Datasets from Appen. Creating a high-quality dataset with the right degree of accuracy for training machine learning (ML) algorithms can be a difficult
Read more