Introduction to Speech Synthesis Markup Language - SSML

Speech Synthesis Markup Language (SSML) is an XML-based markup language for the Web and other applications that enable access to the functionalities using speech. This markup language is used for speech interaction with an application or web content. Thus the applications that are developed using this Speech Synthesis Markup Language are so rich in interaction with the user.


Speech Synthesis Markup Language provides the authors to control speech in many ways. The pronunciation, volume, gender of speech, and other properties of speech can be controlled by Speech Synthesis Markup Language. This markup language is a W3C recommendation for the standards sought by the Voice Browser Working Group.

The purpose of Speech Synthesis Markup Language is to assist in the synthesis process which provides the output of an SSML document as speech. The different elements of SSML assist in different stages of the speech synthesis process. It is better to know the different stages of the speech synthesis process. The different stages of the synthesis process are:

1. XML parse
2. Structure analysis
3. Text normalization
4. Text-to-phoneme conversion
5. Prosody analysis, and
6. Waveform production

The above six stages of synthesis are important for the conversion of the SSML document as voice output.

The first stage of the synthesis is XML parse, during which an XML parser is used for extracting the content from the document tree of the SSML document. Based on the extracted content the structure of the document is analyzed during the next stage. This stage of the synthesis process which does the structure analysis influences the voice output that results out of synthesis. The order in which the voice output is given depends on the structure analysis stage.

The third stage of the synthesis process which is the Text Normalization. During this stage it is determined what should be spoken out for the word in the document. Each and every language has different ways of voice output for the same content. Hence this has to be decided at this stage. For example, if the document contains “3/4”, the voice output can be “three quarters” or “third April” or “fourth March”. This kind of decisions is taken at this stage of the synthesis process.

In the next stage of the process which is the Text-to-phoneme conversion, the word decided in the earlier stages are broken into phonemes which are the basis of pronunciation. Prosody Analysis is the stage in which the pitch, timing, pausing, and emphasis on the words are analyzed. The properties are called Prosodic features which are very important for speech output. The elements such as emphasis, break, and prosody are used in the SSML document to assist in this stage of synthesis.

The final stage is the waveform production during which the output is in the form of audio. The information that is got in the fourth and fifth stage of the process, namely the Text-to-phoneme conversion stage and the Prosody Analysis stage are used in the final stage for the audio output. The voice element that is used in the document can request a particular type of voice such as male or female. An audio element is also used to insert audio files that are to be played.

The structure of an SSML document can be well understood with an example which is given below:

<?xml version="1.0"?>
<speak version="1.0" xmlns=""

<lexicon uri=""/>

<voice gender="female">

<s>I speak <emphasis>French.</emphasis></s>
<s>I also speak <emphasis>German.</emphasis></s>

<sub alias="International Phonetic Association">IPA</sub>


<audio src="royal.wav">
<emphasis>Welcome</emphasis> to the Royal Club.


The <speak> element is the root element of an SSML document. You may note that the version of the SSML is given as a value in the attribute “version” of the <speak> element. The appropriate namespaces and the location of the SSML schema are also included in this element. The language in which the voice is to be produced is also given in the xml:lang attribute. This xml:lang attribute is also allowed in the <p> and the <s> element. The <p> element is a paragraph element which may have sentences in the <s> element.

In our example given above we have used an <p> element which has two sentences given in <s> elements. An <emphasis> element is used to stress or emphasis a word that is spoken. This emphasis differs in languages, dialects or even voices.

Another important element in the SSML document is an <lexicon> element. This element gives the URI of the pronunciation lexicon document. There may be cases where you would like to use a female voice to render some content. In such scenario you can used the <voice> element which has a gender attribute. If you set the gender attribute to “female”, the content is delivered in a female voice. The voice element also has an name attribute where you specify the name of the person whose voice should be used.

The voice element also has an “age” attribute which can be used to mention the age of the gender that speaks out the text. For example if you specify <voice gender="female" age=”6”> the text enclosed by this voice element will be spoken by a female whose age is 6. Changing the age attribute like this would allow you to deliver content in a child’s voice or a mature female’s voice. Another important element that is used is the <sub> element. This element is used to specify the alias words for the letters enclosed by this element. In our example you can see the <sub> element encloses “IPA” which needs to be rendered as “International Phonetic Association” and not just IPA. In such cases to output any of the abbreviations you can use the <sub> element.

There are many elements in Speech Synthesis Markup Language which help in rendering rich voice content for the web or any applications. For details of all the elements of SSML you may check out specification in the URL “”.


FREE Subscription

Subscribe to our mailing list and receive new articles
through email. Keep yourself updated with latest
developments in the industry.


Note : We never rent, trade, or sell my email lists to
anyone. We assure that your privacy is respected
and protected.


FREE Subscription

Stay Current With the Latest Technology Developments Realted to XML. Signup for Our Newsletter and Receive New Articles Through Email.



Note : We never rent, trade, or sell our email lists to anyone. We assure that your privacy is respected and protected.



Add to My Yahoo!

Visit XML Training Material Guide Homepage



Copyright - © 2004 - 2017 - All Rights Reserved.