Introduction
to Speech Synthesis Markup Language - SSML
Speech
Synthesis Markup Language (SSML) is an XML-based markup language for the
Web and other applications that enable access to the functionalities using
speech. This markup language is used for speech interaction with an application
or web content. Thus the applications that are developed using this Speech
Synthesis Markup Language are so rich in interaction with the user.
_______________________________________________
_______________________________________________
Speech Synthesis Markup Language provides the authors to control speech
in many ways. The pronunciation, volume, gender of speech, and other properties
of speech can be controlled by Speech Synthesis Markup Language. This
markup language is a W3C recommendation for the standards sought by the
Voice Browser Working Group.
The purpose
of Speech Synthesis Markup Language is to assist in the synthesis process
which provides the output of an SSML document as speech. The different
elements of SSML assist in different stages of the speech synthesis process.
It is better to know the different stages of the speech synthesis process.
The different stages of the synthesis process are:
1. XML parse
2. Structure analysis
3. Text normalization
4. Text-to-phoneme conversion
5. Prosody analysis, and
6. Waveform production
The above
six stages of synthesis are important for the conversion of the SSML document
as voice output.
The first
stage of the synthesis is XML parse, during which an XML parser is used
for extracting the content from the document tree of the SSML document.
Based on the extracted content the structure of the document is analyzed
during the next stage. This stage of the synthesis process which does
the structure analysis influences the voice output that results out of
synthesis. The order in which the voice output is given depends on the
structure analysis stage.
The third
stage of the synthesis process which is the Text Normalization. During
this stage it is determined what should be spoken out for the word in
the document. Each and every language has different ways of voice output
for the same content. Hence this has to be decided at this stage. For
example, if the document contains 3/4, the voice output can
be three quarters or third April or fourth
March. This kind of decisions is taken at this stage of the synthesis
process.
In the next
stage of the process which is the Text-to-phoneme conversion, the word
decided in the earlier stages are broken into phonemes which are the basis
of pronunciation. Prosody Analysis is the stage in which the pitch, timing,
pausing, and emphasis on the words are analyzed. The properties are called
Prosodic features which are very important for speech output. The elements
such as emphasis, break, and prosody are used in the SSML document to
assist in this stage of synthesis.
The final
stage is the waveform production during which the output is in the form
of audio. The information that is got in the fourth and fifth stage of
the process, namely the Text-to-phoneme conversion stage and the Prosody
Analysis stage are used in the final stage for the audio output. The voice
element that is used in the document can request a particular type of
voice such as male or female. An audio element is also used to insert
audio files that are to be played.
The structure
of an SSML document can be well understood with an example which is given
below:
<?xml
version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<lexicon
uri="http://www.somelexiconfile.com/lexicon.file"/>
<voice
gender="female">
<p>
<s>I speak <emphasis>French.</emphasis></s>
<s>I also speak <emphasis>German.</emphasis></s>
</p>
<sub alias="International
Phonetic Association">IPA</sub>
</voice>
<audio
src="royal.wav">
<emphasis>Welcome</emphasis> to the Royal Club.
</audio>
</speak>
The <speak>
element is the root element of an SSML document. You may note that the
version of the SSML is given as a value in the attribute version
of the <speak> element. The appropriate namespaces and the location
of the SSML schema are also included in this element. The language in
which the voice is to be produced is also given in the xml:lang attribute.
This xml:lang attribute is also allowed in the <p> and the <s>
element. The <p> element is a paragraph element which may have sentences
in the <s> element.
In our example given above we have used an <p> element which has
two sentences given in <s> elements. An <emphasis> element
is used to stress or emphasis a word that is spoken. This emphasis differs
in languages, dialects or even voices.
Another important
element in the SSML document is an <lexicon> element. This element
gives the URI of the pronunciation lexicon document. There may be cases
where you would like to use a female voice to render some content. In
such scenario you can used the <voice> element which has a gender
attribute. If you set the gender attribute to female, the
content is delivered in a female voice. The voice element also has an
name attribute where you specify the name of the person whose voice should
be used.
The voice element also has an age attribute which can be used
to mention the age of the gender that speaks out the text. For example
if you specify <voice gender="female" age=6>
the text enclosed by this voice element will be spoken by a female whose
age is 6. Changing the age attribute like this would allow you to deliver
content in a childs voice or a mature females voice. Another
important element that is used is the <sub> element. This element
is used to specify the alias words for the letters enclosed by this element.
In our example you can see the <sub> element encloses IPA
which needs to be rendered as International Phonetic Association
and not just IPA. In such cases to output any of the abbreviations you
can use the <sub> element.
There are
many elements in Speech Synthesis Markup Language which help in rendering
rich voice content for the web or any applications. For details of all
the elements of SSML you may check out specification in the URL http://www.w3.org/TR/speech-synthesis/.
_______________________________________________
_______________________________________________
FREE
Subscription
Subscribe to our mailing list and receive new articles
through email. Keep yourself updated with latest
developments in the industry.
Note
: We never rent, trade, or sell my email lists to
anyone.
We assure that your privacy is respected
and protected.
_______________________________________
Recommended
XML Books
|
|