Bring your voice skills to life and make it stand out with Speech Synthesis Markup Language.
With over 40,000 skills and apps available on Alexa and Google Assistant, competition is huge. To stand out means not just careful planning and conversation crafting, but adding real polish to the conversational design.
One of the most effective ways to craft your voice experience is to use SSML to tweak the assistant’s voice, to speed up, slow down, stress, or destress words and phrases to give your skill a much more natural user experience.
“There are over 40,000 skills and apps available on Alexa and Google Assistant”
Speech Synthesis Markup Language is an XML-based markup language for computer-generated speech. It allows developers to change how a voice assistant delivers the lines coded into it, controlling intonation, emphasis, rate of speech, and even pronunciation. Think of it as the CSS of voice development - without it your app will still work, but may seem a little flat and lifeless.
SSML is a standard language supported by both Alexa and Google Assistant, however each assistant only supports certain elements of the language, and while both offer the basics, they differ on some of the more useful elements and also offer their own custom additions.
A full list of which SSML elements supported by Alexa is available here.
Google Assistant’s list of supported elements is available here.
Both Alexa and Google offer invaluable tools to play around with SSML and figure out how each element changes the way the assistant speaks. These sandbox environments are great ways to get to grips with SSML.
Note: for the following you will need to have a developer account for each respective platform and gone through several basic steps to set up a project on each platform.
To test SSML on Alexa;
1. Log into your Alexa developer account
2. Select your skill
3. Navigate to the ‘Test’ tab
4. Select ‘Voice and Tone’
5. Set your language at the bottom of the console.
To test SSML on Google Assistant
1. Log into console.actions.google.com
2. Select your app
3. Select ‘Simulator’ from the left-hand navigation
4. Set your surface and language
5. Select “Audio” from the console.
Both supply some sample text to listen to and amend, but you’ll quickly want to start testing out your own responses. For our voice projects, we run all voice skill scripts through SSML simulators on both platforms to understand how the assistant will understand each word in context and pronounce it. Never take anything for granted, how it sounds when you read it out yourself may not be how the assistant understands it!
Pauses are an undervalued part of voice design. Users need to be able to digest what they are hearing, and rushed or very lengthy responses increase the chance of the user becoming overwhelmed and frustrated with your skill.
This can partly be solved in planning using the ‘one breath test’ - reading your script aloud to ensure that each sentence or idea your skill is trying to communicate can be read out in a single breath. While this solves most problems, you might find the assistants rushing some phrases, which is where the <break> tag comes in handy.
SSML gives two options for controlling pauses;
Google Assistant reads the phrase ‘Yes or no’ quite quickly. By adding in some comma length breaks, we can slow the question down a bit to help users understand what is expected of them as a response.
The <break> tag should only really be needed where the assistant’s natural pauses are not what you are seeking. Both Alexa and Google Assistant will interpret punctuation like commas, full stops, hyphens, etc. pretty naturally.
Note: <break> isn’t the answer to every situation. Sometimes it can be difficult to get the assistant to match the exact length of pause you are looking for. You might find more success rewording or restructuring your response to get the right natural flow, or test out if <prosody> or <phoneme> tags work as an alternative.
‘Prosody’ refers to the patterns of stress and intonation in spoken language, it’s what stops us from sounding monotone, and SSML’s <prosody> tag does the exact same for voice assistants. Prosody has two handy uses; adding emphasis to certain phrases, and fixing some tricky assistant mispronunciations. The <prosody> tag lets you control three things with your assistant’s responses;
Rate speeds up or slows down the pace of the assistant’s responses. You have the option of ‘x-slow’ to ‘x-fast’ but our preference is ‘n%’, which gives you more control to fine tune the exact speed you want.
You can wrap specific phrases or words in rate tags to make assistant responses clearer, or wrap entire responses. Google Assistant in particular tends to speak noticeably faster than Alexa, and when reading back multiple sentences end-to-end it can overload the user.
Pitch is handy for adding some extra emphasis to words. We often find that pitch=‘+20%’ coupled with a volume boost raises the assistant voice just enough to make a word stand out to your users.
Adding a bit of extra volume to words also helps with emphasis. Both assistants have predefined ranges from ‘silent’ to ‘x-loud’, while Alexa gives an added option to fine tune with ‘+ndB’.
Note; there is an <emphasis> tag which combines both rate and volume into a single tag, but the assistants can pronounce things wrapped in <emphasis> tags a little oddly, and we find using <prosody> tags gives you more control over fine tuning.
Something that can really elevate voice skills is to add audio. This could be using sound effects, or it might be entire responses recorded using a voice-over talent. Amazon offers a sound library for use with its skills here, while’s equivalent can be find here. Note that both platforms lock these sounds to experiences built for their own platforms. If you are looking to build a cross-platform experience, third party sound libraries such as SoundBible are great for finding and using sound files released under creative commons licences.
Adding these sound effects are a quick and effective way to add some life to your skill. Another route is to record entire responses using a human voice, and serve these up in place of the assistants’ voices, like we did with Channel 4’s Human Test. Now this does involve more work - you can record the audio yourself, or if you have the budget you can hire professional voice talent to provide the clips. If you are choosing this path, it’s important to get the planning and scripting right as changing the audio later in the project can become tricky and expensive.
If you are planning on using sound files throughout your skill, we also recommend using a short introductory music or sound file to be played when users start a new session. This indicates to the user that your skill is a richer media experience, and makes them more likely to stick around for the duration of your skill, rather than exit part-way through.
We used these techniques to great effect with our Channel 4 Human Test skill. The use of audio added a level of surprise and delight, and exceeded user expectations. See the results for yourself by asking Alexa or Google Assistant to ‘Start Human Test’, or click the link below.
While the assistants are pretty accurate at picking up the meaning of certain words from context, they can still misinterpret. You might want a string of numbers read out as individual characters, or a value to be interpreted as a street address.
You can do all of these using the <say-as> tag. <say-as>, coupled with the interpret-as attribute let you specify values as cardinal or ordinal numbers, dates, units, times, telephone numbers, addresses, fractions, individual digits or characters, and even expletives. Amazon offers an additional <w> tag which functions similarly to <say-as> by allowing you to define words as present verbs, past participles, nouns, and also change the meaning of homonyms. For example the word bass has multiple meanings, which you can switch between using <w role=”amazon:SENSE_1”>bass</w>.
When testing out your voice app, you’ll find that the assistant sometimes pronounces words in an odd way, whether clipping or rushing a word or struggling to pronounce rarer or non-native words. While some of the above SSML tricks may fix the issue, Alexa has a ‘secret weapon’ to solve even the trickiest problems - the <phoneme> element.
Phonemes are the smallest unit of sound in speech. You might have seen an odd collection of letters near the top of some Wikipedia articles, like the ‘skɒtlənd’ at the top of this article. These are letters from the International Phonetic Alphabet, which tell readers how the word should be pronounced. While they can seem daunting at first, once you get the hang of them they are a very handy way of having Alexa pronounce words almost exactly as you want.
We often find Alexa will destress the word ‘that’. In a sentence like “I think that we should go home”, this makes sense, whereas in “you think a dog makes that sort of noise” we want ‘that’ to be emphasised. <emphasis> and <prosody> tags can only help so much, in many cases Alexa will still destress the word, clipping it. Fixing this is easy with a phoneme tag.
The easiest part about phonemes is you don’t have to try and construct words out of IPA’s characters, you can copy-paste directly from sites like Wiktionary, which includes IPA pronunciation along with its definitions, such as on this page for chicken.
If your skill is intended for multiple English-speaking regions, remember that words like leisure, data, schedule, and tomato can vary in pronunciation around the world, and account for this when using phoneme tags.
Currently Google Assistant does not support phoneme elements, but Amazon has a list of all characters supported on Alexa here.
Lastly, Alexa has a library of ‘speechcons’; humourous interjections which add some character to your skill, such as ‘great scott’, ‘oh my giddy aunt’, and ‘bravo!’. Each Alexa-supported language has some of its own unique additions; the UK version can be found here and the US list here. While best to use these sparingly, they are a handy way to add personality without having to figure out specific SSML tagging for each interjection.
Using the tips above are great for adding that extra polish to a skill, but while SSML can really elevate a voice experience, it doesn’t compensate for a lack of planning. Making sure you’ve read your content out loud with another person, tested it against the ‘one breath test’, and ran it through the Alexa and Google Assistant simulators to check for pronunciations not only make for a better user experience, but can reduce the amount of time needed polishing. We believe budgeting for SSML is vital to a good voice experience, but it’s best to make the most of this time in really elevating the experience, rather than fixing issues which should have been caught in planning.