Dustin Coates in Alexa

AWS re:Invent Day 4 Notes

The fourth and second-to-last day of AWS re:Invent is over. It saw an Alexa announcement that was upstaged, a look at natural language generation, and using Polly effectively for Alexa skills or elswhere.

Werner Vogels Keynote

I will be honest: I did not attend the keynote, or even watch it live. That thing was long and who has that kind of time when most of the talk won’t be about Alexa? Thankfully, the biggest Alexa announcement—perhaps of the entire week—was leaked the night before: Alexa for Business.

Alexa for Business is huge for businesses and for developers who build for them. As a skill developer in this still-small (relative to web or app building) space, that means you.

Company admins can set up devices both in shared and personal contexts. A share context might be a conference room or a welcome center. You can add information to shared areas that is unique to that space. A conference room can hook into the associated calendar or the video conferencing details. There are a number of different providers from the outset, but can you guess which big one’s missing? If you can’t figure it out, ask your Google Home. However, it does support Amazon Chime (naturally), Skype for Business, Zoom, BlueJeans, RingCentral, and WebEx.

(Random though that doesn’t belong anywhere else: For what it’s worth, I also think it will spur more sales of Alexa devices with screens. If you are using Alexa in a shared space like a conference room to display schedules or data, suddenly a screen is more attractive. Plus businesses are less price-sensitive generally than consumers.)

You can also build and deploy skills that are specific to your company’s business account. Skills can then be rolled out to the devices, including through groups if you have specific skills only certain people or shared spaces can access. Devices can still add public skills, but the admin can restrict that as well.

Finally, there is provisioning and user management. This is as you expect, but worth noting is that Amazon is suggesting that users joins with the account they use personally for Alexa. They would then have access to your private business skills on any of their devices.

I’m excited about Alexa for Business. We just moved into a new Paris office at Algolia and I’m looking forward to seeing what Alexa for Business can provide us.

Natural Language Processing Plus Natural Language Generation: The Cutting Edge of Voice Design

This session was about “natural language generation,” or creating human-like responses automatically. I didn’t take many notes on this session, as most of it concerned a single product from Automated Insights called Wordsmith. This tool is used for more than just speech generation or Alexa: news services use it to quickly create articles. The product uses an editor to have different slots and combinations for text generation. Check out the product for more of what they do.

Amazon Polly Tips and Tricks: How to Bring Your Text-to-Speech Voices to Life

Polly is a text-to-speech service with 52 voices in 25 languages. This includes new voices in Korean (a new language), US English, Japanese, Indian English, and a German voice that is geared toward working well with “Danglish,” or German with English words sprinkled in. Polly is low-latency for real-time usage, but the speech can also be downloaded and replayed.

The text-to-speech pipeline can be thought to have a front-end and a back-end. The front-end is text-processing and the back-end is audio generation. From the moment text comes in, the pipeline performs a number of steps. First there is text normalization, such as transforming numeric figures (77) to words (seventy–seven).

Then the grapheme to phoneme (G2P) transformation. A grapheme can be a letter, punctuation, or a character, such as in Chinese. It’s a piece of writing that can’t be broken down and stand on its own anymore. Phonemes are the smallest piece of speech possible. For example, the word “the” has multiple phonemes even though it has just a single syllable: “th” and “e” are distinct phonemes. Phonemes are taken and sent to wave generation, which provides the voice.

A few challenges exist, such as:

  • Homographs, or words that are spelled the same but pronounced differently. Often the pronunciations can be figured out by the immediate content around it: “You can lead a horse to water, but a pencil is always lead.” There are other times when that’s not possible. “I read the book.” How do you pronounce “read?”
  • Abbreviations, acronyms, initialisms, and units. How is NASA pronounced? A simple text-to-speech system would say “N A S A.”
  • Loan words (“déjà vu”) and proper names.
  • Slang and new additions to the language.

Polly is compliant with SSML, so you can use the <lang> tag to include foreign phrases and names. The same voice is used, but with a different pronunciation. For example, David in American English versus Spanish. There is also pronunciation support through IPA and X-SAMPA. If Polly doesn’t pronounce words exactly how you want them, you can tweak it manually.

The CEO of The Magic Door came up to discuss how they use Polly in their Alexa games. Their games are popular: 900,000 players, over eight million minutes of playtime, with twenty hours of stries, fifty Polly characters plus human voices, and thousands of handwritten SSML tags. They spend significant time manipulating Polly voices to get them just how they want. They adjust timbre and vocal-length to create different characters from the same Polly voice. Phonetic spelling corrects Polly errors—which sometimes appear only after they’ve made another change.

Two features The Magic Door uses often are break tags and prosody. Break tags allow for extended or decreased breaks between words. The company found that this makes a world of difference. So much so that they went back and added break tags to all of their existing content. When The Magic Door uses prosody, it’s usually to slow down the rate of speech. Note that The Magic Door is a story, so Polly is saying more at one time than your content might. However, it’s still probably applicable, but try it out for yourself. The step-by-step process from initial script to marked-up SSML was incredible to see. By the end, there was more SSML than text.