Dustin Coates in Alexa

AWS re:Invent Day 1 Notes

The first day of my first AWS re:Invent is coming to an end and, so far, I’ve learned one major thing: get where you’re going early, or else you’re waiting in line. Oh, you were expecting something more directly related to Alexa? Here are my notes from the two sessions I was able to attend on the first day.

Five AWS Services to Supercharge Your Alexa Skills

Given by Mark Bate and Memo Doring, both Solutions Architects at Amazon. This session focused on AWS services that integrate well with Alexa Skills. Both Bate and Doring made it clear that there were five for time, not because they are the only five that work well with Alexa. True company men, through-and-through. (That and, of course, they have a point.)

Here are the five:

  1. Lambda
  2. S3
  3. DynamoDB
  4. Polly
  5. SES

Lambda is an obvious one, as most people will be hosting their skills there. In case you still need convincing, Lambda’s got some benefits: there are no servers to manage, they scale as the skill scales, and you’re never paying for idle machines. (I’d argue that this is underselling Lambda for Alexa Skills a bit. The use of Lambda is free for nearly all skills, with no extra SSL/TSL configuration, with automatic security baseline.)

Lambda pro-tips:

  • Versions and aliases are like Git commits and tags
    • You can point your Alexa Skill to a specific version of a Lambda
    • The downside is that you must re-submit your skill when you want to update which Lambda version it points to
      • Bate argues that this is a good thing, because a switch without versions and without re-submission is less likely to break the skill
      • A good counter-argument is that certification takes a while and can be hit or miss
  • Environment variables persist across versions and aliases
    • They can be used for information you don’t want in the code, like API keys, but also information they you may need to “flip,” like the current season (to display pumpkin spice content, for example)

(Interesting side note that doesn’t fit anywhere else: Bate said that the eight second timeout where Alexa is waiting for a response isn’t always eight seconds and can sometimes extend beyond it.)

S3 can be used for hosting images for cards in the Alexa app or Fire TV (I’m of two minds of cards: Do users ever look at them? But they’re usually so low-effort.) or for the Echo Show (where I’m much more bullish). Also useful for hosting video and sound content, or even websites to promote an Alexa Skill.

S3 pro-tips:

DynamoDB is used to store persistent information from a skill, whether it’s about a user or otherwise. Interacting with DynamoDB from a skill on Lambda is easy, because it’s baked directly into the SDK.

DynamoDB pro-tips:

  • The TTL is supported for every record written
    • This can be used to clear out information about a user after a period of time where you deem it to be “stale”
  • DynamoDB streams can send events and do work outside of the skill based on changes inside DynamoDB

For the last two, Polly is the text-to-speech engine on AWS. One of Bate and Doring’s colleagues built a real-time translation skill using Polly, because developers can choose localized voices. (An attendee asked if developers will be able to leverage Polly directly from within SSML in the future. Doring said that it’s something they’ve heard, but the implementation may be unlikely.) SES sends emails, which can be a useful additional touch-point to a skill interaction. It can allow you, the developer, to go “beyond the card” and provide even more information to the user. The challenge here is getting the email address.

Applying Alexa’s Natural Language to Your Challenges

The three members of this session were Paul Cutsinger and Bob Serr (Director of Alexa Skills Kit) of Amazon, and Bob Stolzberg of VoiceXP.

This session focused on building conversational skills that go beyond “call and command” that most skills use today. Particularly, through the dialog interface, such as requiring or confirming slots. The dialog interface is where a skill delegates to Alexa to perform these actions.

There were two mental models presented that stood out. The first, from Cutsinger, was the idea of the “voice stack,” which includes NLU, speech recognition, and even microphone technology. The second concerned how the user navigates through the skill, particularly as it compares to the web or mobile apps.

On the web and mobile, users follow a graph UI. There’s a specific flow they must follow to get where they’re headed. Visit page (often homepage), tap on menu, click on link, go through form step-by-step. They can sometimes skip steps, but the flow is pretty predictable. Conversational (including voice) isn’t like that. Users expect that they can jump in where they want and not go through all of the steps. Developers should think of this flow as “frame” based, where there are specific modules that work together, but can work independently. One flow is deep (graph UI) and the other is wide (frame UI).

When defining the UI for voice, most people start with the “happy path,” or the path a user takes where everything goes perfectly. This includes the user filling out all of the slots at once. That’s fine, but don’t stop there.

Slots can be marked as required and you can specify the order. Alexa will prompt for missing slots in the order that you specify. This moves this logic from your code to Alexa. (This has positives and negatives. A positive is that this logic can be brittle and messy in your own code. However, you have less control over how this flow works. For example, you can’t decide “well, I’ve got three of the four slots, that’s enough.”)

Another useful feature is entity resolution (or synonyms). You can define synonyms inside of the Alexa configuration, so that “large” is equivalent to “huge,” “grand,” or “big.” You can even go beyond this, and define phrase synonyms, where “weighs too much to carry” resolves to the canonical value of “large.” You get both what the user said and the canonical value. You can also set a value to be a synonym for multiple canonical values and you’ll get both resolutions back.

A really great takeaway is using entity resolution/synonyms for error correction. Cutsinger mentioned that in an earlier session, Alexa kept hearing “mini” as “minute.” Adding a synonym for that and always relying on the canonical value fixed the problem.

Tomorrow is day two of AWS re:Invent, so come back and check the notes then.