ResponseBuilder: This is what makes the Alexa Skills Kit SDK for Node.js talk

Well, just like that, it happened. We’ve had our first change in code during the Dig Deep series. Amazon made a (welcome) update to the response builder just a couple of days after this post went up. I’ll create an updated post soon. In the meantime, use this guide if you’re using the existing NodeJS SDK.

Without building a response, it seems like we couldn’t make Alexa speak. It’s interesting, however, to note that ResponseBuilder only came about in August of 2016—just over a year ago. While the impetus was supporting long-form audio, it also alleviated a lot of the annoying manual response building that we had to do before. We’ll look at it in-depth today.

As a reminder, this is the Dig Deep series, where we look line-by-line at the tools and libraries we use to build voice-first experiences. This is not the place to go for tutorials, but if you want to learn interesting little nuggets about what you use every day, off we go…

The ResponseBuilder function is there primarily so we can build our responses ourselves. This will either be as a replacement for emitting events like :ask, :tell, etc. or for playing audio. We see the code below in full (along with a few other functions).

function isObject(obj) {
  return (!!obj) && (obj.constructor === Object);
}

function IsOverridden(name) {
  return this.listenerCount(name) > 1;
}

function ResponseBuilder(self) {
  var responseObject = self.response;
  responseObject.version = '1.0';
  responseObject.response = {
    shouldEndSession: true
  };
  responseObject.sessionAttributes = self._event.session.attributes;

  return (function () {
    return {
      'speak': function (speechOutput) {
        responseObject.response.outputSpeech = createSSMLSpeechObject(speechOutput);
        return this;
      },
      'listen': function (repromptSpeech) {
        responseObject.response.reprompt = {
          outputSpeech: createSSMLSpeechObject(repromptSpeech)
        };
        responseObject.response.shouldEndSession = false;
        return this;
      },
      'cardRenderer': function (cardTitle, cardContent, cardImage) {
        var card = {
          type: 'Simple',
          title: cardTitle,
          content: cardContent
        };

        if(cardImage && (cardImage.smallImageUrl || cardImage.largeImageUrl)) {
          card.type = 'Standard';
          card['image'] = {};

          delete card.content;
          card.text = cardContent;

          if(cardImage.smallImageUrl) {
            card.image['smallImageUrl'] = cardImage.smallImageUrl;
          }

          if(cardImage.largeImageUrl) {
            card.image['largeImageUrl'] = cardImage.largeImageUrl;
          }
        }

        responseObject.response.card = card;
        return this;
      },
      'linkAccountCard': function () {
        responseObject.response.card = {
          type: 'LinkAccount'
        };
        return this;
      },
      'audioPlayer': function (directiveType, behavior, url, token, expectedPreviousToken, offsetInMilliseconds) {
        var audioPlayerDirective;
        if (directiveType === 'play') {
          audioPlayerDirective = {
            "type": "AudioPlayer.Play",
            "playBehavior": behavior,
            "audioItem": {
              "stream": {
                "url": url,
                "token": token,
                "expectedPreviousToken": expectedPreviousToken,
                "offsetInMilliseconds": offsetInMilliseconds
              }
            }
          };
        } else if (directiveType === 'stop') {
          audioPlayerDirective = {
            "type": "AudioPlayer.Stop"
          };
        } else {
          audioPlayerDirective = {
            "type": "AudioPlayer.ClearQueue",
            "clearBehavior": behavior
          };
        }

        responseObject.response.directives = [audioPlayerDirective];
        return this;
      },
      'audioPlayerPlay': function (behavior, url, token, expectedPreviousToken, offsetInMilliseconds) {
        var audioPlayerDirective = {
          "type": "AudioPlayer.Play",
          "playBehavior": behavior,
          "audioItem": {
            "stream": {
              "url": url,
              "token": token,
              "expectedPreviousToken": expectedPreviousToken,
              "offsetInMilliseconds": offsetInMilliseconds
            }
          }
        };

        responseObject.response.directives = [audioPlayerDirective];
        return this;
      },
      'audioPlayerStop': function () {
        var audioPlayerDirective = {
          "type": "AudioPlayer.Stop"
        };

        responseObject.response.directives = [audioPlayerDirective];
        return this;
      },
      'audioPlayerClearQueue': function (clearBehavior) {
        var audioPlayerDirective = {
          "type": "AudioPlayer.ClearQueue",
          "clearBehavior": clearBehavior
        };

        responseObject.response.directives = [audioPlayerDirective];
        return this;
      }
    }
  })();
}

function createSSMLSpeechObject(message) {
  return {
    type: 'SSML',
    ssml: `<speak> ${message} </speak>`
  };
}

First off, we’ve got a couple of methods up-front. isObject does exactly what you would expect and we covered IsOverridden in a previous post so I won’t go over it again here. But this is the Dig Deep series, where we look at code line-by-line together, so I’d be remiss in skipping over these. Otherwise, boring! Let’s get to the fun stuff.

The ResponseBuilder function is a long one, clocking in at nearly 120 lines in total. The good thing for us, though, is that it’s so long only because it’s building a long object. So there won’t be a ton of complexity here. There is, meanwhile, a lot to learn in terms of how we make Alexa talk.

Setup

var responseObject = self.response;
responseObject.version = '1.0';
responseObject.response = {
  shouldEndSession: true
};
responseObject.sessionAttributes = self._event.session.attributes;

First we’ve got the setup that’s common to all response types. We set the response type to be self.response, which right now is an empty object which we saw when we looked at alexaRequestHandler.

The version is set to 1.0. This isn’t much different than what’s happening when we connect to DynamoDB, but is in my opinion much cleaner than 2012-08-10 as an API version identifier.

And, finally, we start beuilding a response object on the response object. Yearh, it’s a bit weird naming, but let’s roll with it. For simplicity sake, we’ll call it the response sub-object. It has a single key/value pair to start: shouldEndSession, which is true.

shouldEndSession ends the session, doesn’t wait for a user response, turns off the light on the top of the Echo, and will save the session to DynamoDB if you’ve set that up.

Next we have an IIFE that will return to use our response object. It is wrapped inside an IIFE so that we can maintain the value of this to the response object.

speak

'speak': function (speechOutput) {
  responseObject.response.outputSpeech = createSSMLSpeechObject(speechOutput);
  return this;
}

speak is the simplest one. It sets what Alexa will say to whatever is passed in as speechOutput after wrapping it in <speak></speak> tags for SSML via the createSSMLSpeechOutput function.

Finally, this—and all other methods—return this, to make the method calls chainable.

listen

'listen': function (repromptSpeech) {
  responseObject.response.reprompt = {
    outputSpeech: createSSMLSpeechObject(repromptSpeech)
  };
  responseObject.response.shouldEndSession = false;
  return this;
}

listen is not too different than speak except it sets the reprompt (what is said if the user doesn’t respond to the initial prompt from Alexa) and specifies that the session should not be ended. Because it only sets the reprompt and not the prompt, listen should never be used on its own.

cardRenderer

'cardRenderer': function (cardTitle, cardContent, cardImage) {
  var card = {
    type: 'Simple',
    title: cardTitle,
    content: cardContent
  };

  if(cardImage && (cardImage.smallImageUrl || cardImage.largeImageUrl)) {
    card.type = 'Standard';
    card['image'] = {};

    delete card.content;
    card.text = cardContent;

    if(cardImage.smallImageUrl) {
      card.image['smallImageUrl'] = cardImage.smallImageUrl;
    }

    if(cardImage.largeImageUrl) {
      card.image['largeImageUrl'] = cardImage.largeImageUrl;
    }
  }

  responseObject.response.card = card;
  return this;
},

A card is what’s displayed inside the Alexa app on a device or on the web. It is not the same as the templates displayed on the Echo Show. It displays a title and content, plus an optional image.

Alexa Skill card without image

This method takes three arguments: cardTitle, cardContent, and cardImage. cardImage is an object that has one or both of smallImageUrl and largeImageUrl.

There are two card types, and which type we’ll use depends on whether we have an image or not. The Simple card has no image, while the Standard card does. Both types have a title, whereas the Simple card displays text with content and the Standard card displays it with text. I can see the argument for the text not being all of the content in the Standard card, but it seems to complexify it nonetheless.

Finally, this is all being set as the card attribute on the response sub-object.

linkAccountCard

'linkAccountCard': function () {
  responseObject.response.card = {
    type: 'LinkAccount'
  };
  return this;
},

This creates a card for account linking, which we’ll look at in a future post.

Alexa Skill account linking card

audioPlayer

Now we’re having fun…

'audioPlayer': function (directiveType, behavior, url, token, expectedPreviousToken, offsetInMilliseconds) {
  var audioPlayerDirective;
  if (directiveType === 'play') {
    audioPlayerDirective = {
      "type": "AudioPlayer.Play",
      "playBehavior": behavior,
      "audioItem": {
        "stream": {
          "url": url,
          "token": token,
          "expectedPreviousToken": expectedPreviousToken,
          "offsetInMilliseconds": offsetInMilliseconds
        }
      }
    };
  } else if (directiveType === 'stop') {
    audioPlayerDirective = {
      "type": "AudioPlayer.Stop"
    };
  } else {
    audioPlayerDirective = {
      "type": "AudioPlayer.ClearQueue",
      "clearBehavior": behavior
    };
  }

  responseObject.response.directives = [audioPlayerDirective];
  return this;
},

Here’s another example of Amazon giving us multiple ways to do the same thing. Be sure to thank Akshat Shah next time you see him around…

The audioPlayer method takes up to six arguments. The first one is always mandatory and is the action you wish to take. The options are play, stop, and clearQueue. If you provide anything else, you might as well provide clearQueue because it’s the fallthrough case.

For the remaining, all are mandatory if you’re playing audio. No further are needed if you’re stopping audio. And the second (behavior) is necessary if you’re clearing the queue.

Because this does the same as the next three combined, we’ll just look directly at those.

audioPlayerPlay

'audioPlayerPlay': function (behavior, url, token, expectedPreviousToken, offsetInMilliseconds) {
  var audioPlayerDirective = {
    "type": "AudioPlayer.Play",
    "playBehavior": behavior,
    "audioItem": {
      "stream": {
        "url": url,
        "token": token,
        "expectedPreviousToken": expectedPreviousToken,
        "offsetInMilliseconds": offsetInMilliseconds
      }
    }
  };

  responseObject.response.directives = [audioPlayerDirective];
  return this;
},

The audioPlayerPlay method (or, as I like to call it, the Audio Player, Play On method) will play a stream of long-form audio. It is—as all long-form audio capabilties are—unsupported on Fire TV.

The first argument is behavior, which accepts one of three values:

ENQUEUE Will play the new stream after what is currently in the queue.
REPLACE_ALL Replaces all in the queue, including the currently playing and immediately plays the new stream.
REPLACE_ENQUEUED Replaces everything in the queue after the current stream that is playing. Does not stop the current stream.

The SDK will not throw an error if you include another value, but don’t do it. Seriously.

The second argument is url, which is the location of the audio to stream. This must point to an HTTPS URL and can be MP3, AAC, MP4, HLS, PLS, and M3U.

The third argument is token, which represents the stream and is 1024 characters or less. This is required because of the next argument.

The fourth argument is expectedPreviousToken. This is, essentially, what stream should come before this one. This is used in situations where the expected behavior and behavior triggered by the user would potentially cause trouble (for example, a user saying “previous track” right as the current track is ending). This is only allowed and is required when the behavior is ENQUEUE. The SDK will not, but the platform will throw an error otherwise.

The last argument is offsetInMilliseconds. It’s a timestamp representing where in the stream playback should start. 0, of course, starts at the beginning. A developer might use this in a playback mode where a user is coming back to a certain point (e.g. an individual music track maybe not, a recording of a concert, yes).

audioPlayerStop

'audioPlayerStop': function () {
  var audioPlayerDirective = {
    "type": "AudioPlayer.Stop"
  };

  responseObject.response.directives = [audioPlayerDirective];
  return this;
},

Stops the stream. No arguments necessary.

audioPlayerClearQueue

'audioPlayerClearQueue': function (clearBehavior) {
  var audioPlayerDirective = {
    "type": "AudioPlayer.ClearQueue",
    "clearBehavior": clearBehavior
  };

  responseObject.response.directives = [audioPlayerDirective];
  return this;
}

Finally, audioPlayerClearQueue will clear the queue following a clearBehavior. The options for clearBehavior are 'CLEAR_ENQUEUED' or 'CLEAR_ALL'. The difference between the two is that 'CLEAR_ENQUEUED' will clear all after the currently playing stream and continue playback, while 'CLEAR_ALL' will also clear the current stream and stop playback.

createSSMLSpeechObject

Finally, we’ve got createSSMLSpeechObject.

function createSSMLSpeechObject(message) {
  return {
    type: 'SSML',
    ssml: `<speak> ${message} </speak>`
  };
}

This wraps the text that we want Alexa to say and wraps it in SSML <speak></speak> tags and sets the output type as SSML. Believe it or not, in the early days of the SDK, you had to do this yourself. Life’s so much easier now.

That’s it for ResponseBuilder. In the next post, we’ll examine the rest of alexa.js. Until then…