Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Update your browser
CapTech Home Page

Blog February 12, 2018

The Things You Should Be Thinking About with Your Alexa Skill or NLP Application

alexa skills

A truism about natural language processing: the tools to create a conversation today are really good and easy to use. At this point, going to the Amazon Alexa Skill Builder to create an NLP solution is really point and click. It's not hard stuff; we could actually get a group of 12-year-olds writing their own skills in about an hour. So, what does that mean?

It means I'm not going to assemble a simple Skill to impress my clients. I would be embarrassed trying to excite people by putting an Echo on their desk and talking to it. Those people would probably ask, "Why are you showing me this? What is the value?" The truth is there is deep value in NLP solutions, but it goes beyond a simple call and response-to leverage that value, you need deeper expertise and execution.

User Experience (UX)

The first thing to consider when building an NLP application is the user experience-the interaction model. There is a heuristic around how to structure the conversation based on the fact that this is less than perfect technology.

As an example, think about a hypothetical app to deliver Disney World wait times. Disney might think, "This should take us only a couple of days to do an Alexa skill because we already have the wait times calculated and available." Well, it's not that simple. What if a customer asks, "What's the wait time for the Star Wars ride?" Well, there isn't a single ride called Star Wars. There are three different rides that could be-it could be Star Wars Launch Bay, Star Tours: The Adventures Continue, or Star Wars: A Galactic Spectacular. So, when a customer says Star Wars, what do they mean? What's the heuristic around it? Should the Alexa say, "We have multiple Star Wars attractions. Which one do you mean?" Well, no, because your customer may not know what those are, and it introduces another opportunity for error in the conversation.

So now you've asked a question and your customer doesn't know how to answer. Then they get frustrated. In this case, the appropriate response is to list all three in the order that makes the most sense, perhaps the most popular first. Now, what if there were 10 responses? That's not a feasible solution. What are the rules and the heuristics around assuming an answer? When do we ask for clarification? There are hundreds of considerations that go into such an interaction model.

Another UX consideration is your tone. Are you whimsical, are you snarky, are you totally buttoned up? As you structure these conversations, are you going to try to convey some level of personality? If you're a bank you probably don't want to be cracking jokes about your customer being broke. That brand personality is built into the user experience as well.

Middle Tier Services

The second part of getting the most out of these applications is having middle tier services specific to these types of interactions. In the Star Wars case I outlined above, our client might have the end points for the wait times. However, if you decide to deliver all three to the user because you aren't sure what they're asking for-you need an end point that calls those individual ones, puts them together, and then orchestrates them into a single response. Even if you have all the end points you think you need to satisfy the data requirements of the conversation or the skill you're trying to establish, you probably still need additional services to handle context and orchestration. The type of thing to make sure that this is going to work. It's not a heavy lift, but their requests need to be processed and these services are often overlooked.

Trailing Context

The third important consideration for a robust NLP experience is something I like to call 'Trailing Context' or conversational context.

If you ask your device, "How much was my energy bill this month?" and the device replies with a crazy high number, and you say, "Wow. Why?" Your device wouldn't know what you were talking about. And because Alexa is stateless, that's how it is. So, if we you want a "Wow, why?" question to work, your code has to look at and be aware of several things:

  • What's the last thing the user was talking about?
  • What did we just tell them?
  • What's the state of this conversation?

This is something that is much more difficult than it would appear on the surface. Imagine this example:

User: 'How long is the wait time for Star Wars Launch Bay?'
Device: 'Star Wars Launch Bay is currently closed.'
User: 'Do you know when it's going to back up?'
Device: 'When what will be up?'

If you were having a conversation with someone that would be an intuitive inference. So, the end goal of getting to that level of conversation with your device is getting it to instinctively know trailing contexts.

Ultimately, the real issue is that because these devices are simulating human speech, people are going to speak to them as if they were other humans, and the machines aren't necessarily capable. Most of the time, if they do fail, you kind of devolve into caveman speak, so instead of saying, "Hey Siri, call my wife on her cell phone." You say, "Call wife cell." Because you're trying to reduce the error points, and that's not a very good or conversational experience.

Machine Learning

The last piece of the puzzle is machine learning. Machine learning has two aspects to it, both of which work on the same function-improving conversational flow over time, based on what we learn about these conversations themselves. If I ask my device to call Steve Rowling (rhymes with bowling), it might come back saying, "Calling Steve RAO-ling." The device was smart enough to know what you were saying but not smart enough to correct the error.

Currently, there's no machine learning that recognizes how I'm pronouncing it may be different than how the machine is pronouncing it. So, what's the math behind that? Maybe I'm mispronouncing it. It's a tough question to answer. You could just use a micro level of machine learning, which is, when the user pronounces something in a way that's different than it's programed, it will just match the user's speech.

That's at the user level. The second version is using machine learning more broadly. Think about looking for trends across every user so that the 50th person who asks for Steve Rowling gets it pronounced correctly right off the bat. This way the user doesn't also have to train the smart assistant. The device has already been trained by everybody else and the incoming users can benefit from the pain that we all went through. In the case of the Star Wars attractions, if we asked for clarification on something and 90% of the time it was the same answer, then the machine can learn to give that response first.

Putting It All Together

When you're examining NLP systems think about how user experience, middle tier services, trailing context, and machine learning work together to create a robust solution. It's not about having a conversation with a device. It's about how you are integrating across systems, how you're thinking about the heuristics and anticipating that response.

This is where you can set yourself apart. Anyone can make something that'll parrot back what you need it to, but can it do all the underlying things? Can they, however, make the experience seamless, enjoyable, and, most importantly, intuitive? Because that's where these solutions can really come alive.