Be multimodal

A great Alexa experience is voice-first, but not voice-only. Customers should be able to converse with your skill without a visual reference, but when a screen is available, the visual reference should assist the customer and enhance the experience. Great skills also present an experience that is relevant and useful to customers who are on the go (for example, if they’re using wearables like Echo Buds or their phone). Your conversation design should take advantage of all the modalities available on Alexa devices to present a complete experience.

Checklist for creating great multimodal experiences

▢ A customer should be able to use the core functions of the skill without need of a screen

▢ Each touch (or remote) target on screen should have a voice command analog (Not every voice command needs a target)

▢ Leverage touch interactions where voice may not be ideal (such as displaying long lists to scroll)

▢ Differentiate voice responses to surface to devices with screens

▢ The core voice functions of the skill should have corresponding touch interactions

▢ Offer on-screen shortcuts or hints to expedite the experience for tasks the customer will complete most often

▢ Punctuate the most significant “moments” in the skill with useful, delightful on-screen interactivity

▢ Clearly display important, contextually relevant data (such as the score of a game, answer to a question, etc.) when the customer most needs it

▢ Offer a variety of imagery based on customer history, seasonality, or other dynamic context on static screens the customer will see often

Be voice first, complement the VUI

While a multimodal voice experience can engage customers in ways voice alone cannot, it should always complement a voice experience. The Alexa screen should not be used as an app, but rather a means to display information that is easier to present visually and to accomplish tasks that are more easily done with touch. Customers may reach out and touch the screen when they’re not sure what to say, or when they want to browse more options, or when they think using touch might be faster than navigating with their voice.

Additionally, when your customer has a screen, there may be instances where you can shorten the VUI response because they customer can see the additional detail on screen. For example, if your skill allows a customer to browse a library of playable content, your VUI for the devices with a screen might omit some details that are important to include in a VUI for customers who do not have a screen. That might sound like the following:

Customer with a screen:

Customer without a screen:

Identify high-value experiences to create impact & delight

Consider which screens your customer will see most often and determine how the screen could either expedite those steps or create delight. Leverage touch in your experiences where it provides the most value to customers. For example, the core experience of the above hypothetical meditation skill is browsing meditation, so we might include touch targets that represent categories of meditations like “short,” “long,” “stress,” “anxiety,” and more so they can filter a very large catalog of content (and the labels of those touch targets should also be speakable). Also consider which moments in the customer experience that will be most memorable, significant, or climactic, and use the screen to enhance the moment. For example, if our customer just got a high score in a trivia game, we want to celebrate on-screen!

Learn more about creating great visual experiences with screens with the Introduction to Multimodal Design.

Be contextual

Be trustworthy

Be multimodal

Be voice first, complement the VUI

Identify high-value experiences to create impact & delight

Previous Article:

Next Article:

Be multimodal

Be voice first, complement the VUI

View conversation

View conversation

Identify high-value experiences to create impact & delight

Previous Article:

Next Article: