Voice is fast-becoming a common user interface. We see customers of all ages using their voice to play games, get the latest news, and control their smart home devices. We find that people embrace voice-first user interfaces because they are natural, conversational, and user-centric.
A great voice experience allows for the many ways people have conversations, but a true conversational experience enables people to express meaning and intent, which is different from any type of human-computer interaction we’ve known.
Since there are no graphical user interfaces (GUIs) with voice-first interactions, designers and developers have to reimagine the entire experience from a voice-first perspective. As you can imagine, this requires us to think differently about the design process. Developers looking to build for voice can start by embracing a set of design principles that are unique to voice-first interactions.
Over the next few weeks, we’ll introduce you to four of these principles. Today’s post will dive into the concept of adaptability and how you can build your voice-first interaction in a way that enables users to speak to Alexa in their own words.
User experience designers work hard to make things easy to learn. But at the end of the day, it is the users who are required to learn and often change their ways to use the technology. Think about how smartphone users rarely switch from one phone type to another. Or how most people use a qwerty keyboard versus a dvorak keyboard. They’ve had to learn the UI and don’t want to have to learn another from scratch.
When you’re building experiences for mobile or the web, you present a singular UI to all users. You, the designer, pick things like text, icons, controls, and layout. Then the users leverage their internal pattern matching and natural-language-understanding (NLU) engine, their brain, to make sense of the UI you’ve laid out and act accordingly. A simple but common example is the “OK” button on your app or site. Before clicking on the button, users will scan the whole page, looking for what they want to do. In this moment, they apply their own language-understanding criteria and determine that “OK” is the highest confidence match to what they want to do. Once they click, the OK function is called.
In voice-first interactions, the OK function still exists, but it is called an intent—what user wants to accomplish. The difference is that in voice, there is no screen of options for the user to scan, no button labeled “OK” to guide them. This means that users lack this navigational guidance. However, it also means they don’t have to bend to the technology.
When we speak, we likely won’t say “OK” every time or perhaps at all; we might instead say, “next,” “now what,” “yep,” “let's do it,” “sounds good,” “got it,” “super,” “roger that,” and so on. In voice, these phrases are called utterances, and a conversational UI will accommodate all of these responses. Through automatic speech recognition (ASR) and natural language understanding (NLU), voice services like Alexa will resolve them to an “OK” intent.
The key difference with voice is that Alexa is providing the NLU, not the user. That means the designer’s job shifts from picking the one label that works best for everyone to providing a range of utterances to train the NLU engine. The technology must bend to the user. That’s because we all have our unique style of speech, and without visual guidelines to constrain us into uniformity, our inputs will vary greatly.
This isn’t to say that voice-first UIs can’t provide clues and guidelines for the user. For example, it may share a list of things users can say in order to help us get started. However, the experience should not feel limiting or predestined; one of the strengths of voice experiences is its conversational nature and its ability to accommodate human tendencies such as emotions, personality, and variety. People should be able to simply talk as they do in everyday life; they should be able to say, “Please turn off the lights,” “Turn off the lights,” and “Good night.” This means unlike a singular graphical user interface, a voice-first UI needs to account for the many paths that a user may reach the same destination. While a GUI’s ambition is to enable customers to learn the interface once and use many times over, a voice-first UI’s ambition is to make it so customers never have to learn the UI since it allows the customers to set the course.
Explore the Amazon Alexa Voice Design Guide to learn more about the fundamental design principles to build rich and compelling voice experiences. If you’re a graphical UI designer or developer, download the guide 4 Essential Design Patterns for Building Engaging Voice-First User Interface. Or watch our on-demand webinar and recorded Twitch stream on how building for voice differs from building for the screen.