We’re off to see the wizard …!

With the recent proliferation of Voice User Interfaces (VUIs) in our homes such as Amazon Echo and Google Home, it seems that VUIs are fast on their way to becoming a mainstream consumer technology.

Voice is a new paradigm for us UX’ers and being interested in voice as a new technology and communication in general, we wanted to learn more about how we go about creating and testing Voice Experiences. World Usability Day (our annual open house event where we invite clients to visit us, try new technologies and discuss all things UX) presented a great opportunity to do just that.

Myself and my colleague Gayle Whittaker ran a Voice Usability testing session with the objective of exploring how to set and run voice usability testing sessions.  We ran this throughout the day to give clients the opportunity to participate and observe voice usability testing. After each session we had some interesting discussions with our clients about how they think voice apps in general and how voice experience could fit into their organisations.

So what did we do? We created ‘Amy’ a hotel voice concierge app…

Wizard of Oz discussion

Creating the test plan for Voice App testing

During WUD, one of our colleagues, Steve Fullerton, was demonstrating how to build a skill for Alexa. He created his skill based around a hotel concierge app, so we decided to follow this theme when running our voice usability test.  We built some test flows around this context and created ‘Amy’, “the helpful concierge voice app”.

To give our audience a bit of background to this research, we set up the following context.

Objectives of test – We wanted to better understand:

  • (informational search intent) – can participants find out specific hotel information such as: breakfast timings, check-out etc.?
  • (booking intent) – can participants book extended check in easily and quickly?
  • (recovery from error) – how does the system recover from errors?
  • What language/terminology and turn of phrase do participants use?
  • What is participants’ reaction to ‘Amy’ and how do they describe her? (We were aiming for professional and helpful).

Overall:  we wanted to understand if participants were satisfied with the information they received, and if they felt positive or negative about the voice experience they had engaged with.

The wonderful world of the Wizard of Oz

WOZ it all about?

First off, what is Wizard of Oz testing?  Unlike traditional user testing where we use a prototype or wire frame to test concepts and ideas, in voice testing, we prepare sample dialogue flows (see below) and have pre-prepared audio files to test our app.  In WOZ testing, participants interact with a voice user interface such as Alexa, whom they believe to be ‘live’, but is actually being operated by an unseen researcher in another room; “the Wizard”.

Our voice testing process followed these steps:

  1. Understanding the main use cases/tasks and user intents (we made assumptions about this but normally we would base this on analytics)
  2. Constructing sample dialogue flows for key tasks (see below).
  3. Deciding how we wanted ‘Amy’ to respond and creating audio files of system responses for each step. We also created responses for correcting conversations and error recovery.  (To create the audio files, we used Amazon Polly, a text to speech service).  I put the different audio clips together using a free soundboard app called Soundboard.
  4. Creating a moderator script, including scenarios for testing and metrics for gathering feedback.
  5. We then paired a mobile phone with an Alexa speaker through Bluetooth.
  6. This meant that during the test, participants engaged with the Alexa speaker but I was actually controlling the voice responses from the soundboard app on my phone (I was the ‘Wizard’).
  7. We then ran the WOZ session!

Note: Normally you would have three people involved in this type of testing: the moderator, the participant and the ‘Wizard’ controlling system responses in another room.  However, for this session, Gayle and I ran this test in the same room with groups of WUD attendees who could observe the session and then discuss how it went after.

How do you build a conversation?

We started with considering key user tasks and mapped out the steps that a user would take in that process. Then, we created a couple of dialogues – potential pathways that a user could take when undertaking that task. We didn’t know exactly how the user would navigate through the task, but we mapped out the happy path first. Then we mapped out some alternative paths should the user decide to follow a different route.

Once we had created the flows, we started to list the responses ‘Amy’ would need to make (system prompts) to guide users through that task.  It was also important to have some error or repair responses (see the Google paper on errors in conversation) ready to help the user get back on track should they go off in a direction we had not anticipated.

Our dialogue flows became a decision tree outlining different potential routes users could take and helping us understand what system prompts and error responses we needed to have ready to test.

How did we get on? – “Pure magic”

Our intentions with this session was to run through the process of setting up a voice usability test.  However, we got some interesting feedback not only on ‘Amy’, but also on voice usability testing itself, for example:

  • People who were used to dealing with voice apps spoke to Amy differently from those who hadn’t engaged in a voice experience before.

People who were not used to voice experiences tended to speak over and interrupt Amy when she was relaying a response. However, those who had used Alexa or Google Home listened more and responded to Amy in consistent ways.  Their previous experience with VUIs had obviously ‘trained’ them somewhat in how to go about conversing with Amy.

  • The importance of understanding the words the users will use and HOW they use them.

Often, one of the key objectives of a voice usability test is to better understand not only the typical words users will say, but also to test the way they use those words. There are so many variations in the way people ask for things, and we saw this when participants were engaged in our testing. This is important to understand if we actually wanted to create ‘Amy’; we would need to ensure she recognises and matches those words.

  • The cultural aspect of language and the importance of localisation.

We understood that participants may use different words when communicating with ‘Amy’, but one thing we didn’t anticipate was the variation in words participants used in general conversation.  For example, one of our Glaswegian participants used the word ‘magic’ as a positive response during the test (in Glasgow this means good or great). This highlighted to us that if we were creating ‘Amy’ for this audience we should make sure she recognises the word “magic” in this specific context.

  • The importance of leading the conversation and sign posting along the way.

As we know, in voice experiences there is no menu to guide the user through the task. We tried to build in sign posting into system responses to let the user know what they could and couldn’t ask for. For example, we let them know from the get-go what they could or couldn’t do with this app.  In other tasks we tried to build in instructions into system responses to help guide the user through the flow.  This worked quite well.

  • What about ‘Amy’? Participants liked her and thought her professional and helpful (if a bit robotic).

We knew that ‘Amy’s’ voice would be an important representation of the hotel brand we were testing, so another key aspect of this testing was to get feedback on how participants felt about ‘Amy’ and her voice. We asked them to describe her after testing and the adjectives they used were generally very positive – ‘professional’, ‘competent’, ‘serious’ etc.  They did, however, find her a bit robotic, one person described her as ‘Alexa’s cousin’.   However, overall, they thought her voice appropriate for the context which was a positive finding for our testing.

Final thoughts on usability testing of voice apps

It’s relatively easy to do!

As this was a new area for us, we were hesitant about how this would work, but once we set up and ran the test, we understood what an easy thing this is to do.  Obviously, we still need to follow robust research processes and ensure we plan and run testing properly, but this process was very quick to set up. It would not be difficult to quickly test and then iterate and refine designs using this method.

Wizard of Oz works really well for voice!

Wizard of Oz is useful not only for voice testing, but I believe it works particularly well for voice usability testing.  It really allowed us to test our system responses and understand if they met user intentions, in particular if our error or repair messages worked.  It also allowed us to gain a better understanding not only of the words participants used, but also how the conversation flowed as a whole and where the gaps in our conversation design were.

Designing for human conversation is complicated!

While the test may be easy to set up and iterate on, what it really highlighted to us, is the complexity of designing for conversations. We did lead participants in terms of how to engage with Amy, which worked to some degree, but because we humans already come pre-programmed with the power of speech, several participants went completely off script and Amy didn’t cope well.   The variability in the way people engaged with the system was really interesting, but makes the design of the system more complicated.

 

It was certainly an interesting and fun experience and we look forward to learning more about designing for voice experiences.

Have your say

Related quotes

Find Out More

See how we can help you deliver great user experiences.
Contact us to get started
Prefer to speak to someone?
Call us on 0131 225 0850
Email us at [email protected]