AI assistants: what's next for conversational UI?

Conversational User Interfaces aren’t actually very good at having a conversation yet, what can we do to get them there?

Movies and television have been featuring Conversational User Interfaces (CUIs) for around 50 years; Hal from 2001: A Space Odyssey (1969); Zen the ship’s computer on Blake’s 7 (1978); WOPR the military AI in WarGames (1983); Max the alien spaceship in Flight of the Navigator (1986); and more recently, Samantha the operating system in Her (2013). We’ve developed expectations of AIs that can confidently converse with their users, tricking us into anthropomorphising them more than we ever would a graphical user interface (GUI).

“We’ve developed expectations of AIs that can confidently converse with their users. ”

It’s not surprising that we currently have high expectations of what CUIs are. We think they are intelligent, insightful, and have their own unique personalities; they should immediately understand what we want and help us achieve our goals. The current reality is disappointing. Even combined with machine learning, CUIs haven’t yet achieved this level of sophistication. Held up against Hollywood’s scifi luminaries, Google Home seems a glorified voice-activated search engine, and Amazon Echo has limited features and can still struggle to understand what we say to it.

Having worked with multiple CUI ecosystems there are a number of apparent technological limitations which need to be overcome, and new design considerations which need to be fully understood before they start to deliver on the dream Hollywood has enticed us with.

Natural, Functional Conversations

Currently CUIs are mainly used for functional (rather than general) conversations where the purpose is to achieve a specific goal — imagine a conversation with someone in a call centre where they are helping you solve a problem.

What is that makes the conversation with a human call centre worker flow more smoothly than the equivalent experience with a CUI? How can we improve CUIs to deliver the same experience?

Understanding Context

Dialogue does not exist in a vacuum. All conversations begin with a mix of background and contextual information, particularly the history and mood of the participants, including:

User history: A call centre worker can draw on a user’s history with their service to make their experience smoother and more effective. Information such as account details and previous actions allow them to make assumptions and anticipate user needs, creating more satisfying and delightful experiences. Most current CUIs, on the other hand, either don’t have access to previous interactions with the user or don’t utilise them well. This is exacerbated by CUIs (especially voice) having only a single, linear output channel — only one thing can be communicated to a user at a time so it has to choose a single assumption from a range of possibilities. Conversely GUIs can display secondary and even tertiary information to the user, helping to signpost a variety of data and actions. This gives a GUI an advantage as it doesn’t have to make as many assumptions and can display multiple suggested options to the user quickly.

User mindset: A lot of communication is non-verbal — even on a call, a human adviser can understand the caller’s tone of voice and determine their state of mind, level of confidence, and the urgency of their request. Current CUIs are only capable of analysing the words themselves, and cannot comprehend intonation, stress, and the rhythm in which they are delivered. Admittedly this is a priority area of research for CUI developers so as the technology develops this limitation will likely abate, but is still hampering adoption use of CUIs.

Understanding user intent

The concept of ‘intent’ is one that has been adopted within the vocabulary of CUIs. It refers to the key action that a user wants to make. CUIs face a couple of challenges with this:

Complex intentions: User research shows that within a functional conversation the initial intent is often just an intro to a core intent. For example, if you are phoning your energy company to ask the value of your latest bill, chances are your core intent is to pay that bill. A GUI can handle this by simply displaying payment option buttons next to the bill value, but a CUI would have to try and make an assumption whether or not there is a further intent, and decide whether or not to deliver a further question to the user. Trying to do this naturally without making a frustratingly wrong assumption is a design challenge. CUIs need to find a balance between capturing information by asking the user questions, and making assumptions to get the user what they want faster. If the assumption is correct, the transaction would be smooth; if the assumption is wrong, it creates a very negative experience.

Indirect questions: People aren’t always direct in how we make requests. Indirect requests for information are easily interpreted by a human listener, but a CUI needs to be designed to make the connection. For example, a person may ask a train conductor “Are there any seats available on the next train to London?”. The direct answer to this question may be “Yes”, but the person’s real intent is to purchase a ticket rather than just gather information. Current CUIs struggle to spot these hidden intents, and instead force users to adapt their way of speaking to suit the interaction method by being unnaturally direct. This helps the user achieve their goals, but overall creates a very robotic and unsatisfying interaction.

Gathering Information

Analysing, absorbing, and reacting to information comes naturally to humans — we can understand a mix of different data formats even in jumbled order. Despite the huge advances made in recent years, natural language processing still lags behind in this area:

Multi-faceted data: A big challenge, particularly for the likes of the Amazon Echo, is piecing together complex and multi-faceted information from a user in order to understand and fulfil their intent. Types of information such as places, dates, and times are all quickly recognised and understood by a human listener, but a CUI has to dissect the request and analyse each part to determine what type of information it’s being presented with — this creates multiple opportunities for errors and misunderstandings.

Requests can have multiple similar types of data.

Requests can have multiple similar types of data.

Checking and amending data: Interrupting and correcting information mid-conversation can be a simple, polite, and quick interaction in a human to human conversation, but can be a nightmare for a CUI. For example interrupting to say “Sorry, I meant the 21st not the 20th” is easily handled by a person, but can become lengthy and frustrating with a CUI, and often results in having to begin the interaction again. Ideally there should be a process flow for the CUI to accept the interruption, determine the error, and amend its response. This process needs careful design — we can expect for these kinds of conversations to be very stilted and formal when compared with human interaction for a good while yet.

Life can get in the way

Voice assistant-backed devices such as Google Home and Amazon Echo are seeing more successful and enduring engagement than Siri has had in the past, mainly due to people being more comfortable speaking to a device in the privacy of their own home rather than in public. However there are still challenges with CUIs in the home:

“Voice assistant-backed devices such as Google Home and Amazon Echo are seeing more successful and enduring engagement than Siri has had in the past. ”

Interruptions: Most people will have experienced being on the phone and briefly interrupting the conversation to deal with something that’s happening in front of them; whether that’s the doorbell, an upset child, or a loud siren that makes it impossible to hear. This is obviously less of a challenge for text-based chatbots, but with a voice assistant actively listening to you any interruptions to the process can cause the assistant to misunderstand, or cancel the interaction leading the user to frustratingly have to restart from the beginning.

Multiple speakers: There are many amusing anecdotes online of the Amazon Echo obeying the requests of children, or even the television. Authenticating who is speaking before proceeding with a request isn’t currently a feature within any major voice assistant, and poses some challenges for security and confidence in the product although again this is an area of active and promising research.

Delivering Information

The more complex the information a CUI needs to communicate, the more consideration a designer needs to put into making this easily digestible.

Transience of speech: It’s very difficult for people to digest and comprehend complex information by ear. Asking for all or part of the information to be repeated is easy when speaking to a person, and visual displays let us process and digest pieces of information at our own pace on a GUI. If we get distracted or forget a CUI’s response, we face a possibly lengthy process of multiple repeats before fully grasping everything it is trying to communicate.

“It’s an interesting challenge to design and program conversational responses that not only feel natural and friendly, but also communicate information well. ”

Format of speech: It’s an interesting challenge to design and program conversational responses that not only feel natural and friendly, but also communicate information well. When a person speaks, the information they give may not be delivered in order of priority, but priority can be indicated by natural stresses and emphasis. This is difficult to replicate with CUIs, as they lack the ability to modulate their voices easily and therefore emphasise words and syllables, so instead syntax needs to be used to emphasise the important parts of the response, again creating robotic-feeling interactions. This is further complicated by the need to often repeat parts of the request back to the user to confirm that it has heard and is responding to the right question.

The Future of Conversational Interfaces

There are some big limitations for CUIs to overcome before they live up to our Hollywood dreams; the design challenge for us is to find ways to work around these limitations and adapt our solutions to the medium as it stands. For CUIs, even more so than on web or mobile, we must understand user mental models, behaviour, and context to create a satisfying experience.

Despite the challenges, the development of CUIs is moving fast. We’ll likely see many developments in the near future that will improve their experience:

Visual displays: Amazon has already launched an Echo with a visual display. As the functionality of Alexa skills advances it is likely this will become a useful supporting feature to voice interaction.

User recognition: Voice ID technology has just been launched by HSBC on their telephone banking service. As more skills and applications are launched, more security features will be expected by users, and with multiple family members using a device, user recognition features will become necessary for better personalisation.

Natural responses and improved empathy: There is a lot of research being carried out to further our understanding of how emotion affects our voices. When this knowledge is applied to voice services and devices we can expect to see them become more responsive to our moods and un-vocalised intentions, resulting in a rise in user interest and engagement.

The future of Conversational User Interfaces is still unclear, but it’s obvious that the opportunities of this technology is not something to be underestimated.

CUIs are growing in popularity and are set to become a breakout technology this year. We have been working on flagship CUI interfaces for a while and have come to understand the nuances and challenges of creating for this new medium. If you are looking to explore the opportunities for conversational user interfaces in your business, get in touch to see how we can help.


Planning Lead

Let’s build something amazing together

Let's talk