AR promises to be different. With it, we can fully bridge the gap between the worlds of bits and atoms. Our reality becomes electronic, and with that, it becomes computable and manipulable. But current UX practices aren’t ready for this. What does a cursor look like in the real world? How do you interact with an object physically and digitally at the same time?
We are going to explore what makes AR UX different from our current standard for screens. Then, we’ll discuss ways we can make excellent UX for AR.
User eXperience (UX) is what the user sees and feels while using a product. Users won’t use boring, unintuitive, complicated products if given the choice. This is understood by pretty much every tech company. Well, almost every tech company (looking at you, Workday).
AR companies are, and should be, focused on hardware and overcoming physical constraints. There’s still lots of work to be done in batteries, displays, and heat sinks before AR glasses are comfortable, powerful, and safe. However, it is impossible that only one company will figure this out[^1]. At that point, we can expect multiple companies to release AR glasses within a year of each other. Then, good UX becomes table stakes: among similar options, users will choose the AR glasses with better UX.
In fact, bad UX can hurt products even with no competition. Google Glass was the only product of its kind in 2013. Yet it flopped spectacularly. Among other factors, it had horrendous UX — users had to strain their eyes upwards to even see anything. And good luck navigating through screens.
So, getting AR UX right will be crucial to users, who want easy products, and to 1P companies, who want to win the market. Additionally, good UX standards baked into the OS and hardware make it so much easier for 3P devs to build good experiences on the devices.
However, getting AR UX right will take a lot of experimentation and work. We can’t just port over our current practices because of many fundamental differences.
The real world is the screen. Physical safety is not guaranteed. Clicking a button on a phone screen will do no physical harm to you. But interacting with a physical object (say a metal sign) could. Additionally, digital overlays may occlude real objects. Not seeing a car coming at you because a digital ad occluded it may be life-or-death.
The display isn’t touchable. Picture how smudgy and dirty AR glasses would get if people touched the lenses all day long. All current mobile UX assumes a touchscreen. Laptop/desktop UX doesn’t assume that but the use cases are different from those of AR’s; AR will be on-the-go.
Social stigma limits interaction models. Imagine seeing someone in public swinging their arms wildly and gesticulating to no one in particular. You would probably give them a wide berth or avoid them altogether. This is what people may think when they see you using hand gestures for AR glasses; social stigma limits the use of this interaction model.
Sparseness is mandatory. Imagine trying to even walk down the sidewalk if your entire field-of-view was covered with digital overlays. Facebook messages, Yelp recommendations, and TikTok videos could obstruct your vision. Alternatively, imagine you’re on a date. How annoying would it be if LinkedIn pulled up your date’s profile (or everyone in the bar’s profile) every 15 seconds or every time you glanced around. You’d hardly be able to focus and before you know it, the date is over with no shot of a second one.
Current mobile UX assumes your phone is the main focus of your attention. There can be buttons and red notification bubbles everywhere. Every pixel is designed to catch and keep your attention. AR UX cannot make that assumption — people will be using it throughout their day as a supplement to their main activities. So. information overlays must take as little space as possible and must come only when needed. Otherwise, users’ FOV will be cluttered and it’ll be harder to function in the real world.
Experiences must be predictive. This is tied closely to the above point. Information must come only when it’s needed. Mobile UX relies on users explicitly asking for information when they need it (e.g. through search). But AR has the benefit of context—all the sensors and cameras on the glasses are there to estimate context and predict what information the user wants at that time. However, getting these predictions right will be extraordinarily hard. If a user is looking at storefronts on a busy street, do they want information about the storefronts? If so, which one[^2]? Even for a particular one, do they want the menu and recommended items or do they want hours and reservations? We will need tons of contextual information (e.g. who you’re with, the time of day, your implicit preferences, etc.) to get predictive experiences right. We struggle with it heavily even with one-location devices such as Google Home. Solving this will be one of the greatest challenges the tech industry will ever face.
So, we will need new interaction models for AR. For that, we must remind ourselves of the basics: the fundamental principles of UX and interaction.
Intuitive. A good UX should feel natural. A user should not have to think about navigating within an app/experience or about the mechanics of selecting an object. Unintuitive interactions are hard to remember and hard to remember means no repeat usage.
Efficient. Interactions shouldn’t take much energy to do. It would be tiring to have to swing your whole arm just to swipe left or right on your phone. The less energy needed, the more you can do it. Additionally, you’re less likely to encounter social stigma with more efficient movements.
Sensical. The interaction movement must make sense for the action. Selecting an object shouldn’t require a swipe-down. This also helps counteract social stigma. Random, herky-jerky movements look crazy. Fluid points and swipes look like technology interactions.
Visual field == Interactive field. This isn’t a de facto standard but I think it will be paramount for AR. It means that you can directly interact with what you see just with your body. This doesn’t apply to computers. There, the visual field is the screen and the interactive field is the mouse/trackpad. They are at 90º from each other and need a cursor to map between each other. Imagine having to use a cursor to interact with the real world. That would break immersion pretty quickly.
There are just a few interactions that form the basis of the Interaction space.
If we can nail these, then building complex interactions (such as zooms, tab switching, etc.) become easy.
I think there will be two dominant modes of using AR: one where it’s the focus of your attention and one where it’s complementary to another experience. We must design for both of these since they have very different requirements.
Main Focus Mode: Here, your attention is primarily on the digital experience. This will be akin to having a phone/laptop in front of your eyes. You could edit documents, watch YouTube, or scroll TikTok. Usually, you’d be in private settings here so fear of social stigma doesn’t matter. So, more interaction models are possible here.
Discreet Mode: Here, your attention is not primarily on the digital experience. You may be navigating somewhere, on a date, networking, or countless other things. This is where AR promises to shine, by providing you information that helps you in your primary activity (e.g. helping you remember people’s names & occupations at a networking event). However, you’re likely to be in public here and social stigma matters. This means that the UX has to be discreet. You don’t want to get distracted and you don’t want others to think you’re distracted either. This limits the interaction models to those that look like natural, non-distracted movements. Additionally, the UX here has to be sparse and predictive, as mentioned before so that you get only the information you need without distracting.
There are two good interaction models that really make sense for AR glasses, one corresponding to each mode above. They follow the UX principles listed previously. They also are achievable in the short-term. Let’s take a look.
Hand Tracking: AR glasses (using their built-in cameras) can track the position and orientation of your hands. You can then use your hands to interact with objects in your FOV. This will be the dominant interaction model for Main Focus Mode. Here are some examples:
Gaze Tracking: Outward and inward-facing cameras can track where your eyes are looking. You can then use gaze to interact with objects. This will be the dominant interaction model for Discreet Mode. Here are some examples:
However, interactions will be limited in Discreet Mode. Most interactions will boil down to “dismiss” or “give me more information.”
Let’s delve further into these two models.
Why Hand Tracking?
Above all, it’s intuitive. Your hands are designed for you to manipulate objects with. Sci-fi media has also reinforced the notion of using your hands to manipulate digital overlays (e.g. Tony Stark discovering a new element in Iron Man 2).
Because it’s intuitive, it’s also sensical. Someone seeing your movements would understand you were doing something on your AR glasses.
It’s efficient, similar to how swipes and taps are efficient on phones. It doesn’t take much energy to move a few fingers or just your hand.
Lastly, it equates the visual and interactive field. If you can reach out and “touch” objects, that is the gold standard of interactivity. There’s no cursor needed.
How might this work?
This section is speculative and consists of hypotheses on how Hand Tracking interactions could work.
Issues with Hand Tracking
Why Gaze Tracking?
Our eyes naturally look at things we’re interested in. So, we can take long gazes as a Yes signal and short/no gazes as a No signal. So, it feels intuitive.
It’s discreet as well. Even in conversation or presentations, looking around is normal. You’re not always looking the other person in the eye. So, if a digital overlay pops up on the right and you glance at it, it won’t look like you’re distracted. It will just look like normal eye movements.
It’s efficient because your eyes only have to move a few millimeters to complete interactions.
In Discreet Mode, which is where Gaze Tracking is important, quick information consumption is the primary interaction. “Learn more” or “Dismiss” will likely be the most common interactions. There, Gaze makes sense to choose one quickly.
Lastly, Gaze Tracking also equates the visual and interactive field. Where you look is where you interact. There is no gap.
How might this work?
This section is speculative and consists of hypotheses on how Gaze Tracking interactions could work.
Pointing and Clicking will be the most important interactions because users will be engaged in other activities while in Discreet Mode.
Issues with Gaze Tracking
[^1]: Even if only one company figures it out, other companies will copy or “be inspired” from them.
[^2]: We can’t show information about every storefront at once. That would violate our principle of sparseness. There are UX ways around this, though, as we will see later.