In discussions about AI capabilities, there’s a human aspect of how people access those capabilities which I think is important regardless of whether or not you’re an AI doomer. Interface design can seem banal, but I think it’s fundamental to how people interact with the singularity.
I.
When Apple invented the mouse, consumers didn’t automatically know how to use it. Apple’s hunch was that the quality ceiling of mouse-based interactions would be much higher than terminal or keyboard-based navigation because of the intuitiveness. They were right. They won again with the touchscreen; both interfaces are now ubiquitous.
But this was the first Apple mouse.
In both cases, although a new kind of input had been created, it took much longer for optimal ergonomics to develop. 40 years on, I think it’s unlikely a mouse will ever release that is a substantial advance on ergonomics invented in the early 2010s—the Logitech G502 comes to mind. I think if you held an early mouse from any manufacturer, you’d feel that this is not the final form immediately.
Importantly, the barrier to making better mice wasn’t the technology—mice are conceptually pretty simple, the interfaces needed already existed, and there’s nothing special about the hardware. What took time was the development of software patterns that could take advantage of this new method of input. It took 12 years from the release of the first mouse for the standard two-buttons-and-a-wheel design to catch on.
Back in 1993, as I was watching many Excel users do their work, I noticed the difficulty they had moving around large spreadsheets. Finding and jumping to different sections was often difficult. I had the idea that perhaps a richer input device would help.
My original idea was the zoom lever. This was simply a lever, presumably for your non-mouse hand (i.e. on the left side of your keyboard if you’re right-handed). When you push it away from you the spreadsheet zooms out. When you pull it towards you, it zooms back in.
I prototyped this by hooking a joystick up to my computer and using DDE to connect it to Excel for zooming. Using a joystick button along with the stick, I also had it do “data zooming”, which was drilling in and out through Excel outlines.
This all seemed useful, so I showed it to the Hardware division at Microsoft. They were initially cool to the idea, which I presented as a zoom lever, and it didn’t go anywhere at that point.
At this point most people thought it was kind of wacky. Focusing on zooming was a very Excel-centric approach. More specifically, it was a very 2-D centric approach. That is, using an application that presents 2-dimensional data, like a spreadsheet or graphics, it’s very useful to zoom in and out. But the other main style of application is a linear flow application like Word, and there it’s not as useful. You could do zooming with Word, where zooming out shows you a multi-page view and then you click on a desired page and zoom into it, but that’s not as natural as with a spreadsheet or graphics and images.
A number of people suggested adding panning and scrolling functionality. In particular I remember Chris Graham saying zooming was just too limiting and it should pan as well. In response to this feedback, I added panning to the prototype, so moving the joystick side-to-side and back-and-forth scrolled Excel in the corresponding direction.
Around this time, the hardware guys came back and said that they had considered adding a wheel to the mouse, but they didn’t know what it would be used for. Document navigation answered that question, so they said that if I could get Office to support it, they would build it. This really meant Excel and Word since they were the “800 lb gorillas”—if Excel and Word supported something, then the other Office apps would follow, and if Office as a whole supported something, then everyone else would follow too (this was the early 1993 when Office was the heart of most people’s computer usage).
The key takeaway here is that the HardwareBoffins had already conceptualised the mechanics of how to add a wheel to the mouse, but it took watching users in the wild to affirm it as a good choice.
II.
Video game graphics are an example of the opposite kind of barrier to development—limitations on the power of home computers set hard controls on the ergonomics of game aesthetics. Selective rendering, zoning, and other optimisations get beautiful games to run, but if we could suddenly 10x the power of home computers, we’d likely see an overnight spike in the beauty of new video games. Meaningful advances in home computational power come from hard research and hard logistical scaling to manufacture new kinds of chips at scale.
If you’re a video game designer developing a game with cutting edge graphics, you need to take bets on the fidelity you’ll be able to achieve on consumer grade hardware several years into the future if you want to truly take advantage of the technology on offer. There’s some uncertainty here, but it’s still a pretty reasonable bet to make that everyone’s computer will be a Little Bit More Powerful a few years than they are now. Some decisions you make can be easily scaled, others can’t. Graphical effects can be enabled or disabled programmatically, but CPU load in the form of interactions and game logic can’t easily be.
Eventually, large language models will be ubiquitous, like the mouse or touchscreen. Part of my wager is that they’re going to go further than ubiquitous and become the fabric on which society is built—like the internet. The internet is much more important than just a mere ubiquitous tool interacted with by billions—people born after its invention struggle to conceptualise a society that works without it.
While there are gains to be made in terms of capabilities which can only come from hard science and the hard logistics of scaling GPU racks and datacenters, I think there’s a wealth of short term gains in ergonomics that aren’t being focused on enough.
I think ChatGPT is like the skeuomorphism of the early 2000s. In the days where digitisation was new, a common philosophy amongst designers was to rely on making digital technology look similar to tools we use in the real world. The earliest way to realise “clickability” (how intuitive it is that an element is an interaction target) was simply to model the the visual appearance of the object based on a real-world object with the same utility. The notepad looked like a notepad, so that means it functions like a notepad.
It’s easy to write off current design norms for everyday activities as being either obviously the best approach and already at a global maxima, or simply a subjective meta in a sprawling sea of forms—a fashion. I think both of these things are normally wrong.
One type of technical accomplishment I hold in the absolute highest esteem is ergonomic advancements to things that seemed obviously solved.
The best example I have of this in recent times is word processing in the last decade. It has been a *phenomenal* decade for tools to write text on a page and share it. Notion cracked open one of the last remaining skeletons of skeuomorphism, the idea that a document is essentially a big piece of paper. Even Google took heed and have been leaning into “smart chips” and “pageless mode” for Google Docs recently.
But never forget that it wasn’t possible to create a document without a fixed page size in Google Docs until 2022.
Big idea. Always possible. Lots of ergonomics advantages. Took 30 years from the mainstream adoption of digital word processing to be realised. Evolving beyond our natural skeuomorphic tendencies takes time.
III.
A lot of people more mathematically inclined than me are going to do a lot of clever work on advancing AI models, but I think there’s underappreciated work to be done advancing new LLM interaction modalities that go beyond the skeuomorphic approach of replicating a conversational experience.
In addition to betting against LLM skeptics, I’m also betting against any system whose main interaction modality is similar to human interaction. I’ve seen a variety of people trying to bundle up LLMs as employee replacements, or build chat interfaces for “talking to code” or as “teachers”. Don’t even get me started on the long maligned (and rightly so) customer service chatbots from hell.
These are crude, primitive tools which will not be looked back on fondly. Don’t get me wrong, language models *will* replace educators and call centres, but they won’t do so as imitations. Even near-perfect replication isn’t enough to outdo the uncanny valley.
The conversation modality will continue to have relevance as roleplaying tools for leisure and may have niche pedagogic utility as a way to introduce fun into education, but it’s not how the future is going to look.
Smartphones became a physical extension of our bodies that we’re anxious and often unable to function without. They give us incredible superpowers. Perfect knowledge of direction and the logistics required to get to any place. Instant communication with anyone, anywhere in the world. Any kind of information, immediately.
There have been some hamfisted attempts to replicate this experience with glasses, headsets, or most recently badges, but they don’t *quite work* for reasons that are nuanced and hard to anticipate theoretically. It was famously cited that it felt rude to be talking to a Google glass user as they looked up at their notifications. This glaring social deficit was deadly out of the gate but hard to anticipate, even if you’re a motivated and capable interaction designer. I’ll reserve judgement on Apple’s latest attempts at goggles for now.
The closest thing to language models becoming an extension of the body so far I think has been felt by software engineers interacting with LLM-powered autocomplete for code. There are many times where, like magic, AI just wrote a whole block of code I was about to bash out piece by piece. That didn’t feel like I was using a tool, it felt like my own capabilities were being extended. There’s no doubt in my mind that autocomplete will win.
I think this capability only came about because autocomplete was already a norm from typing on smartphones. A human assistant would never dream of finishing your sentences. Such a thing isn’t just unhelpful, it’s unsettling and is considered rude by most.
I have no doubt that there’s a bounty of interaction styles left undiscovered. Most of the obvious skeuomorphic approaches are already being tried, but there’s no shortage of novel additions to bring. I expect that much like word processing, many of the coolest and most groundbreaking approaches won’t be tried successfully for another decade or three. Maybe the language models can help design their own interfaces, but I doubt it given the deeply human nature of optimising technology ergonomics.
IV.
It’s easy to dance around and either make big sweeping gestures at everything so vaguely that no meaning can be derived, or poke tiny holes in something so small that nothing can be generalised. This is my attempt at a time capsule. If I didn’t make any specific, pointed guesses about the future, it wouldn’t lock me in to an early 2020s conception of futurism tightly enough to be interesting to revisit in the future.
So in the spirit of boldness, here’s where I venture into unknown territory and offer some of my own ideas. This is what I think comes next for interaction modes.
I think voice commands are coming back big time, but not as conversational agents. Voice controls were always a sensible approach since talking is faster than typing, and voice comprehension tech has come a long way, especially as it relates to proper nouns. Alexa and Google home were on the right track but couldn’t provide utility beyond just playing music and so were hard to monetise. There are still big limitations here regarding how LLMs interact with systems built for humans, but I expect to see a new field of interaction design emerge. Currently we have HTML, CSS and Javascript for humans, and JSON, XML and GraphQL for systems. I think there’s a third class of tool that helps AI agents comprehend and interact with new systems on the way.
The main shortcoming of shorthand has always been that it is hard to read and lacks the emotional colour of prose. Large language models are exceptionally good at synthesising content and mood into prose, and I wonder how that extends to our ability to signal mood via the written word. Emojis are the de-facto way now to fill in the gap left by the impersonal nature of written communication, but I don’t think emoji keyboards are the final form of efficient emotional expression.
The current golden age of word processing technology is not over yet. Notion brought block-based content styling and the elimination of pages to the scene. Autocomplete is the first step, but isn’t mature yet. I’m absolutely sure there’s more to come here.
There’s a crucial tipping point of trust that I’m sure will occur at some point in the next 20 years where people let AI agents make meaningful decisions on their behalf. Some people think of the agent in this case essentially as a robotic executive assistant or therapist or consultant, but that’s a skeuomorphic way of thinking about things modelled around the idea of a persona which I think is unnecessary. I think a much more realistic alternative is essentially fine-tuning a language model on somebody’s preferences to such a degree that it’s like a second brain for them and not an assistant. I have conversations with myself in my brain all the time, but these are fundamentally different kinds of conversations than I would have with another person. When I try to figure out what I want to do in a certain situation, I’m not *advising* myself, or asking myself for help, I’m trying to work out *what I already think*. I’m nowhere close to conceptualising what this would look like in reality, but I think I’ll know it when I see it.
Escaping Skeuomorphism
In discussions about AI capabilities, there’s a human aspect of how people access those capabilities which I think is important regardless of whether or not you’re an AI doomer. Interface design can seem banal, but I think it’s fundamental to how people interact with the singularity.
I.
When Apple invented the mouse, consumers didn’t automatically know how to use it. Apple’s hunch was that the quality ceiling of mouse-based interactions would be much higher than terminal or keyboard-based navigation because of the intuitiveness. They were right. They won again with the touchscreen; both interfaces are now ubiquitous.
But this was the first Apple mouse.
In both cases, although a new kind of input had been created, it took much longer for optimal ergonomics to develop. 40 years on, I think it’s unlikely a mouse will ever release that is a substantial advance on ergonomics invented in the early 2010s—the Logitech G502 comes to mind. I think if you held an early mouse from any manufacturer, you’d feel that this is not the final form immediately.
Importantly, the barrier to making better mice wasn’t the technology—mice are conceptually pretty simple, the interfaces needed already existed, and there’s nothing special about the hardware. What took time was the development of software patterns that could take advantage of this new method of input. It took 12 years from the release of the first mouse for the standard two-buttons-and-a-wheel design to catch on.
The key takeaway here is that the HardwareBoffins had already conceptualised the mechanics of how to add a wheel to the mouse, but it took watching users in the wild to affirm it as a good choice.
II.
Video game graphics are an example of the opposite kind of barrier to development—limitations on the power of home computers set hard controls on the ergonomics of game aesthetics. Selective rendering, zoning, and other optimisations get beautiful games to run, but if we could suddenly 10x the power of home computers, we’d likely see an overnight spike in the beauty of new video games. Meaningful advances in home computational power come from hard research and hard logistical scaling to manufacture new kinds of chips at scale.
If you’re a video game designer developing a game with cutting edge graphics, you need to take bets on the fidelity you’ll be able to achieve on consumer grade hardware several years into the future if you want to truly take advantage of the technology on offer. There’s some uncertainty here, but it’s still a pretty reasonable bet to make that everyone’s computer will be a Little Bit More Powerful a few years than they are now. Some decisions you make can be easily scaled, others can’t. Graphical effects can be enabled or disabled programmatically, but CPU load in the form of interactions and game logic can’t easily be.
Eventually, large language models will be ubiquitous, like the mouse or touchscreen. Part of my wager is that they’re going to go further than ubiquitous and become the fabric on which society is built—like the internet. The internet is much more important than just a mere ubiquitous tool interacted with by billions—people born after its invention struggle to conceptualise a society that works without it.
While there are gains to be made in terms of capabilities which can only come from hard science and the hard logistics of scaling GPU racks and datacenters, I think there’s a wealth of short term gains in ergonomics that aren’t being focused on enough.
I think ChatGPT is like the skeuomorphism of the early 2000s. In the days where digitisation was new, a common philosophy amongst designers was to rely on making digital technology look similar to tools we use in the real world. The earliest way to realise “clickability” (how intuitive it is that an element is an interaction target) was simply to model the the visual appearance of the object based on a real-world object with the same utility. The notepad looked like a notepad, so that means it functions like a notepad.
It’s easy to write off current design norms for everyday activities as being either obviously the best approach and already at a global maxima, or simply a subjective meta in a sprawling sea of forms—a fashion. I think both of these things are normally wrong.
One type of technical accomplishment I hold in the absolute highest esteem is ergonomic advancements to things that seemed obviously solved.
The best example I have of this in recent times is word processing in the last decade. It has been a *phenomenal* decade for tools to write text on a page and share it. Notion cracked open one of the last remaining skeletons of skeuomorphism, the idea that a document is essentially a big piece of paper. Even Google took heed and have been leaning into “smart chips” and “pageless mode” for Google Docs recently.
But never forget that it wasn’t possible to create a document without a fixed page size in Google Docs until 2022.
Big idea. Always possible. Lots of ergonomics advantages. Took 30 years from the mainstream adoption of digital word processing to be realised. Evolving beyond our natural skeuomorphic tendencies takes time.
III.
A lot of people more mathematically inclined than me are going to do a lot of clever work on advancing AI models, but I think there’s underappreciated work to be done advancing new LLM interaction modalities that go beyond the skeuomorphic approach of replicating a conversational experience.
In addition to betting against LLM skeptics, I’m also betting against any system whose main interaction modality is similar to human interaction. I’ve seen a variety of people trying to bundle up LLMs as employee replacements, or build chat interfaces for “talking to code” or as “teachers”. Don’t even get me started on the long maligned (and rightly so) customer service chatbots from hell.
These are crude, primitive tools which will not be looked back on fondly. Don’t get me wrong, language models *will* replace educators and call centres, but they won’t do so as imitations. Even near-perfect replication isn’t enough to outdo the uncanny valley.
The conversation modality will continue to have relevance as roleplaying tools for leisure and may have niche pedagogic utility as a way to introduce fun into education, but it’s not how the future is going to look.
Smartphones became a physical extension of our bodies that we’re anxious and often unable to function without. They give us incredible superpowers. Perfect knowledge of direction and the logistics required to get to any place. Instant communication with anyone, anywhere in the world. Any kind of information, immediately.
There have been some hamfisted attempts to replicate this experience with glasses, headsets, or most recently badges, but they don’t *quite work* for reasons that are nuanced and hard to anticipate theoretically. It was famously cited that it felt rude to be talking to a Google glass user as they looked up at their notifications. This glaring social deficit was deadly out of the gate but hard to anticipate, even if you’re a motivated and capable interaction designer. I’ll reserve judgement on Apple’s latest attempts at goggles for now.
The closest thing to language models becoming an extension of the body so far I think has been felt by software engineers interacting with LLM-powered autocomplete for code. There are many times where, like magic, AI just wrote a whole block of code I was about to bash out piece by piece. That didn’t feel like I was using a tool, it felt like my own capabilities were being extended. There’s no doubt in my mind that autocomplete will win.
I think this capability only came about because autocomplete was already a norm from typing on smartphones. A human assistant would never dream of finishing your sentences. Such a thing isn’t just unhelpful, it’s unsettling and is considered rude by most.
I have no doubt that there’s a bounty of interaction styles left undiscovered. Most of the obvious skeuomorphic approaches are already being tried, but there’s no shortage of novel additions to bring. I expect that much like word processing, many of the coolest and most groundbreaking approaches won’t be tried successfully for another decade or three. Maybe the language models can help design their own interfaces, but I doubt it given the deeply human nature of optimising technology ergonomics.
IV.
It’s easy to dance around and either make big sweeping gestures at everything so vaguely that no meaning can be derived, or poke tiny holes in something so small that nothing can be generalised. This is my attempt at a time capsule. If I didn’t make any specific, pointed guesses about the future, it wouldn’t lock me in to an early 2020s conception of futurism tightly enough to be interesting to revisit in the future.
So in the spirit of boldness, here’s where I venture into unknown territory and offer some of my own ideas. This is what I think comes next for interaction modes.
I think voice commands are coming back big time, but not as conversational agents. Voice controls were always a sensible approach since talking is faster than typing, and voice comprehension tech has come a long way, especially as it relates to proper nouns. Alexa and Google home were on the right track but couldn’t provide utility beyond just playing music and so were hard to monetise. There are still big limitations here regarding how LLMs interact with systems built for humans, but I expect to see a new field of interaction design emerge. Currently we have HTML, CSS and Javascript for humans, and JSON, XML and GraphQL for systems. I think there’s a third class of tool that helps AI agents comprehend and interact with new systems on the way.
The main shortcoming of shorthand has always been that it is hard to read and lacks the emotional colour of prose. Large language models are exceptionally good at synthesising content and mood into prose, and I wonder how that extends to our ability to signal mood via the written word. Emojis are the de-facto way now to fill in the gap left by the impersonal nature of written communication, but I don’t think emoji keyboards are the final form of efficient emotional expression.
The current golden age of word processing technology is not over yet. Notion brought block-based content styling and the elimination of pages to the scene. Autocomplete is the first step, but isn’t mature yet. I’m absolutely sure there’s more to come here.
There’s a crucial tipping point of trust that I’m sure will occur at some point in the next 20 years where people let AI agents make meaningful decisions on their behalf. Some people think of the agent in this case essentially as a robotic executive assistant or therapist or consultant, but that’s a skeuomorphic way of thinking about things modelled around the idea of a persona which I think is unnecessary. I think a much more realistic alternative is essentially fine-tuning a language model on somebody’s preferences to such a degree that it’s like a second brain for them and not an assistant. I have conversations with myself in my brain all the time, but these are fundamentally different kinds of conversations than I would have with another person. When I try to figure out what I want to do in a certain situation, I’m not *advising* myself, or asking myself for help, I’m trying to work out *what I already think*. I’m nowhere close to conceptualising what this would look like in reality, but I think I’ll know it when I see it.