Multimodal Agentic Commerce — Amazon Alexa / Rufus
A design system that works where there are no screens. The foundation for agentic commerce.
Challenge
Amazon's voice-first ecosystem. Alexa needed a design system that worked with AND without screens. Rufus (Amazon's AI shopping assistant) needed high-fidelity prototypes. Design systems assume screens. Voice doesn't have screens. How do you build a system that serves both?
Solution
Architected a voice-first design system for the Alexa ecosystem. Engineered high-fidelity prototypes for Rufus AI. Built multimodal interaction patterns (voice + screen + headless). Reduced global integration timelines by 6 months.
Architecture
Monorepo with shared tokens, multimodal component runtime, and cross-device state orchestration.
Technologies
- Voice UI
- Multimodal Design
- Rufus AI
- Headless
- High-fidelity Prototyping
The Numbers
100% adoption across the Alexa ecosystem. Global integration timelines reduced by six months. A design system that serves voice-only devices, screen devices, and everything in between. The same token architecture powering Echo, Echo Show, Fire TV, and the Alexa mobile app.
Challenge
Every design system textbook starts with a grid, a colour ramp, and a typographic scale. Alexa had none of those. When your product is a voice, there is nothing to line up, nothing to tint, nothing to set. The first question I had to answer was what a design token even means in a world where you cannot see anything.
The problem went deeper than voice. The Alexa ecosystem spans dozens of device types: speakers with no screen, smart displays with small screens, TVs with big screens, phones, tablets, and third-party hardware with unpredictable form factors. A user starts a conversation on an Echo in the kitchen, continues on the Echo Show in the living room, finishes on the phone in the car. The design system had to work across all of them. Not “responsive web” across all of them. Fundamentally different modalities across all of them.
Existing teams had built components for individual surfaces. The Echo Show team had their components. The Fire TV team had theirs. The mobile app had its own library. There was no shared architecture underneath. The result was the usual fragmentation: inconsistent behaviour across devices, duplicated effort across teams, and integration timelines that got longer every quarter.
Redefining what a token means
The answer was to treat the system as a set of rules for behaviour instead of a library of visual components. How long should a confirmation pause last. When should the assistant restate what it heard. How should a multi-turn flow recover from an interruption. How does an error response differ from a disambiguation response in cadence and word count.
These rules were the tokens. Not colours. Not spacing. Behaviours.
A design system for a screenless product is a contract with the user about what the product promises to do, and how it will behave when it cannot. Everything else (visual, typographic, motion) is a secondary concern that only applies when a screen happens to be present.
For voice-only devices, the token set defined timing: the duration of a pause before the system speaks again, the rhythm of a list readout (three items, then a checkpoint: “do you want to hear more?”), the threshold at which the system gives up waiting for input and offers a fallback. These timing tokens were as precise and as non-negotiable as a colour hex value in a visual system. When every Alexa skill respected the same timing tokens, the experience felt coherent across skills. When they did not, the assistant felt broken.
For devices with screens, the token set extended into the visual domain. But the visual tokens were always secondary to the behavioural tokens. A component on the Echo Show had to work first as a voice interaction, then optionally render a visual complement. If you removed the screen, the interaction still had to make sense. This constraint shaped every design decision. We never started from a screen layout and added voice. We started from a voice flow and added a screen when it helped.
The multimodal runtime
The technical architecture was a monorepo with three layers.
Shared tokens at the base. Behavioural tokens (timing, conversational flow, error recovery) consumed by every device. Visual tokens (colour, typography, spacing, motion) consumed only by devices with screens. Both sets lived in the same pipeline and were versioned together.
A multimodal component runtime in the middle. Each component was defined as a set of behaviours, not a set of visual states. A “product card” component, for example, was defined as: “present product name, price, and primary image. On voice, read name and price. On screen, show card. On screen with voice, show card and read price.” The runtime resolved which modality to activate based on the device context.
Cross-device state orchestration at the top. When a user started a shopping session on one device and continued on another, the state followed. Not just the data (cart contents, search query) but the conversational state: where they were in the flow, what the assistant had already said, what the next logical step was. This was the hardest layer. Synchronizing data across devices is a solved problem. Synchronizing conversational context is not.
How the system was built
The process started with an audit of every existing component across every device surface. Not as thorough as a 146-component inventory (the Alexa ecosystem was younger and smaller), but the same principle: you cannot rationalize what you cannot see.
What emerged was a familiar pattern. Teams had built surface-specific components because the system had never given them a reason not to. The Echo Show team needed a product card, so they built one. The mobile team needed a product card, so they built another one. Same concept, same data, different implementation, different behaviour. Nobody’s fault. Just the natural result of building without shared architecture.
The consolidation followed the same logic as any design system rationalization: find the generators, unify the token layer, let the components collapse. But with a twist. In a visual system, the generator produces visual variants. In a multimodal system, the generator produces modality variants. A “product card generator” does not produce “large card” and “small card.” It produces “voice card,” “screen card,” “voice-plus-screen card,” and “headless card.” Same data, same interaction contract, different output channels.
The moment we stopped thinking about devices and started thinking about modalities, the component count dropped by itself. A product card is not a different component on Echo Show and Fire TV. It is the same component in different modes.
Rufus: Prototyping agentic commerce
When Rufus came along, the same runtime was what let us prototype its interactions in days instead of weeks. Rufus is Amazon’s AI shopping assistant. It is not a set of screens you sketch. It is a set of behaviours you test against real catalogue data and real queries.
An AI shopping assistant needs to do things that no traditional UI anticipates. It needs to compare products conversationally. It needs to ask follow-up questions when the query is ambiguous. It needs to present results differently depending on whether the user is looking at a screen or driving. It needs to handle interruptions gracefully (“actually, never mind, show me something cheaper”).
Having a system that already understood multimodality meant we could wire up a working prototype and put it in front of buyers before anyone had drawn a single pixel. The voice tokens defined how Rufus paced its responses. The multimodal runtime decided whether to show a product carousel or read product names aloud. The state orchestration layer let a user start a Rufus session in the kitchen and finish on the phone.
The prototypes were high-fidelity in the truest sense: not pixel-perfect mockups, but functional interactions with real data, real conversational flows, and real multimodal switching. This was possible because the design system was already built for it. The components already knew how to behave across modalities. We just had to compose them into Rufus-specific flows.
Telescope
The prototyping approach we developed for Rufus eventually became a standalone tool: Telescope. The core idea was that high-fidelity prototyping for AI products cannot happen in traditional design tools. You cannot prototype a conversation in Figma. You cannot prototype a multimodal interaction in a static wireframe. You need a tool that runs real interactions with real data.
Telescope let product teams build interactive prototypes that exercised the full design system: voice responses, screen rendering, state transitions, error recovery. Teams could test a Rufus shopping flow end to end, with real catalogue data, and measure where the experience broke down. This feedback loop was critical. Without it, the team would have been designing conversations on paper and discovering problems only after engineering had built the real thing.
Key Results
100% adoption across the Alexa ecosystem. Every device surface consuming the same token architecture.
Six months removed from global integration timelines. New device integrations that previously required building a component library from scratch now inherited the full system.
A design system that makes no assumptions about screens. Voice-only, screen-only, voice-plus-screen, and headless all served by the same generators.
Rufus prototyped in days, not months. High-fidelity interactive prototypes with real data and real conversational flows, built on top of the existing multimodal runtime.
Telescope emerged as a standalone prototyping tool for agentic and multimodal products.
What I Learned
On modality: A design system that assumes screens is a design system that cannot scale to where products are going. Voice, AR, ambient computing, AI agents. The future has fewer screens, not more. Building for voice-first taught me to separate behaviour from presentation in a way that visual-only work never did.
On tokens: Tokens are not just colours and spacing. They are any decision that should be consistent across contexts. Timing, pacing, error thresholds, conversational checkpoints. If it is a decision that affects user experience and should be the same everywhere, it is a token.
On prototyping AI products: You cannot design an AI interaction on paper. The only way to know if a conversational flow works is to run it. High-fidelity prototyping for AI means functional code with real data, not wireframes with placeholder text.
On cross-device state: Synchronizing data is easy. Synchronizing context is hard. The difference between a product that “works on multiple devices” and one that “moves with the user” is the conversational state layer.