Lost in Translation: Architectural Choices for Multilingual Copilot Agents

Your customer just typed a question in Portuguese. Or Japanese. Or switched mid-sentence from English to Spanish because the technical term only exists in one of them. Your agent has to do something — and the choice you make about where that translation happens will shape the agent's cost, latency, fidelity, and how much of your weekend you spend maintaining locale files for the next two years.

This isn't a "pick the best translator" problem. It's an architectural decision, and most teams I talk to default into one tier without realizing they had three (really, four) to choose from. Before we dive into Copilot Studio screens or Foundry flow diagrams, let's lay out the decision space — then I'll walk through three demos I built this week to make each tier real.

Three tiers, ordered by where translation happens

Tier 1 — Client-side browser translation

Translation happens after the agent responds, in the user's browser, on their CPU.

Microsoft Edge's Translate page feature and Chrome's built-in translator (now powered by on-device Gemini Nano in recent versions) intercept the rendered DOM and swap text after your agent has already produced its English response. The agent itself is monolingual; the browser does the work locally, for free, with no network round-trip and no token spend.

Don't confuse this with "browser-based localization." Microsoft's Employee Self-Service multilingual guidance recommends a pattern called Browser Based Localization — but that's actually Tier 2 with the browser used as a language signal. The agent reads the browser locale and serves an authored secondary-language string. Pure Tier 1 means the agent has no idea translation is happening — the DOM gets rewritten after the response leaves the agent. The two get conflated all the time but sit in completely different cost and control regimes.

What it's good at: zero cost, zero latency penalty, fully private (on-device), works on any agent without a single line of code change.

What it isn't good at: anything you actually care about as the agent author. No control over terminology, no control over brand voice, no way to keep "Microsoft 365" from becoming "Microsoft three-six-five" in some target language. It only exists in the browser channel — Teams, mobile, and your API integrations get nothing.

Tier 2 — Authored multilingual topics and steering

Translation is done at design time by humans (or human-supervised tools), and the agent picks the right pre-translated branch at runtime.

In Copilot Studio, this is per-language topic variants and language support, the built-in language detection trigger, the System.Language system variable, and the multilingual generative-answers configuration that grounds responses against locale-specific knowledge. In Azure AI Foundry, it's a language-classifier node at the top of a prompt flow that routes to locale-specific system prompts.

Microsoft's Employee Self-Service multilingual guide is a clean real-world example in two flavors. Option 1 — Browser Based Localization uses the browser locale as the language signal and serves authored secondary-language strings. Option 2 — Dynamic Language Switching adds an AI-driven detection topic so language can change mid-conversation. Both still require you to add each secondary language and upload translated strings — the AI just picks which one to use. Worth reading for Microsoft's own warnings: incremental string changes aren't auto-translated, CJK response quality is variable, and Dynamic Language Switching incurs prompt-usage charges for unlicensed users.

What it's good at: total determinism. Every string is reviewable, QA'd, locked to your terminology. Brand voice is preserved. Compliance teams can sign off per locale. Works in every channel because the agent is multilingual, not the surface around it.

What it isn't good at: it doesn't scale gracefully. Authoring cost grows with languages × topics. Every new feature has to ship in N locales or you accept inconsistent coverage. Code-switching mid-conversation breaks the routing model.

Tier 3 — Runtime AI translation

Translation happens live, every turn, in the cloud.

This tier splits into two sub-flavors that get conflated more than they should:

3a. LLM-in-the-loop translation. A prompt node in Copilot Studio, or a translation step in a Foundry agent flow, uses the same generative model to translate input, run agent logic, then translate the response back. One model, two extra calls per turn (or one if you bundle output-language instructions in a single prompt).

3b. Dedicated translation service. Azure AI Translator for text (with Custom Translator for domain-tuned models and glossaries) or Azure AI Speech translation for real-time speech, including multilingual neural voices. Purpose-built, character-billed, faster than an LLM round-trip, broader language coverage.

What it's good at: flexibility. Any language, on the fly. No per-locale authoring. Code-switching handled gracefully. Voice channels supported.

What it isn't good at: cost predictability and terminology fidelity. Tier 3a doubles your token spend per turn. Tier 3b is much cheaper but is still a per-character meter. Both drift on idioms and domain jargon unless you invest in a custom glossary or Custom Translator model.

The fourth tier nobody calls a tier — hybrid

What teams actually ship in production is almost never one of the three above. It's a hybrid: Tier 2 for the top business-critical languages where terminology and brand matter, falling back to Tier 3b for the long tail. Tier 1 stays as the user's escape hatch. This pattern keeps QA-able quality where it matters and bounded cost where it doesn't.

The tradeoff matrix

Dimension	Tier 1 — Browser	Tier 2 — Authored	Tier 3a — LLM	Tier 3b — Translator/Speech
Direct cost	$0	High one-time + ongoing authoring	$$$ tokens per turn	$ per character
Added latency	None	None	+1 extra model call per turn	~100–300 ms
Language coverage	~100+	Whatever you author	Strong on top ~20 languages, weaker on long tail	100+ text / 70+ speech
Terminology control	None	Full, locked	Glossary-via-prompt (fragile)	Custom Translator (strong)
Voice / speech support	No	Yes, per locale	Yes, but two-step	Yes — best of the four
Channel coverage	Browser only	All channels	All channels	All channels
Maintenance burden	None	High and growing	Low	Low
Compliance reviewability	No	Yes	Hard	Moderate (with glossary)
Code-switching	Poor	Poor	Good	Good

Why this decision is hard to walk back

Here's the part most agent makers don't think about until it's too late: the tier you choose is a one-way door, or close to it. The authoring effort, design time, and content rework needed to get it right before you ship are a fraction of what it takes to evolve an already-deployed agent from one tier to another. The cost I'm talking about here isn't infrastructure or token spend — it's the human work of redesigning topics, re-translating strings, re-reviewing every conversation flow, and re-earning user trust after the experience changes underneath them.

Starting at Tier 1 (browser) and trying to grow into anything else. On the surface this looks free — the agent has no localization code, you just "add multilingual support later." In practice, you've trained your users to expect the inconsistent, machine-quality translations the browser produces. When you upgrade to Tier 2 or Tier 3, the voice, terminology, and even the structure of responses change, and users notice. Tier 1 also only ever helped your browser users — your Teams users, mobile users, and API consumers got nothing for months and have already churned or built workarounds.

Starting at Tier 3a (LLM-in-the-loop) and trying to harden into Tier 2. The most common mid-flight pivot, usually triggered by a compliance review or a CFO seeing the token bill. To retrofit Tier 2 you have to fork every published topic per locale, get each variant linguistically reviewed, re-test the conversation flows, and re-publish. If your agent has 80 topics and you support 6 languages, that's 480 reviewed assets where you used to have 80.

Starting text-only and bolting on voice. Voice isn't just "add Speech to the pipeline." It's a different channel posture — likely Direct Line Speech, different latency budgets, different turn-taking semantics, different telemetry. Agents built text-first often have response patterns (long bullet lists, markdown tables, citations) that simply don't work when spoken aloud.

The expectation problem nobody costs out. Once an end user has interacted with your agent in a given language — well or poorly — they have an expectation. If it worked badly, they stop trusting it and you have to rebuild trust, which is harder than earning it the first time. Multilingual posture is a contract with the user, and contracts are hard to renegotiate. This is why the design decision belongs before the agent goes into the wild, not after.

Build it once, reuse it everywhere: the enterprise framework angle

Now zoom out from a single agent. In a large enterprise, you're not shipping one agent — you're shipping ten, or fifty, or a hundred, often built by different teams, often by citizen developers who shouldn't be reinventing language routing every time.

Picking a multilingual pattern at the organization level (not per agent) gets you three things that stack up over time:

Consistency for end users. Employees and customers who interact with multiple agents expect them to behave the same way around language. If agent A asks "what language would you like?" and agent B silently auto-detects and agent C only speaks English with a browser-translate workaround, your users learn to distrust the whole portfolio.
A reusable framework. A shared "language layer" — typically a routing skill or library that wraps Translator/Speech, exposes a common userLocale context variable, ships a glossary registry, and emits consistent telemetry — turns multilingual support into a checkbox for every new agent rather than a project. Citizen developers inherit it for free.
Centralized cost and quality control. One billing surface for Translator/Speech, one Custom Translator model per business domain, one place to add a new language for everyone, one telemetry view for translation accuracy and per-locale satisfaction. This is exactly the kind of thing your Center of Excellence should own.

The TL;DR: the cheapest multilingual agent is the one whose architecture you decided once, at the platform level, before the first agent shipped. Everything after that is interest payments on the design decision.

Three demos across the tier ladder

Enough about tiers in the abstract — let's run them. I built three demos this week using two agents I already run in my demo tenant: a Handicap Hero golf-club selector that embeds a Copilot Studio agent into a custom front end via Direct Line, and a Copilot Studio Employee Self-Service IT agent grounded on a ServiceNow Copilot connector. The order below mirrors how teams usually find their way up the ladder: start with the free thing, hit its walls, then invest in a real multilingual posture.

Demo 1 — Tier 1 in the wild: just let the browser translate it

Tier 1 is what you get without doing anything. The agent author writes English, the agent responds in English, and the end user clicks "Translate this page" in their browser. No tokens spent, no localization JSON authored, no topic re-routing. Free.

To see what that actually looks like end-to-end I used my Handicap Hero club selector: a single-page front end on Azure Static Web Apps that embeds a Copilot Studio agent via Direct Line. English-only by design. Here's the baseline — English page, English chat, agent doing its job:

Now I let the browser translate the page. Edge and Chrome both use the same Google Cloud Translation surface for unsupported pages, so the visual result is identical:

First problem — the artifacts. Most of the page reads fine in French. Brand names are preserved (Mizuno JPX 925 Hot Metal, Titleist T350, PING G730), which is the right machine-translate behavior. But the domain term shaft — the metal/graphite tube of a golf club — got translated as arbre, the French word for tree. "Live Shaft Advisor" became "Conseiller en arbre en direct". "Shaft fitting" became "raccord d'arbre". A French-speaking golfer would do a double take and probably bounce.

Second problem — the deeper one — is what happens when the user actually tries to use the chat. The page content is in French, so a reasonable user types their question in French. Here's what comes back:

The agent backend has no idea the page got translated. The user's French input flows untranslated into a Copilot Studio agent configured only for English, and the agent — correctly, from its own point of view — replies in English saying it can only operate in English (en-US). No graceful fallback, no auto-detect, no retry in the user's language. The browser then dutifully re-translates that English refusal back into French, and the user is left staring at a polite-but-useless wall.

This is the split-brain problem: the wrapper speaks French, the brain only speaks English, and the user is stuck in the middle with no way to tell which one is actually broken. Worse, the user feels in control — they picked their language, they translated the page, they typed in French — but the results are inconsistent in ways they can't predict and the agent maker can't fix. Browser translation alone is a wish and a prayer: it works for static marketing pages, and it falls apart the moment a real conversation has to round-trip through a backend that doesn't share the assumption.

That's the wall. Tier 1 is a reasonable free fallback for the long tail, and for the user who happens to be on a desktop browser reading static content. It is not a multilingual strategy. To deliver a coherent experience, the agent itself has to know the user's language — which is exactly where Tier 2 starts.

Demo 2 — Tier 2 in Copilot Studio: ESS Browser-Based Localization

For a fairer test of Tier 2 I switched agents to my Employee Self-Service IT agent — a Copilot Studio agent grounded on a ServiceNow Copilot connector with a couple thousand English KB articles behind it. Out of the box it ships English-only, just like Handicap Hero. The difference: instead of leaving the browser to fix things on its own, here the agent itself gets taught to speak French and Spanish.

Microsoft's multilingual configuration guide defines two paths; this section walks through Option 1 (Browser-Based Localization). The work itself:

Add the secondary languages on the agent's Languages page (I added French and Spanish-US).
Download the localization JSON template, translate the strings (greeting, starter prompts, system messages), upload one file per language.
Publish.
Open the agent in M365 Copilot with a browser locale set to the target language.

With the agent published, I opened M365 Copilot first with navigator.language=fr-FR and an Accept-Language: fr-FR header. The browser flipped to French, the agent's authored greeting fired in French (with the user's first name substituted from the JSON), and then I asked the question I actually cared about: when the user types in French, does the ServiceNow knowledge base — authored entirely in English — come back as a French answer at runtime, or do I have to translate and republish thousands of KB articles per language to make this work?

The answer: the KBs come back translated at runtime, and you do not have to republish anything. The GenAI orchestrator translates the grounded payload on the fly, the user gets a French response, and the citation link still points at the original English KB article. One KB article, multilingual answers, zero republishing. That's the big finding for any team running a knowledge-grounded agent and worrying about content fan-out: you don't need to fan it out.

Switching the locale to es-ES reproduced the result — full Spanish synthesis citing the same English KBs:

Tier 2 verified. The "tax" you pay is predictable: one localization JSON per language, kept in sync as you add new authored strings. The "tax" you don't pay is per-conversation translation cost — everything you authored is free at runtime. Compare back to Handicap Hero: every word here was either authored by me or grounded against a KB I own, and none of it round-trips through a browser translator I can't control. The arbre problem can't happen because the translation isn't happening in the user's browser anymore — it's happening inside the agent, with the agent's own terminology.

Demo 3 — Tier 3a hybrid: a Switch Language topic on top of the same ESS agent

Tier 2 is a big upgrade over Tier 1, but it has its own wall. It assumes the browser locale is a reliable signal for what the user wants. What happens when a user is on a Spanish-locale machine, types in English, and wants the answer in French? Microsoft's same guide describes Option 2 — Dynamic Language Switching — which adds a custom topic that detects a user's intent to change language and re-routes the conversation. This is the Tier 3a hybrid: authored strings stay where they are (Tier 2), but a small AI-driven topic overlays a runtime switch (Tier 3a-ish, in that the agent uses the conversation context to override the browser locale).

I built this on the same ESS agent — no second instance — using Studio's "add topic from description with Copilot" flow. The description I gave it:

Detect when the user asks to switch language (in any language). Recognize phrases like 'switch to French', 'parle en français', 'habla español', 'change to English', 'speak English'. Capture the requested language. Confirm the switch in the target language. If the user does not specify a language clearly, ask them to choose between English, French, or Spanish.

Copilot generated the entire topic in about thirty seconds: trigger phrases, a multi-choice Question node bound to Topic.RequestedLanguage, three confirmation branches (English, Français, Español), and a fallback for unmatched language names.

One catch worth flagging up front: switching the user's language mid-conversation is not natively supported by Copilot Studio — there's no built-in action that flips System.Language on the fly. What this topic does instead is acknowledge the switch in the requested language and rely on the agent's GenAI orchestrator picking up that conversation context for subsequent turns. Whether that actually carries through is the question I wanted to answer.

I stacked the deck against it: browser locale set to Spanish, user input in English. First turn: "How do I reset my password?" — the agent answered in Spanish, citing English KB0010024 (browser locale wins, exactly as Tier 2 promises).

Then I sent "switch to French". The Switch Language topic fired and presented three suggestion chips:

I clicked Français. The agent confirmed in French: "Langue changée en français. Toutes les réponses seront désormais en français." Then I re-asked the same English question, "How do I reset my password?" — and got back a fully synthesized French response citing the same English KB:

So three signals were fighting each other — browser locale (Spanish), user input language (English), and conversation-context override (French) — and the orchestrator picked the conversation context. The published topic was the only thing that changed; no system variable, no Power Fx, no platform-level switch. Tier 3a in Copilot Studio is a soft override, but in this run it held up.

If you need a hard switch (say, to drive locale-specific TTS voices or compliance routing), you'll want to also write the chosen language into a global variable and reference it in your other topics' system instructions. That's a deeper integration I didn't go into here.

Design Choices: a decision tree

Mapping the demos back to the decision space:

What's next

This post is the architecture-decision part of a longer arc. In the next post I'll cover:

Azure AI Foundry walkthrough — the language classifier flow, Translator integration, and a Speech-driven voice agent demo.
The hybrid pattern in production — routing rules, fallback design, and how to instrument cost and quality per locale.
A reference framework you can lift — the shared "language layer" pattern: routing skill, glossary registry, telemetry contract, and how to roll it out across an agent portfolio.
Cost math on a realistic scenario — 50,000 conversations/month across 12 locales, modeled across each tier so you can see where the bill actually lands.