Lost in Translation: Architectural Choices for Multilingual Copilot Agents Part 2

In Part 1 I laid out the four-tier menu for multilingual agent design: browser-local translation, topic-level steering, retrieval-aware language detection, and full LLM/AI-driven translation inside the conversation. Part 1 sat squarely in Copilot Studio territory because that's where most enterprise agent builders start — low-code, drag-and-drop, native to the Microsoft 365 surface. But the deeper you push toward Tier 3 and Tier 4, the more the weight of the work shifts away from the agent shell and onto the model.

This post is what happens when you let the model do the heavy lifting. Specifically, when you build the agent in Azure AI Foundry instead of Copilot Studio, ground it with Azure AI Search, and lean on the LLM's native multilingual ability for everything that isn't retrieval. No locale switches. No translated topic clones. One agent, one instructions block, one search index — and clean multilingual responses across English, French, Spanish, and beyond.

I'll walk through the architecture, the agent definition, the multilingual proof points pulled live from the Foundry playground, and the telemetry that explains why this pattern is so cost-efficient at runtime. At the end I'll close with a recap of when each tier from Part 1 is the right answer, so you can match an architecture to the agent in front of you instead of defaulting to the one you saw last.

Why Foundry shines for Tier 3 multilingual

The unlock with a modern frontier LLM (gpt-4o, gpt-4.1, claude-sonnet, gemini — take your pick) is that language detection and response generation are no longer your problem. The model can read French, reason internally in whatever embedding space it prefers, and emit French back to the user — without you writing a single piece of language-handling code, without you uploading translated resource files, and without you cloning topics per locale.

What you do need is one piece of careful prompt engineering: tell the model what language behavior you actually want. Otherwise you'll get inconsistent defaults (some models love to fall back to French, others to English, others will mirror whatever language your knowledge source is written in).

Pair that with a retrieval tool — Azure AI Search over a curated set of authoritative documents — and you've got a grounded multilingual assistant whose answers come from your sources but whose conversational surface adapts automatically to the user.

The architecture

Three components and no glue code:

Azure AI Foundry Agent — the orchestrator. Holds the system instructions (including the language rules), wires the tool, and manages the conversation thread.
Azure AI Search index — for this article I indexed a set of Microsoft Learn pages about Copilot Studio, but the pattern works with any text-first content store: SharePoint, Confluence, product documentation, KB articles.
A frontier LLM deployment — gpt-4o in my case, deployed as a Global Standard endpoint in the same Foundry project. Choose for cost, latency, and tool-calling fidelity.

That's it. No Copilot Studio shell. No connected-agent chaining. No language configuration in the agent shell. The agent is the model plus a tool plus a paragraph of instructions.

The agent definition

Here's the live agent in the Foundry portal — version 4, gpt-4o, with the Azure AI Search tool bound to a dedicated index.

Foundry agent definition: gpt-4o, instructions block, Azure AI Search tool wired to the Microsoft Learn index.

The instructions block carries the multilingual contract. The pattern I landed on after several iterations:

You are a Copilot Studio documentation assistant grounded in Microsoft Learn.

LANGUAGE BEHAVIOR:
- Detect the language of the user's most recent message.
- Respond in THAT EXACT detected language for the entire response.
- English in → English out. French in → French out. Spanish in → Spanish out.
Japanese in → Japanese out. Portuguese in → Portuguese out.
- DEFAULT to English if the message is empty, contains only HTML/code/punctuation,
is a single ambiguous word, or you cannot determine the language with high
confidence. NEVER default to French or any non-English language.
- The retrieval index is English; do NOT mirror the language of indexed documents.
Translate retrieved content into the user's detected language for the response.

CRITICAL TOOL USE RULE:
- For EVERY user question, you MUST call the Azure AI Search tool first to retrieve
relevant Microsoft Learn pages from the indexed knowledge source.
- Never answer from your own training data. The knowledge source is the only source of truth.
- If the search returns no relevant results, say "I don't have that in the indexed
Microsoft Learn documentation" (translated into the user's language) — do not
invent or guess.

CITATIONS — STRICT:
- Cite ONLY URLs from the search tool's actual results for THIS query.
- The valid URL pattern for this knowledge source is exactly:
https://learn.microsoft.com/en-us/microsoft-copilot-studio/<slug>
- NEVER cite power-virtual-agents URLs — that domain is retired.
- Include exactly ONE citation as a markdown link at the end.

RESPONSE LENGTH:
- Be concise. 150–300 words target, 450 max.

Three things to call out in this prompt:

The default is English, not "the language of the source content." Without this guardrail, models will sometimes mirror the language of whatever they retrieved — your French user asks a question, your English Microsoft Learn page comes back from search, and the model responds in English because that's what it just read. The instruction makes the user's input language the authoritative signal.
One search call per turn, no guessing. The retrieval discipline here is half the battle. Without it, the model will happily synthesize plausible-sounding answers from training data and you'll never know it didn't actually consult your knowledge source.
Citation hygiene with a regex-style URL pattern. Modern LLMs are distressingly good at hallucinating URLs that look right but don't exist. Pinning the valid URL pattern in the prompt is a cheap, durable defense.

Multilingual proof, end to end

Here are three live tests from the Foundry playground — same agent, same index, same instructions. The only thing that changed is the language of the question.

English

English: "What is generative orchestration in Copilot Studio?" — 4.4 seconds, one tool call, one citation.

The metadata strip across the bottom of the response is worth reading. It tells you the model used (gpt-4o), the wall-clock latency (4.4s), the conversation thread tokens (16,309), and the tool call sequence (azure_ai_search_call → Azure AI Search → message). That's the receipt that retrieval actually happened.

French (same conversation)

French follow-up in the same thread: "Comment puis-je créer un nouveau sujet dans Copilot Studio ?" Response is in French, citation is to an English Microsoft Learn page (which is fine — the indexed source and the response language are decoupled).

Two things to notice here. First, the model correctly pivoted from English to French within the same conversation, without any session reset or language toggle. Second, the citation it produced is to an English Microsoft Learn URL — that's intentional. The grounded source can be in whatever language your indexed content is in; the response is translated for the user. The user gets to read in their language; they can click through to the canonical source if they want the original.

Spanish (fresh conversation)

Spanish: "¿Cómo agrego soporte multilingüe a mi agente de Copilot Studio?" — 7.1s, full Spanish response, citation to the multilingual configuration article.

Same pattern, third language, no changes to the agent. The model recognized Spanish, called the search tool, retrieved an English page about multilingual configuration, and rendered the answer in Spanish with the right citation.

The telemetry that explains the cost story

Foundry's Traces tab is where this architecture really earns its keep for enterprise readers. It gives you per-turn cost, latency, token spend, and a full span waterfall — the kind of observability you'd otherwise have to bolt on with OpenTelemetry exporters and dashboards.

Traces list: completed turns at 7.5s with ~16K tokens cost $0.04 each. The failed entry shows the gpt-4o context-window overflow you'll hit if you let a single thread accumulate too many retrieval payloads.

Drilling into a single trace gives you the span breakdown:

A single 7.56s conversation: 4.74s in the Azure AI Search tool call, 2.60s in the gpt-4o chat call. Retrieval is the dominant cost.

And finally the user-view — the actual prompt and response inside the trace, so you can audit content alongside performance:

User view inside the trace: full Spanish prompt & response, with the citation URL preserved and the conversation/response IDs ready for incident analysis.

The two things this telemetry pane crystalizes for an enterprise buyer:

Retrieval, not generation, dominates latency. In my traces the search call routinely consumed 60–75% of total turn time. If you want faster agents, optimize the index (vector dimensions, filters, top-N tuning) before you reach for a faster model.
Per-turn cost is observable at the trace level, not just billed in aggregate. $0.04 per turn for a multilingual grounded answer is a defensible unit economic for almost any productivity scenario. You can prove it before you scale.

What surprised me

Three things stood out building this agent that I didn't expect coming from a Copilot Studio mindset:

Instructions matter more than configuration. In CPS you spend a lot of time on settings: enabled languages, generative answers toggles, authentication scopes, channel options. In Foundry, you spend almost all your time on the instructions block. A change to a paragraph in that prompt is more consequential than any portal toggle.
Context window discipline is real, and it's an architectural decision point. Foundry threads accumulate every retrieval payload along with the conversation history. By turn three of a chat where each turn fetches 3-5 documents, you can blow past gpt-4o's context window. This is exactly where agent design earns its money — build in conversation restart cues, summarise older turns into a compact rolling memory, set a turn-count threshold that triggers a fresh thread under the hood, or pin only the most relevant retrieved chunks per turn instead of the full payload. The model won't manage its own context budget. You have to.
Strict citation grammar pays for itself. The number of fabricated-but-plausible URLs I caught in early iterations of this agent was sobering. Pinning the valid URL pattern in the prompt and rejecting any citation that doesn't match is cheap insurance against the most embarrassing failure mode in grounded RAG.

A note on delivery surfaces

A reasonable question at this point: "So how does a user actually chat with this thing? The Foundry playground is great for builders, not end users." Fair. The delivery surface choice — Teams, Microsoft 365 Copilot chat, a web app, Power Apps — depends on where your users already work.

Teams as a first-class surface

For most enterprises, the answer ends up being Teams. It's where your users already live, it handles authentication via the Microsoft identity they're already signed into, and Foundry agents can publish there directly — no custom code, no custom app shell, no developer-mode sideloading. The agent simply shows up as a chat participant; users send messages in whatever language they prefer; the agent responds, grounded and multilingual, the same way it does in the Foundry playground.

Teams gives you the multilingual end-user experience without any of the heavy lifting. The instructions block in the agent does the language detection. The Azure AI Search tool does the grounding. Teams does the delivery. Three moving parts, all loosely coupled, each replaceable when the underlying service evolves.

Microsoft 365 Copilot chat

You can also route a Foundry agent into the Microsoft 365 Copilot chat surface through Copilot Studio as a connected agent, and that pattern is fine if your organization standardizes on Copilot Studio for all agent management. But it isn't required. Foundry can publish directly to Microsoft 365 Copilot on its own — and going direct removes one extra hop from the chain when something goes sideways at runtime. Fewer wrappers, fewer places to look first when troubleshooting.

Why Agent 365 changes the calculus

The other reason I lean toward direct publish: Agent 365 is shaping up to be the unified observability and governance layer across every agent development landscape — Foundry agents, Copilot Studio agents, custom-built agents, partner-platform agents. And critically, it doesn't live in a single product silo. Agent 365 spans Microsoft 365 Admin, Microsoft Entra ID, Microsoft Purview, and Microsoft Power Platform together — identity, lifecycle, compliance, and platform telemetry stitched into one governance plane that follows the agent regardless of where it was built or where it's delivered.

Once that's the place you go to see telemetry, governance, license posture, and adoption signals across every agent your organization runs, the case for routing a Foundry agent through a Copilot Studio shell purely for delivery gets thinner. Pick the shortest path from agent to user, and let Agent 365 do the cross-surface correlation work.

FYI: when you over-orchestrate, the AI layers can fight each other. Stack a Foundry agent inside a Copilot Studio shell inside Microsoft 365 Copilot, and you now have three independent AI components — each with its own language behavior. One layer's instructions say "respond in the user's detected language." Another layer's fallback path says "I only communicate in en-US." When they disagree, the user sees a confusing mix: a Spanish question, an English answer, and a polite disclaimer apologizing for the very thing you designed the bottom layer to handle.

Recap: which tier when

Part 1 laid out four tiers for where multilingual translation can happen in an agent stack. Now that we've worked through Tier 3 in Foundry, here's the same menu again with sharper picking guidance — and the tradeoff matrix from Part 1 reprinted so you don't have to flip tabs.

Tier 1 — Client-side browser translation

Best when: You're shipping a quick internal tool for an English-speaking workforce that occasionally needs to read in another language, the agent only lives in browser channels, and you have zero appetite for maintaining localization assets. It's free, on-device, and adds nothing to your token bill.

Avoid when: The agent reaches Teams, mobile, or any channel beyond the browser. Or when you care at all about terminology, brand voice, or compliance review of translated output. Or when you're going to need to upgrade later — users habituated to browser-translate quality will notice every change you make.

Tier 2 — Authored multilingual topics and steering

Best when: Your agent is mission-critical, your terminology is regulated or trademarked, and the language list is small and stable (say, three to five languages your business actually serves). Worth the authoring investment when wrong words have legal, brand, or safety consequences.

Avoid when: You need to support more than a handful of languages, your topics churn frequently, or your users code-switch mid-conversation. The authoring cost grows with languages × topics and never gets cheaper.

Tier 3a — LLM-in-the-loop translation (this article's pattern)

Best when: You want one agent that handles any language a user might bring — including the long tail and unexpected ones — without per-locale authoring, and you have a knowledge source you can ground against. The LLM does language detection, translation, and response generation in a single turn. Add a retrieval tool and you've got a grounded multilingual assistant for the cost of a model call plus a search.

Avoid when: Token cost is the binding constraint, or when your terminology is so domain-specific that the model can't reason about it without a glossary you'd have to enforce on every turn. Tier 3a is flexible but it's also the most expensive option per turn.

Tier 3b — Dedicated translation service (Translator / Speech)

Best when: You need broad language coverage at predictable per-character cost, you have voice channels to support, or you have an internal glossary you want to lock in via Custom Translator. This is the most cost-effective path for high-volume, high-language-count scenarios once the agent itself is built and stable.

Avoid when: The translation quality bar is "near-native fluency on technical content" — LLM output (Tier 3a) tends to read better even before you tune Custom Translator, especially for shorter outputs and explanatory prose.

Tier 4 — Hybrid (what most enterprises actually ship)

Best when: You have multiple agents serving multiple audiences and you're operating at enterprise scale. Tier 2 for the top two or three business-critical languages where terminology and brand matter, Tier 3a or 3b for the long tail, Tier 1 left in place as the user escape hatch. This is also the right place to plug in a shared organizational language framework so every new agent inherits the same pattern instead of reinventing it.

The tradeoff matrix (reprinted from Part 1)

Dimension	Tier 1 — Browser	Tier 2 — Authored	Tier 3a — LLM	Tier 3b — Translator/Speech
Direct cost	$0	High one-time + ongoing authoring	$$$ tokens per turn	$ per character
Added latency	None	None	+1 extra model call per turn	~100–300 ms
Language coverage	~100+	Whatever you author	Strong on top ~20, weaker on long tail	100+ text / 70+ speech
Terminology control	None	Full, locked	Glossary-via-prompt (fragile)	Custom Translator (strong)
Voice / speech support	No	Yes, per locale	Yes, but two-step	Yes — best of the four
Channel coverage	Browser only	All channels	All channels	All channels
Maintenance burden	None	High and growing	Low	Low
Compliance reviewability	No	Yes	Hard	Moderate (with glossary)
Code-switching	Poor	Poor	Good	Good

The honest summary: there's no universal best tier. There's a best tier for your agent, your users, your languages, your channels, and the constraints that matter most to your business. The mistake I see most often isn't picking the wrong tier — it's never explicitly picking at all, and ending up wherever the platform's defaults dropped you. Pick on purpose, preferably before the first user types their first non-English message.

Where Part 3 is heading

Part 1 covered the design menu. Part 2 (this post) showed the LLM-driven, retrieval-grounded flavor built in Foundry. Part 3 will lean into the operational side: evaluation harnesses for multilingual responses (how do you know your French is actually any good?), drift detection when the knowledge source changes, and the patterns for sharing multilingual agents across multiple lines of business without each team rebuilding the wheel.

Until then — build the agent, instrument the agent, ship the agent to where your users actually are. The model can speak the language. Your job is to make sure it's grounded, observable, and reachable.