Stop hand-tuning prompts; let the optimizer do it
I've spent a lot of late nights tweaking system prompts by hand. Microsoft's new Agent Optimizer in Foundry Agent Service automates the loop, and I'm here for it.
- #foundry
- #agents
- #devex
- #observability
I have written a lot of system prompts. Probably more than is healthy. Most of them have followed the same depressing arc: write a first draft, try a handful of inputs, watch the agent fail in one of seven specific ways, tweak the prompt, try again, fix one thing, accidentally regress another, and end the evening with a prompt file that is two paragraphs longer than it needs to be and a vague feeling that it is “better.”
I have wanted a tool that automates that loop for a long time. Microsoft just shipped one.
The Agent Optimizer in Foundry Agent Service (private preview now, public preview in 30 days) takes a hosted agent, runs it against a task set with explicit pass/fail criteria, generates better candidate configurations, scores them, and ranks the results so you can promote the winner. The post’s worked example moves a customer-support agent from a 0.60 baseline to 0.92 in one cycle. No model retraining, no code changes; just an azd ai agent optimize against your existing hosted agent.
The design choice I like most is what it optimizes. You point it at the system prompt, or the agent’s skills (reusable procedures), or the model deployment itself (the cost/quality trade-off). Pick the lever you actually want to move; the optimizer handles the rest. Each candidate comes back with per-task breakdowns and token costs so you can see what you’re trading. That is a real engineering tool, not a vibes machine.
This piece doesn’t land in isolation. It sits on top of the observability story Microsoft shipped at Build: GA tracing and evals, OpenTelemetry interoperability so the same tracing covers LangChain, LangGraph, OpenAI SDK, Microsoft Agent Framework, and custom agents, plus continuous monitoring through Azure Monitor. Together that is the actual loop: trace what the agent did, evaluate it against your criteria, optimize what’s broken, deploy the winner, monitor for drift.
A few things I am taking away as I plan my own next agent.
- Evals stop being a test artifact and start being a production input. The eval set is the gradient signal the optimizer follows. The richer and more scenario-coded my evals are, the better the candidates get. That means I am going to spend the time I used to spend on prompt micro-edits on writing better evals instead, which is a much better use of my brain.
- The skill is the new prompt. Reusable procedures, scoped to a task, scored independently, optimized in place. Building for that shape now is going to pay off later when one agent quietly becomes ten.
- For SI and ISV partners, “we wrote the system prompt” stops being a deliverable. The deliverable is the eval suite, the skill library, the optimization rubric, and the observability dashboards. Package those.
The honest summary: prompt engineering as a craft is collapsing into prompt engineering as a build step, and Foundry now ships the build step. As someone who has spent too many evenings staring at prompt files, I am genuinely glad about that. I will save the late nights for problems that actually deserve them.