Can your agent reach for the right tool, with valid arguments, and actually finish the job?

Agent with Tools

Context

A real business problem: an accountant needs to price a service in GBP given a USD gross cost and UK VAT rules. A bare model fails (makes up exchange rates, forgets VAT semantics). Your agent—call it VALERIA (VAT + Exchange Logic with Reliable Integration)—has two hands: a live exchange API and a tax calculator. When thrown a query like 'What's £333.33 net from $500 USD gross at 20% UK VAT?', it stops talking, calls both tools in sequence, reads the results, and finishes the job.

Mission

Build VALERIA: a named agent that, when given a real USD-to-GBP pricing query, reaches for a currency-conversion tool and a VAT-calculation tool with valid arguments, chains the outputs, and deploys to a public URL. Run the query live in front of the room; the trace must show both tool calls firing in order.

Finish Line

A named JARVIS agent deployed to a free public Space, demoing one query that visibly fires both of its tools.

Deliverables

Agent with Tools
lesson

A working agent that, handed a task it would otherwise botch, reaches for the right tool with valid arguments and finishes the job.

Team Roles

Tool Smith

Owns two tool definitions that actually solve the USD→GBP→VAT problem.
- Write two complete tool JSON schemas: (1) 'convert_currency' takes amount_usd (number), target_currency (enum: GBP), and returns amount_gbp; (2) 'calculate_net_price' takes gross_gbp (number), vat_rate (number: 0.2 for 20%), and returns net_price_gbp. Every parameter must be typed (string/number/enum), required fields explicit.
- For each tool, write a description specific enough the model knows WHEN to call it. NOT 'does math' but 'converts USD amount to GBP using live exchange rates (do not guess rates)' and 'applies UK VAT rate to a gross price to return net price (do not calculate manually)'.
- Test each tool in a real Claude console (claude.ai) by calling it in a live conversation with actual Claude. Paste the tool schemas, ask Claude a USD→GBP→VAT question, screenshot the full trace showing both calls firing with real arguments and real return values. Hand off only when both tools return correct numeric outputs.
Agent Architect

Fuses VALERIA's identity, tax knowledge, and two tools into one coherent system prompt.
- Write one system prompt (a text file, saved as prompt.txt) that defines VALERIA: a tax-savvy virtual accountant. The prompt must reference the two tool schemas by name, explain when to call each ('If the query asks about USD, call convert_currency first'), and teach VALERIA to read the tool results before proceeding (no guessing exchange rates or VAT once tools are available).
- Embed real tax knowledge: 'UK VAT at 20% means net = gross / 1.2' and 'never calculate exchange rates manually—use the currency tool.' The persona is professional but direct (Central European accountant, no hedging).
- Paste the prompt into a second Claude console conversation with the same tool schemas, ask the same USD→GBP→VAT test query, and verify both tools fire and VALERIA uses their results. Save the trace as a screenshot or JSON export from the Claude API.
Demo Runner

Deploys VALERIA to a public, testable endpoint and executes the forcing query live.
- Deploy VALERIA and her system prompt to a public URL using Vercel, Hugging Face Spaces, or equivalent. The URL must load and execute queries when pasted into a browser by someone NOT running your laptop. Test it yourself on a second device (phone, tablet, another laptop) before the live demo.
- Craft one query that forces both tools to fire: 'Convert $500 USD gross cost to net GBP price after 20% UK VAT.' The agent must call convert_currency first (output: ~£333.33 at ~1.3:1 rate), then calculate_net_price on the result (output: ~£277.78 net). Do not accept a one-tool trace.
- Run the query cold on the public URL, in front of the room, showing the trace (API response or console log) so observers see both tool calls fire with real arguments and results. If a tool errors, diagnose it aloud ('Exchange API timed out—retrying') and execute the recovery query. Do not fake the trace.
Strategist

Owns narrative validation and live-demo execution.
- Validate the forcing query against the real problem: "Does this query represent a genuine accountant use case?" If not, reset it. The query must be hard enough that a bare model fails or hallucinates (test with Claude directly beforehand).
- Plan the narrative: which tool fires first, what each output means, recovery plan if a tool times out. Write it down.
- Own the live demo: narrate each tool call as it happens ("VALERIA is calling the currency tool because the query mentions USD... here's the exchange rate... now it's chaining that to VAT... here's the net price"). If the room doesn't walk away understanding what each tool did and why, the narration failed.

Exemplars

Devin — the first AI software engineer
Cognition AI

Landmark deployed autonomous agent (shell + editor + browser, long-horizon planning) demoed end-to-end — the bar a JARVIS capstone showcase aims at.