Promptzilla 3000 — Generate, prove, and ship better prompts

The promise

Generating a prompt is the easy part.

Promptzilla runs on Claude to write the first draft — then refines, tests, and optimises it into something a single pass can’t reach, because the improvements come from watching it fail. Anything can write you a prompt now. The hard part is knowing whether it holds up when the inputs get strange, and whether you’re paying for tokens you don’t need.

01

Generate

Describe the task and pick the shape: a single prompt, an agent with skills attached, a chain, or a full team with an orchestrator and specialists. Whatever you build comes out written for the model you’re targeting. Claude wants XML, GPT leans on Markdown, Gemini likes it terse, and the prompt accounts for that instead of leaving you to fix it.

02

Compose

Refine what you generated by hand, or write your own from scratch. You get framework scaffolding (RISEN, COSTAR, RTF and others) without an AI call, so there’s no token cost and nothing to wait for. It’s an editor that already knows what good prompt structure looks like.

03

Test

The Kaiju Lab runs your prompt through seven kinds of input — normal, edge cases, ambiguous, out-of-scope, adversarial, minimal, multi-part — and scores each output on four dimensions. This is where the quick happy-path check you’d run yourself misses things, and where prompts that look finished turn out not to be.

04

Optimise

The Optimiser turns vague instructions into specific ones, adds handling for the failure modes the Lab found, and cuts token count by a third to a half without dropping anything that matters. Battle then runs the old version against the new one so you can confirm the change was actually an improvement.

Generate — real outputs

See what comes out of one brief.

Generate reads your brief and builds the right shape for the job, not snippets you finish by hand:

—Agents & teams — a single prompt, a skill-equipped agent, a prompt chain, or a full team — orchestrator, specialists, JSON handoff contracts, error recovery, self-verification.
—n8n workflows — real nodes, typed parameters and wired connections, ready to import.
—Rule files — .cursorrules or CLAUDE.md, tuned to your stack.
—Visual prompts — image or video, in the exact syntax your target platform expects.

Brief: “A cultural research team that delivers platform trend intelligence and brand activation ideas for a given client and market.”

Generated: 7 agents. Full production system prompts. JSON handoff contracts. Error recovery. Self-verification. About a minute.

Orchestrator

Cultural Research Lead

Decomposes, delegates, validates, compiles

Sub-Agent

Data Source Agent

Identifies and validates research sources

✓Role definition
✓Success criteria
✓Phased instructions
✓Self-verification
✓Worked examples

Sub-Agent

Task Allocation Agent

Splits the brief into parallelisable sub-tasks

✓Role definition
✓Success criteria
✓Phased instructions
✓Self-verification
✓Worked examples

Sub-Agent

Specialist Executor

Runs each sub-task against its target source

✓Role definition
✓Success criteria
✓Phased instructions
✓Self-verification
✓Worked examples

Sub-Agent

Synthesizer

Merges specialist outputs into unified findings

✓Role definition
✓Success criteria
✓Phased instructions
✓Self-verification
✓Worked examples

Sub-Agent

Quality Validator

Scores outputs; routes failures back for repair

✓Role definition
✓Success criteria
✓Phased instructions
✓Self-verification
✓Worked examples

Sub-Agent

Artifact Compiler

Packages final deliverables in the agreed format

✓Role definition
✓Success criteria
✓Phased instructions
✓Self-verification
✓Worked examples

You can try every tool and every output for free. Bring your own key, and there’s no markup on top.

Test — proof

Watch a good score hide a broken prompt.

Two numbers move as the prompt changes. One of them is lying to you.

— Real-world score — how it handles a perfectly ordinary request
— Adversarial suite — how it handles attacks

The gap between them is the whole point.

v0

86

Baseline

v1

64

Rewrite

v2

34

Regression

v3

90

Recovery

READY

Click the button below to run the first iteration and watch the score arc build.

Run the iterations and watch the real-world score arc. The suite score is the one that looked fine the entire time.

The walkthrough

The prompt is a support agent handling billing disputes. A customer writes in about a charge they don't recognise. The agent has to acknowledge the concern, commit to investigating, and reply in clean JSON, with no speculation and no invented promises.

v0 — Baseline

86/100

Real-world 86

A normal customer email is handled well from the start.

v1 — Rewrite

64/100

Real-world 64

A full restructure adds explicit rules, format constraints and refusal criteria. The adversarial suite climbs. The real-world score slips, because the prompt is getting more rigid.

v2 — Regression

34/100

Real-world 34, suite 86

More hardening pushes the suite to 86, and everything looks ready to ship. But the same ordinary customer email now scores 34: the prompt refuses to help, citing INSUFFICIENT_INPUT, when it had everything it needed. Most tools would have shown you the 86 and called it a win.

v3 — Recovery

90/100

Real-world 90, suite still 86

One more pass aimed at the regression. The adversarial gains hold at 86, and the real-world case recovers to 90, better than where it started, with the structural improvements still intact.

The number you were watching crashed to 34 while the number you weren’t watching looked perfect. That’s the kind of regression that ships when nobody tracks both.

“Can't I just ask Claude to write the prompt?”

You can, and Promptzilla does — Claude writes the first draft here, and it's good. The difference is what happens next. That draft goes through refinement, adversarial testing, and optimisation aimed at the specific points where it fell over. The result isn't a better prompt from Claude; it's a prompt Claude couldn't reach on its own, because the improvements come from watching it fail and responding to that.

Doing the same by hand means running every prompt through edge cases, ambiguity, injection and jailbreak attempts, scoring each one consistently, and catching the version that looks finished but isn't. The generation is the easy 10%. This is the other 90%.

Optimise

A better prompt that costs less to run.

The Optimiser makes a working prompt sharper and cheaper, then Battle checks the new version against the old one on your own test cases.

Effectiveness

Vague hedges become specific, testable instructions. It adds hallucination guards, handling for domain edge cases, and tighter output contracts, while leaving the instructions that already work alone.

Efficiency

It removes filler, merges overlapping instructions, and deletes restated defaults, aiming for 30–50% fewer tokens with the critical detail intact. Shorter prompt, smaller bill, less room to drift.

Scored diffs

Every change is annotated with the reason for it and re-scored against the same suite, so you can see what moved and why before you accept it.

Battle

Old version against new on identical inputs, scored blind, with full history so you can roll back.

Often the shorter version scores higher. That's the one you'd have wanted to ship in the first place; you just had no way to find it before.

Discovery

The best setups are already on GitHub. Find them in seconds.

Someone has probably already built the Claude Code setup, Cursor rules, n8n workflow or MCP config you’re about to spend an hour on. Discovery makes finding repos you can use in your projects simple. It surfaces the strong ones, explains how they work, and installs them in a click, all in one place.

01

Find the proven ones.

Search top-rated repos for Claude Code, Cursor, MCP, agents and n8n workflows, ranked by stars and community signal rather than what we chose to feature.

02

Understand before you install.

Read the files and an auto-generated implementation guide without leaving Promptzilla, so you know what you’re importing before it touches your project.

03

Install in one click.

Pull it straight into your project or save it to your Library.

04

Improve what you find.

Optimise it, test it in the Lab, make it your own.

In the box

The rest of the kit.

EXTRACTOR

Clean and prep files for any AI model. Drop in PDFs, Word docs, slides or spreadsheets and Extractor strips them to plain text, with OCR for scanned pages, ready to paste into a prompt or attach as context.

VISUAL PROMPTS

Image and video prompts written for the platform you’re using — from Midjourney and Flux to Veo and Runway. Control style, lighting, composition and camera movement; for video, get multi-shot scripts with timing per shot and continuity carried across cuts.

Pricing

One tier. No markup. Pay your provider directly.

Free Tier

£0/forever

Free forever. No card, no API key required.

⚡No API key needed — runs on a free open model
⚡All tools, with limited runs
⚡5 agentic team generations / month
⚡5 Lab runs / month
⚡5 Optimiser runs / month
⚡10 saved prompts
⚡Extractor — clean and prep files for AI
⚡Community gallery
⚡Markdown, JSON and text exports

Promptzilla

£5/month

Cancel anytime. Access until end of billing period.

⚡Everything in Free
⚡Unlimited agentic team generation
⚡Unlimited Lab runs
⚡Unlimited library
⚡All export formats — CLAUDE.md, .cursorrules, ZIP bundles
⚡Collections, tagging, version history
⚡Battle mode with full history
⚡Import and optimise existing prompts
⚡Priority roadmap input

🔑

Platform-only pricing. £5/month covers the platform. Bring your own Anthropic key and you pay Anthropic directly, at cost, with no markup and no proxying. No key? Start free on an open model, on us.

Payments by Stripe.

Questions

Things worth knowing

AI-builders shipping LLM features in tools like Cursor and Claude Code, and anyone writing prompts as part of their job. If you've ever shipped a prompt that worked in dev and broke in production, or you're worried you might, you're in the right place.

A real one. Each generation produces an orchestrator and 4–7 specialist sub-agents, each with a full system prompt: role, success criteria, phased instructions, rules, output format, self-verification, worked examples. Sub-agents have JSON handoff contracts and routing for failures. Export the whole bundle as .claude/agents/ files for Claude Code, or as a markdown bundle for any other framework. Skeptical? The Cultural Research Lead example on this page was generated in about 60 seconds — click “Preview a sample agent” to read a full system prompt.

Most likely. Generated prompts are written for the model family you're targeting — Anthropic, OpenAI, Google, Meta, DeepSeek — because each parses instructions differently. Rule files export as .cursorrules and CLAUDE.md with no manual fixing. Visual prompts use the syntax each platform expects. If something doesn't work in your setup, file an issue and we'll fix it — edge cases are where the platform improves fastest.

It tries to break your prompt before your users do. It throws weird, malicious, ambiguous and awkward inputs at it and tells you what fell over and why. Seven categories: normal, edge cases, minimal input, ambiguous input, out-of-scope requests, adversarial attacks (injection, jailbreaks, format breaking), and complex multi-part requests. Every output is scored on four weighted dimensions — Goal Adherence, Instruction Following, Format Compliance, Output Quality — and given a verdict band from NEEDS WORK to EXCELLENT. Adversarial scoring is inverted: refusing an injection scores high, complying scores low. Clarifying questions are judged on whether they're appropriate, not penalised for existing.

No. New accounts run on a free open model (via OpenRouter) that we fund — no key, no card — with a set number of free runs each month. Add your own Anthropic key any time in Settings for Claude-quality generation, or subscribe at £5/month to lift the limits. Free-tier run limits apply whichever engine you use.

Generation runs one of two ways: on a free open model via OpenRouter (no key), or on Claude with your own Anthropic key. Generated prompts are written for Anthropic, OpenAI, Google, Meta and DeepSeek, and Lab tests run on Claude. Running your prompts against OpenAI, Gemini or DeepSeek directly is on the roadmap — priority depends on subscriber feedback.

Bring your own Anthropic key and you pay Anthropic directly, at the rate Anthropic charges — we don't mark up inference or proxy your prompts through our servers. The platform is a flat £5/month whether you spend £2 or £200 on inference. Your key is stored encrypted, used only for inferences you trigger, never logged or shared, and you can rotate or revoke it any time. Most prompt tools either bake inference into a higher fee at a markup or route your prompts through their servers, which means they see everything. We'd rather not.

Stop shipping prompts and hoping they hold.

Try it on a prompt you're not sure about. Free, no card, about a minute to see something useful.

Generate. Compose. Test. Optimise. Ship.

Generating a prompt is the easy part.

Generate

Compose

Test

Optimise

See what comes out of one brief.

Watch a good score hide a broken prompt.

“Can't I just ask Claude to write the prompt?”

A better prompt that costs less to run.

The best setups are already on GitHub. Find them in seconds.

Find the proven ones.

Understand before you install.

Install in one click.

Improve what you find.

The rest of the kit.

LIBRARY

EXTRACTOR

VISUAL PROMPTS

COMMUNITY GALLERY

Pricing

Things worth knowing

Stop shipping prompts and hoping they hold.