Generate. Test. Optimise. Ship.

The best prompt for any task — a single prompt, an agentic team, or a skill-equipped agent. Generated model-aware, hammered by an adversarial test suite, then optimised for effectiveness and efficiency before you ship.

Not “good enough.” The best prompt for the job — proven, not promised.

The promise

Generate the best. Prove it holds. Ship it lean.

Most tools do one of these. Promptzilla does all three, in order — so what you ship is the strongest version of the prompt, not the first one that ran.

01Generate

The best prompt for any task.

A single prompt, a full agentic team (orchestrator, specialists, handoff contracts, error recovery), or a skill-equipped agent. Generation is model-aware — written differently for Anthropic, OpenAI, Google, Meta, and DeepSeek because each parses instructions differently.

02Test

Battle-tested, not vibes.

Kaiju Lab throws seven adversarial categories at your prompt — edge cases, ambiguity, out-of-scope, prompt injection, jailbreaks — and scores every output across four weighted dimensions. It catches the regressions that pass a happy-path check and break in production.

03Optimise

Sharper and leaner.

The Optimiser hardens effectiveness — vague directives become testable imperatives, failure modes get explicit handling — then strips 30–50% of the tokens without losing intent. Lower latency, lower cost, less surface area for the model to get lost in. Battle proves the new version actually wins.

Generate — real outputs

Any task. Any shape. Generated whole.

A full agentic team. An importable n8n workflow. A repo-grounded rules file. A platform-precise visual prompt. Not snippets — complete, production-shaped output from a single brief.

Brief: “A cultural research team that delivers platform trend intelligence and brand activation ideas for a given client and market.”

Generated: 7 agents. Full production system prompts. JSON handoff contracts. Error recovery. Self-verification. ~1 minute.

Orchestrator
Cultural Research Lead
Decomposes, delegates, validates, compiles
Sub-Agent
Data Source Agent
Identifies and validates research sources
  • Role definition
  • Success criteria
  • Phased instructions
  • Self-verification
  • Worked examples
Sub-Agent
Task Allocation Agent
Splits the brief into parallelisable sub-tasks
  • Role definition
  • Success criteria
  • Phased instructions
  • Self-verification
  • Worked examples
Sub-Agent
Specialist Executor
Runs each sub-task against its target source
  • Role definition
  • Success criteria
  • Phased instructions
  • Self-verification
  • Worked examples
Sub-Agent
Synthesizer
Merges specialist outputs into unified findings
  • Role definition
  • Success criteria
  • Phased instructions
  • Self-verification
  • Worked examples
Sub-Agent
Quality Validator
Scores outputs; routes failures back for repair
  • Role definition
  • Success criteria
  • Phased instructions
  • Self-verification
  • Worked examples
Sub-Agent
Artifact Compiler
Packages final deliverables in the agreed format
  • Role definition
  • Success criteria
  • Phased instructions
  • Self-verification
  • Worked examples

Every output, every tool, every test case — free to try. Bring your own API key. Zero markup.

Test — proof

Battle-tested isn't a slogan here.

A real prompt, taken through three iterations of testing and optimisation. Watch the suite score climb — and watch the regression caught between iterations 2 and 3 that a happy-path check would have shipped. This is what battle-tested actually means.

v0
86
Baseline
v1
64
Rewrite
v2
34
Regression
v3
90
Recovery
READY

Click the button below to run the first iteration and watch the score arc build.

The prompt: a customer support agent handling billing disputes. A customer emails about an unexpected charge on their statement. The prompt needs to acknowledge the concern, commit to investigating, and respond in structured JSON — no speculation, no invented commitments, no over-promising.

Kaiju Lab scored the original 86/100 on a realistic billing complaint. We ran it through the Optimiser. By iteration two, the adversarial suite had climbed to 86 — but the same realistic billing email now scored 34. The prompt had become so rigid it refused to help a completely normal customer message, citing insufficient input on a query that had all the information it needed.

Most testing tools would have shown you the 86 and declared success. The Lab showed you both numbers.

Iteration 01 — v1
64/100
Comprehensive rewrite

Optimiser produced a fully restructured prompt with explicit rules, format constraints, and refusal criteria. Suite scores climbed. Single-run dropped — the model became more rigid on a clean customer email.

Iteration 02 — v2
34/100
Regression caught

Surgical patches pushed the suite score to 86. But the same realistic customer email scored 34 — the prompt now refused to help, citing INSUFFICIENT_INPUT. Most tools would have shipped this version. The Lab caught it.

Iteration 03 — v3
90/100
Targeted recovery

One more pass with feedback focused on the regression. v3 kept the adversarial gains (suite still 86) and recovered the realistic case to 90. Better than where we started, with all the structural improvements.

This is what ‘refinement that's actually proven’ looks like. Numbers, regressions, recovery — all visible, all in the platform.

Optimise

A better prompt that costs less to run.

Most prompt tools stop at “make it work.” The Optimiser makes it sharper and cheaper — then Battle proves the new version actually beats the old one on your test cases.

Effectiveness

Every vague or hedged directive becomes a precise, testable imperative. Hallucination guards, explicit failure handling for the top domain-specific edge cases, tightened output contracts. Working instructions are preserved, never traded away for new ones.

Efficiency

Aggressive token reduction — filler removed, overlapping instructions merged, restated defaults deleted. Target: 30–50% fewer tokens with 100% of the critical information intact. Shorter prompts mean lower cost, lower latency, and less room for the model to drift.

Scored diffs

Every change is annotated with the reasoning behind it and re-scored against the same suite. You see exactly what moved and why — you never ship on a hunch.

Battle

Run the old version head-to-head against the new one on identical adversarial inputs. The winner is decided on evidence, with full history kept so you can roll back any time.

A 40% shorter prompt that scores higher isn't a trade-off. It's the one you should have shipped in the first place.

The Loop

Generate → Test → Optimise → Ship.

01

Generate

Generate a complete agentic team, a prompt, a rule file, or a visual prompt.

02

Test

Stress-test with Kaiju Lab's 7 auto-generated test categories.

03

Optimise

Sharpen effectiveness and cut 30–50% of tokens with the Optimiser. Prove the win head-to-head with Battle.

04

Ship

Import the workflow JSON into n8n. Drop the bundle into .claude/agents/. Or export CLAUDE.md, .cursorrules, markdown, or JSON. Paste and deploy.

No other platform closes the loop — from generation to adversarially-scored, versioned deployment — across agents, prompts, rules, and visual output.

Start from working config

Find rules files that already ship.

Search GitHub for CLAUDE.md, .cursorrules, AGENTS.md and n8n workflows in real repos — read them in-app, pull them in, then run them through the optimiser like anything else.

01

Search

Live GitHub search by stack — Claude Code, Cursor, n8n, MCP, agents. Stars-ranked, not a hand-curated list.

02

Filter

Require the exact artefact: CLAUDE.md, .cursorrules, .cursor/rules, AGENTS.md, SKILL.md. Repos without it drop out.

03

Read

README and a generated how-to in-app — see how the file is wired before you pull anything in.

04

Reuse

Import the file, then optimise or stress-test it the same way you would your own.

Built for how you work

Three ways in. One workspace.

For AI-first builders

Generate CLAUDE.md and .cursorrules grounded in your real repo — paste your tree and package.json, get rules with your actual paths and deps. Generate Claude Code sub-agent teams and export them as .claude/agents/ bundles. Generate full n8n workflows and download importable workflow JSON — real nodes, wired connections, paste straight into n8n. Skip the 40 minutes of setup before you ship.

Rules Generator · Agentic Team Generator · n8n Workflow export · .claude/agents/ bundles

For prompt engineers

Generate complete agentic teams with orchestrators, specialists, handoff contracts, and error recovery. Stress-test any prompt or team with 7 adversarial test cases in Kaiju Lab, scored by an AI judge across four dimensions. Compare versions head-to-head with Battle. Apply targeted improvements with scored diffs in Optimiser. Every change annotated with reasoning — you never ship on a hunch.

Agentic Team Generator · Kaiju Lab · Optimiser · Battle

For creative technologists

Platform-precise prompts for Midjourney, DALL-E 3, Stable Diffusion, Flux, Ideogram, Imagen, Firefly, and Nano Banana. Cinematic video prompts for Veo 3, Runway Gen-4, Kling, Hailuo, Luma, and Seedance — 50+ directorial references, full control over shot, camera, tone, and lighting. Plus the Reverse Image Engine: upload any image, get the prompt that made it.

Image Prompt Generator · Video Prompt Generator · Reverse Image Engine

In the box

Also included.

Pricing

One tier. No markup. Pay your provider directly.

Free Tier

£0/forever

Free forever. No card required.

  • All 8 tools — limited runs
  • 5 agentic team generations / month
  • 5 Lab runs / month
  • 5 Optimiser runs / month
  • 10 saved prompts
  • Community gallery
  • Markdown, JSON, and text exports

Promptzilla

£5/month

Cancel anytime. Access until end of billing period.

  • Everything in Free
  • Unlimited agentic team generation
  • Unlimited Lab runs
  • Unlimited library
  • All export formats — CLAUDE.md, .cursorrules, ZIP bundles
  • Collections, tagging, version history
  • Battle mode with full history
  • Import and optimise existing prompts
  • Priority roadmap input
🔑

Platform-only pricing. £5/month covers Promptzilla; you pay Anthropic directly at cost. No markup, no proxying, no surprise bills. Multi-provider via OpenRouter coming soon.

Payments by Stripe.

Questions

Things worth knowing

AI-builders shipping LLM features in tools like Cursor and Claude Code — and anyone writing prompts as part of their actual job, not as a side hobby. If you've ever shipped a prompt that worked in dev and broke in production, or you're worried you might, you're in the right place.

A real one. Each generation produces an orchestrator and 4–7 specialist sub-agents, with full system prompts (role, success criteria, phased instructions, rules, output formats, self-verification, worked examples). Sub-agents have JSON handoff contracts and routing for failures. You can export the whole bundle as .claude/agents/ files for Claude Code, or as a markdown bundle for any other framework. If you're skeptical: the homepage's Cultural Research Lead example was generated in about 60 seconds — scroll up, click “Preview a sample agent” to read one of its full system prompts.

Almost certainly. Generated prompts are model-aware — they're written differently for Anthropic, OpenAI, Google, Meta, and DeepSeek because each family parses instructions differently. Rule files export as .cursorrules for Cursor and CLAUDE.md for Claude Code with no manual fixing. Visual prompts use the exact syntax each platform expects. If something doesn't work in your specific setup, file an issue and we'll fix it. Edge cases are where the platform improves fastest.

In plain English: it tries to break your prompt before your users do. It throws weird, malicious, ambiguous, or just plain awkward inputs at your prompt and tells you exactly what fell over and why. Specifically, seven categories: normal cases, edge cases, minimal input, ambiguous input, out-of-scope requests, adversarial attacks (prompt injection, jailbreaks, format breaking), and complex multi-part requests. Every output is scored across four weighted dimensions — Goal Adherence, Instruction Following, Format Compliance, Output Quality — and labelled with a verdict band from NEEDS WORK to EXCELLENT. Adversarial scoring is inverted: refusing a prompt injection scores high, complying scores low. Clarifying questions are evaluated for whether they're appropriate, not penalised for existing.

Today: Anthropic, with your own API key. The platform generates Anthropic-native prompts and runs Lab tests on Claude. Coming via OpenRouter: OpenAI, Google (Gemini), DeepSeek, and others. Same single key powers all of them. Roadmap priority depends on subscriber feedback.

Stored encrypted at rest. Used only for inferences you trigger. Never logged, never shared, never sent anywhere except your provider's API endpoint. You can rotate it any time from settings, and revoke it at the provider end whenever you want. We're a small team, not a data broker. Your prompts and outputs aren't training material for anything we sell.

Because we don't want to mark up inference costs, and we don't want to proxy your prompts. You pay Anthropic directly at the rate Anthropic charges. Promptzilla charges £5/month flat for the platform — Generator, Lab, Optimiser, Battle, Library, exports — whether you spend £2 in inference or £200. Most prompt tools either charge a higher monthly fee that bakes in inference at a markup, or proxy your prompts through their servers (which means they see everything). We prefer the boring honest option.

Ship the best version, not the first one.

Generate a prompt, an agentic team, or a skill-equipped agent — battle-tested and optimised — in under a minute. No card, no commitment.