- AI
- tooling
- MCP
- testing
- playwright
Playwright or Playwright MCP? Discovery Is Not Defense
An AI agent driving the browser and a deterministic spec suite solve different problems. Here is the line between exploring with MCP and defending with standard Playwright.
You wire up Playwright MCP on a Friday afternoon. You point the agent at the new checkout flow and say “go buy something.” It reads the page, clicks through, hits a disabled pay button that should be enabled, and tells you exactly which field validation is wrong. Two minutes, no test written. It feels like magic.
So you do the obvious thing. You decide this is your regression suite now. Every morning the agent walks the checkout. Three weeks later it is slow, the token bill is embarrassing, and worst of all it gives you a slightly different answer each run. One day it “couldn’t find the pay button.” The button was fine.
You did not pick a bad tool. You picked the right tool for the wrong job.
Two tools that look like one
Both have “Playwright” in the name and both drive a browser, so it is easy to assume they compete. They do not. They solve different problems.
Standard Playwright is a test runner. You write spec files, it runs them in parallel worker processes, the same way every time, and it is built to live in CI. (parallelism docs)
Playwright MCP is a Model Context Protocol server. It exposes browser actions as tools an AI agent calls live. The agent navigates, clicks, and reads the page through the browser’s accessibility tree, “bypassing the need for screenshots or visually-tuned models,” per the project README. No vision model, structured data, deterministic tool calls.
Read that last line carefully, because it hides the trap. The tool calls are deterministic. The agent deciding which tools to call is not. That is the whole story.
Where MCP wins
MCP shines exactly where you do not yet know what you are testing.
A fresh feature with no specs. A bug report you cannot reproduce. A “just click around and tell me what breaks” pass before release. Here the value is the agent’s judgement: it reads an unexpected error state, decides it is wrong, and explains why. You want the nondeterminism, because you are asking a question you do not already know the answer to.
This is the one-off, exploratory mode. Run it once, learn something, move on. The README puts MCP’s home turf plainly: “exploratory automation, self-healing tests, or long-running autonomous workflows where maintaining continuous browser context outweighs token cost concerns.”
Note the last clause. Outweighs token cost concerns. Microsoft is telling you, in its own docs, that MCP costs tokens and you should only pay when the continuous reasoning is worth it.
Where standard Playwright wins
The moment you want to run the same check tomorrow, and the day after, and block a merge when it fails, every property flips.
You want determinism, not judgement. A quality gate that flickers is worse than no gate, because a gate you cannot trust trains the team to ignore it. A spec file gives you the same answer every run; Playwright even labels the in-between cases for you as “passed,” “flaky,” or “failed” so flakiness is visible instead of silent. (retries docs)
You want speed and low cost. A compiled spec suite runs in parallel workers with no model in the loop. No tokens, no inference latency, no waiting on an agent to reason about a button it has clicked a thousand times.
And you want something to build on, which is the part teams underrate.
Build on the output
A deterministic run produces a deterministic artifact, and artifacts are what pipelines are made of.
Standard Playwright ships reporters: HTML, JSON, JUnit, the lot, plus a custom reporter API when you outgrow them. That JUnit file feeds your CI dashboard. The HTML report is what you open when something breaks at 2am. You can shard the suite across machines and merge the reports back into one view.
You cannot build any of that on a chat transcript. An agent’s run is a story, not a record. Stories are great for understanding; they are useless for trend lines, flake tracking, and “is this failure new or has it been red for a week.”
The line that matters
Stop asking which tool is better. Ask what you are doing.
If you are discovering, use MCP. You do not know the answer yet, you want an agent’s judgement, you will run it a handful of times. Nondeterminism is a feature here.
If you are defending, use standard Playwright. You already know the behavior you want to protect, you will run the check forever, and you need it to be fast, cheap, and identical every time.
The cleanest workflow uses both, in sequence. Let the agent explore and find the bug. Then turn what it found into a spec the pipeline runs for the next two years. The agent discovers the test; the runner defends it.
What this changes
Once you split the job in two, the false choice disappears. You stop trying to make MCP cheap and deterministic, because that was never its job. You stop trying to make your spec suite “smart,” because dumb and identical is exactly what a quality gate should be.
Discovery is not defense. Use the agent to find what matters, and a deterministic suite to keep it found. That is not a compromise between two tools. It is two tools doing what each is good at.
If your team is sorting out where AI agents belong in your testing and delivery pipeline, and where deterministic gates should hold the line instead, that is a conversation we are happy to have. It is the kind of boundary we draw with clients every week.