February 28, 2026

Autotest: A Data Backed Solution for End-to-End Testing

Jayant Chopra, Polarity Labs, Research Division

We introduce Paragon Autotest, an agentic QA platform that reduces manual testing effort by 87% and cuts bug detection time from hours to under five minutes. By leveraging agentic parallelism, Autotest achieves a 3.2x increase in deployment frequency for our customers while reducing production bugs by 40%, demonstrating a structural solution to the industry's long-standing QA bottleneck.

The Problem

For decades, software testing has been the primary bottleneck in the development lifecycle, with testing costs consuming an average of 23-35% of overall IT spending [1]. The core challenge is structural: manual testing is fundamentally sequential. This linear process does not scale, resulting in an average test cycle time of a staggering 23 days [1]. For modern teams, a month-long feedback loop is untenable.

Traditional test automation offered a partial solution but introduced its own problems of brittleness and high maintenance costs. The result is that testing remains the primary source of delivery delays at the enterprise level [1], and bugs that escape to production are up to 100 times more expensive to fix [2].

The QA Bottleneck — By the Numbers

23-35%

IT spend on testing

23 days

Avg test cycle time

100x

Cost multiplier for prod bugs

<4%

Paragon false positive rate

Sources: Forbes [1], Functionize [2], BrowserStack [3]

Our Approach

Paragon Autotest addresses this by fundamentally changing the execution model from sequential to parallel. Instead of a single tester working through a list, Paragon dispatches a fleet of autonomous AI agents that execute tests concurrently. This is agentic parallelism.

While a manual tester validates a single payment flow, Paragon can simultaneously have other agents testing dozens of behavioral variations, edge cases, and different browser viewports. This approach transforms test execution from a linear process into a massively parallel one, reducing test suite completion times by over 90% [3]. Unlike simple LLM-wrappers that can increase the rate of false positives [4], Paragon's agents operate within a sophisticated harness that provides each agent with an isolated, deterministic environment. This ensures a false positive rate of less than 4%, making results reliable and actionable.

How Autotest Works

Step 1

Natural Language Input

Describe your test in plain English

Step 2

Agentic Generation

Agents generate Playwright code & spin up browser environments

Step 3

Parallel Execution

Run alongside hundreds of tests with automated analysis

Step 4

PR & CI Integration

Auto-created PR with tests ready for CI pipeline

Time to first test running: under 2 minutes.

Results

Our customer data demonstrates a transformative impact across the entire development lifecycle. By automating the most time-consuming QA tasks, teams running Paragon ship faster and with higher quality. These outcomes are a direct result of shrinking the feedback loop. The most critical metric in modern QA is Mean Time to Detect (MTTD). By moving detection from days or hours to minutes, Paragon fundamentally changes the economics of bug fixing.

Customer Impact Metrics

87%

Reduction in manual testing effort

3.2x

Increase in deployment frequency

40%

Reduction in production bugs

<5 min

Mean Time to Detect (MTTD)

Paragon Autotest<5 minutes

Traditional Automation2-4 hours

Manual Testing1-3 days

Lower is better. MTTD measures time from code change to bug detection.

With a sub-5-minute MTTD even for complex, multi-step tests, Paragon allows developers to fix bugs while the context is still fresh, eliminating the expensive context-switching that plagues traditional QA cycles.

Benchmark: Evaluating Agentic E2E Test Accuracy

To evaluate the end-to-end testing capabilities of different agentic systems, we constructed a benchmark designed to measure the ability of an AI agent to correctly execute and validate complex, multi-step user journeys that are representative of real-world web applications.

Benchmark Design and Methodology

The benchmark is constructed from 100 open-source repositories with publicly accessible web applications. Repositories were selected for having complex user flows and active test suites, including projects in e-commerce, developer tooling, and content management.

For each repository, we defined a set of four critical test categories:

Test Categories

Credit & Payment Validation

Calculating and verifying transaction amounts, interacting with mock payment provider APIs.

UI/UX Element Interaction

Navigating between tabs, interacting with dynamic components, validating visual state changes.

User Onboarding Flows

Multi-step registration, profile completion, and initial setup sequences.

Complex User Journeys

Backtracking, error recovery, and state management across long sessions (e.g., cart abandonment and return).

Evaluation Protocol

Each agent was tasked with executing the full test suite for all 100 repositories. Accuracy is defined as the percentage of test outcomes (pass or fail) that correctly match a human-verified ground truth. For a test to be considered accurate, the agent must not only report the correct pass/fail status but also correctly identify the specific step or assertion that failed.

Evaluation Metric: F0.5 Score

The F0.5 score places twice as much importance on precision as on recall, reflecting the high cost of false positives in CI/CD environments.

F_0.5 = (1.25 × Precision × Recall) / (0.25 × Precision + Recall)

Precision

TP / (TP + FP)

Recall

TP / (TP + FN)

Key Stat

63% failures from complex journeys

Across the 100 repositories, the average test suite completion time was 42 minutes. The most common failure category was Complex User Journeys, which accounted for 63% of all test failures. This highlights the difficulty general-purpose agents have with maintaining state over long, interactive sessions. Within this category, backtracking from multi-step flows was the single most frequent point of failure, with a 14% failure rate across all non-Paragon agents.

Results

Accuracy in Locating Failed Tests Across Agentic QA Tools

89%

Paragon

81%

Cursor Cloud Agent

80%

Claude Cloud Agent

78%

Codex 5.3

Higher is better. Accuracy measured across payment validation, UI testing, onboarding flows, and complex user journeys.

Tool	Accuracy
Paragon (Ours)	89%
Cursor Cloud Agent	81%
Claude Cloud Agent	80%
Codex 5.3	78%

Paragon achieved 89% accuracy, outperforming Cursor Cloud Agent (81%), Claude Cloud Agent (80%), and Codex 5.3 (78%). The results highlight a key distinction: while all modern agents can handle simple, single-page tests, accuracy on complex, multi-step journeys is what separates a purpose-built QA agent from a general-purpose coding assistant.

Paragon's performance is attributed to its sophisticated state-tracking and execution harness, which is specifically designed to manage context across long and complex user sessions. General-purpose agents, while powerful, lack this specialized architecture, leading to a drop in accuracy on the most challenging multi-step test cases.

Key Takeaways

Sequential execution is the bottleneck

The 23-day average test cycle is a direct result of linear process [1].

Agentic parallelism is the solution

Concurrent AI agents reduce test execution times by over 90% [3].

Low false positive rate is essential

A deterministic execution harness keeps false positives under 4% for trust and adoption.

Under 2 minutes to first test

Instant time-to-value removes adoption friction for enterprise tooling.

Getting Started

Both features are available now on app.paragon.run. Existing Paragon users can access autotest under the testing tab within the dashboard. New users can sign up and connect their first repository in minutes to get started.

The testing suite and monitoring features are included in Startup and Enterprise plans, with usage-based pricing for compute resources. See our pricing page for details on PCU rates for each operation.

References

[1] Khan, A. (2024). The Cost Of Time: How Test Build Delays Impact Mid-Sized Companies. Forbes.

[2] Functionize. (2023). The Cost of Finding Bugs Later in the SDLC.

[3] BrowserStack. (2026). Parallel Test Execution vs. Sequential Testing.

[4] Ramler, R. et al. (2025). Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency. arXiv.

Back to Research