📊 Web Bench - A new way to compare AI Browser Agents

Expanding on the foundations of WebVoyager

Suchintan Singh

5 months ago

https://www.skyvern.com/

#ai#api#open_source#workflow_automation#robotic_process_automation

TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. Anthropic Sonnet 3.7 CUA is the current SOTA, with the detailed results here.

Over the past few months, Web Browsing agents such as Skyvern, Browser-use and OpenAI's Operator (CUA) have taken the world by storm. These agents have been used in production for a variety of tasks, from helping people apply to jobs, downloading invoices, and even doing SS4 filings for newly incorporated companies.

Skyvern attempting to purchase a product

Skyvern attempting to fill out the IRS form

Most agents report state of the art performance, but we find that browser agents still struggle with a wide variety of tasks, particularly ones involving authentication, form filling and file downloading.

This is because the standard benchmark today (WebVoyager) focuses on read-heavy tasks and consists of only 643 tasks across only 15 websites (out of 1.1 billion possible websites!). While a great starting point, the benchmark does not capture the internet’s adversarial nature towards browser automation and the difficulty of tasks involving mutating of data on a website.

Can’t access chase.com

Can’t close a pop-up dialog

As a result, we partnered with Halluminate and created a new benchmark to better quantify these failures. Our goal was to create a new consistent measurement system for AI Web Agents by expanding the foundations created by WebVoyager by:

Expanding the number of websites from 15 → 452, and tasks from 642 -> 5,750 to test agent performance on a wider variety of websites
Introduce the concept of READ vs WRITE tasks
READ tasks involve navigating websites and fetching data
- WRITE tasks involve entering data, downloading files, logging in, solving 2FA, etc, and were not well represented in the WebVoyager dataset
Measure the impact of browser infrastructure (eg, access the websites, solve captchas, not crash, etc)

We’re excited to announce Web Bench, a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced.

Results TL;DR

All agents performed surprisingly poorly on write-heavy tasks (eg, logging in, filling out forms, downloading files), which implies that this is the area for the highest opportunity for growth
Skyvern 2.0 has the best performance for tasks in this category

Agent performance for read-heavy tasks (eg, extracting information out of websites) was better than we expected
Anthropic’s CUA had the highest performance
The entire run is open source and can be viewed to see the specific performance of each task

Dataset Creation

The 452 websites are distributed across 17 primary categories. We sampled the benchmark websites from the top 1000 websites in the world measured by web traffic. We then cleaned this dataset of sites by removing:

repeat domains
sites without English translations
sites blocked by paywall

Results

Initial conditions

We ran OpenAI Operator with a human in the loop to establish a baseline for performance
We used consistent browser infrastructure (Skyvern’s infrastructure) when comparing the API-only models without a runtime to eliminate variables
We also ran Skyvern 2.0 on Browserbase to compare the impact of infrastructure, but found (surprisingly) that Skyvern’s infrastructure was able to reliably access more websites and encountered fewer anti-bot issues during navigation
Each agent was allowed a max of 50 steps per execution
Each result was validated by a human in the loop to assert evaluation data quality

Accuracy (Overall)

These results were a little bit surprising, so we decided to cut the data along 2 dimensions to understand where agents may falter:

Read-only tasks (ie, extracting visible data from a particular website)
Write heavy tasks (ie, logging into websites, filling out forms, downloading files)

Accuracy (Read-only tasks)

Read-only tasks constitute tasks that involve agents going to different websites and navigating the sitemap until a particular answer/state has been found.

Unsurprisingly, these results matched the WebVoyager dataset more closely, as the WebVoyager dataset was largely curated to help agents navigate websites and answer questions.

The biggest 2 sources of failures for read-heavy tasks are:

Navigation Issues (cannot figure out how to navigate a page, can’t solve popups)
Information extraction issues (doesn’t pull the correct information)

Accuracy (Write-heavy tasks)

Tasks involving:

filling out forms
logging in
solving 2FA
downloading files

had a much lower pass rate across the board.

Digging a bit deeper into the failures, there were two culprits for the failures that popped up:

Incomplete execution (hallucinating that it’s achieved the goal when it has not)
Unable to identify the correct element to interact with (eg can’t close a popup dialog)

Can’t close subscription pop-up

Unable to find and click coupon buttons

These issues manifest as agents making adverse changes when filling out forms, or optimistically assuming that clicking a “Submit” button completes the task, when in reality, a captcha appeared that now needs to be solved.

This issue is very similar to the phenomenon observed in coding agents, where smarter models try to “overhelp” with code changes by either making changes to unrelated parts of the codebase or repeatedly suggesting incorrect things because they’re missing some important context.

Agent Failure Modes - Deepdive

Digging deeper, we were able to qualify the overall failures into two buckets:

Agent Failures (eg, Agents hallucinated/made poor decisions or didn’t interact with important elements)
Infrastructure failures (eg, Agent can’t access the website, solve a captcha to log in)

Agent issues

The 4 biggest categories of agent errors are:

Navigation Issue (cannot figure out how to navigate a page, can’t solve popups)
Incomplete execution (hallucinating that it’s achieved the goal when it has not)
Time-outs (exceeds step limits)
Information extraction issue (doesn’t pull the correct information)

Infrastructure issues

The 3 biggest categories of infrastructure issues are:

Proxy (Failed to access website/website blocked)
Captcha (Verification required to proceed and the infrastructure is unable to solve it)
Login/Authentication (Google Auth detecting you’re a bot)

These findings imply that the browser infrastructure powering the agents is equally as important as the quality of the agent itself.

Other interesting characteristics

While accuracy is the most important characteristic of a web browsing agent, there is also a desire to get “faster” and “cheaper” agents.

Fast and cheap agents can be characterized by tracking the following metrics:

Runtime duration
Number of steps

While the right pricing model for browser agents hasn’t emerged yet, this data gives an important insight into whether pricing per hour (common amongst hosted browsers + older robotic process automation) and pricing per step (common amongst computer use APIs) is the right methodology.

Agent Runtime Duration

This metric is important for a few latency-sensitive market segments/situations:

Copilot-like products where a human is supervising 1 or many agents executing in parallel
Phone agents referencing information / doing real-time lookups while a user is talking
Websites aggregating information in real-time to show to the user(e.g., looking up flight or domain name availability)

Steps

Most web browsing agents’ costs scale with the number of steps (i.e., page scans) required to complete a specific task.

Agents may use a varied number of steps for a few reasons:

Most agents use different architectures to batch actions together to minimize the number of steps
Agents eager to solve the problem may use a lot of steps in error situations to try to independently resolve their issues (e.g., chatting with support for solutions instead of terminating early)
Agents using pessimistic approaches to reduce hallucinations may invalidate batches of actions whenever the website changes after a particular action (i.e., filling in an address field might invalidate the action plan for the rest of the page)

Next Steps

We intend to benchmark Claude 4, Operator O3, UI-TARs, and Mariner API
We intend to expand the benchmark to cover more categories of websites beyond the top 1000 websites in the world
We intend to expand this dataset to include other languages to assert multi-lingual browser agent performance

FAQ

Why not include a bigger cohort of websites?

The goal of the first version of Web Bench was to start with a reasonably large number of websites + tasks.

The next few versions will expand: (1) the number of websites encompassed by this benchmark, (2) the language of websites, and (3) the categories of websites

Why were only a subset of vendors benchmarked?

The overall cost to run this benchmark was approximately $3,000 per agent for human evaluations. This cost is high enough to be prohibitive of testing every single agent on the market.

The halluminate team plans to release an open-source automated evaluation harness to allow teams to self-serve testing against Web Bench. We leave this for future work.

If you’d like to submit benchmark results for your own agent, please contact the Halluminate team

Where can I read the technical deepdive?

For a comprehensive technical debrief on the dataset creation and evaluation methodology, check out the Halluminate team writeup here.

Did anything funny happen along the way?

Yes.

We had examples of Skyvern chatting with GitHub’s AI Support bot

Browser Use googled how to evade Cloudflare captchas