Expanding on the foundations of WebVoyager
TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. Anthropic Sonnet 3.7 CUA is the current SOTA, with the detailed results here.
Over the past few months, Web Browsing agents such as Skyvern, Browser-use and OpenAI's Operator (CUA) have taken the world by storm. These agents have been used in production for a variety of tasks, from helping people apply to jobs, downloading invoices, and even doing SS4 filings for newly incorporated companies.

Skyvern attempting to purchase a product

Skyvern attempting to fill out the IRS form
Most agents report state of the art performance, but we find that browser agents still struggle with a wide variety of tasks, particularly ones involving authentication, form filling and file downloading.
This is because the standard benchmark today (WebVoyager) focuses on read-heavy tasks and consists of only 643 tasks across only 15 websites (out of 1.1 billion possible websites!). While a great starting point, the benchmark does not capture the internet’s adversarial nature towards browser automation and the difficulty of tasks involving mutating of data on a website.

Can’t access chase.com

Can’t close a pop-up dialog
As a result, we partnered with Halluminate and created a new benchmark to better quantify these failures. Our goal was to create a new consistent measurement system for AI Web Agents by expanding the foundations created by WebVoyager by:
We’re excited to announce Web Bench, a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced.



The 452 websites are distributed across 17 primary categories. We sampled the benchmark websites from the top 1000 websites in the world measured by web traffic. We then cleaned this dataset of sites by removing:

These results were a little bit surprising, so we decided to cut the data along 2 dimensions to understand where agents may falter:

Read-only tasks constitute tasks that involve agents going to different websites and navigating the sitemap until a particular answer/state has been found.
Unsurprisingly, these results matched the WebVoyager dataset more closely, as the WebVoyager dataset was largely curated to help agents navigate websites and answer questions.
The biggest 2 sources of failures for read-heavy tasks are:

Tasks involving:
had a much lower pass rate across the board.
Digging a bit deeper into the failures, there were two culprits for the failures that popped up:

Can’t close subscription pop-up

Unable to find and click coupon buttons
These issues manifest as agents making adverse changes when filling out forms, or optimistically assuming that clicking a “Submit” button completes the task, when in reality, a captcha appeared that now needs to be solved.
This issue is very similar to the phenomenon observed in coding agents, where smarter models try to “overhelp” with code changes by either making changes to unrelated parts of the codebase or repeatedly suggesting incorrect things because they’re missing some important context.

Digging deeper, we were able to qualify the overall failures into two buckets:
The 4 biggest categories of agent errors are:
The 3 biggest categories of infrastructure issues are:
These findings imply that the browser infrastructure powering the agents is equally as important as the quality of the agent itself.
While accuracy is the most important characteristic of a web browsing agent, there is also a desire to get “faster” and “cheaper” agents.
Fast and cheap agents can be characterized by tracking the following metrics:
While the right pricing model for browser agents hasn’t emerged yet, this data gives an important insight into whether pricing per hour (common amongst hosted browsers + older robotic process automation) and pricing per step (common amongst computer use APIs) is the right methodology.

This metric is important for a few latency-sensitive market segments/situations:

Most web browsing agents’ costs scale with the number of steps (i.e., page scans) required to complete a specific task.
Agents may use a varied number of steps for a few reasons:
The goal of the first version of Web Bench was to start with a reasonably large number of websites + tasks.
The next few versions will expand: (1) the number of websites encompassed by this benchmark, (2) the language of websites, and (3) the categories of websites
The overall cost to run this benchmark was approximately $3,000 per agent for human evaluations. This cost is high enough to be prohibitive of testing every single agent on the market.
The halluminate team plans to release an open-source automated evaluation harness to allow teams to self-serve testing against Web Bench. We leave this for future work.
If you’d like to submit benchmark results for your own agent, please contact the Halluminate team
For a comprehensive technical debrief on the dataset creation and evaluation methodology, check out the Halluminate team writeup here.
Yes.
We had examples of Skyvern chatting with GitHub’s AI Support bot

Browser Use googled how to evade Cloudflare captchas
