Benchmarking Browser-Use Models

[Browser Use](https://browser-use.com) is an open source AI browser agent capable of loading and interacting with web pages. I am exploring it's use for my [[Artists' Opportunity Finder]] project, and benchmarked a typical request to determine the speed and cost tradeoffs between a mixture of local and cloud models: ## Speed vs Cost Using an AI agent to browse websites can be cheap, reliable, and or fast - but not all 3 - here's the result of benchmarking: ## Benchmark Task Visit a web page and extract all the URLs pointing to a artists' opportunities, prompted as: > Visit https://www.nyfa.org/opportunities/?opportunity_discipline=Painting and extract each listed opportunity's details URL. %% This specific task could have been made easier by turning off vision and including the URLs in the context provided to the model by passing `include_attributes=['href']` and `use_vision=False` to the Browser Use Agent constructor - but these settings were left out to force the models to search the DOM to increase the challenge. %% ### Cloud LLM Results I first tested recent vision-capable LLMs with Zero Data Retention providers on OpenRouter at FP8 or better quantization: | **OpenRouter LLM** | **Cost** | **Time** | Failures | | --------------------------------------------------------------------------------- | --------- | ----------- | ----------------------------- | | [**Gemini 3 Flash Preview**](https://openrouter.ai/google/gemini-3-flash-preview) | 2.7 cents | 36 seconds | | | [**Gemini 2.5 Flash**](https://openrouter.ai/google/gemini-2.5-flash) | 3.1 cents | 42 seconds | | | **[GPT-5 Mini](https://openrouter.ai/openai/gpt-5-mini)** | 0.9 cents | 82 seconds | | | **[Gemini 2.5 Flash Lite](https://openrouter.ai/google/gemini-2.5-flash-lite)** | 0.7 cents | 30 seconds | ⚠️ only 1/15 URLS | | **[GLM 4.6V](https://openrouter.ai/z-ai/glm-4.6v)** | 6.9 cents | 396 seconds | ⚠️ only 3/15 URLS | | **[GPT-5 Nano](https://openrouter.ai/openai/gpt-5-nano)** | 0.5 cents | 79 seconds | ⚠️ only 1/15 URLS | | **[Claude Haiku 4.5](https://openrouter.ai/anthropic/claude-haiku-4.5)** | 4.4 cents | -- | ⚠️ invalid json - run failed | * I would have liked to try the following LLMs, but they did not have a ZDR provider at FP8 or above: InternVL, Grok 4.1 Fast * I tested some additional LLMs that failed that are not listed. ### Local LLM Results I then tested locally-runnable vision-capable LLMs were tested via LMStudio on a Macbook M4 with the model pre-loaded and a 32k context length: | **Local LLM** | **Time** | Failures | | ----------------------------------------------------------------------------------------------------------- | ----------- | --------------------------------------------------------- | | **[BU-30B-A3B-Preview](https://model.lmstudio.ai/download/mlx-community/bu-30b-a3b-preview-4bit)** mlx 4bit | 314 seconds | ⚠️ returned 1 opportunity, and it was from the wrong page | | **[Devstral-Small-2-2512](https://lmstudio.ai/models/mistralai/devstral-small-2-2512)** mlx 4bit | -- | ⚠️ requests timed out (> 120s) | | [**Qwen3-VL-4B**](https://lmstudio.ai/models/qwen/qwen3-vl-4b) mlx 4bit | -- | ⚠️ visited a URL instead of returning | ## Result Summary * No local LLM I could run completed the task. * **GPT-5 Mini was the cheapest** at < 1 cent, but took 82 seconds. * **Gemini 3 Flash Preview was the fastest** at 36 seconds, but expensive at 2.7 cents. * Despite having higher per-token cost than Gemini 2.5 Flash, Gemini 3 Flash was still cheaper due to higher token efficiency. * **Browser Use 30B A3B Preview did not accomplish the task** but I look forward to testing the next version, as this model should be the most adept with the Browser Use tools.