OpAgent is a powerful agentic framework designed for autonomous web navigation and operation. It comes in two primary modes to suit different use cases: a full-featured Agentic Framework for state-of-the-art performance, and a streamlined Single-Model Mode for ease of use and quick deployment.
- News
- Overview
- Performance Highlights
- Getting Started
- Detailed Introduction: The Agentic Framework
- Citation
πππ [2026/02/14] We have released our technical report. Please refer to OpAgent Technical Report for details.
π₯π₯π₯ [2026/01/22] We are pleased to announce that Opagent achieves a remarkable 71.6% resolve rate on the Webarena leaderboard.
This repository provides the code and models for OpAgent, an operator agent for web navigation. We offer two distinct modes:
-
OpAgent: Single-Model Mode (
opagent_single_model/directory)- A simplified, end-to-end approach where a single, powerful Vision-Language Model (VLM) directly performs web navigation tasks.
- This mode is designed for accessibility and quick deployment, offering a powerful yet easy-to-use solution for web automation.
- Perfect for developers who want to quickly integrate a web agent into their applications.
-
OpAgent: The Full Agentic Framework (
opagent/directory)- An advanced, multi-agent system composed of a Planner, Grounder, Reflector, and Summarizer.
- This architecture enables sophisticated reasoning, robust error recovery, and self-correction, achieving top-tier performance on complex, long-horizon web tasks.
- Ideal for researchers and users seeking maximum performance and a deep dive into agentic AI architectures.
We employ an innovative Online Agentic Reinforcement Learning (RL) pipeline to significantly improve the capability of a single VLM. Our RL-enhanced model (RL-HybridReward-Zero) achieves a 38.1% success rate (@Pass5) on WebArena, outperforming other monolithic baselines and demonstrating a 10.7% absolute improvement over the original model.

Our full agentic framework, OpAgent, achieves a state-of-the-art (SOTA) 71.6% resolve rate on the WebArena benchmark (formerly OAgent on the WebArena leaderboard), securing the #1 position on the leaderboard on Jan. 2026.

Depending on which mode you'd like to use, please follow the instructions below.
This mode provides a ready-to-use, interactive web agent powered by a single model. It's the quickest way to see OpAgent in action.
For detailed installation and usage instructions, please refer to the README in the opagent_single_model directory:
β‘οΈ Go to Single-Model Mode Usage Guide β¬ οΈ
A quick preview of how to get started:
cd opagent_single_model
pip install -r requirements.txt
python main.pyThis mode utilizes a multi-agent architecture (Planner, Grounder, etc.) to achieve the highest performance.
The core logic is implemented in the ./opagent/ directory, with evaluation scripts located in ./demo/local_agent_eval.py. This setup is primarily designed for benchmark evaluation and research.
To run the evaluation:
# Detailed setup and execution instructions are work-in-progress.
# Please refer to the code in the 'opagent' and 'demo' directories for now.
# We welcome community contributions to improve the documentation!(Learn more about the Agentic Framework's architecture below)β
This section details the architecture of our high-performance, multi-agent framework.
This document describes the structure of the demo WebAgent framework implemented in the ./opagent/local_agent_eval.py script. This framework aims to execute and evaluate automated tasks in real Web environments (such as the WebArena Shopping environment) via local/remote model calls.
This Agent adopts a modular Planner-Grounder-Reflector-Summary architecture. The entire system consists of a task scheduler, multi-threaded Workers, browser environment management, and core Agent logic.
The execution flow of the Agent is a closed-loop system, mainly containing the following steps:
- Observation: Acquire the current webpage screenshot.
- Reflector(Gemini3-Pro):
- Analyzes the execution result of the previous action.
- Checks if the task is completed (
is_task_done). - Collects key information (Notes) to satisfy user requests.
- Provides feedback signals to the Planner.
- Planner(Gemini3-Pro):
- Receives feedback from the Reflector, the current screenshot, and domain expert tips (
tips). - Generates the next high-level instruction (
instruction) and action type (action_type). - Expert Strategy: Dynamically injects expert knowledge and navigation strategies for specific domains (e.g., Adobe Commerce Admin).
- Receives feedback from the Reflector, the current screenshot, and domain expert tips (
- Grounder(PostTraining-Qwen2.5-VL-72B):
- We collected millions of data points and trained a version of Grounder based on Qwen2.5-VL-72B through post-training (SFT and RL).
- Receives instructions and the current screenshot from the Planner.
- Uses a Vision Language Model (VLM) to output specific page coordinates (
coords) or operation parameters.
- Action Execution(Playwright):
- Executes specific operations (Click, Type, Scroll, Select Option, etc.) via Playwright.
- Summary(Gemini3-Pro):
- Generates the final answer based on execution history and collected information at the end of the task.
The main body of the Agent, responsible for maintaining task status, calling various model modules, and executing the main loop.
- State Maintenance:
steps(history steps),marked_notes(collected info),last_screenshot. - Model Calls:
call_reflector: Calls the reasoning model to judge status.call_planner: Calls the reasoning model to generate plans.call_grounder: Calls the visual model (usually an SFT model) to get precise coordinates.call_summary: Generates the final answer.
- Strategy Injection:
get_domain_specific_tipsdynamically loads operation guides for different sites like Shopping/Admin/Map based on the current URL.
A unified model call interface encapsulating requests to different backend services:
- MatrixLLM / Gemini: Used for reasoning (Planner/Reflector).
- CodeBot / OpenAI SDK: Used for Grounder (Qwen-VL, etc.).
- HTTP: General HTTP calls.
- AFTS Tool Integration: Automatically handles image uploads, converting Base64 to URLs for specific models.
- BrowserActor: Encapsulates the Playwright Browser instance, supporting browser connection management across threads/processes.
- Worker: Multi-threaded workflow, where each Worker binds an ECS instance IP and a Browser Endpoint.
- Environment Refresh: Automatically handles SSH connections, Cookie injection, and ECS website status reset before tasks start.
The framework defines four core Prompt templates guiding different Agent roles:
REFLECTION_PROMPT: Emphasizes "based on observed facts", responsible for verifying task success criteria, detecting infinite loops, and collecting structured data.PLANNER_PROMPT: Responsible for generating atomic operation instructions. Includes detailed action definitions (scroll, click, type, etc.) and core principles (priority search, table pagination checks, etc.).GROUNDER_PROMPT: Concise visual instructions requiring the model to output<tool_call>or coordinates.SUMMARY_PROMPT: Responsible for formatting the final answer, handling sorting, counting, and specific format requirements.
- Robustness Handling:
- JS fallback mechanism for
select_option(when Playwright standard selection fails). - Automatic retry mechanism.
- JS fallback mechanism for
- Multimodal Support: Core logic relies heavily on VLM (Visual Language Models) to process webpage visual elements.
If you use OpAgent in your research or project, please cite it as follows:
@misc{opagent2026,
author = {CodeFuse-AI Team},
title = {OpAgent: Operator Agent for Web Navigation},
year = {2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/codefuse-ai/OpAgent}},
url = {https://github.com/codefuse-ai/OpAgent}
}