RL Environments for Training and Evaluation

Gyms for GUl agents, web, mobile and desktop.

RL Environment Hero

Built-in Leading Benchmarks

Run industry-standard benchmarks instantly — no setup required.

OSWorld
AndroidWorld
WebArena

OS Control

AGI-O can perform open-ended tasks across major operating systems, such as Windows, Linux and Mac.

  • Computer Use

    Hands-free execution via GBOX controller

  • Easy to Use

    Setup in minutes

  • Out of the Box

    Run benchmarks instantly

Agent S3 w/ GPT-5 bBoN (N=10)
69.90%
GBOX Agent
64.20%
GTA1 w/ GPT-5
63.40%
Claude-sonnet-4-5-20250929
62.90%
Agent S3 w/ GPT-5 bBoN (N=1)
62.60%
Agentic-Lybic-Maestro
61.90%
CoACT-1
60.80%

Android App Control

Measure how well agents complete end-to-end mobile journeys across flagship Android apps and custom business flows.

  • Android Emulator

    Pre-wired with flagship APKs

  • Task Validator

    Understands UI and database events

  • One-Click Run

    Spin up curated Android suites

GBOX
86.20%
mobile-use
84.50%
AutoGLM-Mobile
80.20%
LX-GUIAgent
79.30%
DroidRun
78.40%
Finalrun
76.70%

Browser Control

Benchmark how your agents navigate complex, multi-step browser tasks using real-world grade replicas.

  • Sandbox Ready

    WebArena replicas with telemetry

  • Edge-to-Edge

    Covers research and production use

  • Live Analytics

    Replay and debug every session

GBOX
67.98%
DeepSky Agent
66.90%
Narada AI
64.20%
IBM CUGA
61.70%
OpenAI Operator
58.10%

Training Gyms

Train your agents with environments designed for real-world tasks like financial analysis, customer service, and enterprise workflows.

Airbnb

Real Airbnb listings and walkthrough data.

2K
Real house data
10K
Video feeds

Instagram

Instagram posts and audience analytics.

20K
Instagram posts
100K
Real users

LinkedIn

LinkedIn profiles and company records.

100K
Real profiles
30K
Companies

Expedia

Expedia flights and hotel inventory.

10K
Flights
10K
Hotels

Private Benchmarks

Evaluate your agents with diversified long-horizon tasks in controllable environments.

Airbnb

Train Travel Booking Agents.

Booking House
Family Vacation Planner
Publish House
Business Trip Planner
Scenic Drive Itinerary

Instagram

Train Social Media Agents.

Publish Photo Post
Weekly Content Scheduler
Reel Trend Optimizer
Pet KOL Daily Monitor
Smart DM Concierge

Verifier Coverage

Combine data-aware and perception-driven validation layers to confirm successful task completion even in complex, multi-step scenarios.

Database Verifier

For example, when an Agent clicks the 'like' button on a post, a new record is created in the database table; the validator checks the table's data to determine whether the task has been completed.

Database Verifier

UI Verifier

Determine whether a task is completed by observing changes in the UI — for example, by using Android UI automator to output XML layout files, or by using CUA models such as UI-TARS or Gelato.

UI Verifier

On-Premise Deployment with Full Customization

Install on your own servers with air-gapped security. Modify benchmarks, create custom environments, and keep all data within your infrastructure.

On-Premise Deployment

Air-gapped runtime

Deploy the container on isolated clusters with encrypted volume mounts and zero outbound traffic.

Customizable stacks

Swap benchmark suites, inject proprietary datasets, and wire your own validators without breaking the core framework.

Enterprise governance

Integrate with SSO, audit logging, and policy engines so every experiment is compliant by design.

Accelerate your RL training with GBOX