Benchmark v1.0

Evaluate AI with Real Financial Tasks

Professional-grade benchmark for AI agents handling complex, real-world financial scenarios. Expert-crafted tasks, deterministic scoring, and simulated financial environments.

Expert-Crafted Tasks

500+

Expert-Crafted Tasks

Financial Domains

8

Financial Domains

Overall Score

78.1

Overall Score

Real-World Sourced

100%

Real-World Sourced
Leaderboard

How Top Models Perform

Real benchmark results across our four scoring dimensions. Updated with each new model release.

100%

80%

60%

40%

20%

68.2%
87.9%
83.8%
100%
78.1%
GPT-5.5
62.1%
90.5%
79.2%
95.8%
74.6%
Opus 4.5
57.8%
85.1%
72.4%
91.7%
70.2%
Gemini 2.5
51.9%
81.7%
68.3%
88.2%
65.8%
DeepSeek R2
47.6%
78.9%
64.7%
85%
61.4%
Qwen-3 Max
calc
semantic
evidence
format
total
Architecture

Full-Stack Evaluation Infrastructure

Four layers working together to create the most rigorous financial AI evaluation framework.

01

Evaluation Engine

Numeric Scorer
Range checks, tolerance bands, three-state scoring
Semantic Judge
Per-goal pass/fail with weighted aggregation
Evidence Validator
Source citation & document page verification
Format Checker
Schema validation & artifact compliance

02

Mock Environments

Banking & Payments
Accounts, transactions, transfers
Brokerage & Trading
Positions, market data, order execution
Insurance & Tax
Policy lookup, premium calc, tax filing
Document Vault
10-K filings, fund factsheets, statements

03

Data Foundation

Real Market Data
Live prices, OHLCV, macro indicators
Real Company Filings
SEC 10-K, earnings, balance sheets
Real Fund Data
NAV, expense ratios, holdings
Synthetic Profiles
Realistic user personas & histories

04

Expert Layer

Task Design
From real advisory cases
Rubric Engineering
Weighted criteria & edge cases
Golden Answers
Expert-verified reference solutions
Continuous Review
Ongoing calibration & updates
Fig. 01
Task Domains

Eight Dimensions of Financial Competence

Comprehensive coverage of the financial skills that matter most for individual users.

Financial Planning

Retirement, education funding, cash flow projections

50 tasks

Portfolio Analysis

Asset allocation, rebalancing, risk optimization

50 tasks

Stock Research

Fundamental analysis, valuation models

50 tasks

Compliance & Tax

Tax-loss harvesting, cross-border filing

50 tasks

Credit & Mortgage

Loan comparison, amortization, refinancing

50 tasks

Insurance Analysis

Coverage evaluation, premium comparison

50 tasks

Macro & ETF

Economic indicators, sector rotation, ETF selection

50 tasks

Cross-Border Finance

FX hedging, multi-currency planning

50 tasks
Scoring Engine

How We Score

Every task is scored across four dimensions with weighted criteria for comprehensive assessment.
Scoring Engine
Fig. 02

45pts

Calculation Accuracy

DETERMINISTIC

  • Numeric range validation with tolerance bands
  • Three-state: correct (+pts), missing (0), wrong (−pts)
  • CSV metric coverage with thresholds
  • Per-metric importance weighting
  • Format normalization (%, bps, currency)

30pts

Semantic Reasoning

LLM-ASSISTED

  • Per-goal pass/fail against golden answers
  • Weighted goal aggregation by importance
  • Methodology verification (not just answer)
  • Reasoning chain quality assessment
  • Cross-validated by multiple judge models

15pts

Evidence & Sources

DETERMINISTIC

  • Citation verification against source documents
  • Page-level reference accuracy
  • Tool output binding (correct API calls)
  • Evidence completeness and relevance
  • No hallucinated sources penalty

10pts

Delivery Format

DETERMINISTIC

  • JSON schema validation for structured output
  • Required artifact completeness check
  • CSV column & data type verification
  • File naming and structure compliance
  • Professional formatting standards
Sample Task

What a Real Task Looks Like

Tasks combine narrative context, reference materials, tool access, and structured output requirements.
RA0004HARD

Fund Investment Risk Assessment

Prompt
You are an alternative investment risk control officer in the risk management department of an insurance company. The investment department has submitted LP investment applications for two PE funds, requiring you to conduct risk assessments and provide risk control opinions. Please complete the following tasks based on the attached materials: 1. Assess the main risks of the two funds respectively (market risk, credit risk, liquidity risk, operational risk). 2. Evaluate the GP's management capabilities and credit status. 3. Assess the investor protection mechanisms in the fund terms. 4. Conduct stress testing (loss estimation under extreme scenarios). 5. Provide risk control opinions (approve/conditionally approve/reject). Please generate: risk_memo.md, risk_matrix.csv, stress_test.csv, evidence_manifest.json.
Reference Files
assumptions.txt
attachment_1.txt
attachment_2.xlsx
Golden Deliverables
evidence_manifest.json
portfolio_risk_analysis.xlsx
risk_assessment_report.md
risk_matrix.xlsx
  • assumptions.txt
  • attachment_1.txt
  • attachment_2.xlsx
  • evidence_manifest.json
  • portfolio_risk_analysis.xlsx
  • risk_assessment_report.md
  • risk_matrix.xlsx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
=== Investment Application Overview === Fund A: COSCO Mining Acquisition Fund - Size: 80 billion CNY (20 billion equity + 60 billion debt) - Strategy: Mining resource acquisitions - GP: COSCO Mining (state-owned enterprise) - Proposed commitment: 2 billion CNY Fund B: Frontier Strategic Emerging Industries PE Fund - Size: 630 million CNY - Strategy: Strategic emerging industries (semiconductors, new energy, advanced manufacturing) - GP: Frontier Capital - Proposed commitment: 600 million CNY === Insurance Fund Investment Restrictions === - Single fund investment shall not exceed 30% of fund size - Total alternative investments shall not exceed 25% of insurance fund's available balance - Investment terms must match insurance liability duration

Ready to Benchmark Your Agent?

Access the public dataset, run your agent against real financial tasks, and see where it stands on the leaderboard.