GDPVal Extended: AI Productivity Benchmark

Human expert labeled multi-domain tasks with solid rubrics and golden answer.

Tasks distribution

Tasks Distribution Overview

Completed 3879 tasks in Chinese and English across various industries including manufacturing, retail trade, and science & technology, delivering outputs in formats such as xlsx, docx, pptx, pdf, and md.

Language Distribution

68.4%
English (2653 tasks)
31.6%
English (1226 tasks)

Deliverable Type Distribution


80%

60%

40%

20%

64.2%
20.5%
10.6%
2.7%
1.4%
0.4%
xlsx (2490)
docx (795)
pptx (411)
pdf (105)
Multi-file* (54)
markdown (16)
*Multi-file combinations include Excel+Doc, Excel+PPT, and other combinations

Industry Distribution

0%
20%
40%
Manufacturing
22.1%
Trade and Retail
21.7%
Professional, Scientific, and Technical Services
17.5%
Real Estate, Rental, and Leasing
10.6%
Government
8.3%
Information Industry
6%
Healthcare and Social Assistance
5.5%
Finance and Insurance
5.3%
Wholesale Trade
3%
Demo cases

Sample Tasks, Real Challenges

Explore three end-to-end tasks that span the most common office deliverables — a marketing blog article, a promotional plan presentation, and a data-driven stocking spreadsheet. Each sample includes a full context prompt, reference files, constraint scoring criteria, and an ideal deliverable, giving you a complete picture of how GDPVal evaluates an AI Agent's ability to handle real workplace complexity.

Outdoor Equipment - Marketing

Prompt
You are the content editor for AlpineVista Gear, an outdoor equipment brand, responsible for managing the brand's blog. The brand's flagship product is the Summit Pro hardshell jacket, which uses Gore-Tex Pro ePE (expanded polyethylene) membrane technology, priced at $599, targeting professional mountaineers, backcountry skiers, and alpine climbers. A key recent marketing focus for the brand is communicating the breakthrough advantages of ePE technology over traditional ePTFE technology to consumers, particularly the core selling point of being "PFAS-free." Write an SEO-optimized blog article titled "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer." The article's goal is to explain ePE membrane technology to non-technical outdoor enthusiasts, help them make better purchasing decisions, and establish brand authority in the industry. The article should use a friendly, accessible tone aimed at beginners, fully showcasing the practical benefits of the new technology for athletes, consumers, and the environment.
Rubric
constraint-catalog.yaml
Reference Files
gore-tex-epe-hardshell-context.md
Golden Deliverables
gore-tex-epe-vs-eptfe-blog.docx
  • constraint-catalog.yaml
  • gore-tex-epe-hardshell-context.md
  • gore-tex-epe-vs-eptfe-blog.docx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
task_name: 0020-gm-seo-blog-outdoor deliverable_type: docx result_filename: validation-result.json rubric: Hard Constraint: - rubric_item_id: gm20-h02-docx-file-opened-normally implementation_type: code score: 1 criterion: Whether the .docx file can be opened normally without errors. criterion_cn: .docx 文件是否可以正常打开且无错误。 score_1_condition: The .docx file can be opened normally without any errors. score_1_condition_cn: .docx 文件可以正常打开且无任何错误。 score_0_condition: The .docx file cannot be opened normally or has format errors. score_0_condition_cn: .docx 文件无法正常打开或存在格式错误。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: '' query_reference_cn: '' - rubric_item_id: gm20-h03-text-contains-title-goretex implementation_type: code score: 1 criterion: 'Whether the text contains the title "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer" (ignoring leading/trailing whitespace; case-insensitive).' criterion_cn: '文本是否包含标题 "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer"(忽略首尾空格;不区分大小写)。' score_1_condition: The document contains the title (ignoring leading/trailing whitespace, case-insensitive). score_1_condition_cn: 文档包含该标题(忽略首尾空格,不区分大小写)。 score_0_condition: The document does not contain the title, or the title does not match the requirement. score_0_condition_cn: 文档不包含该标题,或标题与要求不匹配。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: 'Write an SEO-optimized blog article titled "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer."' query_reference_cn: '撰写一篇 SEO 优化的博客文章,标题为"Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer"。' - rubric_item_id: gm20-h04-article-compares-breakthrough-advancements implementation_type: llm score: 1 criterion: Whether the article compares the breakthrough advancements of ePE technology over traditional ePTFE technology. criterion_cn: 文章是否比较了 ePE 技术相对于传统 ePTFE 技术的突破性进展。 score_1_condition: The article compares ePE technology with traditional ePTFE technology and conveys the breakthrough advancements of ePE. score_1_condition_cn: 文章将 ePE 技术与传统 ePTFE 技术进行了比较,并传达了 ePE 的突破性进展。 score_0_condition: The article does not compare ePE technology with traditional ePTFE technology, or does not convey the breakthrough advancements of ePE. score_0_condition_cn: 文章未将 ePE 技术与传统 ePTFE 技术进行比较,或未传达 ePE 的突破性进展。 related_facts: ePE (expanded polyethylene) achieves the same microporous structure as ePTFE but is made from polyethylene containing zero fluorine atoms, delivering equivalent waterproofing and breathability with no PFAS in the membrane itself, and approximately 35% lower carbon footprint. related_facts_cn: ePE(膨化聚乙烯)实现了与 ePTFE 相同的微孔结构,但由不含氟原子的聚乙烯制成,在薄膜本身不含 PFAS 的情况下提供同等的防水和透气性能,碳足迹降低约 35%。

E-Commerce - Sales Analysis

Prompt
You are an operations specialist at LuxeSelect Boutique, a luxury buyer e-commerce website. The site primarily deals in personal shopping and sales of handbags, scarves/shawls, jewelry, shoes, and small leather goods from luxury brands such as Louis Vuitton and Chanel. Valentine's Day is approaching, and you need to create a promotion plan proposal (PowerPoint presentation, no more than 10 slides) for submission to the Operations Director for approval. The proposal must include the following four modules: 1. Theme & Copy: Craft an attractive campaign theme and core copy for this Valentine's Day promotion. 2. Poster Design Concept: Describe the design concept for at least one promotional poster, including visual style, core elements, and copy layout. 3. Promotion Timeline: Develop a complete promotion timeline from warm-up to campaign end, covering channels such as social media, email marketing, and on-site banners. 4. Expected Results: Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin), develop a reasonable promotional pricing strategy and provide forecasts for key metrics such as expected sales revenue, gross profit, etc. Attachment descriptions: - order-list.pdf: Contains 36 curated luxury accessories for this Valentine's Day selection (18 each from Louis Vuitton and Chanel), spanning five categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods. Each product includes an illustration, SKU, color, material, dimensions, special features, and notes. - purchase-order.xlsx: Contains each product's SKU, category, cost price, retail price, minimum acceptable selling price (brand floor price), proposed purchase quantity, as well as formula-calculated gross profit and gross margin. Please output the final presentation in PDF format.
Rubric
constraint-catalog.yaml
Reference Files
order-list.pdf
purchase-order.xlsx
Golden Deliverables
valentines-promo-plan.pdf
  • constraint-catalog.yaml
  • order-list.pdf
  • purchase-order.xlsx
  • valentines-promo-plan.pdf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
task_name: 0021-gs-valentines-promo-plan deliverable_type: pdf result_filename: validation-result.json rubric: Hard Constraint: - rubric_item_id: gs21-h01-deliverable-pdf-format implementation_type: code score: 1 criterion: Whether the deliverable is in PDF format. criterion_cn: 交付文件是否为 PDF 格式。 score_1_condition: The deliverable file has a .pdf extension. score_1_condition_cn: 交付文件的扩展名为 .pdf。 score_0_condition: The deliverable is not a PDF file. score_0_condition_cn: 交付文件不是 PDF 文件。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: Please output the final presentation in PDF format. query_reference_cn: 请以 PDF 格式输出最终演示文稿。 - rubric_item_id: gs21-h02-output-contains-more-10 implementation_type: code score: 1 criterion: Whether the output contains no more than 10 slides (pages). criterion_cn: 输出是否不超过 10 张幻灯片(页)。 score_1_condition: The PDF contains no more than 10 pages. score_1_condition_cn: PDF 不超过 10 页。 score_0_condition: The PDF contains more than 10 pages. score_0_condition_cn: PDF 超过 10 页。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: 'you need to create a promotion plan proposal (PowerPoint presentation, no more than 10 slides)' query_reference_cn: 情人节即将到来,你需要为运营总监创建一份促销计划提案(PowerPoint 演示文稿,不超过 10 张幻灯片)以获得批准。 - rubric_item_id: gs21-h03-title-slide-contains-luxeselect implementation_type: llm score: 1 criterion: |- Whether the title on Slide 1 contains "LuxeSelect Boutique" and "Valentine's Day" (case-insensitive). criterion_cn: |- 第 1 张幻灯片的标题是否包含 "LuxeSelect Boutique" 和 "Valentine's Day"(不区分大小写)。 score_1_condition: |- The first page text contains both "LuxeSelect Boutique" and "Valentine's Day" (case-insensitive). score_1_condition_cn: |- 第一页文本包含 "LuxeSelect Boutique" 和 "Valentine's Day"(不区分大小写)。 score_0_condition: The first page is missing one or both of the required phrases. score_0_condition_cn: 第一页缺少上述必需短语中的一个或两个。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: >- You are an operations specialist at LuxeSelect Boutique, a luxury buyer e-commerce website. The site primarily deals in personal shopping and sales of handbags, scarves/shawls, jewelry, shoes, and small leather goods from luxury brands such as Louis Vuitton and Chanel. Valentine's Day is approaching, and you need to create a promotion plan proposal query_reference_cn: >- 你是奢侈品买手电商平台 LuxeSelect Boutique 的运营人员。该平台主要从事路易威登、香奈儿等奢侈品牌的手袋、围巾/披肩、珠宝、鞋类和小皮具的个人购物和销售。 情人节即将到来,你需要为运营总监创建一份促销计划提案

Restaurant - Supply Analysis

Prompt
You are a supply chain management specialist at "SZW Hotpot" (蜀滋味), a Chinese hotpot chain brand. SZW Hotpot is headquartered in San Francisco Chinatown, primarily serving the Bay Area Chinese community and local diners. The brand currently operates 8 locations in the Bay Area (1 of which is closed), located in San Francisco Chinatown, Sunset District, Milpitas, Cupertino, Fremont, Oakland Chinatown, Daly City, and San Mateo. The 2026 Chinese New Year / Lunar New Year falls in February (February 17 is New Year's Eve, February 18 is the first day of the new year). It is now mid-January 2026, and you need to develop a 2026 Spring Festival ingredient stocking plan based on Q1 2025 (January through March) historical operations data. You will receive an Excel reference file containing Q1 2025 operations data ("szw-hotpot-2025-q1-operations-data.xlsx"), which includes 5 worksheets: (a) Store Matrix — Store Master Data Columns: Store ID, Store Name, Location, Store Type (Flagship / Standard / Express), Seats, Status (Active / Closed), Cold Storage (sqft), Delivery Zone, Notes. 7 stores are Active, 1 (SF-007, Daly City) is Closed. The Notes column flags anomalies relevant to planning (e.g., one store has extra foot traffic from a parade during Spring Festival, one store has limited cold storage requiring frozen item prioritization, etc.). (b) Ingredient Master — Ingredient Master Data Columns: Item ID, Item Name, Category (5 categories), Unit, Unit Cost ($), Shelf Life (days), Lead Time (days), MOQ (per store order), Wastage Rate, Storage Type (Refrigerated / Frozen / Ambient). 31 ingredients total. Shelf life ranges from 2 days (Fresh Juice) to 180 days (Beer, Soda); lead times range from 1 to 5 days; wastage rates range from 1% (ambient items) to 15% (Fresh Juice). (c-e) Jan 2025 / Feb 2025 / Mar 2025 — Monthly Operations Data Each sheet contains two data regions: - Daily Sales Metrics: Date, Day, Store ID, Tables Served, Customer Count, Revenue ($$), Avg Spend/Person ($$), Table Turnover Rate, Food Cost Ratio - Daily Ingredient Consumption: Date, Store ID, Category, Item Name, Unit, Consumption Qty, Unit Cost ($$), Total Cost ($$), Cumulative MTD Cost ($) Please build an Excel workbook as the deliverable containing a Spring Festival stocking plan. The workbook should include historical data analysis, trend and peak analysis, a store-level stocking plan for February 10–24, 2026, a budget summary, and an executive summary.
Rubric
constraint-catalog.yaml
Reference Files
szw-hotpot-2025-q1-operations-data.xlsx
Golden Deliverables
szw-hotpot-2026-cny-stocking-plan.xlsx
  • constraint-catalog.yaml
  • szw-hotpot-2025-q1-operations-data.xlsx
  • szw-hotpot-2026-cny-stocking-plan.xlsx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
task_name: 0022-gs-hotpot-stocking-plan deliverable_type: xlsx result_filename: validation-result.json rubric: Hard Constraint: - rubric_item_id: gs22-h01-deliverable-single-xlsx-excel implementation_type: code score: 1 criterion: Whether the deliverable is a single .xlsx Excel workbook file. criterion_cn: 交付物是否为单一 .xlsx Excel 工作簿文件。 score_1_condition: The deliverable file has a .xlsx extension and is a valid Excel workbook. score_1_condition_cn: 交付文件具有 .xlsx 扩展名且为有效的 Excel 工作簿。 score_0_condition: The deliverable is not an .xlsx file or is not a valid Excel workbook. score_0_condition_cn: 交付物不是 .xlsx 文件或不是有效的 Excel 工作簿。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: Please build an Excel workbook as the deliverable query_reference_cn: 请构建一份 Excel 工作簿作为交付物 - rubric_item_id: gs22-h02-workbook-contains-multiple-worksheets implementation_type: llm score: 1 criterion: Whether the workbook contains multiple worksheets or clearly separated sections corresponding to historical analysis, stocking plan, budget summary, and other modules. criterion_cn: 工作簿是否包含多个工作表或明确分隔的区域,对应历史分析、备货计划、预算汇总等模块。 score_1_condition: The workbook contains multiple worksheets or clearly separated sections for historical analysis, stocking plan, budget summary, and other modules. score_1_condition_cn: 工作簿包含多个工作表或明确分隔的区域,对应历史分析、备货计划、预算汇总等模块。 score_0_condition: The workbook lacks multiple worksheets or clearly separated sections for the required modules. score_0_condition_cn: 工作簿缺少多个工作表或明确分隔的区域来对应所需模块。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: 'The workbook should include historical data analysis, trend and peak analysis, a store-level stocking plan for February 10–24, 2026, a budget summary, and an executive summary.' query_reference_cn: '工作簿应包括历史数据分析、趋势和高峰分析、2026 年 2 月 10-24 日的门店级备货计划、预算汇总和执行摘要。' - rubric_item_id: gs22-h03-number-formatting-consistent-dollar implementation_type: llm score: 1 criterion: 'Whether number formatting is consistent: dollar amounts use $ with two decimal places, percentages use % format, quantities rounded to one decimal.' criterion_cn: 数值格式是否一致:金额使用 $ 并保留两位小数,百分比使用 % 格式,数量保留一位小数。 score_1_condition: 'Number formatting is consistent throughout: dollar amounts use $ with two decimal places, percentages use % format, quantities are rounded to one decimal.' score_1_condition_cn: 全篇数值格式一致:金额使用 $ 并保留两位小数,百分比使用 % 格式,数量保留一位小数。 score_0_condition: Number formatting is inconsistent or does not follow the required formats. score_0_condition_cn: 数值格式不一致或未遵循要求的格式。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: '' query_reference_cn: '' - rubric_item_id: gs22-h04-workbook-contains-historical-data implementation_type: llm score: 1 criterion: Whether the workbook contains a historical data summary for the 2025 Spring Festival period (January 20 to February 4). criterion_cn: 工作簿是否包含 2025 年春节期间(1 月 20 日至 2 月 4 日)的历史数据汇总。 score_1_condition: The workbook contains a historical data summary covering the 2025 Spring Festival period (January 20 to February 4). score_1_condition_cn: 工作簿包含覆盖 2025 年春节期间(1 月 20 日至 2 月 4 日)的历史数据汇总。 score_0_condition: The workbook does not contain a historical data summary for the 2025 Spring Festival period. score_0_condition_cn: 工作簿不包含 2025 年春节期间的历史数据汇总。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: 'historical data analysis' query_reference_cn: '历史数据分析' - rubric_item_id: gs22-h05-workbook-contains-clearly-labeled implementation_type: llm score: 1 criterion: Whether the workbook contains a clearly labeled 2026 Spring Festival stocking plan. criterion_cn: 工作簿是否包含明确标注的 2026 年春节备货计划。 score_1_condition: The workbook contains a clearly labeled 2026 Spring Festival stocking plan. score_1_condition_cn: 工作簿包含明确标注的 2026 年春节备货计划。 score_0_condition: The workbook does not contain a clearly labeled 2026 Spring Festival stocking plan. score_0_condition_cn: 工作簿不包含明确标注的 2026 年春节备货计划。 related_facts: '' related_facts_cn: '' facts_reference: '' facts_reference_cn: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' query_reference_cn: '2026 年 2 月 10-24 日的门店级备货计划'