GDPVal Extended: AI Productivity Benchmark

Human expert labeled multi-domain tasks with solid rubrics and golden answer.

Tasks distribution

Tasks Distribution Overview

Completed 3879 tasks in Chinese and English across various industries including manufacturing, retail trade, and science & technology, delivering outputs in formats such as xlsx, docx, pptx, pdf, and md.

Language Distribution

68.4%

English (2653 tasks)

31.6%

Chinese (1226 tasks)

Deliverable Type Distribution

80%

60%

40%

20%

64.2%

20.5%

10.6%

2.7%

1.4%

0.4%

xlsx (2490)

docx (795)

pptx (411)

pdf (105)

Multi-file* (54)

markdown (16)

*Multi-file combinations include Excel+Doc, Excel+PPT, and other combinations

Industry Distribution

10%

20%

30%

Manufacturing

22.1%

Trade and Retail

21.7%

Professional, Scientific, and Technical Services

17.5%

Real Estate, Rental, and Leasing

10.6%

Government

8.3%

Information Industry

Healthcare and Social Assistance

5.5%

Finance and Insurance

5.3%

Wholesale Trade

Demo cases

Sample Tasks, Real Challenges

Explore three end-to-end tasks that span the most common office deliverables — a marketing blog article, a promotional plan presentation, and a data-driven stocking spreadsheet. Each sample includes a full context prompt, reference files, constraint scoring criteria, and an ideal deliverable, giving you a complete picture of how GDPVal evaluates an AI Agent's ability to handle real workplace complexity.

Outdoor Equipment - Marketing

Prompt

You are the content editor for AlpineVista Gear, an outdoor equipment brand, responsible for managing the brand's blog. The brand's flagship product is the Summit Pro hardshell jacket, which uses Gore-Tex Pro ePE (expanded polyethylene) membrane technology, priced at $599, targeting professional mountaineers, backcountry skiers, and alpine climbers. A key recent marketing focus for the brand is communicating the breakthrough advantages of ePE technology over traditional ePTFE technology to consumers, particularly the core selling point of being "PFAS-free." Write an SEO-optimized blog article titled "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer." The article's goal is to explain ePE membrane technology to non-technical outdoor enthusiasts, help them make better purchasing decisions, and establish brand authority in the industry. The article should use a friendly, accessible tone aimed at beginners, fully showcasing the practical benefits of the new technology for athletes, consumers, and the environment.

Rubric

constraint-catalog.yaml

Reference Files

gore-tex-epe-hardshell-context.md

Golden Deliverables

gore-tex-epe-vs-eptfe-blog.docx

constraint-catalog.yaml
gore-tex-epe-hardshell-context.md
gore-tex-epe-vs-eptfe-blog.docx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
task_name: 0020-gm-seo-blog-outdoor
deliverable_type: docx
result_filename: validation-result.json
rubric:
  Hard Constraint:
    - rubric_item_id: gm20-h02-docx-file-opened-normally
      implementation_type: code
      score: 1
      criterion: Whether the .docx file can be opened normally without errors.
      score_1_condition: The .docx file can be opened normally without any errors.
      score_0_condition: The .docx file cannot be opened normally or has format errors.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-h03-text-contains-title-goretex
      implementation_type: code
      score: 1
      criterion: 'Whether the text contains the title "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer" (ignoring leading/trailing whitespace; case-insensitive).'
      score_1_condition: The document contains the title (ignoring leading/trailing whitespace, case-insensitive).
      score_0_condition: The document does not contain the title, or the title does not match the requirement.
      related_facts: ''
      facts_reference: ''
      query_reference: 'Write an SEO-optimized blog article titled "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer."'
    - rubric_item_id: gm20-h04-article-compares-breakthrough-advancements
      implementation_type: llm
      score: 1
      criterion: Whether the article compares the breakthrough advancements of ePE technology over traditional ePTFE technology.
      score_1_condition: The article compares ePE technology with traditional ePTFE technology and conveys the breakthrough advancements of ePE.
      score_0_condition: The article does not compare ePE technology with traditional ePTFE technology, or does not convey the breakthrough advancements of ePE.
      related_facts: ePE (expanded polyethylene) achieves the same microporous structure as ePTFE but is made from polyethylene containing zero fluorine atoms, delivering equivalent waterproofing and breathability with no PFAS in the membrane itself, and approximately 35% lower carbon footprint.
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: 'communicating the breakthrough advantages of ePE technology over traditional ePTFE technology to consumers'
    - rubric_item_id: gm20-h05-article-emphasizes-core-selling
      implementation_type: llm
      score: 1
      criterion: Whether the article emphasizes the core selling point of "PFAS-free" when comparing ePE and ePTFE technologies.
      score_1_condition: The article clearly emphasizes "PFAS-free" as a core selling point when comparing ePE and ePTFE technologies.
      score_0_condition: The article does not mention "PFAS-free" as a key advantage of ePE over ePTFE, or the PFAS-free selling point is not clearly conveyed.
      related_facts: Traditional ePTFE membranes and most legacy DWR treatments contain PFAS. ePE membranes contain zero fluorine atoms and are PFAS-free. PFAS detected in 89% of legacy DWR-treated jackets (Glüge et al., Environmental Science & Technology, 2024).
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: 'particularly the core selling point of being "PFAS-free."'
    - rubric_item_id: gm20-h01-deliverable-docx-word-document
      implementation_type: code
      score: 1
      criterion: Whether the deliverable is a .docx Word document (file extension is .docx).
      score_1_condition: The deliverable is a .docx format Word document.
      score_0_condition: The deliverable is not in .docx format, or the file extension is not .docx.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
  Soft Constraint:
    - rubric_item_id: gm20-s01-external-links-document-avoid
      implementation_type: code
      score: 1
      criterion: 'Whether external links in the document avoid using prohibited generic anchor text: {"click here", "here", "this article", "read more", "learn more"}.'
      score_1_condition: All external links in the document avoid using prohibited generic anchor text.
      score_0_condition: The document contains external links using the above prohibited anchor text.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s02-document-contains-least-one
      implementation_type: code
      score: 1
      criterion: Whether the document contains at least one Heading 2 (H2) subheading.
      score_1_condition: The document contains at least one H2 subheading.
      score_0_condition: The document does not contain any H2 subheadings.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s03-document-contains-least-one
      implementation_type: code
      score: 1
      criterion: Whether the document contains at least one Heading 3 (H3) subheading.
      score_1_condition: The document contains at least one H3 subheading.
      score_0_condition: The document does not contain any H3 subheadings.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s04-bold-formatting-used-highlight
      implementation_type: code
      score: 1
      criterion: Whether bold formatting is used to highlight and convey important content.
      score_1_condition: The document uses bold formatting to highlight important content.
      score_0_condition: The document does not use bold formatting.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s05-italic-formatting-used-highlight
      implementation_type: code
      score: 1
      criterion: Whether italic formatting is used to highlight and convey important content.
      score_1_condition: The document uses italic formatting to highlight important content.
      score_0_condition: The document does not use italic formatting.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s06-four-related-secondary-keywords
      implementation_type: llm
      score: 1
      criterion: Whether four related secondary keywords are listed after the article body.
      score_1_condition: Four related secondary keywords are listed after the body.
      score_0_condition: Secondary keywords are not listed after the body, or fewer than four are listed.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s07-final-section-document-mentions
      implementation_type: code
      score: 1
      criterion: Whether the final section of the document mentions a "Pull quote" and the selected quote text.
      score_1_condition: The final section mentions a Pull quote and the selected quote text.
      score_0_condition: The final section does not mention a Pull quote or does not include quote text.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s08-four-listed-secondary-keywords
      implementation_type: code
      score: 1
      criterion: Whether each of the four listed secondary keywords appears at least once in the article body.
      score_1_condition: Each of the four secondary keywords appears at least once in the body.
      score_0_condition: Some secondary keywords do not appear in the body.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s09-four-listed-secondary-keywords
      implementation_type: code
      score: 1
      criterion: Whether the four listed secondary keywords are all distinct from each other and none is the primary keyword "Gore-Tex ePE".
      score_1_condition: The four secondary keywords are all distinct from each other and none is the primary keyword "Gore-Tex ePE".
      score_0_condition: Secondary keywords contain duplicates, or include the primary keyword "Gore-Tex ePE".
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s10-body-includes-interspersed-links
      implementation_type: llm
      score: 1
      criterion: Whether the body includes interspersed links to news or external resources related to "PFAS-free outdoor gear" or "Gore-Tex ePE" (using SEO-friendly anchor text).
      score_1_condition: The body includes interspersed links to relevant news or external resources, using SEO-friendly anchor text.
      score_0_condition: The body does not provide relevant external links, or link anchor text does not meet SEO requirements.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s11-body-contains-laypersonfriendly-definition
      implementation_type: llm
      score: 1
      criterion: Whether the body contains a layperson-friendly definition of PFAS (per- and polyfluoroalkyl substances), mentioning their persistence ("forever chemicals" or equivalent phrasing) and environmental impact.
      score_1_condition: Contains a layperson-friendly definition of PFAS, mentioning their persistence ("forever chemicals" or equivalent phrasing) and environmental impact.
      score_0_condition: Does not contain a layperson-friendly definition of PFAS, or does not mention persistence and environmental impact.
      related_facts: PFAS (Per- and Polyfluoroalkyl Substances) are known as "forever chemicals" because they are extremely persistent in the environment, causing long-term contamination to water sources and ecosystems.
      facts_reference: https://en.wikipedia.org/wiki/Per-_and_polyfluoroalkyl_substances
      query_reference: ''
    - rubric_item_id: gm20-s12-body-contains-beginnerfriendly-definition
      implementation_type: llm
      score: 1
      criterion: Whether the body contains a beginner-friendly definition of the "Gore-Tex ePE membrane", mentioning microporous structure and waterproof-breathable performance.
      score_1_condition: Contains a beginner-friendly definition of the ePE membrane, mentioning microporous structure and waterproof-breathable performance.
      score_0_condition: Does not contain a definition of the ePE membrane, or the definition is not accessible enough, or does not mention microporous structure and waterproof-breathable performance.
      related_facts: 'ePE (expanded polyethylene) achieves the same microporous structure as ePTFE but is made from polyethylene containing zero fluorine atoms. The result: equivalent waterproofing and breathability, with no PFAS in the membrane itself.'
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: ''
    - rubric_item_id: gm20-s13-technical-terms-appear-eptfe
      implementation_type: llm
      score: 1
      criterion: If technical terms appear ("ePTFE", "ePE", "MVTR", "DWR", "PFAS", "laminate", "face fabric"), whether a layperson-friendly definition is provided within two sentences.
      score_1_condition: All technical terms that appear are given a layperson-friendly definition within two sentences.
      score_0_condition: Technical terms appear without a layperson-friendly definition provided within two sentences.
      related_facts: ''
      facts_reference: ''
      query_reference: 'explain ePE membrane technology to non-technical outdoor enthusiasts'
    - rubric_item_id: gm20-s14-article-compares-least-two
      implementation_type: llm
      score: 1
      criterion: Whether the article compares at least two specific performance differences or advantages/disadvantages between ePE and ePTFE membrane technologies (e.g., carbon footprint, PFAS content, waterproof rating, breathability).
      score_1_condition: At least two specific performance differences or advantages/disadvantages are compared.
      score_0_condition: No ePE vs ePTFE performance comparison is made, or only one aspect is compared.
      related_facts: 'Key advantages of ePE membrane over ePTFE include: PFAS-free, approximately 35% lower carbon footprint (Higg MSI), and equal or better waterproof-breathable performance.'
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: 'communicating the breakthrough advantages of ePE technology over traditional ePTFE technology'
    - rubric_item_id: gm20-s15-article-mentions-fc0-dwr
      implementation_type: llm
      score: 1
      criterion: Whether the article mentions FC0 DWR (durable water repellent) treatment and explains its "PFAS-free" characteristic.
      score_1_condition: Mentions FC0 DWR treatment and explains its PFAS-free characteristic.
      score_0_condition: Does not mention FC0 DWR, or does not explain its PFAS-free characteristic.
      related_facts: FC0 DWR refers to a fluorocarbon-free formulation that achieves water repellency without using any PFAS. The Summit Pro uses FC0 (PFAS-free fluorocarbon-free) DWR.
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: ''
    - rubric_item_id: gm20-s16-article-explains-least-one
      implementation_type: llm
      score: 1
      criterion: Whether the article explains at least one advantage of C-KNIT™ backer technology compared to traditional linings (e.g., lighter, more breathable, quieter).
      score_1_condition: Explains at least one advantage of C-KNIT™ backer technology compared to traditional linings.
      score_0_condition: Does not mention the advantages of C-KNIT™ backer technology.
      related_facts: 'C-KNIT™ is Gore''s proprietary circular-knit backer technology. Instead of a stiff tricot liner, C-KNIT™ uses a softer, more open-weave structure. Benefits: up to 10% lighter, noticeably quieter during movement, improved moisture wicking.'
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: ''
    - rubric_item_id: gm20-s17-article-explicitly-states-alpinevista
      implementation_type: llm
      score: 1
      criterion: Whether the article explicitly states that the AlpineVista Summit Pro uses Gore-Tex Pro ePE technology, and cites at least two specific parameters from the "Product Specifications" section of the reference file.
      score_1_condition: Explicitly states the product technology and cites at least two specific parameters (e.g., waterproof rating, weight, breathability MVTR, etc.).
      score_0_condition: Does not state the product uses Gore-Tex Pro ePE technology, or cites fewer than two specific parameters.
      related_facts: 'Summit Pro specs: Gore-Tex Pro ePE membrane, 3-layer laminate, 80D recycled nylon ripstop face fabric, C-KNIT™ backer, 28,000 mm waterproof rating, 25,000 g/m²/24h MVTR, FC0 DWR, 425 g weight, $599 MSRP.'
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: 'The brand''s flagship product is the Summit Pro hardshell jacket, which uses Gore-Tex Pro ePE (expanded polyethylene) membrane technology, priced at $599'
    - rubric_item_id: gm20-s18-article-cites-least-two
      implementation_type: llm
      score: 1
      criterion: Whether the article cites at least two specific data points or statistics from the "Market Statistics & Data Points" section of the reference file, with sources noted.
      score_1_condition: Cites at least two specific market data points or statistics, with sources noted.
      score_0_condition: Does not cite market data, or cites fewer than two data points, or sources are not noted.
      related_facts: 'Market data includes: $2.8B hardshell market (Allied Market Research, 2025), 73% consumers influenced by sustainability (OIA, Q3 2025), 89% legacy jackets contain PFAS (Glüge et al., 2024), ePE ~35% lower carbon footprint (Gore 2025 Sustainability Report), EU PFAS restriction expected by 2027, 22% YoY growth in PFAS-free searches.'
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: ''
    - rubric_item_id: gm20-s19-article-cites-least-two
      implementation_type: llm
      score: 1
      criterion: Whether the article cites at least two experts or institutions from the "Key Experts & Voices to Reference" section of the reference file, including their names/institution names and quotable viewpoints or key findings.
      score_1_condition: Cites at least two experts or institutions, including names/institution names and quotable viewpoints or key findings.
      score_0_condition: Does not cite experts, or cites fewer than two, or does not include viewpoints/findings.
      related_facts: 'Key experts: Dr. Amara Singh (Empa) — ePE matches ePTFE durability; Dr. Kai Lüdemann (UFZ Leipzig) — ePE is ''the most commercially viable pathway to a fluorine-free hardshell''; Maria Torres-Vega (EOG) — brands adopting ePE gain 2-3 year compliance head start; Jake Orton (IFMGA Guide) — ePE is ''noticeably more supple and quieter''.'
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: ''
    - rubric_item_id: gm20-s20-cited-expert-their-affiliated
      implementation_type: llm
      score: 1
      criterion: For each cited expert, whether their affiliated institution or professional title/background is provided.
      score_1_condition: Each cited expert is provided with their affiliated institution or professional title/background.
      score_0_condition: Cited experts are missing institutional or professional background information.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s21-article-states-epe-membrane
      implementation_type: llm
      score: 1
      criterion: Whether the article states that the ePE membrane has a lower carbon footprint than ePTFE (citing Higg MSI or Gore sustainability report, with a specific reduction of approximately 35%).
      score_1_condition: States that ePE has a lower carbon footprint than ePTFE, citing Higg MSI or Gore sustainability report.
      score_0_condition: Does not state the carbon footprint comparison, or does not cite an authoritative source.
      related_facts: According to Higg MSI data, the Gore-Tex ePE membrane has approximately 35% lower carbon footprint than ePTFE.
      facts_reference: Gore-Tex Sustainability Report; Higg Materials Sustainability Index
      query_reference: ''
    - rubric_item_id: gm20-s22-cited-statistics-include-original
      implementation_type: llm
      score: 1
      criterion: Whether all cited statistics include the original source name and publication year.
      score_1_condition: All cited statistics include the original source name and publication year (e.g., "Allied Market Research, 2025").
      score_0_condition: Some statistics are missing source or year information.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s23-article-includes-least-one
      implementation_type: llm
      score: 1
      criterion: Whether the article includes at least one competitor comparison, referencing at least two competitor products from the "Competitive Landscape" table in the reference file, and comparing at least one parameter.
      score_1_condition: Includes competitor comparison, references at least two competitor products, and compares at least one parameter (price, membrane type, or key differentiator).
      score_0_condition: Does not include competitor comparison, or references fewer than two competitor products, or does not compare any parameter.
      related_facts: 'Competitors: Arc''teryx Alpha SV ($799, Gore-Tex Pro ePE), Patagonia Triolet ($449, Gore-Tex ePE non-Pro), Norrøna Trollveggen ($699, Gore-Tex Pro ePE), Mountain Hardwear Exposure/2 ($500, Gore-Tex Pro ePTFE), Mammut Nordwand Advanced ($750, Gore-Tex Pro ePE). AlpineVista Summit Pro at $599 is below the $637 competitive average.'
      facts_reference: gore-tex-epe-hardshell-context.md
      query_reference: ''
    - rubric_item_id: gm20-s24-competitor-comparison-presented-structured
      implementation_type: llm
      score: 1
      criterion: Whether the competitor comparison is presented in a structured table format, including at least three products.
      score_1_condition: Competitor comparison is presented in a structured table format, including at least three products.
      score_0_condition: Competitor comparison is described only in prose paragraphs, or the table includes fewer than three products.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s25-competitor-comparison-positions-alpinevista
      implementation_type: llm
      score: 1
      criterion: Whether the competitor comparison positions the AlpineVista Summit Pro as "best value in Pro ePE" or equivalent phrasing.
      score_1_condition: Positions the Summit Pro as best value in Pro ePE or equivalent phrasing.
      score_0_condition: Does not position the Summit Pro for value.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s26-article-prominently-features-least
      implementation_type: llm
      score: 1
      criterion: Whether the article prominently features at least two different outdoor activity scenarios where the product is suitable (e.g., alpine climbing, backcountry skiing, heavy rain hiking), with explanations tied to specific ePE technology advantages.
      score_1_condition: Features at least two different outdoor activity scenarios, with explanations tied to ePE technology advantages.
      score_0_condition: Does not introduce use scenarios, or fewer than two scenarios, or does not tie to technology advantages.
      related_facts: ''
      facts_reference: ''
      query_reference: 'fully showcasing the practical benefits of the new technology for athletes, consumers, and the environment'
    - rubric_item_id: gm20-s27-article-addresses-least-one
      implementation_type: llm
      score: 1
      criterion: Whether the article addresses at least one common consumer concern (e.g., durability, DWR reapplication, price justification, etc.).
      score_1_condition: Addresses at least one common consumer concern.
      score_0_condition: Does not address any common consumer concerns.
      related_facts: ''
      facts_reference: ''
      query_reference: 'help them make better purchasing decisions'
    - rubric_item_id: gm20-s28-article-mentions-previews-alpinevista
      implementation_type: llm
      score: 1
      criterion: Whether the article mentions or previews the "AlpineVista Community Trail Reports" feature (a social platform launching in Q3 2026), linking it to real-world use scenarios for the ePE hardshell jacket.
      score_1_condition: Mentions or previews the Community Trail Reports feature, linking it to product use scenarios.
      score_0_condition: Does not mention the Community Trail Reports feature.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s29-conclusion-explicitly-states-upcoming
      implementation_type: llm
      score: 1
      criterion: Whether the conclusion explicitly states that upcoming articles will include field-use reviews or interviews with professional athletes or professional guides.
      score_1_condition: The conclusion explicitly states that upcoming articles will include field-use reviews or interviews with professional athletes or professional guides.
      score_0_condition: The conclusion does not mention upcoming field reviews or interviews.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s30-epe-vs-eptfe-core
      implementation_type: llm
      score: 1
      criterion: Whether the ePE vs ePTFE core attribute comparison is presented in a structured table format (with at least 5 comparison dimensions).
      score_1_condition: The ePE vs ePTFE comparison is presented in a structured table format, with at least 5 comparison dimensions.
      score_0_condition: The ePE vs ePTFE comparison is described only in paragraphs, or the table has fewer than 5 comparison dimensions.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s31-article-includes-faq-section
      implementation_type: llm
      score: 1
      criterion: Whether the article includes a FAQ section with at least two questions in natural language question format, where the first sentence of each answer is a direct answer.
      score_1_condition: Includes a FAQ section with at least two natural language questions, where the first sentence of each answer is a direct answer.
      score_0_condition: Does not include a FAQ section, or fewer than two questions, or the first sentence of answers is not a direct answer.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s32-conclusion-paragraph-explicitly-states
      implementation_type: llm
      score: 1
      criterion: Whether the conclusion paragraph explicitly states that upcoming articles will include a Gore-Tex ePE hardshell jacket buying guide or care and maintenance tutorial.
      score_1_condition: The conclusion explicitly states that upcoming articles will include a buying guide or care and maintenance tutorial.
      score_0_condition: The conclusion does not mention upcoming buying guides or care tutorials.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s33-conclusion-paragraphs-market-trend
      implementation_type: llm
      score: 1
      criterion: Whether the conclusion paragraph's market trend summary cites specific data (e.g., market size or growth rate) as support.
      score_1_condition: The conclusion's market trend summary cites specific data (e.g., market size or growth rate).
      score_0_condition: The conclusion's market trend summary does not cite any specific data.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
    - rubric_item_id: gm20-s34-major-content-paragraph-first
      implementation_type: llm
      score: 4
      criterion: Whether each major content paragraph (first paragraph under H2) opens with a conclusive statement (directly answering the implicit question of the section heading), rather than beginning with background context or transitional phrasing.
      score_4_condition: All H2 opening paragraphs begin with conclusive statements, directly answering the implicit question of the heading, with no background preamble.
      score_3_condition: Most H2 opening paragraphs begin with conclusive statements, directly answering the implicit question of the heading.
      score_2_condition: Approximately half of H2 opening paragraphs begin with conclusive statements.
      score_1_condition: A few H2 opening paragraphs use conclusion-first writing, but not consistently.
      score_0_condition: Most H2 opening paragraphs begin with background context or transitional phrasing, not using conclusion-first writing.
      related_facts: ''
      facts_reference: ''
      query_reference: ''
  Optional Constraint:
    - rubric_item_id: gm20-o01-overall-formatting-style-readability
      implementation_type: llm
      score: 4
      criterion: Overall formatting, style, readability, and professionalism of the deliverable.
      score_4_condition: Formatting is exquisite, style is professionally unified, readability is excellent, reaching publication-quality standards.
      score_3_condition: Formatting is polished, style is unified, readability is good, demonstrates professionalism.
      score_2_condition: Formatting is standardized, style is generally consistent, readability is acceptable.
      score_1_condition: Basic formatting is complete, but there are obvious style inconsistencies or layout issues.
      score_0_condition: Formatting is chaotic, style is inconsistent, readability is poor, lacks professionalism.
      related_facts: ''
      facts_reference: ''
      query_reference: ''

E-Commerce - Sales Analysis

Prompt

You are an operations specialist at LuxeSelect Boutique, a luxury buyer e-commerce website. The site primarily deals in personal shopping and sales of handbags, scarves/shawls, jewelry, shoes, and small leather goods from luxury brands such as Louis Vuitton and Chanel. Valentine's Day is approaching, and you need to create a promotion plan proposal (PowerPoint presentation, no more than 10 slides) for submission to the Operations Director for approval. The proposal must include the following four modules: 1. Theme & Copy: Craft an attractive campaign theme and core copy for this Valentine's Day promotion. 2. Poster Design Concept: Describe the design concept for at least one promotional poster, including visual style, core elements, and copy layout. 3. Promotion Timeline: Develop a complete promotion timeline from warm-up to campaign end, covering channels such as social media, email marketing, and on-site banners. 4. Expected Results: Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin), develop a reasonable promotional pricing strategy and provide forecasts for key metrics such as expected sales revenue, gross profit, etc. Attachment descriptions: - order-list.pdf: Contains 36 curated luxury accessories for this Valentine's Day selection (18 each from Louis Vuitton and Chanel), spanning five categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods. Each product includes an illustration, SKU, color, material, dimensions, special features, and notes. - purchase-order.xlsx: Contains each product's SKU, category, cost price, retail price, minimum acceptable selling price (brand floor price), proposed purchase quantity, as well as formula-calculated gross profit and gross margin. Please output the final presentation in PDF format.

Rubric

constraint-catalog.yaml

Reference Files

order-list.pdf

purchase-order.xlsx

Golden Deliverables

valentines-promo-plan.pdf

constraint-catalog.yaml
order-list.pdf
purchase-order.xlsx
valentines-promo-plan.pdf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
task_name: 0021-gs-valentines-promo-plan
deliverable_type: pdf
result_filename: validation-result.json
rubric:
  Hard Constraint:
  - rubric_item_id: gs21-h01-deliverable-pdf-format
    implementation_type: code
    score: 1
    criterion: Whether the deliverable is in PDF format.
    score_1_condition: The deliverable file has a .pdf extension.
    score_0_condition: The deliverable is not a PDF file.
    related_facts: ''
    facts_reference: ''
    query_reference: Please output the final presentation in PDF format.
  - rubric_item_id: gs21-h02-output-contains-more-10
    implementation_type: code
    score: 1
    criterion: Whether the output contains no more than 10 slides (pages).
    score_1_condition: The PDF contains no more than 10 pages.
    score_0_condition: The PDF contains more than 10 pages.
    related_facts: ''
    facts_reference: ''
    query_reference: 'you need to create a promotion plan proposal (PowerPoint presentation, no more than 10 slides)'
  - rubric_item_id: gs21-h03-title-slide-contains-luxeselect
    implementation_type: llm
    score: 1
    criterion: |-
      Whether the title on Slide 1 contains "LuxeSelect Boutique" and "Valentine's Day" (case-insensitive).
    score_1_condition: |-
      The first page text contains both "LuxeSelect Boutique" and "Valentine's Day" (case-insensitive).
    score_0_condition: The first page is missing one or both of the required phrases.
    related_facts: ''
    facts_reference: ''
    query_reference: >-
      You are an operations specialist at LuxeSelect Boutique, a luxury buyer e-commerce website. The site primarily deals in personal shopping and sales of handbags, scarves/shawls, jewelry, shoes, and small leather goods from luxury brands such as Louis Vuitton and Chanel.

      Valentine's Day is approaching, and you need to create a promotion plan proposal

  - rubric_item_id: gs21-h04-least-one-slide-dedicated
    implementation_type: llm
    score: 1
    criterion: "Whether at least one slide is dedicated to presenting the Valentine's Day promotion theme."
    score_1_condition: "At least one slide focuses on the Valentine's Day promotion theme."
    score_0_condition: No slide is dedicated to the promotion theme.
    related_facts: ''
    facts_reference: ''
    query_reference: "1. **Theme & Copy**: Craft an attractive campaign theme and core copy for this Valentine's Day promotion."
  - rubric_item_id: gs21-h05-least-one-slide-presents
    implementation_type: llm
    score: 1
    criterion: Whether at least one slide presents a poster design concept or visual design brief.
    score_1_condition: At least one slide presents a poster design concept or visual design brief.
    score_0_condition: No slide presents a poster design concept.
    related_facts: ''
    facts_reference: ''
    query_reference: '2. **Poster Design Concept**: Describe the design concept for at least one promotional poster, including visual style, core elements, and copy layout.'
  - rubric_item_id: gs21-h06-least-one-slide-presents
    implementation_type: llm
    score: 1
    criterion: Whether at least one slide presents a promotion timeline or rollout schedule.
    score_1_condition: At least one slide presents a promotion timeline or rollout schedule.
    score_0_condition: No slide presents a promotion timeline.
    related_facts: ''
    facts_reference: ''
    query_reference: '3. **Promotion Timeline**: Develop a complete promotion timeline from warm-up to campaign end, covering channels such as social media, email marketing, and on-site banners.'
  - rubric_item_id: gs21-h07-least-one-slide-presents
    implementation_type: llm
    score: 1
    criterion: Whether at least one slide presents expected results or financial forecasts.
    score_1_condition: At least one slide presents expected results or financial forecasts.
    score_0_condition: No slide presents expected results or financial forecasts.
    related_facts: ''
    facts_reference: ''
    query_reference: '4. **Expected Results**: Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin), develop a reasonable promotional pricing strategy
      and provide forecasts for key metrics such as expected sales revenue, gross profit, etc.'
  - rubric_item_id: gs21-h08-proposal-includes-summary-table
    implementation_type: llm
    score: 1
    criterion: Whether the proposal includes a summary table listing basic information for all 36 SKUs (may be grouped by category).
    score_1_condition: A summary table listing basic information for all 36 SKUs is present.
    score_0_condition: 'No summary table for all SKUs, or SKUs are significantly incomplete.'
    related_facts: 'The purchase-order.xlsx contains 36 SKUs total (18 Louis Vuitton + 18 Chanel), across 5 categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods.'
    facts_reference: purchase-order.xlsx
    query_reference: >-
      Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin),
      develop a reasonable promotional pricing strategy and provide forecasts for key metrics such as expected sales revenue, gross profit, etc.
  Soft Constraint:
  - rubric_item_id: gs21-s01-campaign-theme-copy-contains
    implementation_type: llm
    score: 1
    criterion: 'Whether the campaign theme copy contains keywords related to "love," "gift," or "romance" (accepts Love / Gift / Romance or equivalent expressions).'
    score_1_condition: 'The theme copy contains keywords related to love, gift, or romance.'
    score_0_condition: The theme copy does not contain any love/gift/romance keywords.
    related_facts: ''
    facts_reference: ''
    query_reference: 'Craft an attractive campaign theme and core copy for this Valentine''s Day promotion.'
  - rubric_item_id: gs21-s02-theme-copy-mentions-least
    implementation_type: llm
    score: 1
    criterion: Whether the theme copy mentions at least one brand name (Louis Vuitton or Chanel).
    score_1_condition: The theme copy mentions at least one brand name (Louis Vuitton or Chanel).
    score_0_condition: No brand name is mentioned in the theme copy.
    related_facts: The curated selection consists of 18 Louis Vuitton and 18 Chanel products.
    facts_reference: order-list.pdf
    query_reference: ''
  - rubric_item_id: gs21-s03-proposal-includes-consumerfacing-promotional
    implementation_type: llm
    score: 1
    criterion: 'Whether the proposal includes a consumer-facing promotional tagline (a short, compelling one-liner).'
    score_1_condition: 'A short, compelling consumer-facing promotional tagline is present.'
    score_0_condition: No promotional tagline is present.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s04-copy-employs-scarcity-urgency
    implementation_type: llm
    score: 1
    criterion: 'Whether the copy employs scarcity or urgency tactics (e.g., "limited edition," "limited time," "exclusive," "only X left," or references limited/seasonal exclusive information from the ORDER
      LIST PDF).'
    score_1_condition: The copy employs scarcity or urgency tactics.
    score_0_condition: No scarcity or urgency tactics are used.
    related_facts: "The ORDER LIST PDF contains products with Notes referencing Valentine's Edition and seasonal exclusives."
    facts_reference: order-list.pdf
    query_reference: ''
  - rubric_item_id: gs21-s05-copy-incorporates-giftgiving-scenario
    implementation_type: llm
    score: 1
    criterion: |-
      Whether the copy incorporates gift-giving scenario guidance (e.g., "pick for her/him," "Valentine's gift," "top gift choice," positioning products as gifts).
    score_1_condition: The copy incorporates gift-giving scenario guidance.
    score_0_condition: No gift-giving scenario guidance is present.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s06-poster-design-concept-clearly
    implementation_type: llm
    score: 1
    criterion: 'Whether the poster design concept clearly describes the visual style (e.g., color palette, typography style, festive elements, etc.).'
    score_1_condition: The poster design concept clearly describes the visual style.
    score_0_condition: No clear description of visual style in the poster design concept.
    related_facts: ''
    facts_reference: ''
    query_reference: 'Describe the design concept for at least one promotional poster, including visual style, core elements, and copy layout.'
  - rubric_item_id: gs21-s07-poster-design-concept-includes
    implementation_type: llm
    score: 1
    criterion: Whether the poster design concept includes at least one specific product name or SKU from the ORDER LIST PDF.
    score_1_condition: The poster design concept references at least one specific product name or SKU from the ORDER LIST PDF.
    score_0_condition: No specific product name or SKU from the ORDER LIST PDF is referenced in the poster design concept.
    related_facts: 'The ORDER LIST PDF contains 36 products with product names, SKUs, and detailed descriptions.'
    facts_reference: order-list.pdf
    query_reference: ''
  - rubric_item_id: gs21-s08-promotion-plan-covers-least
    implementation_type: llm
    score: 1
    criterion: 'Whether the promotion plan covers at least 3 different marketing channels (e.g., social media, email marketing, on-site banners, KOL partnerships, SMS push, etc.).'
    score_1_condition: The promotion plan covers at least 3 different marketing channels.
    score_0_condition: Fewer than 3 marketing channels are covered.
    related_facts: ''
    facts_reference: ''
    query_reference: 'Develop a complete promotion timeline from warm-up to campaign end, covering channels such as social media, email marketing, and on-site banners.'
  - rubric_item_id: gs21-s09-promotion-timeline-includes-explicit
    implementation_type: llm
    score: 1
    criterion: 'Whether the promotion timeline includes explicit dates or phase divisions (e.g., warm-up phase, burst phase, encore phase).'
    score_1_condition: The promotion timeline includes explicit dates or phase divisions.
    score_0_condition: No explicit dates or phase divisions are present.
    related_facts: ''
    facts_reference: ''
    query_reference: 'Develop a complete promotion timeline from warm-up to campaign end'
  - rubric_item_id: gs21-s10-promotion-timeline-includes-least
    implementation_type: llm
    score: 1
    criterion: Whether the promotion timeline includes at least 3 time nodes or phases.
    score_1_condition: At least 3 time nodes or phases are identified.
    score_0_condition: Fewer than 3 time nodes or phases.
    related_facts: ''
    facts_reference: ''
    query_reference: 'Develop a complete promotion timeline from warm-up to campaign end'
  - rubric_item_id: gs21-s11-promotion-plan-mentions-valentines
    implementation_type: llm
    score: 1
    criterion: "Whether the promotion plan mentions Valentine's Day itself (February 14) or key dates around it."
    score_1_condition: "The promotion plan mentions February 14 or key dates around Valentine's Day."
    score_0_condition: "No mention of February 14 or key dates around Valentine's Day."
    related_facts: ''
    facts_reference: ''
    query_reference: "Valentine's Day is approaching, and you need to create a promotion plan proposal"
  - rubric_item_id: gs21-s12-promotion-phase-clear-objectives
    implementation_type: llm
    score: 1
    criterion: 'Whether each promotion phase has clear objectives or core actions (e.g., warm-up phase goal is "build anticipation," burst phase goal is "drive conversions"), rather than just listing channel
      names.'
    score_1_condition: Each promotion phase has clear objectives or core actions.
    score_0_condition: Phases only list channel names without clear objectives.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s13-there-coordination-logic-between
    implementation_type: llm
    score: 1
    criterion: 'Whether there is coordination logic between different channels (e.g., social media drives traffic → on-site banners catch visitors → email marketing drives conversions), demonstrating inter-channel
      synergy.'
    score_1_condition: There is clear coordination logic between channels.
    score_0_condition: No inter-channel coordination logic is demonstrated.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s14-promotion-plan-includes-least
    implementation_type: llm
    score: 1
    criterion: 'Whether the promotion plan includes at least one post-engagement conversion path description (e.g., the user journey from seeing an ad to completing a purchase).'
    score_1_condition: At least one conversion path description is included.
    score_0_condition: No conversion path description is included.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s15-set-skus-summary-table
    implementation_type: llm
    score: 1
    criterion: Whether the set of SKUs in the summary table exactly matches those listed in the Purchase Order Excel (no omissions or extras).
    score_1_condition: The SKU set exactly matches the Purchase Order Excel.
    score_0_condition: SKUs are missing or extra compared to the Purchase Order Excel.
    related_facts: The purchase-order.xlsx lists exactly 36 SKUs across 5 categories.
    facts_reference: purchase-order.xlsx
    query_reference: >-
      Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin),
      develop a reasonable promotional pricing strategy
  - rubric_item_id: gs21-s16-summary-table-includes-retail
    implementation_type: llm
    score: 1
    criterion: 'Whether the summary table includes a Retail Price column for each SKU, with values exactly matching the Purchase Order Excel (to the cent).'
    score_1_condition: Retail Price column is present and values match.
    score_0_condition: Retail Price column is missing or values do not match.
    related_facts: Retail prices for all 36 SKUs are defined in the purchase-order.xlsx.
    facts_reference: purchase-order.xlsx
    query_reference: >-
      Each product's SKU, category, cost price, retail price, minimum acceptable selling price (brand floor price),
      proposed purchase quantity, as well as formula-calculated gross profit and gross margin.
  - rubric_item_id: gs21-s17-summary-table-includes-cost
    implementation_type: llm
    score: 1
    criterion: 'Whether the summary table includes a Cost column for each SKU, with values exactly matching the Purchase Order Excel (to the cent).'
    score_1_condition: Cost column is present and values match.
    score_0_condition: Cost column is missing or values do not match.
    related_facts: Cost prices for all 36 SKUs are defined in the purchase-order.xlsx.
    facts_reference: purchase-order.xlsx
    query_reference: >-
      Each product's SKU, category, cost price, retail price, minimum acceptable selling price (brand floor price),
      proposed purchase quantity
  - rubric_item_id: gs21-s18-summary-table-includes-minimum
    implementation_type: llm
    score: 1
    criterion: 'Whether the summary table includes a Minimum Acceptable Price column for each SKU, with values exactly matching the Purchase Order Excel.'
    score_1_condition: Minimum Acceptable Price column is present and values match.
    score_0_condition: Minimum Acceptable Price column is missing or values do not match.
    related_facts: Minimum acceptable prices (brand floor prices) for all 36 SKUs are defined in the purchase-order.xlsx.
    facts_reference: purchase-order.xlsx
    query_reference: 'minimum acceptable selling price (brand floor price)'
  - rubric_item_id: gs21-s19-summary-table-includes-proposed
    implementation_type: llm
    score: 1
    criterion: 'Whether the summary table includes a Proposed Qty column for each SKU, with values exactly matching the Purchase Order Excel.'
    score_1_condition: Proposed Qty column is present and values match.
    score_0_condition: Proposed Qty column is missing or values do not match.
    related_facts: Proposed purchase quantities for all 36 SKUs are defined in the purchase-order.xlsx.
    facts_reference: purchase-order.xlsx
    query_reference: 'proposed purchase quantity'
  - rubric_item_id: gs21-s20-sku-gross-profit-retail
    implementation_type: llm
    score: 1
    criterion: 'Whether for each SKU, Gross Profit = Retail Price − Cost is calculated correctly (to the cent).'
    score_1_condition: Gross Profit is calculated correctly for each SKU.
    score_0_condition: Gross Profit calculations are incorrect or missing.
    related_facts: 'Gross Profit = Retail Price − Cost. The purchase-order.xlsx includes formula-calculated gross profit.'
    facts_reference: purchase-order.xlsx
    query_reference: 'formula-calculated gross profit and gross margin'
  - rubric_item_id: gs21-s21-sku-gross-margin-gross
    implementation_type: llm
    score: 1
    criterion: 'Whether for each SKU, Gross Margin % = Gross Profit / Retail Price is calculated correctly (within ±0.5% tolerance).'
    score_1_condition: 'Gross Margin % is calculated correctly within tolerance.'
    score_0_condition: 'Gross Margin % calculations are incorrect or missing.'
    related_facts: 'Gross Margin % = Gross Profit / Retail Price. The purchase-order.xlsx includes formula-calculated gross margin.'
    facts_reference: purchase-order.xlsx
    query_reference: 'formula-calculated gross profit and gross margin'
  - rubric_item_id: gs21-s22-promotional-prices-shown-promotional
    implementation_type: llm
    score: 1
    criterion: "If promotional prices are shown, whether all promotional prices are no lower than the corresponding SKU's Minimum Acceptable Price."
    score_1_condition: All promotional prices meet or exceed the Minimum Acceptable Price.
    score_0_condition: One or more promotional prices fall below the Minimum Acceptable Price.
    related_facts: Each SKU has a Minimum Acceptable Price (brand floor price) defined in purchase-order.xlsx. Promotional prices must not go below this floor.
    facts_reference: purchase-order.xlsx
    query_reference: 'minimum acceptable selling price (brand floor price)'
  - rubric_item_id: gs21-s23-expected-total-sales-revenue
    implementation_type: llm
    score: 1
    criterion: 'If expected total sales revenue is shown, whether its value equals the sum of (promotional price or retail price × proposed quantity) for all SKUs, within ±$1.00 tolerance.'
    score_1_condition: Total sales revenue calculation is correct within tolerance.
    score_0_condition: Total sales revenue is incorrect or inconsistent.
    related_facts: 'Total Revenue = sum of (price × quantity) for all 36 SKUs.'
    facts_reference: purchase-order.xlsx
    query_reference: >-
      provide forecasts for key metrics such as expected sales revenue, gross profit, etc.
  - rubric_item_id: gs21-s24-promotional-prices-shown-pricing
    implementation_type: llm
    score: 1
    criterion: 'If promotional prices are shown, whether the pricing reflects a differentiated strategy (different SKUs have different discount levels), rather than a uniform discount across all SKUs.'
    score_1_condition: Pricing reflects a differentiated strategy.
    score_0_condition: All SKUs have the same uniform discount.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s25-promotional-prices-shown-proposal
    implementation_type: llm
    score: 1
    criterion: 'If promotional prices are shown, whether the proposal simultaneously displays the original price (Retail Price) alongside the promotional price (e.g., strikethrough pricing), leveraging
      the price anchoring effect.'
    score_1_condition: Original price is displayed alongside promotional price.
    score_0_condition: Only promotional price is shown without the original price.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s26-promotional-prices-shown-luxury
    implementation_type: llm
    score: 1
    criterion: 'If promotional prices are shown, whether the luxury pricing style is appropriate (e.g., using round numbers rather than .99 discount pricing, consistent with premium brand positioning).'
    score_1_condition: Luxury pricing style is appropriate.
    score_0_condition: 'Pricing style is inappropriate for luxury positioning (e.g., .99 endings).'
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s27-expected-results-include-projected
    implementation_type: llm
    score: 1
    criterion: Whether expected results include a projected Total Revenue figure.
    score_1_condition: A projected Total Revenue figure is present.
    score_0_condition: No projected Total Revenue figure.
    related_facts: ''
    facts_reference: ''
    query_reference: 'provide forecasts for key metrics such as expected sales revenue, gross profit, etc.'
  - rubric_item_id: gs21-s28-expected-results-include-projected
    implementation_type: llm
    score: 1
    criterion: Whether expected results include a projected Total Gross Profit figure.
    score_1_condition: A projected Total Gross Profit figure is present.
    score_0_condition: No projected Total Gross Profit figure.
    related_facts: ''
    facts_reference: ''
    query_reference: 'provide forecasts for key metrics such as expected sales revenue, gross profit, etc.'
  - rubric_item_id: gs21-s29-both-total-sales-revenue
    implementation_type: llm
    score: 1
    criterion: "If both total sales revenue and total gross profit are shown, whether Total Gross Profit = Total Revenue − Total Cost (Total Cost = sum of each SKU's cost × proposed quantity), calculated       correctly, within ±$1.00 tolerance."
    score_1_condition: Total Gross Profit calculation is correct within tolerance.
    score_0_condition: Total Gross Profit is incorrect or inconsistent.
    related_facts: 'Total Cost = sum of (cost × proposed quantity) for all 36 SKUs. Total Gross Profit = Total Revenue − Total Cost.'
    facts_reference: purchase-order.xlsx
    query_reference: >-
      provide forecasts for key metrics such as expected sales revenue, gross profit, etc.
  - rubric_item_id: gs21-s30-expected-results-differentiate-between
    implementation_type: llm
    score: 1
    criterion: 'Whether expected results differentiate between scenarios (e.g., optimistic/baseline/conservative, or full-price vs. promotional price sales), rather than providing only a single figure.'
    score_1_condition: Expected results differentiate between scenarios.
    score_0_condition: Only a single-scenario figure is provided.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s31-expected-results-include-least
    implementation_type: llm
    score: 1
    criterion: 'Whether expected results include at least one non-financial metric forecast (e.g., expected order volume, average order value, conversion rate, new customer ratio, etc.).'
    score_1_condition: At least one non-financial metric forecast is included.
    score_0_condition: No non-financial metric forecast.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs21-s32-proposal-references-names-least
    implementation_type: llm
    score: 1
    criterion: Whether the proposal references the names of at least 8 products from the ORDER LIST PDF.
    score_1_condition: At least 8 product names from the ORDER LIST PDF are referenced.
    score_0_condition: Fewer than 8 product names are referenced.
    related_facts: 'The ORDER LIST PDF contains 36 products across 5 categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods.'
    facts_reference: order-list.pdf
    query_reference: ''
  - rubric_item_id: gs21-s33-referenced-products-span-categories
    implementation_type: llm
    score: 1
    criterion: Whether the referenced products span at least 3 categories.
    score_1_condition: Referenced products span at least 3 categories.
    score_0_condition: Referenced products span fewer than 3 categories.
    related_facts: 'Products span 5 categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods.'
    facts_reference: order-list.pdf
    query_reference: ''
  - rubric_item_id: gs21-s34-referenced-product-names-consistent
    implementation_type: llm
    score: 1
    criterion: 'Whether the referenced product names are consistent with the naming in the ORDER LIST PDF (exact precision not required, but brand and model must match).'
    score_1_condition: Referenced product names are consistent with the ORDER LIST PDF.
    score_0_condition: Product names do not match the ORDER LIST PDF.
    related_facts: ''
    facts_reference: order-list.pdf
    query_reference: ''
  - rubric_item_id: gs21-s35-proposal-references-least-one
    implementation_type: llm
    score: 1
    criterion: "Whether the proposal references at least one product's special description from the ORDER LIST PDF (e.g., Valentine's Edition, seasonal exclusive, or other Notes information)."
    score_1_condition: "At least one product's special description from the Notes column is referenced."
    score_0_condition: No product special descriptions are referenced.
    related_facts: "Some products in the ORDER LIST PDF have special Notes such as Valentine's Edition or seasonal exclusive designations."
    facts_reference: order-list.pdf
    query_reference: ''
  Optional Constraint:
  - rubric_item_id: gs21-o01-slides-clearly-laid-out
    implementation_type: llm
    score: 4
    criterion: 'Whether slides are clearly laid out with logical flow, content is organized by module, and the proposal demonstrates understanding of luxury e-commerce promotion context and consumer purchasing
      psychology.'
    score_4_condition: 'Professional layout with polished flow, excellent modular organization; deep understanding of luxury e-commerce context, multiple psychological strategies employed, perfect balance
      between promotional intensity and brand positioning.'
    score_3_condition: 'Clear layout, logical flow, well-organized by module; good understanding of luxury context, consumer psychology, and balanced promotional tone.'
    score_2_condition: Slides are organized by module with adequate layout; some luxury context and psychology demonstrated.
    score_1_condition: Basic layout exists but disorganized; limited understanding of luxury context.
    score_0_condition: 'Layout is chaotic, no modular organization, no understanding of luxury context or consumer psychology.'
    related_facts: ''
    facts_reference: ''
    query_reference: ''

Restaurant - Supply Analysis

Prompt

You are a supply chain management specialist at "SZW Hotpot" (蜀滋味), a Chinese hotpot chain brand. SZW Hotpot is headquartered in San Francisco Chinatown, primarily serving the Bay Area Chinese community and local diners. The brand currently operates 8 locations in the Bay Area (1 of which is closed), located in San Francisco Chinatown, Sunset District, Milpitas, Cupertino, Fremont, Oakland Chinatown, Daly City, and San Mateo. The 2026 Chinese New Year / Lunar New Year falls in February (February 17 is New Year's Eve, February 18 is the first day of the new year). It is now mid-January 2026, and you need to develop a 2026 Spring Festival ingredient stocking plan based on Q1 2025 (January through March) historical operations data. You will receive an Excel reference file containing Q1 2025 operations data ("szw-hotpot-2025-q1-operations-data.xlsx"), which includes 5 worksheets: (a) Store Matrix — Store Master Data Columns: Store ID, Store Name, Location, Store Type (Flagship / Standard / Express), Seats, Status (Active / Closed), Cold Storage (sqft), Delivery Zone, Notes. 7 stores are Active, 1 (SF-007, Daly City) is Closed. The Notes column flags anomalies relevant to planning (e.g., one store has extra foot traffic from a parade during Spring Festival, one store has limited cold storage requiring frozen item prioritization, etc.). (b) Ingredient Master — Ingredient Master Data Columns: Item ID, Item Name, Category (5 categories), Unit, Unit Cost ($), Shelf Life (days), Lead Time (days), MOQ (per store order), Wastage Rate, Storage Type (Refrigerated / Frozen / Ambient). 31 ingredients total. Shelf life ranges from 2 days (Fresh Juice) to 180 days (Beer, Soda); lead times range from 1 to 5 days; wastage rates range from 1% (ambient items) to 15% (Fresh Juice). (c-e) Jan 2025 / Feb 2025 / Mar 2025 — Monthly Operations Data Each sheet contains two data regions: - Daily Sales Metrics: Date, Day, Store ID, Tables Served, Customer Count, Revenue ($$), Avg Spend/Person ($$), Table Turnover Rate, Food Cost Ratio - Daily Ingredient Consumption: Date, Store ID, Category, Item Name, Unit, Consumption Qty, Unit Cost ($$), Total Cost ($$), Cumulative MTD Cost ($) Please build an Excel workbook as the deliverable containing a Spring Festival stocking plan. The workbook should include historical data analysis, trend and peak analysis, a store-level stocking plan for February 10–24, 2026, a budget summary, and an executive summary.

Rubric

constraint-catalog.yaml

Reference Files

szw-hotpot-2025-q1-operations-data.xlsx

Golden Deliverables

szw-hotpot-2026-cny-stocking-plan.xlsx

constraint-catalog.yaml
szw-hotpot-2025-q1-operations-data.xlsx
szw-hotpot-2026-cny-stocking-plan.xlsx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
task_name: 0022-gs-hotpot-stocking-plan
deliverable_type: xlsx
result_filename: validation-result.json
rubric:
  Hard Constraint:
  - rubric_item_id: gs22-h01-deliverable-single-xlsx-excel
    implementation_type: code
    score: 1
    criterion: Whether the deliverable is a single .xlsx Excel workbook file.
    score_1_condition: The deliverable file has a .xlsx extension and is a valid Excel workbook.
    score_0_condition: The deliverable is not an .xlsx file or is not a valid Excel workbook.
    related_facts: ''
    facts_reference: ''
    query_reference: Please build an Excel workbook as the deliverable
  - rubric_item_id: gs22-h02-workbook-contains-multiple-worksheets
    implementation_type: llm
    score: 1
    criterion: Whether the workbook contains multiple worksheets or clearly separated sections corresponding to historical analysis, stocking plan, budget summary, and other modules.
    score_1_condition: The workbook contains multiple worksheets or clearly separated sections for historical analysis, stocking plan, budget summary, and other modules.
    score_0_condition: The workbook lacks multiple worksheets or clearly separated sections for the required modules.
    related_facts: ''
    facts_reference: ''
    query_reference: 'The workbook should include historical data analysis, trend and peak analysis, a store-level stocking plan for February 10–24, 2026, a budget summary, and an executive summary.'
  - rubric_item_id: gs22-h03-number-formatting-consistent-dollar
    implementation_type: llm
    score: 1
    criterion: 'Whether number formatting is consistent: dollar amounts use $ with two decimal places, percentages use % format, quantities rounded to one decimal.'
    score_1_condition: 'Number formatting is consistent throughout: dollar amounts use $ with two decimal places, percentages use % format, quantities are rounded to one decimal.'
    score_0_condition: Number formatting is inconsistent or does not follow the required formats.
    related_facts: ''
    facts_reference: ''
    query_reference: ''
  - rubric_item_id: gs22-h04-workbook-contains-historical-data
    implementation_type: llm
    score: 1
    criterion: Whether the workbook contains a historical data summary for the 2025 Spring Festival period (January 20 to February 4).
    score_1_condition: The workbook contains a historical data summary covering the 2025 Spring Festival period (January 20 to February 4).
    score_0_condition: The workbook does not contain a historical data summary for the 2025 Spring Festival period.
    related_facts: ''
    facts_reference: ''
    query_reference: 'historical data analysis'
  - rubric_item_id: gs22-h05-workbook-contains-clearly-labeled
    implementation_type: llm
    score: 1
    criterion: Whether the workbook contains a clearly labeled 2026 Spring Festival stocking plan.
    score_1_condition: The workbook contains a clearly labeled 2026 Spring Festival stocking plan.
    score_0_condition: The workbook does not contain a clearly labeled 2026 Spring Festival stocking plan.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-h06-stocking-plans-time-range
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan's time range covers February 10 to February 24, 2026.
    score_1_condition: The stocking plan covers February 10 to February 24, 2026.
    score_0_condition: The stocking plan does not cover the required date range of February 10 to February 24, 2026.
    related_facts: The prompt explicitly requires the stocking plan to cover February 10 to February 24, 2026 (one week before to one week after the festival, 15 days total). February 17 is New Year's Eve, February 18 is the first day.
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-h07-workbook-contains-23-sentence
    implementation_type: llm
    score: 1
    criterion: Whether the workbook contains a 2-3 sentence stocking plan summary.
    score_1_condition: The workbook contains a 2-3 sentence stocking plan summary.
    score_0_condition: The workbook does not contain a stocking plan summary of 2-3 sentences.
    related_facts: ''
    facts_reference: ''
    query_reference: 'an executive summary'
  Soft Constraint:
  - rubric_item_id: gs22-s01-historical-data-correctly-extracted
    implementation_type: llm
    score: 1
    criterion: Whether historical data is correctly extracted and merged from "Jan 2025" and "Feb 2025" worksheets.
    score_1_condition: Data is correctly extracted and merged from both Jan 2025 and Feb 2025 worksheets.
    score_0_condition: Data is not correctly extracted or merged from both worksheets.
    related_facts: The 2025 Spring Festival period spans January 20 to February 4, requiring data from both Jan and Feb worksheets.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s02-revenue-feb-14-falls
    implementation_type: llm
    score: 1
    criterion: Whether revenue for Feb 1-4 falls between $138,000 and $138,500.
    score_1_condition: Revenue for Feb 1-4 is between $138,000 and $138,500.
    score_0_condition: Revenue for Feb 1-4 is outside the range of $138,000 to $138,500.
    related_facts: Feb 1-4 revenue from the Feb 2025 worksheet verifies cross-month data extraction completeness.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s03-summary-covers-active-stores
    implementation_type: llm
    score: 1
    criterion: Whether the summary covers all 7 Active stores (SF-001 through SF-006, SF-008).
    score_1_condition: 'The summary covers all 7 Active stores: SF-001, SF-002, SF-003, SF-004, SF-005, SF-006, and SF-008.'
    score_0_condition: The summary is missing one or more of the 7 Active stores.
    related_facts: 'Store Matrix has 8 stores total: SF-001 to SF-008. SF-007 (Daly City) is Closed. The 7 Active stores are SF-001 through SF-006 and SF-008.'
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s04-closed-store-sf007-appear
    implementation_type: llm
    score: 1
    criterion: Whether closed store SF-007 does not appear in the historical summary or shows as 0.
    score_1_condition: SF-007 does not appear in the historical summary, or its values are 0.
    score_0_condition: SF-007 appears in the historical summary with non-zero values.
    related_facts: SF-007 (Daly City) has Status=Closed in the Store Matrix.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s05-total-revenue-during-spring
    implementation_type: llm
    score: 1
    criterion: Whether total revenue during Spring Festival falls between $506,500 and $507,500.
    score_1_condition: Total revenue during Spring Festival is between $506,500 and $507,500.
    score_0_condition: Total revenue during Spring Festival is outside the range of $506,500 to $507,500.
    related_facts: Sum of all 7 active stores' revenue from Jan 20 to Feb 4 (16 days).
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s06-total-customer-count-falls
    implementation_type: llm
    score: 1
    criterion: Whether total customer count falls between 11,350 and 11,400.
    score_1_condition: Total customer count is between 11,350 and 11,400.
    score_0_condition: Total customer count is outside the range of 11,350 to 11,400.
    related_facts: Sum of all 7 active stores' customer count from Jan 20 to Feb 4.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s07-total-consumption-cost-categories
    implementation_type: llm
    score: 1
    criterion: Whether total consumption cost by 5 categories is presented, with Meat between $45,000 and $45,400.
    score_1_condition: Total consumption cost is broken down by 5 ingredient categories, and Meat category cost is between $45,000 and $45,400.
    score_0_condition: Consumption cost is not broken down by 5 categories, or Meat category cost is outside $45,000 to $45,400.
    related_facts: 5 ingredient categories in Ingredient Master. Meat is the largest category by cost.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s08-consumption-data-presented-store
    implementation_type: llm
    score: 1
    criterion: Whether consumption data is presented in a store x category cross-dimensional matrix.
    score_1_condition: Consumption data is presented in a store x category cross-dimensional matrix.
    score_0_condition: Consumption data is not presented in a cross-dimensional matrix format.
    related_facts: ''
    facts_reference: ''
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s09-analysis-includes-dac-daily
    implementation_type: llm
    score: 1
    criterion: Whether the analysis includes a DAC (Daily Average Consumption) metric calculated as total divided by 16 days.
    score_1_condition: A DAC metric is included, calculated as total divided by 16 days.
    score_0_condition: No DAC metric is included, or it is not based on 16 days.
    related_facts: Spring Festival period is 16 days (Jan 20 to Feb 4 inclusive).
    facts_reference: ''
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s10-daily-average-revenue-between
    implementation_type: llm
    score: 1
    criterion: Whether daily average revenue is between $31,500 and $31,900.
    score_1_condition: Daily average revenue is between $31,500 and $31,900.
    score_0_condition: Daily average revenue is outside the range of $31,500 to $31,900.
    related_facts: Spring Festival daily average = total revenue / 16 days.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s11-total-consumption-31-ingredients
    implementation_type: llm
    score: 1
    criterion: Whether total consumption for each of all 31 ingredients is included.
    score_1_condition: Total consumption for each of all 31 ingredients is included.
    score_0_condition: Total consumption data is missing for some of the 31 ingredients.
    related_facts: Ingredient Master contains 31 ingredients across 5 categories.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s12-total-ingredient-cost-between
    implementation_type: llm
    score: 1
    criterion: Whether total ingredient cost is between $113,000 and $113,800.
    score_1_condition: Total ingredient cost is between $113,000 and $113,800.
    score_0_condition: Total ingredient cost is outside the range of $113,000 to $113,800.
    related_facts: Sum of all ingredient costs during the Spring Festival period.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s13-nonspring-festival-daily-average
    implementation_type: llm
    score: 1
    criterion: Whether non-Spring Festival daily average revenue is between $17,200 and $17,600.
    score_1_condition: Non-Spring Festival daily average revenue is between $17,200 and $17,600.
    score_0_condition: Non-Spring Festival daily average revenue is outside the range of $17,200 to $17,600.
    related_facts: Non-Spring Festival period = January 1-19 (19 days).
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s14-ingredient-unit-costs-match
    implementation_type: llm
    score: 1
    criterion: Whether ingredient unit costs match the Ingredient Master exactly.
    score_1_condition: All ingredient unit costs match the Ingredient Master.
    score_0_condition: One or more ingredient unit costs do not match the Ingredient Master.
    related_facts: Unit costs are defined in the Ingredient Master worksheet.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s15-analysis-includes-peaktonormal-ratio
    implementation_type: llm
    score: 1
    criterion: Whether the analysis includes a Peak-to-Normal Ratio calculation.
    score_1_condition: A Peak-to-Normal Ratio calculation is included.
    score_0_condition: No Peak-to-Normal Ratio calculation is included.
    related_facts: ''
    facts_reference: ''
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s16-revenue-peaktonormal-ratio-between
    implementation_type: llm
    score: 1
    criterion: Whether revenue Peak-to-Normal Ratio is between 1.78 and 1.88.
    score_1_condition: Revenue Peak-to-Normal Ratio is between 1.78 and 1.88.
    score_0_condition: Revenue Peak-to-Normal Ratio is outside the range of 1.78 to 1.88.
    related_facts: Ratio = Spring Festival daily avg revenue / non-Spring Festival daily avg revenue.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'historical data analysis, trend and peak analysis'
  - rubric_item_id: gs22-s17-revenue-comparison-across-jan
    implementation_type: llm
    score: 1
    criterion: Whether revenue comparison across Jan, Feb, Mar 2025 is included (Jan approximately $700K, Feb approximately $528K, Mar approximately $415K).
    score_1_condition: Revenue comparison across Jan, Feb, Mar 2025 is included with values approximately matching Jan ~$700K, Feb ~$528K, Mar ~$415K.
    score_0_condition: Revenue comparison across Jan, Feb, Mar 2025 is missing or values are significantly off.
    related_facts: Monthly revenue totals from the three monthly worksheets.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'trend and peak analysis'
  - rubric_item_id: gs22-s18-feb-vs-jan-change
    implementation_type: llm
    score: 1
    criterion: Whether Feb vs Jan change rate is between -22% and -27%.
    score_1_condition: Feb vs Jan change rate is between -22% and -27%.
    score_0_condition: Feb vs Jan change rate is outside the range of -22% to -27%.
    related_facts: Change rate = (Feb revenue / Jan revenue) - 1.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'trend and peak analysis'
  - rubric_item_id: gs22-s19-mar-vs-feb-change
    implementation_type: llm
    score: 1
    criterion: Whether Mar vs Feb change rate is between -19% and -24%.
    score_1_condition: Mar vs Feb change rate is between -19% and -24%.
    score_0_condition: Mar vs Feb change rate is outside the range of -19% to -24%.
    related_facts: Change rate = (Mar revenue / Feb revenue) - 1.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'trend and peak analysis'
  - rubric_item_id: gs22-s20-monthly-consumption-trends-ingredient
    implementation_type: llm
    score: 1
    criterion: Whether monthly consumption trends by ingredient category are included.
    score_1_condition: Monthly consumption trends broken down by ingredient category are included.
    score_0_condition: Monthly consumption trends by ingredient category are missing.
    related_facts: ''
    facts_reference: ''
    query_reference: 'trend and peak analysis'
  - rubric_item_id: gs22-s21-percentage-change-calculations-correct
    implementation_type: llm
    score: 1
    criterion: Whether all percentage change calculations are correct within plus or minus 1%.
    score_1_condition: All percentage change calculations are correct within +-1%.
    score_0_condition: One or more percentage change calculations are off by more than +-1%.
    related_facts: ''
    facts_reference: ''
    query_reference: 'trend and peak analysis'
  - rubric_item_id: gs22-s22-quantitative-description-spring-festival
    implementation_type: llm
    score: 1
    criterion: Whether a quantitative description of the Spring Festival effect is included.
    score_1_condition: A quantitative description of the Spring Festival effect is included.
    score_0_condition: No quantitative description of the Spring Festival effect is provided.
    related_facts: ''
    facts_reference: ''
    query_reference: 'trend and peak analysis'
  - rubric_item_id: gs22-s23-stocking-plan-broken-down
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan is broken down by store for all 7 Active stores.
    score_1_condition: The stocking plan is broken down by store for all 7 Active stores (SF-001 through SF-006, SF-008).
    score_0_condition: The stocking plan is missing one or more Active stores or is not broken down by store.
    related_facts: '7 Active stores: SF-001 through SF-006, SF-008.'
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s24-closed-store-sf007-excluded
    implementation_type: llm
    score: 1
    criterion: Whether closed store SF-007 is excluded from the stocking plan or has quantity equal to 0.
    score_1_condition: SF-007 does not appear in the stocking plan, or its quantities are 0.
    score_0_condition: SF-007 appears in the stocking plan with non-zero quantities.
    related_facts: SF-007 (Daly City) is Closed.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s25-store-ids-stocking-plan
    implementation_type: llm
    score: 2
    criterion: Whether all store IDs in the stocking plan exist in the Store Matrix (no invalid store IDs).
    score_1_condition: All store IDs in the stocking plan are valid IDs from the Store Matrix.
    score_0_condition: One or more store IDs in the stocking plan do not exist in the Store Matrix.
    related_facts: 'Valid store IDs: SF-001 through SF-008.'
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
    score_2_condition: The deliverable satisfies the criterion to a meaningful intermediate extent, but still has noticeable omissions, inaccuracies, or consistency gaps.
  - rubric_item_id: gs22-s26-stocking-plan-covers-31
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan covers all 31 ingredients.
    score_1_condition: The stocking plan covers all 31 ingredients from the Ingredient Master.
    score_0_condition: The stocking plan is missing one or more of the 31 ingredients.
    related_facts: Ingredient Master contains 31 ingredients.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s27-stocking-plan-covers-ingredient
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan covers all 5 ingredient categories.
    score_1_condition: The stocking plan covers all 5 ingredient categories.
    score_0_condition: The stocking plan is missing one or more of the 5 ingredient categories.
    related_facts: 5 categories in Ingredient Master.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s28-stocking-plan-broken-down
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan is broken down by both store and ingredient name.
    score_1_condition: The stocking plan is broken down by both store and ingredient name.
    score_0_condition: The stocking plan is not broken down by both store and ingredient name.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s29-flagship-stores-greater-total
    implementation_type: llm
    score: 1
    criterion: Whether Flagship stores have greater total stocking quantities than Standard stores, and Standard stores greater than the Express store.
    score_1_condition: Flagship stores (SF-001, SF-003) > Standard stores (SF-002, SF-004, SF-005, SF-008) > Express store (SF-006) in total stocking quantities.
    score_0_condition: The ordering of Flagship > Standard > Express in total stocking quantities is not maintained.
    related_facts: 'Flagship: SF-001 (160 seats), SF-003 (180 seats). Standard: SF-002, SF-004, SF-005, SF-008. Express: SF-006.'
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s30-stocking-plan-includes-dac
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan includes a DAC (Daily Average Consumption) column derived from 2025 Spring Festival data.
    score_1_condition: A DAC column from 2025 Spring Festival data is included in the stocking plan.
    score_0_condition: No DAC column from 2025 Spring Festival data is included.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s31-stocking-plan-includes-wastage
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan includes a Wastage Rate matching the Ingredient Master.
    score_1_condition: Wastage Rate is included and matches the values in the Ingredient Master.
    score_0_condition: Wastage Rate is missing or does not match the Ingredient Master.
    related_facts: Wastage rates in Ingredient Master range from 1% to 15%.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s32-gross-requirement-greater-equal
    implementation_type: llm
    score: 1
    criterion: Whether Gross Requirement is greater than or equal to Net Requirement multiplied by (1 + Wastage Rate).
    score_1_condition: Gross Requirement >= Net Requirement x (1 + Wastage Rate) for all items.
    score_0_condition: Gross Requirement is less than Net Requirement x (1 + Wastage Rate) for one or more items.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s33-stocking-plan-includes-safety
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan includes a Safety Stock column.
    score_1_condition: A Safety Stock column is included in the stocking plan.
    score_0_condition: No Safety Stock column is included.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s34-safety-stock-positively-correlated
    implementation_type: llm
    score: 1
    criterion: Whether Safety Stock is positively correlated with Lead Time.
    score_1_condition: Safety Stock values are positively correlated with Lead Time (longer lead time results in higher safety stock).
    score_0_condition: Safety Stock is not positively correlated with Lead Time.
    related_facts: Lead times in Ingredient Master range from 1 to 5 days.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s35-moq-minimum-order-quantity
    implementation_type: llm
    score: 1
    criterion: Whether MOQ (Minimum Order Quantity) constraint is respected in the stocking plan.
    score_1_condition: MOQ constraint is respected for all ingredients in the stocking plan.
    score_0_condition: MOQ constraint is violated for one or more ingredients.
    related_facts: MOQ values are defined per ingredient in the Ingredient Master.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s36-short-shelf-life-items
    implementation_type: llm
    score: 1
    criterion: Whether short shelf life items (5 days or fewer) use batch ordering rather than a single bulk order.
    score_1_condition: Short shelf life items (shelf life <= 5 days) use batch ordering.
    score_0_condition: Short shelf life items do not use batch ordering.
    related_facts: Short shelf life items include Fresh Beef Abomasum, Duck Blood Curd, Fresh Juice (2-5 days shelf life).
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s37-long-shelf-life-items
    implementation_type: llm
    score: 1
    criterion: Whether long shelf life items (60 days or more) allow a single bulk order.
    score_1_condition: Long shelf life items (shelf life >= 60 days) are ordered in a single bulk order or allow it.
    score_0_condition: Long shelf life items are unnecessarily split into multiple orders.
    related_facts: Long shelf life items include Soup Base varieties (90 days), Beer (180 days), Soda (180 days).
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s38-estimated-cost-equals-total
    implementation_type: llm
    score: 1
    criterion: Whether Estimated Cost equals Total Quantity multiplied by Unit Cost, correct within plus or minus $0.50.
    score_1_condition: Estimated Cost = Total Qty x Unit Cost for all items, within +-$0.50.
    score_0_condition: Estimated Cost calculation is off by more than +-$0.50 for one or more items.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s39-unit-cost-stocking-plan
    implementation_type: llm
    score: 1
    criterion: Whether Unit Cost in the stocking plan matches the Ingredient Master exactly.
    score_1_condition: All Unit Cost values in the stocking plan match the Ingredient Master exactly.
    score_0_condition: One or more Unit Cost values do not match the Ingredient Master.
    related_facts: Unit costs defined in Ingredient Master.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s40-meat-category-accounts-least
    implementation_type: llm
    score: 1
    criterion: Whether Meat category accounts for at least 40% of total stocking cost.
    score_1_condition: Meat category is >= 40% of total stocking cost.
    score_0_condition: Meat category is less than 40% of total stocking cost.
    related_facts: Meat is the dominant ingredient category for hotpot, historically accounting for the largest share of consumption cost.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s41-quantities-rounded-05-multiples
    implementation_type: llm
    score: 1
    criterion: Whether quantities are rounded to 0.5 multiples, and non-zero values are at least 0.5.
    score_1_condition: All quantities are rounded to 0.5 multiples and non-zero values are >= 0.5.
    score_0_condition: Quantities are not rounded to 0.5 multiples or non-zero values are less than 0.5.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s42-stocking-plan-mentions-considers
    implementation_type: llm
    score: 1
    criterion: Whether the stocking plan mentions or considers Storage Type differentiation (Refrigerated, Frozen, Ambient).
    score_1_condition: Storage Type differentiation is mentioned or considered in the stocking plan.
    score_0_condition: Storage Type differentiation is not mentioned or considered.
    related_facts: 'Storage types: Refrigerated, Frozen, Ambient.'
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a store-level stocking plan for February 10–24, 2026'
  - rubric_item_id: gs22-s43-storelevel-budget-subtotal-rows
    implementation_type: llm
    score: 1
    criterion: Whether store-level budget subtotal rows are included.
    score_1_condition: Store-level budget subtotal rows are present.
    score_0_condition: Store-level budget subtotal rows are missing.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a budget summary'
  - rubric_item_id: gs22-s44-categorylevel-budget-subtotal-rows
    implementation_type: llm
    score: 1
    criterion: Whether category-level budget subtotal rows are included.
    score_1_condition: Category-level budget subtotal rows are present.
    score_0_condition: Category-level budget subtotal rows are missing.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a budget summary'
  - rubric_item_id: gs22-s45-grand-total-equals-sum
    implementation_type: llm
    score: 1
    criterion: Whether Grand Total equals the sum of store subtotals.
    score_1_condition: Grand Total equals the sum of all store subtotals.
    score_0_condition: Grand Total does not equal the sum of store subtotals.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a budget summary'
  - rubric_item_id: gs22-s46-grand-total-equals-sum
    implementation_type: llm
    score: 1
    criterion: Whether Grand Total equals the sum of category subtotals.
    score_1_condition: Grand Total equals the sum of all category subtotals.
    score_0_condition: Grand Total does not equal the sum of category subtotals.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a budget summary'
  - rubric_item_id: gs22-s47-grand-total-equals-sum
    implementation_type: llm
    score: 1
    criterion: Whether Grand Total equals the sum of all line-item costs within plus or minus $1.00.
    score_1_condition: Grand Total equals the sum of all line-item costs within +-$1.00.
    score_0_condition: Grand Total differs from the sum of line-item costs by more than +-$1.00.
    related_facts: ''
    facts_reference: ''
    query_reference: 'a budget summary'
  - rubric_item_id: gs22-s48-sf003-sf001-rank-top
    implementation_type: llm
    score: 1
    criterion: Whether SF-003 and SF-001 rank in the top two positions by budget.
    score_1_condition: SF-003 and SF-001 are the top two stores by budget (in either order).
    score_0_condition: SF-003 and SF-001 are not both in the top two positions by budget.
    related_facts: SF-003 (Milpitas, Flagship, 180 seats) and SF-001 (Chinatown, Flagship, 160 seats) are the two largest stores.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'a budget summary'
  - rubric_item_id: gs22-s49-executive-summary-mentions-total
    implementation_type: llm
    score: 1
    criterion: Whether the executive summary mentions total budget consistent with the Grand Total.
    score_1_condition: The executive summary mentions a total budget figure consistent with the Grand Total in the budget section.
    score_0_condition: The executive summary does not mention a total budget or the figure is inconsistent with the Grand Total.
    related_facts: ''
    facts_reference: ''
    query_reference: 'an executive summary'
  - rubric_item_id: gs22-s50-executive-summary-references-comparison
    implementation_type: llm
    score: 1
    criterion: Whether the executive summary references a comparison with 2025 Spring Festival consumption.
    score_1_condition: The executive summary references or compares with 2025 Spring Festival consumption data.
    score_0_condition: The executive summary does not reference 2025 Spring Festival consumption data.
    related_facts: ''
    facts_reference: ''
    query_reference: 'an executive summary'
  - rubric_item_id: gs22-s51-executive-summary-identifies-meat
    implementation_type: llm
    score: 1
    criterion: Whether the executive summary identifies Meat as the highest cost-share category.
    score_1_condition: The executive summary identifies Meat as the highest cost-share ingredient category.
    score_0_condition: The executive summary does not identify Meat as the highest cost-share category.
    related_facts: Meat is the highest cost-share ingredient category in hotpot operations.
    facts_reference: szw-hotpot-2025-q1-operations-data.xlsx
    query_reference: 'an executive summary'
  - rubric_item_id: gs22-s52-data-references-executive-summary
    implementation_type: llm
    score: 1
    criterion: Whether all data references in the executive summary are consistent with the workbook calculations.
    score_1_condition: All data references in the executive summary are consistent with the workbook calculations.
    score_0_condition: One or more data references in the executive summary are inconsistent with the workbook calculations.
    related_facts: ''
    facts_reference: ''
    query_reference: 'an executive summary'
  Optional Constraint:
  - rubric_item_id: gs22-o01-deliverable-clear-tabular-structure
    implementation_type: llm
    score: 4
    criterion: Whether the deliverable has clear tabular structure, consistent formatting, and professional presentation.
    score_4_condition: Polished tabular structure, perfectly aligned, publication-quality professionalism.
    score_3_condition: Clear tabular structure, consistent formatting, professional appearance.
    score_2_condition: Adequate tabular structure, generally consistent formatting.
    score_1_condition: Basic structure but obvious formatting issues.
    score_0_condition: Poor structure, inconsistent formatting, unprofessional appearance.
    related_facts: ''
    facts_reference: ''
    query_reference: ''