GDPVal Extended: AI Productivity Benchmark
Human expert labeled multi-domain tasks with solid rubrics and golden answer.
Tasks distribution
Tasks Distribution Overview
Completed 3879 tasks in Chinese and English across various industries including manufacturing, retail trade, and science & technology, delivering outputs in formats such as xlsx, docx, pptx, pdf, and md.
Language Distribution
68.4%
English (2653 tasks)
31.6%
Chinese (1226 tasks)
Deliverable Type Distribution
80%
60%
40%
20%
64.2%
20.5%
10.6%
2.7%
1.4%
0.4%
xlsx (2490)
docx (795)
pptx (411)
pdf (105)
Multi-file* (54)
markdown (16)
*Multi-file combinations include Excel+Doc, Excel+PPT, and other combinations
Industry Distribution
0%
10%
20%
30%
Manufacturing
22.1%
Trade and Retail
21.7%
Professional, Scientific, and Technical Services
17.5%
Real Estate, Rental, and Leasing
10.6%
Government
8.3%
Information Industry
6%
Healthcare and Social Assistance
5.5%
Finance and Insurance
5.3%
Wholesale Trade
3%
Demo cases
Sample Tasks, Real Challenges
Explore three end-to-end tasks that span the most common office deliverables — a marketing blog article, a promotional plan presentation, and a data-driven stocking spreadsheet. Each sample includes a full context prompt, reference files, constraint scoring criteria, and an ideal deliverable, giving you a complete picture of how GDPVal evaluates an AI Agent's ability to handle real workplace complexity.
Outdoor Equipment - Marketing
Prompt
You are the content editor for AlpineVista Gear, an outdoor equipment brand, responsible for managing the brand's blog. The brand's flagship product is the Summit Pro hardshell jacket, which uses Gore-Tex Pro ePE (expanded polyethylene) membrane technology, priced at $599, targeting professional mountaineers, backcountry skiers, and alpine climbers. A key recent marketing focus for the brand is communicating the breakthrough advantages of ePE technology over traditional ePTFE technology to consumers, particularly the core selling point of being "PFAS-free."
Write an SEO-optimized blog article titled "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer." The article's goal is to explain ePE membrane technology to non-technical outdoor enthusiasts, help them make better purchasing decisions, and establish brand authority in the industry. The article should use a friendly, accessible tone aimed at beginners, fully showcasing the practical benefits of the new technology for athletes, consumers, and the environment.
Rubric
constraint-catalog.yaml
Reference Files
gore-tex-epe-hardshell-context.md
Golden Deliverables
gore-tex-epe-vs-eptfe-blog.docx
- constraint-catalog.yaml
- gore-tex-epe-hardshell-context.md
- gore-tex-epe-vs-eptfe-blog.docx
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374task_name: 0020-gm-seo-blog-outdoor deliverable_type: docx result_filename: validation-result.json rubric: Hard Constraint: - rubric_item_id: gm20-h02-docx-file-opened-normally implementation_type: code score: 1 criterion: Whether the .docx file can be opened normally without errors. score_1_condition: The .docx file can be opened normally without any errors. score_0_condition: The .docx file cannot be opened normally or has format errors. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-h03-text-contains-title-goretex implementation_type: code score: 1 criterion: 'Whether the text contains the title "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer" (ignoring leading/trailing whitespace; case-insensitive).' score_1_condition: The document contains the title (ignoring leading/trailing whitespace, case-insensitive). score_0_condition: The document does not contain the title, or the title does not match the requirement. related_facts: '' facts_reference: '' query_reference: 'Write an SEO-optimized blog article titled "Gore-Tex ePE vs ePTFE: Why the New Hardshell Technology Is a Game-Changer."' - rubric_item_id: gm20-h04-article-compares-breakthrough-advancements implementation_type: llm score: 1 criterion: Whether the article compares the breakthrough advancements of ePE technology over traditional ePTFE technology. score_1_condition: The article compares ePE technology with traditional ePTFE technology and conveys the breakthrough advancements of ePE. score_0_condition: The article does not compare ePE technology with traditional ePTFE technology, or does not convey the breakthrough advancements of ePE. related_facts: ePE (expanded polyethylene) achieves the same microporous structure as ePTFE but is made from polyethylene containing zero fluorine atoms, delivering equivalent waterproofing and breathability with no PFAS in the membrane itself, and approximately 35% lower carbon footprint. facts_reference: gore-tex-epe-hardshell-context.md query_reference: 'communicating the breakthrough advantages of ePE technology over traditional ePTFE technology to consumers' - rubric_item_id: gm20-h05-article-emphasizes-core-selling implementation_type: llm score: 1 criterion: Whether the article emphasizes the core selling point of "PFAS-free" when comparing ePE and ePTFE technologies. score_1_condition: The article clearly emphasizes "PFAS-free" as a core selling point when comparing ePE and ePTFE technologies. score_0_condition: The article does not mention "PFAS-free" as a key advantage of ePE over ePTFE, or the PFAS-free selling point is not clearly conveyed. related_facts: Traditional ePTFE membranes and most legacy DWR treatments contain PFAS. ePE membranes contain zero fluorine atoms and are PFAS-free. PFAS detected in 89% of legacy DWR-treated jackets (Glüge et al., Environmental Science & Technology, 2024). facts_reference: gore-tex-epe-hardshell-context.md query_reference: 'particularly the core selling point of being "PFAS-free."' - rubric_item_id: gm20-h01-deliverable-docx-word-document implementation_type: code score: 1 criterion: Whether the deliverable is a .docx Word document (file extension is .docx). score_1_condition: The deliverable is a .docx format Word document. score_0_condition: The deliverable is not in .docx format, or the file extension is not .docx. related_facts: '' facts_reference: '' query_reference: '' Soft Constraint: - rubric_item_id: gm20-s01-external-links-document-avoid implementation_type: code score: 1 criterion: 'Whether external links in the document avoid using prohibited generic anchor text: {"click here", "here", "this article", "read more", "learn more"}.' score_1_condition: All external links in the document avoid using prohibited generic anchor text. score_0_condition: The document contains external links using the above prohibited anchor text. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s02-document-contains-least-one implementation_type: code score: 1 criterion: Whether the document contains at least one Heading 2 (H2) subheading. score_1_condition: The document contains at least one H2 subheading. score_0_condition: The document does not contain any H2 subheadings. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s03-document-contains-least-one implementation_type: code score: 1 criterion: Whether the document contains at least one Heading 3 (H3) subheading. score_1_condition: The document contains at least one H3 subheading. score_0_condition: The document does not contain any H3 subheadings. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s04-bold-formatting-used-highlight implementation_type: code score: 1 criterion: Whether bold formatting is used to highlight and convey important content. score_1_condition: The document uses bold formatting to highlight important content. score_0_condition: The document does not use bold formatting. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s05-italic-formatting-used-highlight implementation_type: code score: 1 criterion: Whether italic formatting is used to highlight and convey important content. score_1_condition: The document uses italic formatting to highlight important content. score_0_condition: The document does not use italic formatting. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s06-four-related-secondary-keywords implementation_type: llm score: 1 criterion: Whether four related secondary keywords are listed after the article body. score_1_condition: Four related secondary keywords are listed after the body. score_0_condition: Secondary keywords are not listed after the body, or fewer than four are listed. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s07-final-section-document-mentions implementation_type: code score: 1 criterion: Whether the final section of the document mentions a "Pull quote" and the selected quote text. score_1_condition: The final section mentions a Pull quote and the selected quote text. score_0_condition: The final section does not mention a Pull quote or does not include quote text. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s08-four-listed-secondary-keywords implementation_type: code score: 1 criterion: Whether each of the four listed secondary keywords appears at least once in the article body. score_1_condition: Each of the four secondary keywords appears at least once in the body. score_0_condition: Some secondary keywords do not appear in the body. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s09-four-listed-secondary-keywords implementation_type: code score: 1 criterion: Whether the four listed secondary keywords are all distinct from each other and none is the primary keyword "Gore-Tex ePE". score_1_condition: The four secondary keywords are all distinct from each other and none is the primary keyword "Gore-Tex ePE". score_0_condition: Secondary keywords contain duplicates, or include the primary keyword "Gore-Tex ePE". related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s10-body-includes-interspersed-links implementation_type: llm score: 1 criterion: Whether the body includes interspersed links to news or external resources related to "PFAS-free outdoor gear" or "Gore-Tex ePE" (using SEO-friendly anchor text). score_1_condition: The body includes interspersed links to relevant news or external resources, using SEO-friendly anchor text. score_0_condition: The body does not provide relevant external links, or link anchor text does not meet SEO requirements. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s11-body-contains-laypersonfriendly-definition implementation_type: llm score: 1 criterion: Whether the body contains a layperson-friendly definition of PFAS (per- and polyfluoroalkyl substances), mentioning their persistence ("forever chemicals" or equivalent phrasing) and environmental impact. score_1_condition: Contains a layperson-friendly definition of PFAS, mentioning their persistence ("forever chemicals" or equivalent phrasing) and environmental impact. score_0_condition: Does not contain a layperson-friendly definition of PFAS, or does not mention persistence and environmental impact. related_facts: PFAS (Per- and Polyfluoroalkyl Substances) are known as "forever chemicals" because they are extremely persistent in the environment, causing long-term contamination to water sources and ecosystems. facts_reference: https://en.wikipedia.org/wiki/Per-_and_polyfluoroalkyl_substances query_reference: '' - rubric_item_id: gm20-s12-body-contains-beginnerfriendly-definition implementation_type: llm score: 1 criterion: Whether the body contains a beginner-friendly definition of the "Gore-Tex ePE membrane", mentioning microporous structure and waterproof-breathable performance. score_1_condition: Contains a beginner-friendly definition of the ePE membrane, mentioning microporous structure and waterproof-breathable performance. score_0_condition: Does not contain a definition of the ePE membrane, or the definition is not accessible enough, or does not mention microporous structure and waterproof-breathable performance. related_facts: 'ePE (expanded polyethylene) achieves the same microporous structure as ePTFE but is made from polyethylene containing zero fluorine atoms. The result: equivalent waterproofing and breathability, with no PFAS in the membrane itself.' facts_reference: gore-tex-epe-hardshell-context.md query_reference: '' - rubric_item_id: gm20-s13-technical-terms-appear-eptfe implementation_type: llm score: 1 criterion: If technical terms appear ("ePTFE", "ePE", "MVTR", "DWR", "PFAS", "laminate", "face fabric"), whether a layperson-friendly definition is provided within two sentences. score_1_condition: All technical terms that appear are given a layperson-friendly definition within two sentences. score_0_condition: Technical terms appear without a layperson-friendly definition provided within two sentences. related_facts: '' facts_reference: '' query_reference: 'explain ePE membrane technology to non-technical outdoor enthusiasts' - rubric_item_id: gm20-s14-article-compares-least-two implementation_type: llm score: 1 criterion: Whether the article compares at least two specific performance differences or advantages/disadvantages between ePE and ePTFE membrane technologies (e.g., carbon footprint, PFAS content, waterproof rating, breathability). score_1_condition: At least two specific performance differences or advantages/disadvantages are compared. score_0_condition: No ePE vs ePTFE performance comparison is made, or only one aspect is compared. related_facts: 'Key advantages of ePE membrane over ePTFE include: PFAS-free, approximately 35% lower carbon footprint (Higg MSI), and equal or better waterproof-breathable performance.' facts_reference: gore-tex-epe-hardshell-context.md query_reference: 'communicating the breakthrough advantages of ePE technology over traditional ePTFE technology' - rubric_item_id: gm20-s15-article-mentions-fc0-dwr implementation_type: llm score: 1 criterion: Whether the article mentions FC0 DWR (durable water repellent) treatment and explains its "PFAS-free" characteristic. score_1_condition: Mentions FC0 DWR treatment and explains its PFAS-free characteristic. score_0_condition: Does not mention FC0 DWR, or does not explain its PFAS-free characteristic. related_facts: FC0 DWR refers to a fluorocarbon-free formulation that achieves water repellency without using any PFAS. The Summit Pro uses FC0 (PFAS-free fluorocarbon-free) DWR. facts_reference: gore-tex-epe-hardshell-context.md query_reference: '' - rubric_item_id: gm20-s16-article-explains-least-one implementation_type: llm score: 1 criterion: Whether the article explains at least one advantage of C-KNIT™ backer technology compared to traditional linings (e.g., lighter, more breathable, quieter). score_1_condition: Explains at least one advantage of C-KNIT™ backer technology compared to traditional linings. score_0_condition: Does not mention the advantages of C-KNIT™ backer technology. related_facts: 'C-KNIT™ is Gore''s proprietary circular-knit backer technology. Instead of a stiff tricot liner, C-KNIT™ uses a softer, more open-weave structure. Benefits: up to 10% lighter, noticeably quieter during movement, improved moisture wicking.' facts_reference: gore-tex-epe-hardshell-context.md query_reference: '' - rubric_item_id: gm20-s17-article-explicitly-states-alpinevista implementation_type: llm score: 1 criterion: Whether the article explicitly states that the AlpineVista Summit Pro uses Gore-Tex Pro ePE technology, and cites at least two specific parameters from the "Product Specifications" section of the reference file. score_1_condition: Explicitly states the product technology and cites at least two specific parameters (e.g., waterproof rating, weight, breathability MVTR, etc.). score_0_condition: Does not state the product uses Gore-Tex Pro ePE technology, or cites fewer than two specific parameters. related_facts: 'Summit Pro specs: Gore-Tex Pro ePE membrane, 3-layer laminate, 80D recycled nylon ripstop face fabric, C-KNIT™ backer, 28,000 mm waterproof rating, 25,000 g/m²/24h MVTR, FC0 DWR, 425 g weight, $599 MSRP.' facts_reference: gore-tex-epe-hardshell-context.md query_reference: 'The brand''s flagship product is the Summit Pro hardshell jacket, which uses Gore-Tex Pro ePE (expanded polyethylene) membrane technology, priced at $599' - rubric_item_id: gm20-s18-article-cites-least-two implementation_type: llm score: 1 criterion: Whether the article cites at least two specific data points or statistics from the "Market Statistics & Data Points" section of the reference file, with sources noted. score_1_condition: Cites at least two specific market data points or statistics, with sources noted. score_0_condition: Does not cite market data, or cites fewer than two data points, or sources are not noted. related_facts: 'Market data includes: $2.8B hardshell market (Allied Market Research, 2025), 73% consumers influenced by sustainability (OIA, Q3 2025), 89% legacy jackets contain PFAS (Glüge et al., 2024), ePE ~35% lower carbon footprint (Gore 2025 Sustainability Report), EU PFAS restriction expected by 2027, 22% YoY growth in PFAS-free searches.' facts_reference: gore-tex-epe-hardshell-context.md query_reference: '' - rubric_item_id: gm20-s19-article-cites-least-two implementation_type: llm score: 1 criterion: Whether the article cites at least two experts or institutions from the "Key Experts & Voices to Reference" section of the reference file, including their names/institution names and quotable viewpoints or key findings. score_1_condition: Cites at least two experts or institutions, including names/institution names and quotable viewpoints or key findings. score_0_condition: Does not cite experts, or cites fewer than two, or does not include viewpoints/findings. related_facts: 'Key experts: Dr. Amara Singh (Empa) — ePE matches ePTFE durability; Dr. Kai Lüdemann (UFZ Leipzig) — ePE is ''the most commercially viable pathway to a fluorine-free hardshell''; Maria Torres-Vega (EOG) — brands adopting ePE gain 2-3 year compliance head start; Jake Orton (IFMGA Guide) — ePE is ''noticeably more supple and quieter''.' facts_reference: gore-tex-epe-hardshell-context.md query_reference: '' - rubric_item_id: gm20-s20-cited-expert-their-affiliated implementation_type: llm score: 1 criterion: For each cited expert, whether their affiliated institution or professional title/background is provided. score_1_condition: Each cited expert is provided with their affiliated institution or professional title/background. score_0_condition: Cited experts are missing institutional or professional background information. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s21-article-states-epe-membrane implementation_type: llm score: 1 criterion: Whether the article states that the ePE membrane has a lower carbon footprint than ePTFE (citing Higg MSI or Gore sustainability report, with a specific reduction of approximately 35%). score_1_condition: States that ePE has a lower carbon footprint than ePTFE, citing Higg MSI or Gore sustainability report. score_0_condition: Does not state the carbon footprint comparison, or does not cite an authoritative source. related_facts: According to Higg MSI data, the Gore-Tex ePE membrane has approximately 35% lower carbon footprint than ePTFE. facts_reference: Gore-Tex Sustainability Report; Higg Materials Sustainability Index query_reference: '' - rubric_item_id: gm20-s22-cited-statistics-include-original implementation_type: llm score: 1 criterion: Whether all cited statistics include the original source name and publication year. score_1_condition: All cited statistics include the original source name and publication year (e.g., "Allied Market Research, 2025"). score_0_condition: Some statistics are missing source or year information. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s23-article-includes-least-one implementation_type: llm score: 1 criterion: Whether the article includes at least one competitor comparison, referencing at least two competitor products from the "Competitive Landscape" table in the reference file, and comparing at least one parameter. score_1_condition: Includes competitor comparison, references at least two competitor products, and compares at least one parameter (price, membrane type, or key differentiator). score_0_condition: Does not include competitor comparison, or references fewer than two competitor products, or does not compare any parameter. related_facts: 'Competitors: Arc''teryx Alpha SV ($799, Gore-Tex Pro ePE), Patagonia Triolet ($449, Gore-Tex ePE non-Pro), Norrøna Trollveggen ($699, Gore-Tex Pro ePE), Mountain Hardwear Exposure/2 ($500, Gore-Tex Pro ePTFE), Mammut Nordwand Advanced ($750, Gore-Tex Pro ePE). AlpineVista Summit Pro at $599 is below the $637 competitive average.' facts_reference: gore-tex-epe-hardshell-context.md query_reference: '' - rubric_item_id: gm20-s24-competitor-comparison-presented-structured implementation_type: llm score: 1 criterion: Whether the competitor comparison is presented in a structured table format, including at least three products. score_1_condition: Competitor comparison is presented in a structured table format, including at least three products. score_0_condition: Competitor comparison is described only in prose paragraphs, or the table includes fewer than three products. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s25-competitor-comparison-positions-alpinevista implementation_type: llm score: 1 criterion: Whether the competitor comparison positions the AlpineVista Summit Pro as "best value in Pro ePE" or equivalent phrasing. score_1_condition: Positions the Summit Pro as best value in Pro ePE or equivalent phrasing. score_0_condition: Does not position the Summit Pro for value. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s26-article-prominently-features-least implementation_type: llm score: 1 criterion: Whether the article prominently features at least two different outdoor activity scenarios where the product is suitable (e.g., alpine climbing, backcountry skiing, heavy rain hiking), with explanations tied to specific ePE technology advantages. score_1_condition: Features at least two different outdoor activity scenarios, with explanations tied to ePE technology advantages. score_0_condition: Does not introduce use scenarios, or fewer than two scenarios, or does not tie to technology advantages. related_facts: '' facts_reference: '' query_reference: 'fully showcasing the practical benefits of the new technology for athletes, consumers, and the environment' - rubric_item_id: gm20-s27-article-addresses-least-one implementation_type: llm score: 1 criterion: Whether the article addresses at least one common consumer concern (e.g., durability, DWR reapplication, price justification, etc.). score_1_condition: Addresses at least one common consumer concern. score_0_condition: Does not address any common consumer concerns. related_facts: '' facts_reference: '' query_reference: 'help them make better purchasing decisions' - rubric_item_id: gm20-s28-article-mentions-previews-alpinevista implementation_type: llm score: 1 criterion: Whether the article mentions or previews the "AlpineVista Community Trail Reports" feature (a social platform launching in Q3 2026), linking it to real-world use scenarios for the ePE hardshell jacket. score_1_condition: Mentions or previews the Community Trail Reports feature, linking it to product use scenarios. score_0_condition: Does not mention the Community Trail Reports feature. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s29-conclusion-explicitly-states-upcoming implementation_type: llm score: 1 criterion: Whether the conclusion explicitly states that upcoming articles will include field-use reviews or interviews with professional athletes or professional guides. score_1_condition: The conclusion explicitly states that upcoming articles will include field-use reviews or interviews with professional athletes or professional guides. score_0_condition: The conclusion does not mention upcoming field reviews or interviews. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s30-epe-vs-eptfe-core implementation_type: llm score: 1 criterion: Whether the ePE vs ePTFE core attribute comparison is presented in a structured table format (with at least 5 comparison dimensions). score_1_condition: The ePE vs ePTFE comparison is presented in a structured table format, with at least 5 comparison dimensions. score_0_condition: The ePE vs ePTFE comparison is described only in paragraphs, or the table has fewer than 5 comparison dimensions. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s31-article-includes-faq-section implementation_type: llm score: 1 criterion: Whether the article includes a FAQ section with at least two questions in natural language question format, where the first sentence of each answer is a direct answer. score_1_condition: Includes a FAQ section with at least two natural language questions, where the first sentence of each answer is a direct answer. score_0_condition: Does not include a FAQ section, or fewer than two questions, or the first sentence of answers is not a direct answer. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s32-conclusion-paragraph-explicitly-states implementation_type: llm score: 1 criterion: Whether the conclusion paragraph explicitly states that upcoming articles will include a Gore-Tex ePE hardshell jacket buying guide or care and maintenance tutorial. score_1_condition: The conclusion explicitly states that upcoming articles will include a buying guide or care and maintenance tutorial. score_0_condition: The conclusion does not mention upcoming buying guides or care tutorials. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s33-conclusion-paragraphs-market-trend implementation_type: llm score: 1 criterion: Whether the conclusion paragraph's market trend summary cites specific data (e.g., market size or growth rate) as support. score_1_condition: The conclusion's market trend summary cites specific data (e.g., market size or growth rate). score_0_condition: The conclusion's market trend summary does not cite any specific data. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gm20-s34-major-content-paragraph-first implementation_type: llm score: 4 criterion: Whether each major content paragraph (first paragraph under H2) opens with a conclusive statement (directly answering the implicit question of the section heading), rather than beginning with background context or transitional phrasing. score_4_condition: All H2 opening paragraphs begin with conclusive statements, directly answering the implicit question of the heading, with no background preamble. score_3_condition: Most H2 opening paragraphs begin with conclusive statements, directly answering the implicit question of the heading. score_2_condition: Approximately half of H2 opening paragraphs begin with conclusive statements. score_1_condition: A few H2 opening paragraphs use conclusion-first writing, but not consistently. score_0_condition: Most H2 opening paragraphs begin with background context or transitional phrasing, not using conclusion-first writing. related_facts: '' facts_reference: '' query_reference: '' Optional Constraint: - rubric_item_id: gm20-o01-overall-formatting-style-readability implementation_type: llm score: 4 criterion: Overall formatting, style, readability, and professionalism of the deliverable. score_4_condition: Formatting is exquisite, style is professionally unified, readability is excellent, reaching publication-quality standards. score_3_condition: Formatting is polished, style is unified, readability is good, demonstrates professionalism. score_2_condition: Formatting is standardized, style is generally consistent, readability is acceptable. score_1_condition: Basic formatting is complete, but there are obvious style inconsistencies or layout issues. score_0_condition: Formatting is chaotic, style is inconsistent, readability is poor, lacks professionalism. related_facts: '' facts_reference: '' query_reference: ''
E-Commerce - Sales Analysis
Prompt
You are an operations specialist at LuxeSelect Boutique, a luxury buyer e-commerce website. The site primarily deals in personal shopping and sales of handbags, scarves/shawls, jewelry, shoes, and small leather goods from luxury brands such as Louis Vuitton and Chanel.
Valentine's Day is approaching, and you need to create a promotion plan proposal (PowerPoint presentation, no more than 10 slides) for submission to the Operations Director for approval. The proposal must include the following four modules:
1. Theme & Copy: Craft an attractive campaign theme and core copy for this Valentine's Day promotion.
2. Poster Design Concept: Describe the design concept for at least one promotional poster, including visual style, core elements, and copy layout.
3. Promotion Timeline: Develop a complete promotion timeline from warm-up to campaign end, covering channels such as social media, email marketing, and on-site banners.
4. Expected Results: Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin), develop a reasonable promotional pricing strategy and provide forecasts for key metrics such as expected sales revenue, gross profit, etc.
Attachment descriptions:
- order-list.pdf: Contains 36 curated luxury accessories for this Valentine's Day selection (18 each from Louis Vuitton and Chanel), spanning five categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods. Each product includes an illustration, SKU, color, material, dimensions, special features, and notes.
- purchase-order.xlsx: Contains each product's SKU, category, cost price, retail price, minimum acceptable selling price (brand floor price), proposed purchase quantity, as well as formula-calculated gross profit and gross margin.
Please output the final presentation in PDF format.
Rubric
constraint-catalog.yaml
Reference Files
order-list.pdf
purchase-order.xlsx
Golden Deliverables
valentines-promo-plan.pdf
- constraint-catalog.yaml
- order-list.pdf
- purchase-order.xlsx
- valentines-promo-plan.pdf
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431task_name: 0021-gs-valentines-promo-plan deliverable_type: pdf result_filename: validation-result.json rubric: Hard Constraint: - rubric_item_id: gs21-h01-deliverable-pdf-format implementation_type: code score: 1 criterion: Whether the deliverable is in PDF format. score_1_condition: The deliverable file has a .pdf extension. score_0_condition: The deliverable is not a PDF file. related_facts: '' facts_reference: '' query_reference: Please output the final presentation in PDF format. - rubric_item_id: gs21-h02-output-contains-more-10 implementation_type: code score: 1 criterion: Whether the output contains no more than 10 slides (pages). score_1_condition: The PDF contains no more than 10 pages. score_0_condition: The PDF contains more than 10 pages. related_facts: '' facts_reference: '' query_reference: 'you need to create a promotion plan proposal (PowerPoint presentation, no more than 10 slides)' - rubric_item_id: gs21-h03-title-slide-contains-luxeselect implementation_type: llm score: 1 criterion: |- Whether the title on Slide 1 contains "LuxeSelect Boutique" and "Valentine's Day" (case-insensitive). score_1_condition: |- The first page text contains both "LuxeSelect Boutique" and "Valentine's Day" (case-insensitive). score_0_condition: The first page is missing one or both of the required phrases. related_facts: '' facts_reference: '' query_reference: >- You are an operations specialist at LuxeSelect Boutique, a luxury buyer e-commerce website. The site primarily deals in personal shopping and sales of handbags, scarves/shawls, jewelry, shoes, and small leather goods from luxury brands such as Louis Vuitton and Chanel. Valentine's Day is approaching, and you need to create a promotion plan proposal - rubric_item_id: gs21-h04-least-one-slide-dedicated implementation_type: llm score: 1 criterion: "Whether at least one slide is dedicated to presenting the Valentine's Day promotion theme." score_1_condition: "At least one slide focuses on the Valentine's Day promotion theme." score_0_condition: No slide is dedicated to the promotion theme. related_facts: '' facts_reference: '' query_reference: "1. **Theme & Copy**: Craft an attractive campaign theme and core copy for this Valentine's Day promotion." - rubric_item_id: gs21-h05-least-one-slide-presents implementation_type: llm score: 1 criterion: Whether at least one slide presents a poster design concept or visual design brief. score_1_condition: At least one slide presents a poster design concept or visual design brief. score_0_condition: No slide presents a poster design concept. related_facts: '' facts_reference: '' query_reference: '2. **Poster Design Concept**: Describe the design concept for at least one promotional poster, including visual style, core elements, and copy layout.' - rubric_item_id: gs21-h06-least-one-slide-presents implementation_type: llm score: 1 criterion: Whether at least one slide presents a promotion timeline or rollout schedule. score_1_condition: At least one slide presents a promotion timeline or rollout schedule. score_0_condition: No slide presents a promotion timeline. related_facts: '' facts_reference: '' query_reference: '3. **Promotion Timeline**: Develop a complete promotion timeline from warm-up to campaign end, covering channels such as social media, email marketing, and on-site banners.' - rubric_item_id: gs21-h07-least-one-slide-presents implementation_type: llm score: 1 criterion: Whether at least one slide presents expected results or financial forecasts. score_1_condition: At least one slide presents expected results or financial forecasts. score_0_condition: No slide presents expected results or financial forecasts. related_facts: '' facts_reference: '' query_reference: '4. **Expected Results**: Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin), develop a reasonable promotional pricing strategy and provide forecasts for key metrics such as expected sales revenue, gross profit, etc.' - rubric_item_id: gs21-h08-proposal-includes-summary-table implementation_type: llm score: 1 criterion: Whether the proposal includes a summary table listing basic information for all 36 SKUs (may be grouped by category). score_1_condition: A summary table listing basic information for all 36 SKUs is present. score_0_condition: 'No summary table for all SKUs, or SKUs are significantly incomplete.' related_facts: 'The purchase-order.xlsx contains 36 SKUs total (18 Louis Vuitton + 18 Chanel), across 5 categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods.' facts_reference: purchase-order.xlsx query_reference: >- Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin), develop a reasonable promotional pricing strategy and provide forecasts for key metrics such as expected sales revenue, gross profit, etc. Soft Constraint: - rubric_item_id: gs21-s01-campaign-theme-copy-contains implementation_type: llm score: 1 criterion: 'Whether the campaign theme copy contains keywords related to "love," "gift," or "romance" (accepts Love / Gift / Romance or equivalent expressions).' score_1_condition: 'The theme copy contains keywords related to love, gift, or romance.' score_0_condition: The theme copy does not contain any love/gift/romance keywords. related_facts: '' facts_reference: '' query_reference: 'Craft an attractive campaign theme and core copy for this Valentine''s Day promotion.' - rubric_item_id: gs21-s02-theme-copy-mentions-least implementation_type: llm score: 1 criterion: Whether the theme copy mentions at least one brand name (Louis Vuitton or Chanel). score_1_condition: The theme copy mentions at least one brand name (Louis Vuitton or Chanel). score_0_condition: No brand name is mentioned in the theme copy. related_facts: The curated selection consists of 18 Louis Vuitton and 18 Chanel products. facts_reference: order-list.pdf query_reference: '' - rubric_item_id: gs21-s03-proposal-includes-consumerfacing-promotional implementation_type: llm score: 1 criterion: 'Whether the proposal includes a consumer-facing promotional tagline (a short, compelling one-liner).' score_1_condition: 'A short, compelling consumer-facing promotional tagline is present.' score_0_condition: No promotional tagline is present. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s04-copy-employs-scarcity-urgency implementation_type: llm score: 1 criterion: 'Whether the copy employs scarcity or urgency tactics (e.g., "limited edition," "limited time," "exclusive," "only X left," or references limited/seasonal exclusive information from the ORDER LIST PDF).' score_1_condition: The copy employs scarcity or urgency tactics. score_0_condition: No scarcity or urgency tactics are used. related_facts: "The ORDER LIST PDF contains products with Notes referencing Valentine's Edition and seasonal exclusives." facts_reference: order-list.pdf query_reference: '' - rubric_item_id: gs21-s05-copy-incorporates-giftgiving-scenario implementation_type: llm score: 1 criterion: |- Whether the copy incorporates gift-giving scenario guidance (e.g., "pick for her/him," "Valentine's gift," "top gift choice," positioning products as gifts). score_1_condition: The copy incorporates gift-giving scenario guidance. score_0_condition: No gift-giving scenario guidance is present. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s06-poster-design-concept-clearly implementation_type: llm score: 1 criterion: 'Whether the poster design concept clearly describes the visual style (e.g., color palette, typography style, festive elements, etc.).' score_1_condition: The poster design concept clearly describes the visual style. score_0_condition: No clear description of visual style in the poster design concept. related_facts: '' facts_reference: '' query_reference: 'Describe the design concept for at least one promotional poster, including visual style, core elements, and copy layout.' - rubric_item_id: gs21-s07-poster-design-concept-includes implementation_type: llm score: 1 criterion: Whether the poster design concept includes at least one specific product name or SKU from the ORDER LIST PDF. score_1_condition: The poster design concept references at least one specific product name or SKU from the ORDER LIST PDF. score_0_condition: No specific product name or SKU from the ORDER LIST PDF is referenced in the poster design concept. related_facts: 'The ORDER LIST PDF contains 36 products with product names, SKUs, and detailed descriptions.' facts_reference: order-list.pdf query_reference: '' - rubric_item_id: gs21-s08-promotion-plan-covers-least implementation_type: llm score: 1 criterion: 'Whether the promotion plan covers at least 3 different marketing channels (e.g., social media, email marketing, on-site banners, KOL partnerships, SMS push, etc.).' score_1_condition: The promotion plan covers at least 3 different marketing channels. score_0_condition: Fewer than 3 marketing channels are covered. related_facts: '' facts_reference: '' query_reference: 'Develop a complete promotion timeline from warm-up to campaign end, covering channels such as social media, email marketing, and on-site banners.' - rubric_item_id: gs21-s09-promotion-timeline-includes-explicit implementation_type: llm score: 1 criterion: 'Whether the promotion timeline includes explicit dates or phase divisions (e.g., warm-up phase, burst phase, encore phase).' score_1_condition: The promotion timeline includes explicit dates or phase divisions. score_0_condition: No explicit dates or phase divisions are present. related_facts: '' facts_reference: '' query_reference: 'Develop a complete promotion timeline from warm-up to campaign end' - rubric_item_id: gs21-s10-promotion-timeline-includes-least implementation_type: llm score: 1 criterion: Whether the promotion timeline includes at least 3 time nodes or phases. score_1_condition: At least 3 time nodes or phases are identified. score_0_condition: Fewer than 3 time nodes or phases. related_facts: '' facts_reference: '' query_reference: 'Develop a complete promotion timeline from warm-up to campaign end' - rubric_item_id: gs21-s11-promotion-plan-mentions-valentines implementation_type: llm score: 1 criterion: "Whether the promotion plan mentions Valentine's Day itself (February 14) or key dates around it." score_1_condition: "The promotion plan mentions February 14 or key dates around Valentine's Day." score_0_condition: "No mention of February 14 or key dates around Valentine's Day." related_facts: '' facts_reference: '' query_reference: "Valentine's Day is approaching, and you need to create a promotion plan proposal" - rubric_item_id: gs21-s12-promotion-phase-clear-objectives implementation_type: llm score: 1 criterion: 'Whether each promotion phase has clear objectives or core actions (e.g., warm-up phase goal is "build anticipation," burst phase goal is "drive conversions"), rather than just listing channel names.' score_1_condition: Each promotion phase has clear objectives or core actions. score_0_condition: Phases only list channel names without clear objectives. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s13-there-coordination-logic-between implementation_type: llm score: 1 criterion: 'Whether there is coordination logic between different channels (e.g., social media drives traffic → on-site banners catch visitors → email marketing drives conversions), demonstrating inter-channel synergy.' score_1_condition: There is clear coordination logic between channels. score_0_condition: No inter-channel coordination logic is demonstrated. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s14-promotion-plan-includes-least implementation_type: llm score: 1 criterion: 'Whether the promotion plan includes at least one post-engagement conversion path description (e.g., the user journey from seeing an ad to completing a purchase).' score_1_condition: At least one conversion path description is included. score_0_condition: No conversion path description is included. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s15-set-skus-summary-table implementation_type: llm score: 1 criterion: Whether the set of SKUs in the summary table exactly matches those listed in the Purchase Order Excel (no omissions or extras). score_1_condition: The SKU set exactly matches the Purchase Order Excel. score_0_condition: SKUs are missing or extra compared to the Purchase Order Excel. related_facts: The purchase-order.xlsx lists exactly 36 SKUs across 5 categories. facts_reference: purchase-order.xlsx query_reference: >- Based on the pricing data in the attached Purchase Order Excel (cost, retail price, floor price, gross margin), develop a reasonable promotional pricing strategy - rubric_item_id: gs21-s16-summary-table-includes-retail implementation_type: llm score: 1 criterion: 'Whether the summary table includes a Retail Price column for each SKU, with values exactly matching the Purchase Order Excel (to the cent).' score_1_condition: Retail Price column is present and values match. score_0_condition: Retail Price column is missing or values do not match. related_facts: Retail prices for all 36 SKUs are defined in the purchase-order.xlsx. facts_reference: purchase-order.xlsx query_reference: >- Each product's SKU, category, cost price, retail price, minimum acceptable selling price (brand floor price), proposed purchase quantity, as well as formula-calculated gross profit and gross margin. - rubric_item_id: gs21-s17-summary-table-includes-cost implementation_type: llm score: 1 criterion: 'Whether the summary table includes a Cost column for each SKU, with values exactly matching the Purchase Order Excel (to the cent).' score_1_condition: Cost column is present and values match. score_0_condition: Cost column is missing or values do not match. related_facts: Cost prices for all 36 SKUs are defined in the purchase-order.xlsx. facts_reference: purchase-order.xlsx query_reference: >- Each product's SKU, category, cost price, retail price, minimum acceptable selling price (brand floor price), proposed purchase quantity - rubric_item_id: gs21-s18-summary-table-includes-minimum implementation_type: llm score: 1 criterion: 'Whether the summary table includes a Minimum Acceptable Price column for each SKU, with values exactly matching the Purchase Order Excel.' score_1_condition: Minimum Acceptable Price column is present and values match. score_0_condition: Minimum Acceptable Price column is missing or values do not match. related_facts: Minimum acceptable prices (brand floor prices) for all 36 SKUs are defined in the purchase-order.xlsx. facts_reference: purchase-order.xlsx query_reference: 'minimum acceptable selling price (brand floor price)' - rubric_item_id: gs21-s19-summary-table-includes-proposed implementation_type: llm score: 1 criterion: 'Whether the summary table includes a Proposed Qty column for each SKU, with values exactly matching the Purchase Order Excel.' score_1_condition: Proposed Qty column is present and values match. score_0_condition: Proposed Qty column is missing or values do not match. related_facts: Proposed purchase quantities for all 36 SKUs are defined in the purchase-order.xlsx. facts_reference: purchase-order.xlsx query_reference: 'proposed purchase quantity' - rubric_item_id: gs21-s20-sku-gross-profit-retail implementation_type: llm score: 1 criterion: 'Whether for each SKU, Gross Profit = Retail Price − Cost is calculated correctly (to the cent).' score_1_condition: Gross Profit is calculated correctly for each SKU. score_0_condition: Gross Profit calculations are incorrect or missing. related_facts: 'Gross Profit = Retail Price − Cost. The purchase-order.xlsx includes formula-calculated gross profit.' facts_reference: purchase-order.xlsx query_reference: 'formula-calculated gross profit and gross margin' - rubric_item_id: gs21-s21-sku-gross-margin-gross implementation_type: llm score: 1 criterion: 'Whether for each SKU, Gross Margin % = Gross Profit / Retail Price is calculated correctly (within ±0.5% tolerance).' score_1_condition: 'Gross Margin % is calculated correctly within tolerance.' score_0_condition: 'Gross Margin % calculations are incorrect or missing.' related_facts: 'Gross Margin % = Gross Profit / Retail Price. The purchase-order.xlsx includes formula-calculated gross margin.' facts_reference: purchase-order.xlsx query_reference: 'formula-calculated gross profit and gross margin' - rubric_item_id: gs21-s22-promotional-prices-shown-promotional implementation_type: llm score: 1 criterion: "If promotional prices are shown, whether all promotional prices are no lower than the corresponding SKU's Minimum Acceptable Price." score_1_condition: All promotional prices meet or exceed the Minimum Acceptable Price. score_0_condition: One or more promotional prices fall below the Minimum Acceptable Price. related_facts: Each SKU has a Minimum Acceptable Price (brand floor price) defined in purchase-order.xlsx. Promotional prices must not go below this floor. facts_reference: purchase-order.xlsx query_reference: 'minimum acceptable selling price (brand floor price)' - rubric_item_id: gs21-s23-expected-total-sales-revenue implementation_type: llm score: 1 criterion: 'If expected total sales revenue is shown, whether its value equals the sum of (promotional price or retail price × proposed quantity) for all SKUs, within ±$1.00 tolerance.' score_1_condition: Total sales revenue calculation is correct within tolerance. score_0_condition: Total sales revenue is incorrect or inconsistent. related_facts: 'Total Revenue = sum of (price × quantity) for all 36 SKUs.' facts_reference: purchase-order.xlsx query_reference: >- provide forecasts for key metrics such as expected sales revenue, gross profit, etc. - rubric_item_id: gs21-s24-promotional-prices-shown-pricing implementation_type: llm score: 1 criterion: 'If promotional prices are shown, whether the pricing reflects a differentiated strategy (different SKUs have different discount levels), rather than a uniform discount across all SKUs.' score_1_condition: Pricing reflects a differentiated strategy. score_0_condition: All SKUs have the same uniform discount. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s25-promotional-prices-shown-proposal implementation_type: llm score: 1 criterion: 'If promotional prices are shown, whether the proposal simultaneously displays the original price (Retail Price) alongside the promotional price (e.g., strikethrough pricing), leveraging the price anchoring effect.' score_1_condition: Original price is displayed alongside promotional price. score_0_condition: Only promotional price is shown without the original price. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s26-promotional-prices-shown-luxury implementation_type: llm score: 1 criterion: 'If promotional prices are shown, whether the luxury pricing style is appropriate (e.g., using round numbers rather than .99 discount pricing, consistent with premium brand positioning).' score_1_condition: Luxury pricing style is appropriate. score_0_condition: 'Pricing style is inappropriate for luxury positioning (e.g., .99 endings).' related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s27-expected-results-include-projected implementation_type: llm score: 1 criterion: Whether expected results include a projected Total Revenue figure. score_1_condition: A projected Total Revenue figure is present. score_0_condition: No projected Total Revenue figure. related_facts: '' facts_reference: '' query_reference: 'provide forecasts for key metrics such as expected sales revenue, gross profit, etc.' - rubric_item_id: gs21-s28-expected-results-include-projected implementation_type: llm score: 1 criterion: Whether expected results include a projected Total Gross Profit figure. score_1_condition: A projected Total Gross Profit figure is present. score_0_condition: No projected Total Gross Profit figure. related_facts: '' facts_reference: '' query_reference: 'provide forecasts for key metrics such as expected sales revenue, gross profit, etc.' - rubric_item_id: gs21-s29-both-total-sales-revenue implementation_type: llm score: 1 criterion: "If both total sales revenue and total gross profit are shown, whether Total Gross Profit = Total Revenue − Total Cost (Total Cost = sum of each SKU's cost × proposed quantity), calculated correctly, within ±$1.00 tolerance." score_1_condition: Total Gross Profit calculation is correct within tolerance. score_0_condition: Total Gross Profit is incorrect or inconsistent. related_facts: 'Total Cost = sum of (cost × proposed quantity) for all 36 SKUs. Total Gross Profit = Total Revenue − Total Cost.' facts_reference: purchase-order.xlsx query_reference: >- provide forecasts for key metrics such as expected sales revenue, gross profit, etc. - rubric_item_id: gs21-s30-expected-results-differentiate-between implementation_type: llm score: 1 criterion: 'Whether expected results differentiate between scenarios (e.g., optimistic/baseline/conservative, or full-price vs. promotional price sales), rather than providing only a single figure.' score_1_condition: Expected results differentiate between scenarios. score_0_condition: Only a single-scenario figure is provided. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s31-expected-results-include-least implementation_type: llm score: 1 criterion: 'Whether expected results include at least one non-financial metric forecast (e.g., expected order volume, average order value, conversion rate, new customer ratio, etc.).' score_1_condition: At least one non-financial metric forecast is included. score_0_condition: No non-financial metric forecast. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs21-s32-proposal-references-names-least implementation_type: llm score: 1 criterion: Whether the proposal references the names of at least 8 products from the ORDER LIST PDF. score_1_condition: At least 8 product names from the ORDER LIST PDF are referenced. score_0_condition: Fewer than 8 product names are referenced. related_facts: 'The ORDER LIST PDF contains 36 products across 5 categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods.' facts_reference: order-list.pdf query_reference: '' - rubric_item_id: gs21-s33-referenced-products-span-categories implementation_type: llm score: 1 criterion: Whether the referenced products span at least 3 categories. score_1_condition: Referenced products span at least 3 categories. score_0_condition: Referenced products span fewer than 3 categories. related_facts: 'Products span 5 categories: handbags, scarves/shawls, jewelry, shoes, and small leather goods.' facts_reference: order-list.pdf query_reference: '' - rubric_item_id: gs21-s34-referenced-product-names-consistent implementation_type: llm score: 1 criterion: 'Whether the referenced product names are consistent with the naming in the ORDER LIST PDF (exact precision not required, but brand and model must match).' score_1_condition: Referenced product names are consistent with the ORDER LIST PDF. score_0_condition: Product names do not match the ORDER LIST PDF. related_facts: '' facts_reference: order-list.pdf query_reference: '' - rubric_item_id: gs21-s35-proposal-references-least-one implementation_type: llm score: 1 criterion: "Whether the proposal references at least one product's special description from the ORDER LIST PDF (e.g., Valentine's Edition, seasonal exclusive, or other Notes information)." score_1_condition: "At least one product's special description from the Notes column is referenced." score_0_condition: No product special descriptions are referenced. related_facts: "Some products in the ORDER LIST PDF have special Notes such as Valentine's Edition or seasonal exclusive designations." facts_reference: order-list.pdf query_reference: '' Optional Constraint: - rubric_item_id: gs21-o01-slides-clearly-laid-out implementation_type: llm score: 4 criterion: 'Whether slides are clearly laid out with logical flow, content is organized by module, and the proposal demonstrates understanding of luxury e-commerce promotion context and consumer purchasing psychology.' score_4_condition: 'Professional layout with polished flow, excellent modular organization; deep understanding of luxury e-commerce context, multiple psychological strategies employed, perfect balance between promotional intensity and brand positioning.' score_3_condition: 'Clear layout, logical flow, well-organized by module; good understanding of luxury context, consumer psychology, and balanced promotional tone.' score_2_condition: Slides are organized by module with adequate layout; some luxury context and psychology demonstrated. score_1_condition: Basic layout exists but disorganized; limited understanding of luxury context. score_0_condition: 'Layout is chaotic, no modular organization, no understanding of luxury context or consumer psychology.' related_facts: '' facts_reference: '' query_reference: ''
Restaurant - Supply Analysis
Prompt
You are a supply chain management specialist at "SZW Hotpot" (蜀滋味), a Chinese hotpot chain brand. SZW Hotpot is headquartered in San Francisco Chinatown, primarily serving the Bay Area Chinese community and local diners. The brand currently operates 8 locations in the Bay Area (1 of which is closed), located in San Francisco Chinatown, Sunset District, Milpitas, Cupertino, Fremont, Oakland Chinatown, Daly City, and San Mateo.
The 2026 Chinese New Year / Lunar New Year falls in February (February 17 is New Year's Eve, February 18 is the first day of the new year). It is now mid-January 2026, and you need to develop a 2026 Spring Festival ingredient stocking plan based on Q1 2025 (January through March) historical operations data.
You will receive an Excel reference file containing Q1 2025 operations data ("szw-hotpot-2025-q1-operations-data.xlsx"), which includes 5 worksheets:
(a) Store Matrix — Store Master Data
Columns: Store ID, Store Name, Location, Store Type (Flagship / Standard / Express), Seats, Status (Active / Closed), Cold Storage (sqft), Delivery Zone, Notes.
7 stores are Active, 1 (SF-007, Daly City) is Closed. The Notes column flags anomalies relevant to planning (e.g., one store has extra foot traffic from a parade during Spring Festival, one store has limited cold storage requiring frozen item prioritization, etc.).
(b) Ingredient Master — Ingredient Master Data
Columns: Item ID, Item Name, Category (5 categories), Unit, Unit Cost ($), Shelf Life (days), Lead Time (days), MOQ (per store order), Wastage Rate, Storage Type (Refrigerated / Frozen / Ambient).
31 ingredients total. Shelf life ranges from 2 days (Fresh Juice) to 180 days (Beer, Soda); lead times range from 1 to 5 days; wastage rates range from 1% (ambient items) to 15% (Fresh Juice).
(c-e) Jan 2025 / Feb 2025 / Mar 2025 — Monthly Operations Data
Each sheet contains two data regions:
- Daily Sales Metrics: Date, Day, Store ID, Tables Served, Customer Count, Revenue ($$), Avg Spend/Person ($$), Table Turnover Rate, Food Cost Ratio
- Daily Ingredient Consumption: Date, Store ID, Category, Item Name, Unit, Consumption Qty, Unit Cost ($$), Total Cost ($$), Cumulative MTD Cost ($)
Please build an Excel workbook as the deliverable containing a Spring Festival stocking plan. The workbook should include historical data analysis, trend and peak analysis, a store-level stocking plan for February 10–24, 2026, a budget summary, and an executive summary.
Rubric
constraint-catalog.yaml
Reference Files
szw-hotpot-2025-q1-operations-data.xlsx
Golden Deliverables
szw-hotpot-2026-cny-stocking-plan.xlsx
- constraint-catalog.yaml
- szw-hotpot-2025-q1-operations-data.xlsx
- szw-hotpot-2026-cny-stocking-plan.xlsx
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552task_name: 0022-gs-hotpot-stocking-plan deliverable_type: xlsx result_filename: validation-result.json rubric: Hard Constraint: - rubric_item_id: gs22-h01-deliverable-single-xlsx-excel implementation_type: code score: 1 criterion: Whether the deliverable is a single .xlsx Excel workbook file. score_1_condition: The deliverable file has a .xlsx extension and is a valid Excel workbook. score_0_condition: The deliverable is not an .xlsx file or is not a valid Excel workbook. related_facts: '' facts_reference: '' query_reference: Please build an Excel workbook as the deliverable - rubric_item_id: gs22-h02-workbook-contains-multiple-worksheets implementation_type: llm score: 1 criterion: Whether the workbook contains multiple worksheets or clearly separated sections corresponding to historical analysis, stocking plan, budget summary, and other modules. score_1_condition: The workbook contains multiple worksheets or clearly separated sections for historical analysis, stocking plan, budget summary, and other modules. score_0_condition: The workbook lacks multiple worksheets or clearly separated sections for the required modules. related_facts: '' facts_reference: '' query_reference: 'The workbook should include historical data analysis, trend and peak analysis, a store-level stocking plan for February 10–24, 2026, a budget summary, and an executive summary.' - rubric_item_id: gs22-h03-number-formatting-consistent-dollar implementation_type: llm score: 1 criterion: 'Whether number formatting is consistent: dollar amounts use $ with two decimal places, percentages use % format, quantities rounded to one decimal.' score_1_condition: 'Number formatting is consistent throughout: dollar amounts use $ with two decimal places, percentages use % format, quantities are rounded to one decimal.' score_0_condition: Number formatting is inconsistent or does not follow the required formats. related_facts: '' facts_reference: '' query_reference: '' - rubric_item_id: gs22-h04-workbook-contains-historical-data implementation_type: llm score: 1 criterion: Whether the workbook contains a historical data summary for the 2025 Spring Festival period (January 20 to February 4). score_1_condition: The workbook contains a historical data summary covering the 2025 Spring Festival period (January 20 to February 4). score_0_condition: The workbook does not contain a historical data summary for the 2025 Spring Festival period. related_facts: '' facts_reference: '' query_reference: 'historical data analysis' - rubric_item_id: gs22-h05-workbook-contains-clearly-labeled implementation_type: llm score: 1 criterion: Whether the workbook contains a clearly labeled 2026 Spring Festival stocking plan. score_1_condition: The workbook contains a clearly labeled 2026 Spring Festival stocking plan. score_0_condition: The workbook does not contain a clearly labeled 2026 Spring Festival stocking plan. related_facts: '' facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-h06-stocking-plans-time-range implementation_type: llm score: 1 criterion: Whether the stocking plan's time range covers February 10 to February 24, 2026. score_1_condition: The stocking plan covers February 10 to February 24, 2026. score_0_condition: The stocking plan does not cover the required date range of February 10 to February 24, 2026. related_facts: The prompt explicitly requires the stocking plan to cover February 10 to February 24, 2026 (one week before to one week after the festival, 15 days total). February 17 is New Year's Eve, February 18 is the first day. facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-h07-workbook-contains-23-sentence implementation_type: llm score: 1 criterion: Whether the workbook contains a 2-3 sentence stocking plan summary. score_1_condition: The workbook contains a 2-3 sentence stocking plan summary. score_0_condition: The workbook does not contain a stocking plan summary of 2-3 sentences. related_facts: '' facts_reference: '' query_reference: 'an executive summary' Soft Constraint: - rubric_item_id: gs22-s01-historical-data-correctly-extracted implementation_type: llm score: 1 criterion: Whether historical data is correctly extracted and merged from "Jan 2025" and "Feb 2025" worksheets. score_1_condition: Data is correctly extracted and merged from both Jan 2025 and Feb 2025 worksheets. score_0_condition: Data is not correctly extracted or merged from both worksheets. related_facts: The 2025 Spring Festival period spans January 20 to February 4, requiring data from both Jan and Feb worksheets. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s02-revenue-feb-14-falls implementation_type: llm score: 1 criterion: Whether revenue for Feb 1-4 falls between $138,000 and $138,500. score_1_condition: Revenue for Feb 1-4 is between $138,000 and $138,500. score_0_condition: Revenue for Feb 1-4 is outside the range of $138,000 to $138,500. related_facts: Feb 1-4 revenue from the Feb 2025 worksheet verifies cross-month data extraction completeness. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s03-summary-covers-active-stores implementation_type: llm score: 1 criterion: Whether the summary covers all 7 Active stores (SF-001 through SF-006, SF-008). score_1_condition: 'The summary covers all 7 Active stores: SF-001, SF-002, SF-003, SF-004, SF-005, SF-006, and SF-008.' score_0_condition: The summary is missing one or more of the 7 Active stores. related_facts: 'Store Matrix has 8 stores total: SF-001 to SF-008. SF-007 (Daly City) is Closed. The 7 Active stores are SF-001 through SF-006 and SF-008.' facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s04-closed-store-sf007-appear implementation_type: llm score: 1 criterion: Whether closed store SF-007 does not appear in the historical summary or shows as 0. score_1_condition: SF-007 does not appear in the historical summary, or its values are 0. score_0_condition: SF-007 appears in the historical summary with non-zero values. related_facts: SF-007 (Daly City) has Status=Closed in the Store Matrix. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s05-total-revenue-during-spring implementation_type: llm score: 1 criterion: Whether total revenue during Spring Festival falls between $506,500 and $507,500. score_1_condition: Total revenue during Spring Festival is between $506,500 and $507,500. score_0_condition: Total revenue during Spring Festival is outside the range of $506,500 to $507,500. related_facts: Sum of all 7 active stores' revenue from Jan 20 to Feb 4 (16 days). facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s06-total-customer-count-falls implementation_type: llm score: 1 criterion: Whether total customer count falls between 11,350 and 11,400. score_1_condition: Total customer count is between 11,350 and 11,400. score_0_condition: Total customer count is outside the range of 11,350 to 11,400. related_facts: Sum of all 7 active stores' customer count from Jan 20 to Feb 4. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s07-total-consumption-cost-categories implementation_type: llm score: 1 criterion: Whether total consumption cost by 5 categories is presented, with Meat between $45,000 and $45,400. score_1_condition: Total consumption cost is broken down by 5 ingredient categories, and Meat category cost is between $45,000 and $45,400. score_0_condition: Consumption cost is not broken down by 5 categories, or Meat category cost is outside $45,000 to $45,400. related_facts: 5 ingredient categories in Ingredient Master. Meat is the largest category by cost. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s08-consumption-data-presented-store implementation_type: llm score: 1 criterion: Whether consumption data is presented in a store x category cross-dimensional matrix. score_1_condition: Consumption data is presented in a store x category cross-dimensional matrix. score_0_condition: Consumption data is not presented in a cross-dimensional matrix format. related_facts: '' facts_reference: '' query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s09-analysis-includes-dac-daily implementation_type: llm score: 1 criterion: Whether the analysis includes a DAC (Daily Average Consumption) metric calculated as total divided by 16 days. score_1_condition: A DAC metric is included, calculated as total divided by 16 days. score_0_condition: No DAC metric is included, or it is not based on 16 days. related_facts: Spring Festival period is 16 days (Jan 20 to Feb 4 inclusive). facts_reference: '' query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s10-daily-average-revenue-between implementation_type: llm score: 1 criterion: Whether daily average revenue is between $31,500 and $31,900. score_1_condition: Daily average revenue is between $31,500 and $31,900. score_0_condition: Daily average revenue is outside the range of $31,500 to $31,900. related_facts: Spring Festival daily average = total revenue / 16 days. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s11-total-consumption-31-ingredients implementation_type: llm score: 1 criterion: Whether total consumption for each of all 31 ingredients is included. score_1_condition: Total consumption for each of all 31 ingredients is included. score_0_condition: Total consumption data is missing for some of the 31 ingredients. related_facts: Ingredient Master contains 31 ingredients across 5 categories. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s12-total-ingredient-cost-between implementation_type: llm score: 1 criterion: Whether total ingredient cost is between $113,000 and $113,800. score_1_condition: Total ingredient cost is between $113,000 and $113,800. score_0_condition: Total ingredient cost is outside the range of $113,000 to $113,800. related_facts: Sum of all ingredient costs during the Spring Festival period. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s13-nonspring-festival-daily-average implementation_type: llm score: 1 criterion: Whether non-Spring Festival daily average revenue is between $17,200 and $17,600. score_1_condition: Non-Spring Festival daily average revenue is between $17,200 and $17,600. score_0_condition: Non-Spring Festival daily average revenue is outside the range of $17,200 to $17,600. related_facts: Non-Spring Festival period = January 1-19 (19 days). facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s14-ingredient-unit-costs-match implementation_type: llm score: 1 criterion: Whether ingredient unit costs match the Ingredient Master exactly. score_1_condition: All ingredient unit costs match the Ingredient Master. score_0_condition: One or more ingredient unit costs do not match the Ingredient Master. related_facts: Unit costs are defined in the Ingredient Master worksheet. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s15-analysis-includes-peaktonormal-ratio implementation_type: llm score: 1 criterion: Whether the analysis includes a Peak-to-Normal Ratio calculation. score_1_condition: A Peak-to-Normal Ratio calculation is included. score_0_condition: No Peak-to-Normal Ratio calculation is included. related_facts: '' facts_reference: '' query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s16-revenue-peaktonormal-ratio-between implementation_type: llm score: 1 criterion: Whether revenue Peak-to-Normal Ratio is between 1.78 and 1.88. score_1_condition: Revenue Peak-to-Normal Ratio is between 1.78 and 1.88. score_0_condition: Revenue Peak-to-Normal Ratio is outside the range of 1.78 to 1.88. related_facts: Ratio = Spring Festival daily avg revenue / non-Spring Festival daily avg revenue. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'historical data analysis, trend and peak analysis' - rubric_item_id: gs22-s17-revenue-comparison-across-jan implementation_type: llm score: 1 criterion: Whether revenue comparison across Jan, Feb, Mar 2025 is included (Jan approximately $700K, Feb approximately $528K, Mar approximately $415K). score_1_condition: Revenue comparison across Jan, Feb, Mar 2025 is included with values approximately matching Jan ~$700K, Feb ~$528K, Mar ~$415K. score_0_condition: Revenue comparison across Jan, Feb, Mar 2025 is missing or values are significantly off. related_facts: Monthly revenue totals from the three monthly worksheets. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'trend and peak analysis' - rubric_item_id: gs22-s18-feb-vs-jan-change implementation_type: llm score: 1 criterion: Whether Feb vs Jan change rate is between -22% and -27%. score_1_condition: Feb vs Jan change rate is between -22% and -27%. score_0_condition: Feb vs Jan change rate is outside the range of -22% to -27%. related_facts: Change rate = (Feb revenue / Jan revenue) - 1. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'trend and peak analysis' - rubric_item_id: gs22-s19-mar-vs-feb-change implementation_type: llm score: 1 criterion: Whether Mar vs Feb change rate is between -19% and -24%. score_1_condition: Mar vs Feb change rate is between -19% and -24%. score_0_condition: Mar vs Feb change rate is outside the range of -19% to -24%. related_facts: Change rate = (Mar revenue / Feb revenue) - 1. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'trend and peak analysis' - rubric_item_id: gs22-s20-monthly-consumption-trends-ingredient implementation_type: llm score: 1 criterion: Whether monthly consumption trends by ingredient category are included. score_1_condition: Monthly consumption trends broken down by ingredient category are included. score_0_condition: Monthly consumption trends by ingredient category are missing. related_facts: '' facts_reference: '' query_reference: 'trend and peak analysis' - rubric_item_id: gs22-s21-percentage-change-calculations-correct implementation_type: llm score: 1 criterion: Whether all percentage change calculations are correct within plus or minus 1%. score_1_condition: All percentage change calculations are correct within +-1%. score_0_condition: One or more percentage change calculations are off by more than +-1%. related_facts: '' facts_reference: '' query_reference: 'trend and peak analysis' - rubric_item_id: gs22-s22-quantitative-description-spring-festival implementation_type: llm score: 1 criterion: Whether a quantitative description of the Spring Festival effect is included. score_1_condition: A quantitative description of the Spring Festival effect is included. score_0_condition: No quantitative description of the Spring Festival effect is provided. related_facts: '' facts_reference: '' query_reference: 'trend and peak analysis' - rubric_item_id: gs22-s23-stocking-plan-broken-down implementation_type: llm score: 1 criterion: Whether the stocking plan is broken down by store for all 7 Active stores. score_1_condition: The stocking plan is broken down by store for all 7 Active stores (SF-001 through SF-006, SF-008). score_0_condition: The stocking plan is missing one or more Active stores or is not broken down by store. related_facts: '7 Active stores: SF-001 through SF-006, SF-008.' facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s24-closed-store-sf007-excluded implementation_type: llm score: 1 criterion: Whether closed store SF-007 is excluded from the stocking plan or has quantity equal to 0. score_1_condition: SF-007 does not appear in the stocking plan, or its quantities are 0. score_0_condition: SF-007 appears in the stocking plan with non-zero quantities. related_facts: SF-007 (Daly City) is Closed. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s25-store-ids-stocking-plan implementation_type: llm score: 2 criterion: Whether all store IDs in the stocking plan exist in the Store Matrix (no invalid store IDs). score_1_condition: All store IDs in the stocking plan are valid IDs from the Store Matrix. score_0_condition: One or more store IDs in the stocking plan do not exist in the Store Matrix. related_facts: 'Valid store IDs: SF-001 through SF-008.' facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' score_2_condition: The deliverable satisfies the criterion to a meaningful intermediate extent, but still has noticeable omissions, inaccuracies, or consistency gaps. - rubric_item_id: gs22-s26-stocking-plan-covers-31 implementation_type: llm score: 1 criterion: Whether the stocking plan covers all 31 ingredients. score_1_condition: The stocking plan covers all 31 ingredients from the Ingredient Master. score_0_condition: The stocking plan is missing one or more of the 31 ingredients. related_facts: Ingredient Master contains 31 ingredients. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s27-stocking-plan-covers-ingredient implementation_type: llm score: 1 criterion: Whether the stocking plan covers all 5 ingredient categories. score_1_condition: The stocking plan covers all 5 ingredient categories. score_0_condition: The stocking plan is missing one or more of the 5 ingredient categories. related_facts: 5 categories in Ingredient Master. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s28-stocking-plan-broken-down implementation_type: llm score: 1 criterion: Whether the stocking plan is broken down by both store and ingredient name. score_1_condition: The stocking plan is broken down by both store and ingredient name. score_0_condition: The stocking plan is not broken down by both store and ingredient name. related_facts: '' facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s29-flagship-stores-greater-total implementation_type: llm score: 1 criterion: Whether Flagship stores have greater total stocking quantities than Standard stores, and Standard stores greater than the Express store. score_1_condition: Flagship stores (SF-001, SF-003) > Standard stores (SF-002, SF-004, SF-005, SF-008) > Express store (SF-006) in total stocking quantities. score_0_condition: The ordering of Flagship > Standard > Express in total stocking quantities is not maintained. related_facts: 'Flagship: SF-001 (160 seats), SF-003 (180 seats). Standard: SF-002, SF-004, SF-005, SF-008. Express: SF-006.' facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s30-stocking-plan-includes-dac implementation_type: llm score: 1 criterion: Whether the stocking plan includes a DAC (Daily Average Consumption) column derived from 2025 Spring Festival data. score_1_condition: A DAC column from 2025 Spring Festival data is included in the stocking plan. score_0_condition: No DAC column from 2025 Spring Festival data is included. related_facts: '' facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s31-stocking-plan-includes-wastage implementation_type: llm score: 1 criterion: Whether the stocking plan includes a Wastage Rate matching the Ingredient Master. score_1_condition: Wastage Rate is included and matches the values in the Ingredient Master. score_0_condition: Wastage Rate is missing or does not match the Ingredient Master. related_facts: Wastage rates in Ingredient Master range from 1% to 15%. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s32-gross-requirement-greater-equal implementation_type: llm score: 1 criterion: Whether Gross Requirement is greater than or equal to Net Requirement multiplied by (1 + Wastage Rate). score_1_condition: Gross Requirement >= Net Requirement x (1 + Wastage Rate) for all items. score_0_condition: Gross Requirement is less than Net Requirement x (1 + Wastage Rate) for one or more items. related_facts: '' facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s33-stocking-plan-includes-safety implementation_type: llm score: 1 criterion: Whether the stocking plan includes a Safety Stock column. score_1_condition: A Safety Stock column is included in the stocking plan. score_0_condition: No Safety Stock column is included. related_facts: '' facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s34-safety-stock-positively-correlated implementation_type: llm score: 1 criterion: Whether Safety Stock is positively correlated with Lead Time. score_1_condition: Safety Stock values are positively correlated with Lead Time (longer lead time results in higher safety stock). score_0_condition: Safety Stock is not positively correlated with Lead Time. related_facts: Lead times in Ingredient Master range from 1 to 5 days. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s35-moq-minimum-order-quantity implementation_type: llm score: 1 criterion: Whether MOQ (Minimum Order Quantity) constraint is respected in the stocking plan. score_1_condition: MOQ constraint is respected for all ingredients in the stocking plan. score_0_condition: MOQ constraint is violated for one or more ingredients. related_facts: MOQ values are defined per ingredient in the Ingredient Master. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s36-short-shelf-life-items implementation_type: llm score: 1 criterion: Whether short shelf life items (5 days or fewer) use batch ordering rather than a single bulk order. score_1_condition: Short shelf life items (shelf life <= 5 days) use batch ordering. score_0_condition: Short shelf life items do not use batch ordering. related_facts: Short shelf life items include Fresh Beef Abomasum, Duck Blood Curd, Fresh Juice (2-5 days shelf life). facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s37-long-shelf-life-items implementation_type: llm score: 1 criterion: Whether long shelf life items (60 days or more) allow a single bulk order. score_1_condition: Long shelf life items (shelf life >= 60 days) are ordered in a single bulk order or allow it. score_0_condition: Long shelf life items are unnecessarily split into multiple orders. related_facts: Long shelf life items include Soup Base varieties (90 days), Beer (180 days), Soda (180 days). facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s38-estimated-cost-equals-total implementation_type: llm score: 1 criterion: Whether Estimated Cost equals Total Quantity multiplied by Unit Cost, correct within plus or minus $0.50. score_1_condition: Estimated Cost = Total Qty x Unit Cost for all items, within +-$0.50. score_0_condition: Estimated Cost calculation is off by more than +-$0.50 for one or more items. related_facts: '' facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s39-unit-cost-stocking-plan implementation_type: llm score: 1 criterion: Whether Unit Cost in the stocking plan matches the Ingredient Master exactly. score_1_condition: All Unit Cost values in the stocking plan match the Ingredient Master exactly. score_0_condition: One or more Unit Cost values do not match the Ingredient Master. related_facts: Unit costs defined in Ingredient Master. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s40-meat-category-accounts-least implementation_type: llm score: 1 criterion: Whether Meat category accounts for at least 40% of total stocking cost. score_1_condition: Meat category is >= 40% of total stocking cost. score_0_condition: Meat category is less than 40% of total stocking cost. related_facts: Meat is the dominant ingredient category for hotpot, historically accounting for the largest share of consumption cost. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s41-quantities-rounded-05-multiples implementation_type: llm score: 1 criterion: Whether quantities are rounded to 0.5 multiples, and non-zero values are at least 0.5. score_1_condition: All quantities are rounded to 0.5 multiples and non-zero values are >= 0.5. score_0_condition: Quantities are not rounded to 0.5 multiples or non-zero values are less than 0.5. related_facts: '' facts_reference: '' query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s42-stocking-plan-mentions-considers implementation_type: llm score: 1 criterion: Whether the stocking plan mentions or considers Storage Type differentiation (Refrigerated, Frozen, Ambient). score_1_condition: Storage Type differentiation is mentioned or considered in the stocking plan. score_0_condition: Storage Type differentiation is not mentioned or considered. related_facts: 'Storage types: Refrigerated, Frozen, Ambient.' facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a store-level stocking plan for February 10–24, 2026' - rubric_item_id: gs22-s43-storelevel-budget-subtotal-rows implementation_type: llm score: 1 criterion: Whether store-level budget subtotal rows are included. score_1_condition: Store-level budget subtotal rows are present. score_0_condition: Store-level budget subtotal rows are missing. related_facts: '' facts_reference: '' query_reference: 'a budget summary' - rubric_item_id: gs22-s44-categorylevel-budget-subtotal-rows implementation_type: llm score: 1 criterion: Whether category-level budget subtotal rows are included. score_1_condition: Category-level budget subtotal rows are present. score_0_condition: Category-level budget subtotal rows are missing. related_facts: '' facts_reference: '' query_reference: 'a budget summary' - rubric_item_id: gs22-s45-grand-total-equals-sum implementation_type: llm score: 1 criterion: Whether Grand Total equals the sum of store subtotals. score_1_condition: Grand Total equals the sum of all store subtotals. score_0_condition: Grand Total does not equal the sum of store subtotals. related_facts: '' facts_reference: '' query_reference: 'a budget summary' - rubric_item_id: gs22-s46-grand-total-equals-sum implementation_type: llm score: 1 criterion: Whether Grand Total equals the sum of category subtotals. score_1_condition: Grand Total equals the sum of all category subtotals. score_0_condition: Grand Total does not equal the sum of category subtotals. related_facts: '' facts_reference: '' query_reference: 'a budget summary' - rubric_item_id: gs22-s47-grand-total-equals-sum implementation_type: llm score: 1 criterion: Whether Grand Total equals the sum of all line-item costs within plus or minus $1.00. score_1_condition: Grand Total equals the sum of all line-item costs within +-$1.00. score_0_condition: Grand Total differs from the sum of line-item costs by more than +-$1.00. related_facts: '' facts_reference: '' query_reference: 'a budget summary' - rubric_item_id: gs22-s48-sf003-sf001-rank-top implementation_type: llm score: 1 criterion: Whether SF-003 and SF-001 rank in the top two positions by budget. score_1_condition: SF-003 and SF-001 are the top two stores by budget (in either order). score_0_condition: SF-003 and SF-001 are not both in the top two positions by budget. related_facts: SF-003 (Milpitas, Flagship, 180 seats) and SF-001 (Chinatown, Flagship, 160 seats) are the two largest stores. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'a budget summary' - rubric_item_id: gs22-s49-executive-summary-mentions-total implementation_type: llm score: 1 criterion: Whether the executive summary mentions total budget consistent with the Grand Total. score_1_condition: The executive summary mentions a total budget figure consistent with the Grand Total in the budget section. score_0_condition: The executive summary does not mention a total budget or the figure is inconsistent with the Grand Total. related_facts: '' facts_reference: '' query_reference: 'an executive summary' - rubric_item_id: gs22-s50-executive-summary-references-comparison implementation_type: llm score: 1 criterion: Whether the executive summary references a comparison with 2025 Spring Festival consumption. score_1_condition: The executive summary references or compares with 2025 Spring Festival consumption data. score_0_condition: The executive summary does not reference 2025 Spring Festival consumption data. related_facts: '' facts_reference: '' query_reference: 'an executive summary' - rubric_item_id: gs22-s51-executive-summary-identifies-meat implementation_type: llm score: 1 criterion: Whether the executive summary identifies Meat as the highest cost-share category. score_1_condition: The executive summary identifies Meat as the highest cost-share ingredient category. score_0_condition: The executive summary does not identify Meat as the highest cost-share category. related_facts: Meat is the highest cost-share ingredient category in hotpot operations. facts_reference: szw-hotpot-2025-q1-operations-data.xlsx query_reference: 'an executive summary' - rubric_item_id: gs22-s52-data-references-executive-summary implementation_type: llm score: 1 criterion: Whether all data references in the executive summary are consistent with the workbook calculations. score_1_condition: All data references in the executive summary are consistent with the workbook calculations. score_0_condition: One or more data references in the executive summary are inconsistent with the workbook calculations. related_facts: '' facts_reference: '' query_reference: 'an executive summary' Optional Constraint: - rubric_item_id: gs22-o01-deliverable-clear-tabular-structure implementation_type: llm score: 4 criterion: Whether the deliverable has clear tabular structure, consistent formatting, and professional presentation. score_4_condition: Polished tabular structure, perfectly aligned, publication-quality professionalism. score_3_condition: Clear tabular structure, consistent formatting, professional appearance. score_2_condition: Adequate tabular structure, generally consistent formatting. score_1_condition: Basic structure but obvious formatting issues. score_0_condition: Poor structure, inconsistent formatting, unprofessional appearance. related_facts: '' facts_reference: '' query_reference: ''