five

cookbook7/historical-culinary-substitutions

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cookbook7/historical-culinary-substitutions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en tags: - food - cooking - culinary - historical - ingredient-substitution - recipes - nlp - text-classification - feature-extraction pretty_name: Historical Culinary Substitutions (1600–1929) size_categories: - 10K<n<100K task_categories: - feature-extraction - text-classification - question-answering task_ids: - semantic-similarity-classification - multi-class-classification - open-domain-qa annotations_creators: - machine-generated - expert-validated language_creators: - machine-extracted source_datasets: - original --- # Historical Culinary Substitutions Dataset **25,251 ingredient substitution pairs** extracted from 154 public-domain cookbooks published between 1653 and 1936. Every record links directly to the source book and year, with original text as context. The dataset captures **pre-refrigeration, pre-industrial culinary knowledge** — how cooks adapted ingredients when items were scarce, seasonal, or unavailable. ## Why This Dataset Matters | Existing resources | What they lack | |---|---| | USDA FoodData Central | Nutrient data, no substitution logic | | AllRecipes / crowd-sourced lists | No citations, no historical context, modern-only | | Wikipedia food substitution tables | Sparse, unstructured, no provenance | | Recipe NLP datasets (Recipe1M+) | No substitution extraction | This dataset fills a gap: **grounded, cited, cross-referenced substitution intelligence spanning 330 years of culinary history.** The `consensus_score` field is unique — it measures cross-book agreement. `butter → lard` scores ~0.99 because nearly every book in the corpus agrees on that substitution. `suet → butter` scores lower because opinions diverge. ## Dataset Summary | Metric | Value | |---|---| | Substitution pairs (after dedup & cleaning) | 25,251 | | Unique ingredients | 7,473 | | Unique substitutes | 11,504 | | Source books | 154 | | Year range | 1653–1936 | | Enriched records (with category + validation) | 36,657 | | Cross-referenced entries | 11,775 | ## Files | File | Format | Description | |---|---|---| | `substitutions.jsonl` | JSONL | Core dataset — cleaned ingredient → substitute pairs with citations | | `book_meta.json` | JSON | 345 source books: slug → publication year mapping | | `dataset_stats.json` | JSON | Full dataset statistics and distribution breakdowns | ### Substitution Record Format ```json { "ingredient": "butter", "substitute": "lard", "context": "Where butter is mentioned in pastry, lard may always be used instead, and in some cases is preferable.", "notes": "", "diet_tags": ["historical"], "book": "Mrs Beeton's Book of Household Management", "year": 1861, "consensus_score": 0.992 } ``` ### Field Descriptions | Field | Type | Description | |---|---|---| | `ingredient` | string | The ingredient being substituted | | `substitute` | string | The recommended replacement | | `context` | string | Original text from the source book (verbatim) | | `notes` | string | Extraction notes or additional guidance | | `diet_tags` | list[str] | Dietary/contextual tags | | `book` | string | Source cookbook title | | `year` | int | Publication year of the source book | | `consensus_score` | float | 0–1 cross-book agreement score | ## Coverage Highlights **Top ingredients:** `butter`, `milk`, `cream`, `water`, `meat`, `chicken`, `sugar`, `flour`, `lard`, `wine`, `egg`, `coffee`, `beef`, `vanilla`, `brandy` **Century distribution:** | Century | Records | |---|---| | 1600s | Small (Compleat Cook 1658, A Book of Fruits and Flowers 1653) | | 1700s | Growing (American/English colonial cookery) | | 1800s | Largest cohort (Victorian-era cookbooks dominate) | | 1900s–1929 | Strong (early 20th century domestic science) | **Category distribution** (from enrichment): dairy, fat, protein, grain, sweetener, spice, liquid, leavening, vegetable, fruit, fish, other ## Source Books All 345 books are pre-1929 public-domain works from Project Gutenberg and the Internet Archive. 154 of these yielded substitution pairs that passed quality filtering. Notable titles include: - *The Compleat Cook* (1658) - *The Cook's Oracle* (1817) - *The Virginia Housewife* (1824) - *Directions for Cookery* (1837) - *Mrs Beeton's Book of Household Management* (1861) - *Boston Cooking-School Cook Book* (1896) - *Cookery and Dining in Imperial Rome* (1936 translation of Apicius) Full book metadata with publication years in `book_meta.json`. ## Data Quality The enrichment pipeline (Qwen 3.5 35B + Gemini 2.5 Flash validation) adds: - **Category labels** (12-class: dairy, fat, protein, grain, sweetener, spice, liquid, leavening, vegetable, fruit, fish, other) - **Validity flags** — flags self-substitutions, non-food items (sulphur, paint pigments), and nonsense extractions - **Consensus scores** — cross-book agreement for each substitution pair | Quality metric | Count | |---|---| | Total enriched records | 36,657 | | Clean records (valid=true) | 32,074 | | Self-substitutions flagged | 190 | | Nonsense ingredients flagged | 3 | | Nonsense substitutes flagged | 4 | | Missing context | 15 | | Missing year (flagged) | 4,396 | ## Use Cases - **Food NLP / recipe understanding** — ingredient substitution modelling, recipe adaptation - **Allergen/dietary applications** — find alternatives grounded in historical precedent - **Culinary history research** — track how substitutions changed over 330 years - **RAG for cooking assistants** — grounded, cited answers to "what can I use instead of X?" - **Data augmentation** — enrich modern recipe datasets with historical substitutions ## Companion API A REST API with semantic search, LLM synthesis, and batch endpoints is available: - **Search:** Natural language queries → grounded answers with citations - **Substitute:** Ingredient → ranked substitutes with consensus scores - **Modernise:** Rewrite old recipes in modern terms - **Convert:** Historical measurement → modern equivalent - **Explain:** Define archaic cooking terms Source code: [github.com/noelpope7/old-recipe-intelligence](https://github.com/noelpope7/old-recipe-intelligence) ## Licence **CC BY-NC 4.0** — free for non-commercial use with attribution. Commercial use requires a separate license. Cite as: > Pope, N. (2026). *Historical Culinary Substitutions Dataset*. Old Recipe Intelligence. https://huggingface.co/datasets/cookbook7/historical-culinary-substitutions ### Commercial Licensing **Non-commercial use** (research, education, personal projects) is free under CC BY-NC 4.0. **Commercial use** (products, services, SaaS, data resale) requires a paid license: | License | Price | Use case | |---|---|---| | Startup | $499 one-time | Small companies, <10 employees | | Business | $1,500 one-time | Mid-size companies, unlimited internal use | | Enterprise | $3,000+ one-time | Large orgs, redistribution, white-label | All licenses include perpetual access to the current dataset version. Updates and API access available separately. To license, open a GitHub issue or email via the repository contact. ## Acknowledgements - **Source texts:** Project Gutenberg, Internet Archive - **Enrichment models:** Qwen 3.5 35B (local), Gemini 2.5 Flash (Vertex AI) - **Embeddings:** `jonny9f/food_embeddings2` (food-domain sentence transformer), `all-MiniLM-L6-v2`
提供机构:
cookbook7
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作