cookbook7/historical-culinary-substitutions
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cookbook7/historical-culinary-substitutions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
tags:
- food
- cooking
- culinary
- historical
- ingredient-substitution
- recipes
- nlp
- text-classification
- feature-extraction
pretty_name: Historical Culinary Substitutions (1600–1929)
size_categories:
- 10K<n<100K
task_categories:
- feature-extraction
- text-classification
- question-answering
task_ids:
- semantic-similarity-classification
- multi-class-classification
- open-domain-qa
annotations_creators:
- machine-generated
- expert-validated
language_creators:
- machine-extracted
source_datasets:
- original
---
# Historical Culinary Substitutions Dataset
**25,251 ingredient substitution pairs** extracted from 154 public-domain cookbooks published between 1653 and 1936.
Every record links directly to the source book and year, with original text as context. The dataset captures **pre-refrigeration, pre-industrial culinary knowledge** — how cooks adapted ingredients when items were scarce, seasonal, or unavailable.
## Why This Dataset Matters
| Existing resources | What they lack |
|---|---|
| USDA FoodData Central | Nutrient data, no substitution logic |
| AllRecipes / crowd-sourced lists | No citations, no historical context, modern-only |
| Wikipedia food substitution tables | Sparse, unstructured, no provenance |
| Recipe NLP datasets (Recipe1M+) | No substitution extraction |
This dataset fills a gap: **grounded, cited, cross-referenced substitution intelligence spanning 330 years of culinary history.**
The `consensus_score` field is unique — it measures cross-book agreement. `butter → lard` scores ~0.99 because nearly every book in the corpus agrees on that substitution. `suet → butter` scores lower because opinions diverge.
## Dataset Summary
| Metric | Value |
|---|---|
| Substitution pairs (after dedup & cleaning) | 25,251 |
| Unique ingredients | 7,473 |
| Unique substitutes | 11,504 |
| Source books | 154 |
| Year range | 1653–1936 |
| Enriched records (with category + validation) | 36,657 |
| Cross-referenced entries | 11,775 |
## Files
| File | Format | Description |
|---|---|---|
| `substitutions.jsonl` | JSONL | Core dataset — cleaned ingredient → substitute pairs with citations |
| `book_meta.json` | JSON | 345 source books: slug → publication year mapping |
| `dataset_stats.json` | JSON | Full dataset statistics and distribution breakdowns |
### Substitution Record Format
```json
{
"ingredient": "butter",
"substitute": "lard",
"context": "Where butter is mentioned in pastry, lard may always be used instead, and in some cases is preferable.",
"notes": "",
"diet_tags": ["historical"],
"book": "Mrs Beeton's Book of Household Management",
"year": 1861,
"consensus_score": 0.992
}
```
### Field Descriptions
| Field | Type | Description |
|---|---|---|
| `ingredient` | string | The ingredient being substituted |
| `substitute` | string | The recommended replacement |
| `context` | string | Original text from the source book (verbatim) |
| `notes` | string | Extraction notes or additional guidance |
| `diet_tags` | list[str] | Dietary/contextual tags |
| `book` | string | Source cookbook title |
| `year` | int | Publication year of the source book |
| `consensus_score` | float | 0–1 cross-book agreement score |
## Coverage Highlights
**Top ingredients:** `butter`, `milk`, `cream`, `water`, `meat`, `chicken`, `sugar`, `flour`, `lard`, `wine`, `egg`, `coffee`, `beef`, `vanilla`, `brandy`
**Century distribution:**
| Century | Records |
|---|---|
| 1600s | Small (Compleat Cook 1658, A Book of Fruits and Flowers 1653) |
| 1700s | Growing (American/English colonial cookery) |
| 1800s | Largest cohort (Victorian-era cookbooks dominate) |
| 1900s–1929 | Strong (early 20th century domestic science) |
**Category distribution** (from enrichment): dairy, fat, protein, grain, sweetener, spice, liquid, leavening, vegetable, fruit, fish, other
## Source Books
All 345 books are pre-1929 public-domain works from Project Gutenberg and the Internet Archive. 154 of these yielded substitution pairs that passed quality filtering. Notable titles include:
- *The Compleat Cook* (1658)
- *The Cook's Oracle* (1817)
- *The Virginia Housewife* (1824)
- *Directions for Cookery* (1837)
- *Mrs Beeton's Book of Household Management* (1861)
- *Boston Cooking-School Cook Book* (1896)
- *Cookery and Dining in Imperial Rome* (1936 translation of Apicius)
Full book metadata with publication years in `book_meta.json`.
## Data Quality
The enrichment pipeline (Qwen 3.5 35B + Gemini 2.5 Flash validation) adds:
- **Category labels** (12-class: dairy, fat, protein, grain, sweetener, spice, liquid, leavening, vegetable, fruit, fish, other)
- **Validity flags** — flags self-substitutions, non-food items (sulphur, paint pigments), and nonsense extractions
- **Consensus scores** — cross-book agreement for each substitution pair
| Quality metric | Count |
|---|---|
| Total enriched records | 36,657 |
| Clean records (valid=true) | 32,074 |
| Self-substitutions flagged | 190 |
| Nonsense ingredients flagged | 3 |
| Nonsense substitutes flagged | 4 |
| Missing context | 15 |
| Missing year (flagged) | 4,396 |
## Use Cases
- **Food NLP / recipe understanding** — ingredient substitution modelling, recipe adaptation
- **Allergen/dietary applications** — find alternatives grounded in historical precedent
- **Culinary history research** — track how substitutions changed over 330 years
- **RAG for cooking assistants** — grounded, cited answers to "what can I use instead of X?"
- **Data augmentation** — enrich modern recipe datasets with historical substitutions
## Companion API
A REST API with semantic search, LLM synthesis, and batch endpoints is available:
- **Search:** Natural language queries → grounded answers with citations
- **Substitute:** Ingredient → ranked substitutes with consensus scores
- **Modernise:** Rewrite old recipes in modern terms
- **Convert:** Historical measurement → modern equivalent
- **Explain:** Define archaic cooking terms
Source code: [github.com/noelpope7/old-recipe-intelligence](https://github.com/noelpope7/old-recipe-intelligence)
## Licence
**CC BY-NC 4.0** — free for non-commercial use with attribution. Commercial use requires a separate license.
Cite as:
> Pope, N. (2026). *Historical Culinary Substitutions Dataset*. Old Recipe Intelligence. https://huggingface.co/datasets/cookbook7/historical-culinary-substitutions
### Commercial Licensing
**Non-commercial use** (research, education, personal projects) is free under CC BY-NC 4.0.
**Commercial use** (products, services, SaaS, data resale) requires a paid license:
| License | Price | Use case |
|---|---|---|
| Startup | $499 one-time | Small companies, <10 employees |
| Business | $1,500 one-time | Mid-size companies, unlimited internal use |
| Enterprise | $3,000+ one-time | Large orgs, redistribution, white-label |
All licenses include perpetual access to the current dataset version. Updates and API access available separately.
To license, open a GitHub issue or email via the repository contact.
## Acknowledgements
- **Source texts:** Project Gutenberg, Internet Archive
- **Enrichment models:** Qwen 3.5 35B (local), Gemini 2.5 Flash (Vertex AI)
- **Embeddings:** `jonny9f/food_embeddings2` (food-domain sentence transformer), `all-MiniLM-L6-v2`
提供机构:
cookbook7



