five

DigbyOldridge/digby-oldridge-heritage-archive

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/DigbyOldridge/digby-oldridge-heritage-archive
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-nc-nd-4.0 tags: - colour - color - heritage - paint - pigment - CIELAB - colorimetry - alignment - SFT - DPO - heritage-craft - material-science - Digby-Oldridge - oxfordshire - conservation - british-heritage - fine-tuning - instruction-tuning - color-science - architectural-history - pigment-chemistry - substrate-specification - OSI - natural-dyeing - rlhf - preference-learning - domain-adaptation - expert-knowledge - historic-buildings - medieval - interior-design - lime-render - colour-science - ai - llm - large-language-models - generative-ai - machine-learning - supervised-fine-tuning - instruction-dataset - sft-dataset - dpo-training - preference-optimization - reasoning - grounded-ai - structured-data - scientific-data - architecture - branding - fashion - textiles task_categories: - text-generation - question-answering task_ids: - dialogue-modeling - language-modeling language_creators: - expert-generated annotations_creators: - expert-generated multilinguality: - monolingual pretty_name: "Digby Oldridge Heritage Colour Archive — AI Alignment Training Package v45" size_categories: - 1K<n<10K dataset_info: features: - name: archive_id dtype: string - name: system dtype: string - name: messages sequence: - name: role dtype: string - name: content dtype: string splits: - name: train num_examples: 5463 - name: validation num_examples: 683 - name: test num_examples: 683 configs: - config_name: sft_master data_files: - split: train path: Digby_SFT_Master_v45_train.jsonl - split: validation path: Digby_SFT_Master_v45_validation.jsonl - split: test path: Digby_SFT_Master_v45_test.jsonl - config_name: sft_complete data_files: - split: train path: Digby_SFT_Complete_v45_messages.jsonl - config_name: synthetic data_files: - split: train path: Digby_SFT_Synthetic_v45.jsonl - config_name: taxonomy data_files: - split: train path: Digby_Taxonomy_SFT_v45_messages.jsonl - config_name: multi_turn data_files: - split: train path: Digby_MultiTurn_v45.jsonl - config_name: dpo data_files: - split: train path: Digby_DPO_v45_messages.jsonl - config_name: system_prompt_sft data_files: - split: train path: Digby_SystemPrompt_SFT_v1_messages.jsonl - config_name: eval data_files: - split: test path: Digby_Eval_v45_Full.jsonl - config_name: wikipedia data_files: - split: train path: Digby_Wikipedia_SFT_v3_Hero05.jsonl - config_name: reference data_files: - split: train path: DIGBY_OLDRIDGE_FullArchive_v45.parquet --- # Digby Oldridge Heritage Colour Archive > The first dataset that teaches LLMs how colour actually behaves > in the real world — using CIELAB coordinates, material physics, > and 1,102 curated colours from the English heritage landscape. **This is not a palette dataset. It is a colour reasoning system.** Most models fail this dataset in under 3 prompts. **Version:** v45 · **License:** CC BY-NC-ND 4.0 · **Contact:** digby@preye.co.uk --- ## 🚀 Quick Start **Train a model:** `sft_master` — core supervised fine-tuning, pre-split 80/10/10 **Align behaviour:** `dpo` — preference optimisation (material safety + OSI guardrail) **Evaluate performance:** `eval` — 300-prompt benchmark across 5 dimensions ```python from datasets import load_dataset # Primary training file train = load_dataset( "DigbyOldridge/digby-oldridge-heritage-archive", name="sft_master", split="train" ) # Evaluation benchmark eval_data = load_dataset( "DigbyOldridge/digby-oldridge-heritage-archive", name="eval", split="test" ) # Multi-turn consultation dialogues dialogues = load_dataset( "DigbyOldridge/digby-oldridge-heritage-archive", name="multi_turn", split="train" ) ``` --- ## 🧠 Design Philosophy The archive is built on a single premise: colour is not a visual choice — it is a material event. Every entry is treated as a physical phenomenon, defined by light interaction, substrate chemistry, and temporal change — not as a digital token. The dataset trains models to reason about colour as the real world experiences it. --- ## 🎯 Who This Is For - AI engineers building design, interior, or colour recommendation systems that need responses grounded in material physics rather than generic palette suggestions - Researchers working on domain adaptation, grounded perception, and colour reasoning in language models - Heritage architects and specifiers building AI tools for traditional building stock - Anyone developing a colour reasoning benchmark and needing a validated 5-dimension eval set with automated scoring criteria - NLP researchers wanting a worked example of transforming a structured knowledge base into expert SFT + DPO training data - Practitioners in natural dyeing, art history, garden design, fashion, or photography needing colour data with genuine material and geographic provenance --- ## The problem this solves Ask any current LLM to specify a heritage colour for a lime render wall and it will confidently give you advice that would destroy the render within a year. It does not know what a film-forming binder does to a breathable substrate. It cannot calculate substrate risk. It has no concept of how a pigment ages over 50 years. It treats hex codes as colour rather than as coordinates in a perceptual colour space grounded in material reality. This dataset corrects those failure modes at the training level. Across 6,829 training pairs. --- ## 📊 What changes after fine-tuning **Prompt:** "Suggest colours for a north-facing living room with original lime render." | | Generic model | Digby-trained model | |---|---|---| | Colour names | Vague ("warm beige") | Named archive entries with Archive ID | | Lightness reasoning | None | L\* used to compensate for diffuse north light | | Substrate guidance | None | OSI calculated, binder specified | | Colour values | Hallucinated or approximate | Chromatic Lock — exact hex + L\*a\*b\* | | Aging behaviour | None | Predictive Patina at 15/25/50 years | | Safety | Generic permissive | Refuses unsafe lime render application | ### Example — Digby-trained output ``` Monastic Gall · Archive ID 174 · #2F2D2A · L*=18.6 North-facing compensation: low L* anchors presence in diffuse light without advancing. OSI on mature lime render: 26.1 — moderate risk. Binder: mineral silicate (non-film-forming, vapour-permeable). Predictive Patina: 15 years — binder begins first chalking cycle; pigment stable. 25 years — colour anchoring into the substrate surface. 50 years — near-mineral integration. The colour of a building that has been the same colour for two generations. Colour Memory: the iron-gall ink and tannin register of the dissolved Oxfordshire monastic scriptoria — carbon and goethite, both UV-inert, drawn from the pigment tradition of Abingdon Abbey and the upper Thames valley dye houses. ``` --- ## 🔬 The benchmark that breaks LLMs Run `chromatic_lock_validator_v45.py` against any colour-aware model. Generic instruction-tuned models score near zero on specification safety — they cannot identify substrate incompatibility, cannot calculate OSI, and cannot produce Predictive Patina grounded in pigment chemistry. > **RESEARCHER ALERT: Can your model pass the Chromatic Lock?** > The validator is in the repo. Open a Discussion to share > your scores. **Evaluation dimensions:** | Dimension | Prompts | What it tests | |---|---|---| | `colorimetric_lookup` | 60 | Name → exact hex + L\*a\*b\* | | `perceptual_description` | 60 | Material vocabulary quality | | `specification_safety` | 60 | OSI accuracy ± 2 | | `harmony_reasoning` | 60 | Named archive colour pairings | | `historical_provenance` | 60 | Geographic and material depth | --- ## What is inside | Component | Examples | Purpose | |---|---|---| | SFT Master — train | 5,463 | Primary fine-tuning file | | SFT Master — validation | 683 | Validation during training | | SFT Master — test | 683 | Held-out evaluation | | Hero 50 expert consultations | 250 | Gold standard voice anchors | | Full archive taxonomy | 3,306 | Colorimetric lookup all 1,102 | | Synthetic non-Hero | 3,156 | Expert pairs for remaining 1,052 | | Domain expansion | 117 | Fashion, garden, art, photography | | Multi-turn dialogues | 22 | Realistic consultation arcs | | DPO pairs | 40 | UK heritage guardrail | | System prompt SFT | 10 | 10-gap colour knowledge scaffold | | Wikipedia SFT | 25 | Encyclopaedic authority layer | | Evaluation benchmark | 300 | 5-dimension automated scoring | | Archive reference | 1,102 rows | Ground truth CSV and Parquet | --- ## What makes this different **Named, not scraped.** Every colour in the archive was deliberately named for a specific place in the Oxfordshire landscape — Radcot Verdigris for Radcot Bridge, Thames Ironwork for the upper Thames navigation, Godstow Ultramarine for the Augustinian nunnery on the Thames water meadow. The name is a geographic coordinate. This naming discipline is the methodology. **Material science embedded.** Every entry carries pigment chemistry, substrate behaviour, UV stability rating, and 50-year Predictive Patina grounded in documented historic pigment behaviour. The model learns that a colour is not a hex code — it is a material with physical properties that change over time on specific substrates. **Substrate risk quantified.** The Oldridge Substrate Index (OSI) is calculated for every specification pair. The model learns to refuse unsafe application rather than defaulting to generic permissive advice. **Historically grounded.** The archive spans prehistoric earthworks to Victorian industrial — Wayland's Smithy at 3,500 BCE through the Didcot railway junction at 1840. Each colour carries the specific historical moment and material tradition it was drawn from. --- ## Training architecture ### Layer 1 — Associative Memory (`taxonomy` config) 3,306 dictionary-style SFT pairs. Three pair types per colour: colorimetric lookup, reverse identity, full scholarly account. Every Pair 1 completion carries a **Chromatic Lock** — exact `Hex_Code`, `L*=`, `a*=`, `b*=` values embedded verbatim. ### Layer 2 — Expert Voice (`sft_complete` config) 250 Hero 50 expert consultation pairs establish the *Master Colorimetrist* scholarly voice before the model encounters the broader archive. ### Layer 3 — Synthetic Extension (`synthetic` config) 3,156 pairs generated via controlled templates, chromatically matched by L\* band and hue quadrant, and mathematically validated — OSI calculations verified for every specification pair. ### Layer 4 — Preference Alignment (`dpo` config) 40 DPO pairs. `chosen` = correct substrate assessment with OSI. `rejected` = plausible but materially wrong generic advice. ### Multi-turn Dialogues (`multi_turn` config) 22 realistic consultation dialogues covering: dark palette specification, competitor intercept, fabric and rug scheme, OSI challenge, pigment science, rising damp, textile print palette. The only multi-turn colour consultation dataset in existence. --- ## Domain coverage | Domain | Pairs | Topics | |---|---|---| | Heritage specification | 2,204 | OSI, lime render, listed buildings | | Historical provenance | 2,079 | Geography, medieval pigments | | Interior design | 950 | Biophilic, proportion, atmosphere | | Pigment science | 439 | UV stability, Predictive Patina | | Fabric and textile | 229 | Linen, wool, rug, print palette | | Fashion and clothing | 30 | Natural dye, seasonal palette | | Garden design | 25 | Estate gardens, visual depth | | Art history | 20 | Oxford painting, conservation | | Photography and film | 20 | LUT development, CIELAB workflow | | AI and NLP methodology | 10 | Dataset construction, Chromatic Lock | | Colour psychology | 10 | Biophilic design, healthcare | | Natural dyeing | 10 | Oxfordshire dye plants, mordants | --- ## Data format ### SFT — Messages API format Compatible with Bedrock, Claude, OpenAI, and any trainer expecting `{system, messages}` format: ```json { "archive_id": "174", "system": "You are the Master Colorimetrist...", "messages": [ { "role": "user", "content": "What colour for a north-facing room with original lime render?" }, { "role": "assistant", "content": "Monastic Gall (Archive ID 174, #2F2D2A, L*=18.6) — OSI: 26.1 moderate risk. Mineral silicate binder. Predictive Patina at 25yr: colour anchoring into substrate..." } ] } ``` ### DPO — preference pairs ```json { "system": "...", "messages": [{"role": "user", "content": "..."}], "chosen": [{"role": "assistant", "content": "OSI on hot lime: 63.0 — extreme risk..."}], "rejected": [{"role": "assistant", "content": "That would look wonderful on lime render..."}] } ``` ### Evaluation ```json { "dimension": "specification_safety", "archive_id": "1004", "prompt": "OSI for Faringdon Smalt on lime render?", "ground_truth_hex": "#4050B8", "correct_osi": 51.4, "correct_osi_band": "high risk", "expected_contains": ["#4050B8", "L*=29.7", "OSI"], "scoring": "osi_accuracy_graded" } ``` --- ## Fine-tuning example ```python from trl import SFTTrainer, DPOTrainer from datasets import load_dataset # Step 1 — SFT dataset = load_dataset( "DigbyOldridge/digby-oldridge-heritage-archive", name="sft_master", split="train" ) trainer = SFTTrainer( model=model, train_dataset=dataset, formatting_func=lambda x: ( f"<|system|>{x['system']}" f"<|user|>{x['messages'][0]['content']}" f"<|assistant|>{x['messages'][1]['content']}" ), ) # Step 2 — DPO alignment dpo_data = load_dataset( "DigbyOldridge/digby-oldridge-heritage-archive", name="dpo", split="train" ) dpo_trainer = DPOTrainer( model=sft_model, train_dataset=dpo_data, beta=0.1 ) ``` --- ## 🔌 Interoperability - Compatible with OpenAI and Anthropic Messages API format - Works with TRL `SFTTrainer` and `DPOTrainer` out of the box - Usable in LoRA and QLoRA pipelines without modification - Supports instruction tuning, chat fine-tuning, and preference alignment in a single dataset --- ## Oldridge Substrate Index (OSI) Real calculated metric embedded in every specification pair: ``` OSI = (L* × 0.4) + (C* × 0.3) + (Substrate_Alkalinity × 30) ``` | Substrate | Alkalinity | Expected OSI range | |---|---|---| | Hot lime render (fresh) | 1.0 | 35–70 | | Mature lime render | 0.6 | 20–45 | | Masonry (limestone, brick) | 0.8 | 28–55 | | Gypsum plaster | 0.3 | 10–30 | | Modern acrylic plaster | 0.1 | 5–22 | **Risk bands:** < 15 low · 15–35 moderate · 35–55 high · > 55 extreme --- ## The 10 AI colour knowledge gaps this dataset addresses | Gap | What current AI does | What this trains | |---|---|---| | 1 | Treats hex as colour | CIELAB coordinates as primary reference | | 2 | Recites CIELAB definitions | Spatial CIELAB reasoning in context | | 3 | Describes colour as static | 15/25/50-year Predictive Patina | | 4 | Ignores substrate | OSI calculation + binder specification | | 5 | Names as branding | Names as geographic coordinates | | 6 | Conflates pigment families | Iron oxide vs lake vs smalt behaviour | | 7 | Misuses historic vocabulary | Fugitive, mordant, lake as technical terms | | 8 | Treats light as fixed | English seasonal light as a variable | | 9 | Colour as product | Colour as curated material record | | 10 | No proportion guidance | 60/30/10 Oxfordshire interior hierarchy | --- ## Versioning | Version | Content | |---|---| | v1.0 | Raw 1,102-entry archive CSV | | v2.0 | Taxonomy SFT — colorimetric lookup pairs | | v3.0 | Hero 50 expert voice + DPO alignment | | v4.0 | Full synthetic extension — all 1,102 colours | | v4.5 (current) | Domain expansion, multi-turn, 5-dimension benchmark | --- ## Archive Schema | Field | Description | |---|---| | `id` | Unique archive identifier | | `Brand_Name` | Colour name — place name is geographic anchor | | `Hex_Code` | Heritage-curated hex code | | `L` `a` `b` `C` `h` | CIELAB coordinates | | `Material_Source` | Historic pigment or material tradition | | `Colour_Memory` | Geographic and historical anchor | | `Light_Behaviour_Note` | English seasonal light behaviour | | `UV_Stability_Note` | Pigment stability on traditional substrates | | `Specifier_Warning` | Lime render and heritage binder cautions | | `Sequence` | Hero 50 ordering (1–50) | | `Hero50` | TRUE for the 50 flagship colours | --- ## Linked model **[digby-oldridge/digby-heritage-colour-expert](https://huggingface.co/digby-oldridge/digby-heritage-colour-expert)** *(placeholder — activates on model upload)* --- ## Limitations See `DATA_STATEMENT.md` for the full statement. 1. Geographic scope: Oxfordshire and surrounding region only 2. Curated archive: colours are expert-designed and named rather than instrumentally measured — CIELAB coordinates define the chromatic position, material annotations reflect documented historic pigment behaviour for each colour family 3. Synthetic extension: 3,156 pairs generated via controlled templates, chromatically matched and mathematically validated 4. OSI is proprietary: not an industry standard — formula and interpretation fully published in documentation 5. English substrate focus: UK heritage masonry 6. No adversarial evaluation prompts included --- ## Citation ```bibtex @dataset{oldridge2025digby, author = {Oldridge, Digby}, title = {Digby Oldridge Heritage Colour Archive}, year = {2025}, version = {v45}, url = {https://huggingface.co/datasets/DigbyOldridge/digby-oldridge-heritage-archive}, note = {1,102 expert-curated Oxfordshire heritage colours. 6,829 SFT pairs, 40 DPO pairs, 22 multi-turn dialogues, 300-prompt 5-dimension benchmark.} } ``` --- ## Why this matters Colour is one of the last domains where AI still behaves like a surface-level autocomplete system — pattern-matching on aesthetics with no understanding of the physical world underneath. This dataset is an attempt to change that by grounding colour in material science, pigment chemistry, and place. If models can reason about colour correctly — about how a pigment behaves on a specific substrate in a specific light — they can reason about the physical world more broadly. The archive is not a product. It is a record. It will outlast every commercial reformulation, every trend cycle, every model generation that trains on it. --- ## License [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) Commercial licensing: **digby@preye.co.uk**
提供机构:
DigbyOldridge
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作