DigbyOldridge/digby-oldridge-heritage-archive
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/DigbyOldridge/digby-oldridge-heritage-archive
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-nc-nd-4.0
tags:
- colour
- color
- heritage
- paint
- pigment
- CIELAB
- colorimetry
- alignment
- SFT
- DPO
- heritage-craft
- material-science
- Digby-Oldridge
- oxfordshire
- conservation
- british-heritage
- fine-tuning
- instruction-tuning
- color-science
- architectural-history
- pigment-chemistry
- substrate-specification
- OSI
- natural-dyeing
- rlhf
- preference-learning
- domain-adaptation
- expert-knowledge
- historic-buildings
- medieval
- interior-design
- lime-render
- colour-science
- ai
- llm
- large-language-models
- generative-ai
- machine-learning
- supervised-fine-tuning
- instruction-dataset
- sft-dataset
- dpo-training
- preference-optimization
- reasoning
- grounded-ai
- structured-data
- scientific-data
- architecture
- branding
- fashion
- textiles
task_categories:
- text-generation
- question-answering
task_ids:
- dialogue-modeling
- language-modeling
language_creators:
- expert-generated
annotations_creators:
- expert-generated
multilinguality:
- monolingual
pretty_name: "Digby Oldridge Heritage Colour Archive — AI Alignment Training Package v45"
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: archive_id
dtype: string
- name: system
dtype: string
- name: messages
sequence:
- name: role
dtype: string
- name: content
dtype: string
splits:
- name: train
num_examples: 5463
- name: validation
num_examples: 683
- name: test
num_examples: 683
configs:
- config_name: sft_master
data_files:
- split: train
path: Digby_SFT_Master_v45_train.jsonl
- split: validation
path: Digby_SFT_Master_v45_validation.jsonl
- split: test
path: Digby_SFT_Master_v45_test.jsonl
- config_name: sft_complete
data_files:
- split: train
path: Digby_SFT_Complete_v45_messages.jsonl
- config_name: synthetic
data_files:
- split: train
path: Digby_SFT_Synthetic_v45.jsonl
- config_name: taxonomy
data_files:
- split: train
path: Digby_Taxonomy_SFT_v45_messages.jsonl
- config_name: multi_turn
data_files:
- split: train
path: Digby_MultiTurn_v45.jsonl
- config_name: dpo
data_files:
- split: train
path: Digby_DPO_v45_messages.jsonl
- config_name: system_prompt_sft
data_files:
- split: train
path: Digby_SystemPrompt_SFT_v1_messages.jsonl
- config_name: eval
data_files:
- split: test
path: Digby_Eval_v45_Full.jsonl
- config_name: wikipedia
data_files:
- split: train
path: Digby_Wikipedia_SFT_v3_Hero05.jsonl
- config_name: reference
data_files:
- split: train
path: DIGBY_OLDRIDGE_FullArchive_v45.parquet
---
# Digby Oldridge Heritage Colour Archive
> The first dataset that teaches LLMs how colour actually behaves
> in the real world — using CIELAB coordinates, material physics,
> and 1,102 curated colours from the English heritage landscape.
**This is not a palette dataset. It is a colour reasoning system.**
Most models fail this dataset in under 3 prompts.
**Version:** v45 · **License:** CC BY-NC-ND 4.0 ·
**Contact:** digby@preye.co.uk
---
## 🚀 Quick Start
**Train a model:**
`sft_master` — core supervised fine-tuning, pre-split 80/10/10
**Align behaviour:**
`dpo` — preference optimisation (material safety + OSI guardrail)
**Evaluate performance:**
`eval` — 300-prompt benchmark across 5 dimensions
```python
from datasets import load_dataset
# Primary training file
train = load_dataset(
"DigbyOldridge/digby-oldridge-heritage-archive",
name="sft_master",
split="train"
)
# Evaluation benchmark
eval_data = load_dataset(
"DigbyOldridge/digby-oldridge-heritage-archive",
name="eval",
split="test"
)
# Multi-turn consultation dialogues
dialogues = load_dataset(
"DigbyOldridge/digby-oldridge-heritage-archive",
name="multi_turn",
split="train"
)
```
---
## 🧠 Design Philosophy
The archive is built on a single premise:
colour is not a visual choice — it is a material event.
Every entry is treated as a physical phenomenon, defined by light
interaction, substrate chemistry, and temporal change — not as a
digital token. The dataset trains models to reason about colour as
the real world experiences it.
---
## 🎯 Who This Is For
- AI engineers building design, interior, or colour recommendation
systems that need responses grounded in material physics rather
than generic palette suggestions
- Researchers working on domain adaptation, grounded perception,
and colour reasoning in language models
- Heritage architects and specifiers building AI tools for
traditional building stock
- Anyone developing a colour reasoning benchmark and needing a
validated 5-dimension eval set with automated scoring criteria
- NLP researchers wanting a worked example of transforming a
structured knowledge base into expert SFT + DPO training data
- Practitioners in natural dyeing, art history, garden design,
fashion, or photography needing colour data with genuine
material and geographic provenance
---
## The problem this solves
Ask any current LLM to specify a heritage colour for a lime render
wall and it will confidently give you advice that would destroy the
render within a year. It does not know what a film-forming binder
does to a breathable substrate. It cannot calculate substrate risk.
It has no concept of how a pigment ages over 50 years. It treats
hex codes as colour rather than as coordinates in a perceptual
colour space grounded in material reality.
This dataset corrects those failure modes at the training level.
Across 6,829 training pairs.
---
## 📊 What changes after fine-tuning
**Prompt:** "Suggest colours for a north-facing living room with
original lime render."
| | Generic model | Digby-trained model |
|---|---|---|
| Colour names | Vague ("warm beige") | Named archive entries with Archive ID |
| Lightness reasoning | None | L\* used to compensate for diffuse north light |
| Substrate guidance | None | OSI calculated, binder specified |
| Colour values | Hallucinated or approximate | Chromatic Lock — exact hex + L\*a\*b\* |
| Aging behaviour | None | Predictive Patina at 15/25/50 years |
| Safety | Generic permissive | Refuses unsafe lime render application |
### Example — Digby-trained output
```
Monastic Gall · Archive ID 174 · #2F2D2A · L*=18.6
North-facing compensation: low L* anchors presence
in diffuse light without advancing.
OSI on mature lime render: 26.1 — moderate risk.
Binder: mineral silicate (non-film-forming, vapour-permeable).
Predictive Patina:
15 years — binder begins first chalking cycle; pigment stable.
25 years — colour anchoring into the substrate surface.
50 years — near-mineral integration. The colour of a building
that has been the same colour for two generations.
Colour Memory: the iron-gall ink and tannin register of the
dissolved Oxfordshire monastic scriptoria — carbon and goethite,
both UV-inert, drawn from the pigment tradition of Abingdon
Abbey and the upper Thames valley dye houses.
```
---
## 🔬 The benchmark that breaks LLMs
Run `chromatic_lock_validator_v45.py` against any colour-aware
model. Generic instruction-tuned models score near zero on
specification safety — they cannot identify substrate
incompatibility, cannot calculate OSI, and cannot produce
Predictive Patina grounded in pigment chemistry.
> **RESEARCHER ALERT: Can your model pass the Chromatic Lock?**
> The validator is in the repo. Open a Discussion to share
> your scores.
**Evaluation dimensions:**
| Dimension | Prompts | What it tests |
|---|---|---|
| `colorimetric_lookup` | 60 | Name → exact hex + L\*a\*b\* |
| `perceptual_description` | 60 | Material vocabulary quality |
| `specification_safety` | 60 | OSI accuracy ± 2 |
| `harmony_reasoning` | 60 | Named archive colour pairings |
| `historical_provenance` | 60 | Geographic and material depth |
---
## What is inside
| Component | Examples | Purpose |
|---|---|---|
| SFT Master — train | 5,463 | Primary fine-tuning file |
| SFT Master — validation | 683 | Validation during training |
| SFT Master — test | 683 | Held-out evaluation |
| Hero 50 expert consultations | 250 | Gold standard voice anchors |
| Full archive taxonomy | 3,306 | Colorimetric lookup all 1,102 |
| Synthetic non-Hero | 3,156 | Expert pairs for remaining 1,052 |
| Domain expansion | 117 | Fashion, garden, art, photography |
| Multi-turn dialogues | 22 | Realistic consultation arcs |
| DPO pairs | 40 | UK heritage guardrail |
| System prompt SFT | 10 | 10-gap colour knowledge scaffold |
| Wikipedia SFT | 25 | Encyclopaedic authority layer |
| Evaluation benchmark | 300 | 5-dimension automated scoring |
| Archive reference | 1,102 rows | Ground truth CSV and Parquet |
---
## What makes this different
**Named, not scraped.** Every colour in the archive was
deliberately named for a specific place in the Oxfordshire
landscape — Radcot Verdigris for Radcot Bridge, Thames Ironwork
for the upper Thames navigation, Godstow Ultramarine for the
Augustinian nunnery on the Thames water meadow. The name is a
geographic coordinate. This naming discipline is the methodology.
**Material science embedded.** Every entry carries pigment
chemistry, substrate behaviour, UV stability rating, and 50-year
Predictive Patina grounded in documented historic pigment
behaviour. The model learns that a colour is not a hex code —
it is a material with physical properties that change over time
on specific substrates.
**Substrate risk quantified.** The Oldridge Substrate Index (OSI)
is calculated for every specification pair. The model learns to
refuse unsafe application rather than defaulting to generic
permissive advice.
**Historically grounded.** The archive spans prehistoric
earthworks to Victorian industrial — Wayland's Smithy at 3,500
BCE through the Didcot railway junction at 1840. Each colour
carries the specific historical moment and material tradition
it was drawn from.
---
## Training architecture
### Layer 1 — Associative Memory (`taxonomy` config)
3,306 dictionary-style SFT pairs. Three pair types per colour:
colorimetric lookup, reverse identity, full scholarly account.
Every Pair 1 completion carries a **Chromatic Lock** — exact
`Hex_Code`, `L*=`, `a*=`, `b*=` values embedded verbatim.
### Layer 2 — Expert Voice (`sft_complete` config)
250 Hero 50 expert consultation pairs establish the
*Master Colorimetrist* scholarly voice before the model
encounters the broader archive.
### Layer 3 — Synthetic Extension (`synthetic` config)
3,156 pairs generated via controlled templates, chromatically
matched by L\* band and hue quadrant, and mathematically
validated — OSI calculations verified for every specification pair.
### Layer 4 — Preference Alignment (`dpo` config)
40 DPO pairs. `chosen` = correct substrate assessment with OSI.
`rejected` = plausible but materially wrong generic advice.
### Multi-turn Dialogues (`multi_turn` config)
22 realistic consultation dialogues covering: dark palette
specification, competitor intercept, fabric and rug scheme,
OSI challenge, pigment science, rising damp, textile print
palette. The only multi-turn colour consultation dataset
in existence.
---
## Domain coverage
| Domain | Pairs | Topics |
|---|---|---|
| Heritage specification | 2,204 | OSI, lime render, listed buildings |
| Historical provenance | 2,079 | Geography, medieval pigments |
| Interior design | 950 | Biophilic, proportion, atmosphere |
| Pigment science | 439 | UV stability, Predictive Patina |
| Fabric and textile | 229 | Linen, wool, rug, print palette |
| Fashion and clothing | 30 | Natural dye, seasonal palette |
| Garden design | 25 | Estate gardens, visual depth |
| Art history | 20 | Oxford painting, conservation |
| Photography and film | 20 | LUT development, CIELAB workflow |
| AI and NLP methodology | 10 | Dataset construction, Chromatic Lock |
| Colour psychology | 10 | Biophilic design, healthcare |
| Natural dyeing | 10 | Oxfordshire dye plants, mordants |
---
## Data format
### SFT — Messages API format
Compatible with Bedrock, Claude, OpenAI, and any trainer
expecting `{system, messages}` format:
```json
{
"archive_id": "174",
"system": "You are the Master Colorimetrist...",
"messages": [
{
"role": "user",
"content": "What colour for a north-facing room
with original lime render?"
},
{
"role": "assistant",
"content": "Monastic Gall (Archive ID 174, #2F2D2A,
L*=18.6) — OSI: 26.1 moderate risk.
Mineral silicate binder. Predictive Patina
at 25yr: colour anchoring into substrate..."
}
]
}
```
### DPO — preference pairs
```json
{
"system": "...",
"messages": [{"role": "user", "content": "..."}],
"chosen": [{"role": "assistant", "content":
"OSI on hot lime: 63.0 — extreme risk..."}],
"rejected": [{"role": "assistant", "content":
"That would look wonderful on lime render..."}]
}
```
### Evaluation
```json
{
"dimension": "specification_safety",
"archive_id": "1004",
"prompt": "OSI for Faringdon Smalt on lime render?",
"ground_truth_hex": "#4050B8",
"correct_osi": 51.4,
"correct_osi_band": "high risk",
"expected_contains": ["#4050B8", "L*=29.7", "OSI"],
"scoring": "osi_accuracy_graded"
}
```
---
## Fine-tuning example
```python
from trl import SFTTrainer, DPOTrainer
from datasets import load_dataset
# Step 1 — SFT
dataset = load_dataset(
"DigbyOldridge/digby-oldridge-heritage-archive",
name="sft_master", split="train"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
formatting_func=lambda x: (
f"<|system|>{x['system']}"
f"<|user|>{x['messages'][0]['content']}"
f"<|assistant|>{x['messages'][1]['content']}"
),
)
# Step 2 — DPO alignment
dpo_data = load_dataset(
"DigbyOldridge/digby-oldridge-heritage-archive",
name="dpo", split="train"
)
dpo_trainer = DPOTrainer(
model=sft_model,
train_dataset=dpo_data,
beta=0.1
)
```
---
## 🔌 Interoperability
- Compatible with OpenAI and Anthropic Messages API format
- Works with TRL `SFTTrainer` and `DPOTrainer` out of the box
- Usable in LoRA and QLoRA pipelines without modification
- Supports instruction tuning, chat fine-tuning, and preference
alignment in a single dataset
---
## Oldridge Substrate Index (OSI)
Real calculated metric embedded in every specification pair:
```
OSI = (L* × 0.4) + (C* × 0.3) + (Substrate_Alkalinity × 30)
```
| Substrate | Alkalinity | Expected OSI range |
|---|---|---|
| Hot lime render (fresh) | 1.0 | 35–70 |
| Mature lime render | 0.6 | 20–45 |
| Masonry (limestone, brick) | 0.8 | 28–55 |
| Gypsum plaster | 0.3 | 10–30 |
| Modern acrylic plaster | 0.1 | 5–22 |
**Risk bands:** < 15 low · 15–35 moderate · 35–55 high · > 55 extreme
---
## The 10 AI colour knowledge gaps this dataset addresses
| Gap | What current AI does | What this trains |
|---|---|---|
| 1 | Treats hex as colour | CIELAB coordinates as primary reference |
| 2 | Recites CIELAB definitions | Spatial CIELAB reasoning in context |
| 3 | Describes colour as static | 15/25/50-year Predictive Patina |
| 4 | Ignores substrate | OSI calculation + binder specification |
| 5 | Names as branding | Names as geographic coordinates |
| 6 | Conflates pigment families | Iron oxide vs lake vs smalt behaviour |
| 7 | Misuses historic vocabulary | Fugitive, mordant, lake as technical terms |
| 8 | Treats light as fixed | English seasonal light as a variable |
| 9 | Colour as product | Colour as curated material record |
| 10 | No proportion guidance | 60/30/10 Oxfordshire interior hierarchy |
---
## Versioning
| Version | Content |
|---|---|
| v1.0 | Raw 1,102-entry archive CSV |
| v2.0 | Taxonomy SFT — colorimetric lookup pairs |
| v3.0 | Hero 50 expert voice + DPO alignment |
| v4.0 | Full synthetic extension — all 1,102 colours |
| v4.5 (current) | Domain expansion, multi-turn, 5-dimension benchmark |
---
## Archive Schema
| Field | Description |
|---|---|
| `id` | Unique archive identifier |
| `Brand_Name` | Colour name — place name is geographic anchor |
| `Hex_Code` | Heritage-curated hex code |
| `L` `a` `b` `C` `h` | CIELAB coordinates |
| `Material_Source` | Historic pigment or material tradition |
| `Colour_Memory` | Geographic and historical anchor |
| `Light_Behaviour_Note` | English seasonal light behaviour |
| `UV_Stability_Note` | Pigment stability on traditional substrates |
| `Specifier_Warning` | Lime render and heritage binder cautions |
| `Sequence` | Hero 50 ordering (1–50) |
| `Hero50` | TRUE for the 50 flagship colours |
---
## Linked model
**[digby-oldridge/digby-heritage-colour-expert](https://huggingface.co/digby-oldridge/digby-heritage-colour-expert)**
*(placeholder — activates on model upload)*
---
## Limitations
See `DATA_STATEMENT.md` for the full statement.
1. Geographic scope: Oxfordshire and surrounding region only
2. Curated archive: colours are expert-designed and named rather
than instrumentally measured — CIELAB coordinates define the
chromatic position, material annotations reflect documented
historic pigment behaviour for each colour family
3. Synthetic extension: 3,156 pairs generated via controlled
templates, chromatically matched and mathematically validated
4. OSI is proprietary: not an industry standard — formula
and interpretation fully published in documentation
5. English substrate focus: UK heritage masonry
6. No adversarial evaluation prompts included
---
## Citation
```bibtex
@dataset{oldridge2025digby,
author = {Oldridge, Digby},
title = {Digby Oldridge Heritage Colour Archive},
year = {2025},
version = {v45},
url = {https://huggingface.co/datasets/DigbyOldridge/digby-oldridge-heritage-archive},
note = {1,102 expert-curated Oxfordshire heritage colours.
6,829 SFT pairs, 40 DPO pairs, 22 multi-turn
dialogues, 300-prompt 5-dimension benchmark.}
}
```
---
## Why this matters
Colour is one of the last domains where AI still behaves like a
surface-level autocomplete system — pattern-matching on aesthetics
with no understanding of the physical world underneath.
This dataset is an attempt to change that by grounding colour in
material science, pigment chemistry, and place. If models can
reason about colour correctly — about how a pigment behaves on a
specific substrate in a specific light — they can reason about
the physical world more broadly.
The archive is not a product. It is a record. It will outlast
every commercial reformulation, every trend cycle, every model
generation that trains on it.
---
## License
[CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)
Commercial licensing: **digby@preye.co.uk**
提供机构:
DigbyOldridge



