machina-sports/ayrton-1-qa-v2
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/machina-sports/ayrton-1-qa-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
task_categories:
- question-answering
- text-generation
size_categories:
- 100K<n<1M
source_datasets:
- original
tags:
- formula-1
- f1
- motorsport
- question-answering
- fine-tuning
- mlx
pretty_name: Ayrton-1 F1 QA v2
---
# Ayrton-1 QA v2
A question-answering dataset for supervised fine-tuning of language models on Formula 1 knowledge, 1950–2025.
This is the training corpus behind [`machina-sports/ayrton-1`](https://huggingface.co/machina-sports/ayrton-1).
## What's in it
Template-generated question/answer pairs covering:
- **Race results, championship standings, driver/constructor history** (1950–2025, backed by Jolpica-F1).
- **Session-level telemetry and strategy** — lap times, pit stops, stints, compound usage, top speeds (2018–2025, backed by FastF1).
- **Explicit coverage-boundary refusals** — pre-2018 FastF1-style questions paired with scope-limited responses, teaching the model to refuse out-of-window questions instead of hallucinating.
Each row exists in two parallel formats:
- **Messages format** (`*_messages.jsonl`) — OpenAI-style `{role, content}` turns.
- **MLX format** (`*_mlx.jsonl`) — pre-flattened prompt/response pairs consumable by `mlx_lm.lora`.
## Splits
| Split | File | Purpose |
|---|---|---|
| Train | `train_messages.jsonl` / `train_mlx.jsonl` | SFT |
| Valid | `valid_messages.jsonl` / `valid_mlx.jsonl` | 2024 season, held out during training |
| Test | `test_messages.jsonl` / `test_mlx.jsonl` | 2025 season, final evaluation |
**Leakage-safe temporal split:** validation = 2024, test = 2025 — no season overlap between splits. This prevents the trivial leakage pattern of a model memorizing a race result from train and answering the same question in test.
## Fields
Each row is a chat completion example:
```json
{
"messages": [
{"role": "user", "content": "Who won the 1988 Monaco Grand Prix?"},
{"role": "assistant", "content": "Alain Prost won the 1988 Monaco Grand Prix."}
],
"meta": {
"template": "race_winner",
"season": 1988,
"source": "jolpica"
}
}
```
Template metadata is preserved so downstream eval can score per-template accuracy floors.
## How it was built
1. **Ingest** upstream APIs (Jolpica-F1, FastF1) into `data/raw/`.
2. **Normalize** into parquet tables (`data/normalized/`).
3. **Template expansion** — ~40 factual templates generate QA pairs with controlled paraphrase variants.
4. **Teacher distillation** — Gemini 3 Flash generates style-improved answers on train split, filtered by:
- Value-level factual overlap (per-template thresholds, 0.45–0.55).
- Strict numeric matching for lap-count/position/speed templates.
- Hedge rejection on atomic-fact templates.
5. **Mix** — distilled rows blended into base train at ~30%.
6. **Strategy boost** — final train mix enforces 60% strategy / 40% other, with a small refusal quota (~0.8%) of coverage-boundary rows.
7. **Split** temporally (2024 valid / 2025 test).
Full pipeline: [`scripts/`](https://github.com/machinasports/ayrton-1/tree/main/scripts) in the source repo.
## Intended use
- Fine-tuning LLMs for F1 domain QA.
- Evaluating F1 knowledge in existing LLMs via the held-out splits.
- Research on temporal knowledge splits and distillation-from-proprietary-teacher recipes.
## Out of scope / limitations
- **English only.**
- **No live data.** Anything past 2025.
- **FastF1-dependent rows skew toward 2018+.** Pre-2018 coverage is race-level only.
- **Teacher-distilled rows carry Gemini-3-Flash stylistic bias** by construction. The factual filter is strict, but tone may skew toward the teacher.
- **Template-generated questions are not fully natural** — they read like bench queries, not conversational prompts.
- **Known data gap:** Jolpica returns an empty payload for 1954 `constructorStandings`; not patched.
## Data sources & attribution
| Source | Used for | License |
|---|---|---|
| [Jolpica-F1](https://github.com/jolpica/jolpica-f1) | 1950–2025 race/results/standings | See upstream |
| [FastF1](https://github.com/theOehrly/Fast-F1) | 2018–2025 telemetry/strategy | MIT |
| OpenF1 | *Not used in this release* | — |
## License
**CC BY-NC 4.0** — free for non-commercial use with attribution. Commercial use requires agreement with the source upstream licenses in addition to this dataset.
The teacher-distilled rows are subject to Gemini API output terms; redistribution is considered fair use within this non-commercial research release.
## Citation
```bibtex
@misc{ayrton1_qa_v2,
title = {Ayrton-1 QA v2: A Formula 1 Question-Answering Dataset (1950-2025)},
author = {Machina Sports},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/machina-sports/ayrton-1-qa-v2}}
}
```
提供机构:
machina-sports



