machina-sports/ayrton-1-qa-v2

Name: machina-sports/ayrton-1-qa-v2
Creator: machina-sports
Published: 2026-04-17 14:33:12
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/machina-sports/ayrton-1-qa-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 language: - en task_categories: - question-answering - text-generation size_categories: - 100K<n<1M source_datasets: - original tags: - formula-1 - f1 - motorsport - question-answering - fine-tuning - mlx pretty_name: Ayrton-1 F1 QA v2 --- # Ayrton-1 QA v2 A question-answering dataset for supervised fine-tuning of language models on Formula 1 knowledge, 1950–2025. This is the training corpus behind [`machina-sports/ayrton-1`](https://huggingface.co/machina-sports/ayrton-1). ## What's in it Template-generated question/answer pairs covering: - **Race results, championship standings, driver/constructor history** (1950–2025, backed by Jolpica-F1). - **Session-level telemetry and strategy** — lap times, pit stops, stints, compound usage, top speeds (2018–2025, backed by FastF1). - **Explicit coverage-boundary refusals** — pre-2018 FastF1-style questions paired with scope-limited responses, teaching the model to refuse out-of-window questions instead of hallucinating. Each row exists in two parallel formats: - **Messages format** (`*_messages.jsonl`) — OpenAI-style `{role, content}` turns. - **MLX format** (`*_mlx.jsonl`) — pre-flattened prompt/response pairs consumable by `mlx_lm.lora`. ## Splits | Split | File | Purpose | |---|---|---| | Train | `train_messages.jsonl` / `train_mlx.jsonl` | SFT | | Valid | `valid_messages.jsonl` / `valid_mlx.jsonl` | 2024 season, held out during training | | Test | `test_messages.jsonl` / `test_mlx.jsonl` | 2025 season, final evaluation | **Leakage-safe temporal split:** validation = 2024, test = 2025 — no season overlap between splits. This prevents the trivial leakage pattern of a model memorizing a race result from train and answering the same question in test. ## Fields Each row is a chat completion example: ```json { "messages": [ {"role": "user", "content": "Who won the 1988 Monaco Grand Prix?"}, {"role": "assistant", "content": "Alain Prost won the 1988 Monaco Grand Prix."} ], "meta": { "template": "race_winner", "season": 1988, "source": "jolpica" } } ``` Template metadata is preserved so downstream eval can score per-template accuracy floors. ## How it was built 1. **Ingest** upstream APIs (Jolpica-F1, FastF1) into `data/raw/`. 2. **Normalize** into parquet tables (`data/normalized/`). 3. **Template expansion** — ~40 factual templates generate QA pairs with controlled paraphrase variants. 4. **Teacher distillation** — Gemini 3 Flash generates style-improved answers on train split, filtered by: - Value-level factual overlap (per-template thresholds, 0.45–0.55). - Strict numeric matching for lap-count/position/speed templates. - Hedge rejection on atomic-fact templates. 5. **Mix** — distilled rows blended into base train at ~30%. 6. **Strategy boost** — final train mix enforces 60% strategy / 40% other, with a small refusal quota (~0.8%) of coverage-boundary rows. 7. **Split** temporally (2024 valid / 2025 test). Full pipeline: [`scripts/`](https://github.com/machinasports/ayrton-1/tree/main/scripts) in the source repo. ## Intended use - Fine-tuning LLMs for F1 domain QA. - Evaluating F1 knowledge in existing LLMs via the held-out splits. - Research on temporal knowledge splits and distillation-from-proprietary-teacher recipes. ## Out of scope / limitations - **English only.** - **No live data.** Anything past 2025. - **FastF1-dependent rows skew toward 2018+.** Pre-2018 coverage is race-level only. - **Teacher-distilled rows carry Gemini-3-Flash stylistic bias** by construction. The factual filter is strict, but tone may skew toward the teacher. - **Template-generated questions are not fully natural** — they read like bench queries, not conversational prompts. - **Known data gap:** Jolpica returns an empty payload for 1954 `constructorStandings`; not patched. ## Data sources & attribution | Source | Used for | License | |---|---|---| | [Jolpica-F1](https://github.com/jolpica/jolpica-f1) | 1950–2025 race/results/standings | See upstream | | [FastF1](https://github.com/theOehrly/Fast-F1) | 2018–2025 telemetry/strategy | MIT | | OpenF1 | *Not used in this release* | — | ## License **CC BY-NC 4.0** — free for non-commercial use with attribution. Commercial use requires agreement with the source upstream licenses in addition to this dataset. The teacher-distilled rows are subject to Gemini API output terms; redistribution is considered fair use within this non-commercial research release. ## Citation ```bibtex @misc{ayrton1_qa_v2, title = {Ayrton-1 QA v2: A Formula 1 Question-Answering Dataset (1950-2025)}, author = {Machina Sports}, year = {2026}, howpublished = {\url{https://huggingface.co/datasets/machina-sports/ayrton-1-qa-v2}} } ```

提供机构：

machina-sports

5,000+

优质数据集

54 个

任务类型

进入经典数据集