five

round-bird/georgia-high-school-sports

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/round-bird/georgia-high-school-sports
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - dpo - preference - sports - georgia - high-school - fine-tuning size_categories: - 1K<n<10K --- # Georgia High School Sports — DPO Preference Dataset A preference dataset for **Direct Preference Optimization (DPO)** fine-tuning, focused on Georgia high school sports. Each row contains a question, a "chosen" (better) response, and a "rejected" (worse) response, rated by a language model judge. This dataset was generated entirely on local hardware (Apple M4) using open-source models via [Ollama](https://ollama.com) — no cloud APIs required. --- ## What is DPO? Direct Preference Optimization is a technique for fine-tuning language models using human (or AI) preferences. Instead of training with a single "correct" answer, you give the model pairs of answers — one better, one worse — and it learns to prefer the better style of response. If you want to learn more, see the [original DPO paper](https://arxiv.org/abs/2305.18290). --- ## Quick Start ### Install the `datasets` library ```bash pip install datasets ``` ### Load the dataset in Python ```python from datasets import load_dataset dataset = load_dataset("round-bird/georgia-sports-ollama") # Look at the first row print(dataset["train"][0]) ``` ### See what columns are available ```python print(dataset["train"].column_names) # ['instruction', 'system_prompt', 'topic', 'date', 'citation', # 'chosen', 'rejected', 'chosen_rating', 'rejected_rating', # 'model_chosen', 'model_rejected'] ``` ### Filter to only high-confidence preference pairs ```python # Keep only rows where the rating gap is 2 or more strong_pairs = dataset["train"].filter( lambda row: row["chosen_rating"] - row["rejected_rating"] >= 2 ) print(f"Strong preference pairs: {len(strong_pairs)}") ``` --- ## Dataset Overview | Stat | Value | |------|-------| | Total rows | 4,078 | | Source articles | 1,396 | | Topic | Georgia high school sports (GHSA) | | Article date range | April 2022 – September 2025 | | Article source | GPB Sports Blog | | Avg chosen response length | 525 characters | | Avg rejected response length | 467 characters | --- ## Column Descriptions Each row in the dataset has the following columns: ### `instruction` **Type:** string The question being asked. These are standalone questions about Georgia high school sports — things a fan or analyst might ask. They were generated from real GPB Sports articles by an LLM. **Example:** > "By what margin did Brooks County win its regional championship game on November 6?" --- ### `system_prompt` **Type:** string Context provided to the models when they generated their answers. This contains the article text that the question was based on (up to 3,000 characters), along with the article title, date, and citation. Both the chosen and rejected models received the same system prompt. **Example (abbreviated):** > "You are a knowledgeable Georgia high school sports analyst. Use the following article to answer the user's question accurately and in detail. > > Article: "Brooks Coach Freeman Dutifully Deals With Sorrow And Joy" > Date: November 11, 2020 > Source: Jon Nelson. GPB Sports Blog... > > [article text follows]" --- ### `topic` **Type:** string Always `"Georgia high school sports"` for this dataset. Useful if you combine this with other DPO datasets and want to filter by domain. --- ### `date` **Type:** string The publication date of the source article (e.g., `"November 11, 2020"`). --- ### `citation` **Type:** string A formatted citation for the source article, including author, title, publication, date, and URL. **Example:** > Jon Nelson. "Brooks Coach Freeman Dutifully Deals With Sorrow And Joy." GPB Sports Blog, November 11, 2020. https://www.gpb.org/blogs/gpb-sports-blog/... --- ### `chosen` **Type:** list of message objects The **better** response, formatted as a conversation. This is the response that received the higher rating from the judge model. The list contains 2–3 messages: 1. `{"role": "system", "content": "..."}` — the system prompt (same as the `system_prompt` column) 2. `{"role": "user", "content": "..."}` — the question (same as `instruction`) 3. `{"role": "assistant", "content": "..."}` — the model's answer (the good one) To get just the answer text: ```python answer = row["chosen"][-1]["content"] ``` --- ### `rejected` **Type:** list of message objects The **worse** response, in the same format as `chosen`. Same system prompt and question, but the assistant's answer received a lower rating. To get just the answer text: ```python answer = row["rejected"][-1]["content"] ``` --- ### `chosen_rating` **Type:** integer (1–5) The judge model's rating for the chosen (better) response. | Rating | Meaning | |--------|---------| | 5 | Excellent — accurate, detailed, well-sourced | | 4 | Very good — mostly accurate with minor issues | | 3 | Good — generally correct but lacks detail | | 2 | Moderate — partially correct or vague | | 1 | Poor — incorrect, irrelevant, or empty | **Distribution in this dataset:** | Rating | Count | Percent | |--------|-------|---------| | 5 | 1,203 | 29.5% | | 4 | 2,759 | 67.7% | | 3 | 109 | 2.7% | | 2 | 7 | 0.2% | --- ### `rejected_rating` **Type:** integer (1–5) The judge model's rating for the rejected (worse) response. Uses the same 1–5 scale. **Distribution in this dataset:** | Rating | Count | Percent | |--------|-------|---------| | 4 | 1,079 | 26.5% | | 3 | 1,420 | 34.8% | | 2 | 1,560 | 38.2% | | 1 | 19 | 0.5% | --- ### `model_chosen` **Type:** string Which model generated the chosen (better) response. One of: - `"llama3.1:8b"` - `"mistral:7b-instruct-v0.2-q4_K_M"` **Win rates:** | Model | Times Chosen (won) | Percent | |-------|-------------------|---------| | llama3.1:8b | 2,349 | 57.6% | | mistral:7b-instruct | 1,729 | 42.4% | --- ### `model_rejected` **Type:** string Which model generated the rejected (worse) response. Same possible values as `model_chosen`. --- ## How the Rating Gap Works Every row has a gap between `chosen_rating` and `rejected_rating`. Larger gaps mean the judge was more confident about which answer was better. | Gap | Count | Percent | Meaning | |-----|-------|---------|---------| | 1 | 2,574 | 63.1% | Slight preference | | 2 | 1,410 | 34.6% | Clear preference | | 3 | 93 | 2.3% | Strong preference | | 4 | 1 | 0.0% | Very strong preference | Rows where both responses received the same rating (ties) were excluded from the dataset. --- ## How This Dataset Was Made The dataset was built in four phases, all running locally on an Apple M4 Mac via Ollama: ### Phase 1: Generate Questions - **Input:** 1,396 scraped GPB Sports articles - **Model:** `llama3.1:8b` - **Output:** 3 standalone questions per article → 4,148 questions ### Phase 2: Generate Responses - **Input:** Each question + article context as system prompt - **Model A:** `llama3.1:8b` - **Model B:** `mistral:7b-instruct-v0.2-q4_K_M` - **Output:** Two responses per question → 4,148 response pairs ### Phase 3: Judge Responses - **Input:** Each question + both responses (article context NOT shown to judge) - **Model:** `llama3.1:8b` (temperature 0.1 for consistency) - **Output:** 1–5 rating for each response → 4,148 judgments ### Phase 4: Format - Pairs where ratings were tied were dropped (70 rows) - Higher-rated response → `chosen`, lower-rated → `rejected` - **Final output:** 4,078 DPO preference pairs --- ## Intended Use This dataset is designed for DPO fine-tuning of language models to improve response quality on Georgia high school sports questions. It can also be used for: - Studying preference learning and reward modeling - Benchmarking small open-source models on domain-specific QA - Teaching DPO concepts with a real, reproducible dataset --- ## Limitations - **Judge bias:** The judge model (`llama3.1:8b`) may have systematic biases in its ratings. It's also relatively small for a judge. - **Self-play:** `llama3.1:8b` serves as both a response generator and the judge, which can create circular preferences. - **Domain scope:** Exclusively Georgia high school sports — not generalizable. - **Article grounding:** Responses were generated with article context in the system prompt. The model is being trained to answer questions *given article context*, not from memory. - **Small model outputs:** 7–8B parameter models produce shorter, less detailed answers than larger models. --- ## License MIT

license: MIT协议 task_categories: - 文本生成 language: - 英语 tags: - DPO - 偏好 - 体育 - 佐治亚州 - 高中 - 微调 size_categories: - 1K<n<10K --- # 佐治亚州高中体育——DPO偏好数据集 本数据集为面向佐治亚州高中体育场景的**直接偏好优化(Direct Preference Optimization, DPO)**微调专用偏好数据集。每条数据均包含一个问题、一条「优选(chosen)」(更优)回复与一条「劣选(rejected)」(更差)回复,所有回复均由大语言模型评审打分。 本数据集完全基于本地硬件(Apple M4芯片)通过开源模型与Ollama工具生成,无需调用云端API。 --- ## 什么是DPO? 直接偏好优化是一种利用人类(或AI)偏好对大语言模型进行微调的技术。相较于使用单一「标准答案」进行训练,该方法会向模型提供成对的回复——一条更优、一条更劣,使模型学习偏好更优质的回复风格。如需深入了解,可参阅[DPO原始论文](https://arxiv.org/abs/2305.18290)。 --- ## 快速上手 ### 安装`datasets`库 bash pip install datasets ### 在Python中加载数据集 python from datasets import load_dataset dataset = load_dataset("round-bird/georgia-sports-ollama") # 查看训练集第一条数据 print(dataset["train"][0]) ### 查看可用列名 python print(dataset["train"].column_names) # ['instruction', 'system_prompt', 'topic', 'date', 'citation', # 'chosen', 'rejected', 'chosen_rating', 'rejected_rating', # 'model_chosen', 'model_rejected'] ### 筛选高置信度偏好对 python # 仅保留评分差距≥2的条目 strong_pairs = dataset["train"].filter( lambda row: row["chosen_rating"] - row["rejected_rating"] >= 2 ) print(f"高置信度偏好对数量:{len(strong_pairs)}") --- ## 数据集概览 | 统计项 | 数值 | |------|-------| | 总数据条数 | 4,078 | | 来源文章数 | 1,396 | | 主题 | 佐治亚州高中体育(GHSA) | | 文章发布时间范围 | 2022年4月 – 2025年9月 | | 文章来源 | GPB体育博客(GPB Sports Blog) | | 优选回复平均字符数 | 525 | | 劣选回复平均字符数 | 467 | --- ## 列说明 本数据集每条数据包含以下列: ### `instruction` **类型:字符串** 即用户提出的独立问题,均围绕佐治亚州高中体育展开,内容为体育爱好者或分析师可能提出的疑问,由大语言模型基于真实GPB体育博客文章生成。 **示例:** > "布鲁克斯县队在11月6日的地区锦标赛中以多大分差获胜?" --- ### `system_prompt` **类型:字符串** 模型生成回复时使用的上下文信息,包含生成问题所依据的文章文本(最长3000字符)、文章标题、发布日期与引用信息。优选与劣选回复均使用完全相同的系统提示词。 **示例(缩写版):** > "你是一名专业的佐治亚州高中体育分析师,请根据以下文章准确详细地回答用户的问题。 > > 文章:《布鲁克斯教练弗里曼从容应对悲喜交织》 > 发布日期:2020年11月11日 > 来源:乔恩·纳尔逊(Jon Nelson),GPB体育博客... > > [后续为文章正文]" --- ### `topic` **类型:字符串** 本数据集所有条目均为「佐治亚州高中体育」,便于与其他DPO数据集合并后按领域筛选。 --- ### `date` **类型:字符串** 来源文章的发布日期,例如 "2020年11月11日"。 --- ### `citation` **类型:字符串** 来源文章的格式化引用信息,包含作者、标题、出版物、发布日期与URL。 **示例:** > 乔恩·纳尔逊(Jon Nelson). 《布鲁克斯教练弗里曼从容应对悲喜交织》. GPB体育博客,2020年11月11日. https://www.gpb.org/blogs/gpb-sports-blog/... --- ### `chosen` **类型:消息对象列表** 即评分更高的优质回复,格式为对话格式,包含2-3条消息: 1. `{"role": "system", "content": "..."}` —— 与`system_prompt`列内容一致的系统提示词 2. `{"role": "user", "content": "..."}` —— 与`instruction`列一致的用户问题 3. `{"role": "assistant", "content": "..."}` —— 模型生成的优质回复 仅提取回复正文的代码示例: python answer = row["chosen"][-1]["content"] --- ### `rejected` **类型:消息对象列表** 即评分更低的劣质回复,格式与`chosen`列完全一致,使用相同的系统提示词与用户问题,但助手回复评分更低。 仅提取回复正文的代码示例: python answer = row["rejected"][-1]["content"] --- ### `chosen_rating` **类型:整数(1-5分)** 评审模型对优选回复的评分。 | 评分 | 含义 | |--------|---------| | 5 | 优秀——准确、详细、论据充分 | | 4 | 极佳——整体准确,仅存在轻微瑕疵 | | 3 | 良好——整体正确,但缺乏细节 | | 2 | 一般——部分正确或表述模糊 | | 1 | 较差——错误、无关或内容空泛 | **本数据集评分分布:** | 评分 | 数量 | 占比 | |--------|-------|---------| | 5 | 1,203 | 29.5% | | 4 | 2,759 | 67.7% | | 3 | 109 | 2.7% | | 2 | 7 | 0.2% | --- ### `rejected_rating` **类型:整数(1-5分)** 评审模型对劣选回复的评分,使用相同的1-5评分标准。 **本数据集评分分布:** | 评分 | 数量 | 占比 | |--------|-------|---------| | 4 | 1,079 | 26.5% | | 3 | 1,420 | 34.8% | | 2 | 1,560 | 38.2% | | 1 | 19 | 0.5% | --- ### `model_chosen` **类型:字符串** 生成优选回复的模型,可选值包括: - "llama3.1:8b" - "mistral:7b-instruct-v0.2-q4_K_M" **模型胜率:** | 模型 | 获胜次数 | 占比 | |-------|-------------------|---------| | llama3.1:8b | 2,349 | 57.6% | | mistral:7b-instruct | 1,729 | 42.4% | --- ### `model_rejected` **类型:字符串** 生成劣选回复的模型,可选值与`model_chosen`完全一致。 --- ## 评分差距说明 每条数据均存在`chosen_rating`与`rejected_rating`之间的评分差距,差距越大代表评审模型对回复优劣的判断置信度越高。 | 评分差距 | 数量 | 占比 | 含义 | |-----|-------|---------|---------| | 1 | 2,574 | 63.1% | 轻微偏好 | | 2 | 1,410 | 34.6% | 明确偏好 | | 3 | 93 | 2.3% | 强偏好 | | 4 | 1 | 0.0% | 极强偏好 | 评分相同的回复对已被剔除出本数据集。 --- ## 数据集构建流程 本数据集通过四个阶段构建,所有流程均在搭载Apple M4芯片的Mac设备上通过Ollama完成: ### 阶段1:生成问题 - **输入:** 1,396篇爬取的GPB体育博客文章 - **模型:** `llama3.1:8b` - **输出:** 每篇文章生成3个独立问题,共4,148个问题 ### 阶段2:生成回复 - **输入:** 每个问题+作为系统提示词的文章上下文 - **模型A:** `llama3.1:8b` - **模型B:** `mistral:7b-instruct-v0.2-q4_K_M` - **输出:** 每个问题生成两条回复,共4,148对回复 ### 阶段3:评审回复 - **输入:** 每个问题+两条回复(评审模型无法获取文章上下文) - **模型:** `llama3.1:8b`(温度参数设为0.1以保证一致性) - **输出:** 为每条回复打出1-5分,共4,148条评审结果 ### 阶段4:格式化 - 剔除评分相同的回复对(共70条) - 将评分更高的回复标记为`chosen`,评分更低的标记为`rejected` - **最终输出:** 4,078条DPO偏好数据对 --- ## 预期用途 本数据集专为大语言模型的DPO微调设计,用于提升模型在佐治亚州高中体育相关问题上的回复质量,也可用于: - 研究偏好学习与奖励建模 - 在领域特定问答任务上对小型开源模型进行基准测试 - 使用真实可复现的数据集教授DPO相关概念 --- ## 局限性 - **评审模型偏差**:评审模型`llama3.1:8b`可能存在系统性评分偏差,且作为评审模型参数规模相对较小。 - **自我博弈偏差**:`llama3.1:8b`同时作为回复生成模型与评审模型,可能产生循环偏好问题。 - **领域局限性**:仅覆盖佐治亚州高中体育场景,不具备泛化能力。 - **文章上下文锚定**:回复生成时已在系统提示词中嵌入文章上下文,模型训练目标为基于给定文章上下文回答问题,而非基于记忆作答。 - **小模型输出限制**:70亿至80亿参数的开源模型生成的回复相较于更大参数模型的回复更短、细节更少。 --- ## 许可证 MIT协议
提供机构:
round-bird
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作