markdown-table-qa
收藏Markdown Table QA 数据集概述
数据集基本信息
- 数据集名称: Markdown Table QA Dataset
- 托管地址: https://huggingface.co/datasets/cetusian/markdown-table-qa
- 数据规模: 11,000 个样本
- 数据分割:
- 训练集: 10,000 个样本
- 验证集: 1,000 个样本
- 总大小: 20,646,922 字节
- 下载大小: 7,825,977 字节
数据结构与特征
数据集包含以下字段:
id: 样本标识符(字符串类型)instruction: 关于表格的自然语言问题(字符串类型)input: Markdown 格式的表格(字符串类型)response: 包含<think>...</think>推理链的对话式答案(字符串类型)domain: 表格所属领域(字符串类型)question_type: 问题类型(字符串类型)n_rows: 表格行数(整数类型)n_cols: 表格列数(整数类型)numeric_cols: 数值列列表(字符串列表)categorical_cols: 分类列列表(字符串列表)
数据内容与示例
每个样本包含一个 Markdown 表格、一个自然语言问题和一个对话式答案。答案中包含显式的推理步骤。
示例:
- instruction: How many appointments were on Wednesday and how many were no-shows?
- input: 一个包含 Patient, Doctor, Day, Status, Duration (min) 列的 Markdown 表格。
- response:
<think> Looking at rows where Day = Wednesday: Alice Martin (Attended) and Bob Chen (No-show). That is 2 appointments, 1 no-show. </think> There were 2 appointments on Wednesday. One was attended and one was a no-show — Bob Chen with Dr. Patel.
领域覆盖
数据集涵盖 15 个现实世界领域:
- Healthcare appointments
- Social media campaigns
- Employee HR & performance
- E-commerce products
- Student grades
- Project tracking
- Retail store performance
- Financial transactions
- Sports team stats
- Inventory management
- Customer support tickets
- Marketing leads
- Event registrations
- Restaurant menus
- Flight operations
问题类型分布
数据集包含 12 种问题类型,具体分布如下:
| 类型 | 训练集样本数 | 验证集样本数 | 示例 |
|---|---|---|---|
comparison |
859 | 84 | "Which team had the better win rate, Lions or Eagles?" |
compound |
858 | 84 | "How many no-shows on Wednesday and which doctor had the most?" |
filtered_count |
859 | 83 | "How many campaigns ran on Instagram?" |
filtered_sum |
859 | 83 | "What is the total sales for the North region?" |
lookup |
858 | 84 | "What was Alices performance score?" |
max_row |
835 | 83 | "Which product had the highest unit price?" |
mean |
848 | 83 | "What is the average delivery time?" |
min_row |
770 | 83 | "Which employee had the fewest absences?" |
percentage |
851 | 83 | "What percentage of orders were returned?" |
rank_top3 |
800 | 83 | "What are the top 3 agents by CSAT score?" |
sum |
745 | 83 | "What is the total prep time across all menu items?" |
summarization |
858 | 84 | "Summarize the data in this table." |
| 总计 | 10,000 | 1,000 |
计算类型问题(sum, mean, filtered_sum, filtered_count, max_row, min_row, percentage, rank_top3)的答案在生成推理链前已使用 pandas 进行数学验证。
生成方法
- 表格生成: 使用随机模式、行数(5–20)和列数(3–6)合成生成。
- 描述性问答生成: 使用 120B 模型生成问题及对话式答案(涵盖 comparison, lookup, compound, summarization, filtered_count 类型)。
- 计算性问答生成: 使用 pandas 计算已验证答案;120B 模型仅生成
<think>推理链(涵盖 sum, mean, max_row, min_row, percentage, rank_top3, filtered_sum 类型)。 - 质量保证: 应用了去重、答案基础检查及类型平衡。
使用方式
python from datasets import load_dataset ds = load_dataset("cetusian/markdown-table-qa")
背景与目的
该数据集创建于 Open Source Hack Day: Surogate / Invergent AI 黑客松(2025年4月4日),旨在比较监督微调(SFT)与强化学习(GRPO)在微调小型模型(如 Qwen3-0.6B / Qwen2.5-0.8B)处理 Markdown 表格理解任务上的效果,并衡量强化学习相较于监督基线的提升程度。




