cetusian/markdown-table-qa-20
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/cetusian/markdown-table-qa-20
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: instruction
dtype: string
- name: input
dtype: string
- name: response
dtype: string
- name: domain
dtype: string
- name: question_type
dtype: string
- name: n_rows
dtype: int64
- name: n_cols
dtype: int64
- name: numeric_cols
list: string
- name: categorical_cols
list: string
splits:
- name: train
num_examples: 2000
- name: validation
num_examples: 200
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---
# Markdown Table QA Dataset — Part 20/20
Part **20** of a 20-dataset collection for training and evaluating language models on structured table understanding and computational reasoning. Each part contains **2,200 samples** (2,000 train + 200 validation) with step-by-step reasoning traces.
See the full collection: [cetusian/markdown-table-qa-01](https://huggingface.co/datasets/cetusian/markdown-table-qa-01) through [cetusian/markdown-table-qa-20](https://huggingface.co/datasets/cetusian/markdown-table-qa-20)
Parent dataset: [cetusian/markdown-table-qa](https://huggingface.co/datasets/cetusian/markdown-table-qa) (11,000 samples)
---
## What's in it
Each sample contains a markdown table paired with a natural language question and a detailed answer with step-by-step reasoning:
| Field | Description |
|---|---|
| `instruction` | Natural language question about the table |
| `input` | The markdown table |
| `response` | Answer with `<think>...</think>` reasoning trace followed by a final answer |
| `domain` | Table domain (e.g. `healthcare_appointments`, `wildlife_survey`) |
| `question_type` | One of 12 types — equally balanced (~167 train + ~17 val per type) |
### Reasoning format
Every response includes a detailed `<think>` block that:
- Quotes **exact cell values** from the table
- Shows **all arithmetic step by step** (`a + b = c; c + d = e`)
- Enumerates rows explicitly by name for counting tasks
- Never skips to final results
---
## Question types (equally balanced)
| Type | Description |
|---|---|
| `sum` | Sum a numeric column |
| `mean` | Average of a numeric column |
| `max_row` | Row with highest value |
| `min_row` | Row with lowest value |
| `filtered_sum` | Sum with a filter condition |
| `filtered_count` | Count with a filter condition |
| `percentage` | Percentage of rows matching a condition |
| `rank_top3` | Top 3 rows by a numeric column |
| `comparison` | Compare values between two rows |
| `lookup` | Look up a specific cell value |
| `compound` | Multi-part question combining lookups |
| `summarization` | Summarize the entire table |
Computational types have **mathematically verified answers** computed with pandas.
---
## Domains
35 real-world domains covering diverse table structures including healthcare, finance, sports, e-commerce, energy, wildlife, logistics, and more.
---
## How to use
```python
from datasets import load_dataset
ds = load_dataset("cetusian/markdown-table-qa-20")
# Load all 20 parts
from datasets import concatenate_datasets
all_train = concatenate_datasets([
load_dataset(f"cetusian/markdown-table-qa-{i:02d}", split="train")
for i in range(1, 21)
])
# -> 40,000 training samples
```
---
## Generation
Generated using a pipeline built on **[vLLM](https://github.com/vllm-project/vllm)** with **OpenAI gpt-oss-120b** (4 GPUs, tensor parallelism). Quality-filtered for proper reasoning traces, answer grounding, and balanced type distribution.
---
## About Surogate
**[Surogate](https://surogate.ai)** is a full-stack AgentOps platform for developing, deploying, evaluating, and monitoring reliable AI agents — built by [Invergent AI](https://github.com/invergent-ai/surogate).
---
dataset_info: 数据集信息:
特征:
- 名称:id,数据类型:字符串
- 名称:instruction,数据类型:字符串
- 名称:input,数据类型:字符串
- 名称:response,数据类型:字符串
- 名称:domain,数据类型:字符串
- 名称:question_type,数据类型:字符串
- 名称:n_rows(行数),数据类型:64位整数
- 名称:n_cols(列数),数据类型:64位整数
- 名称:numeric_cols(数值列),数据类型:字符串列表
- 名称:categorical_cols(分类列),数据类型:字符串列表
划分:
- 名称:train(训练集),示例数量:2000
- 名称:validation(验证集),示例数量:200
配置:
- 配置名称:default(默认配置),数据文件:
- 划分:train,路径:data/train-*
- 划分:validation,路径:data/validation-*
---
# Markdown表格(Markdown Table)问答数据集 — 第20/20部分
本数据集为20个数据集集合中的第20部分,用于训练和评估面向结构化表格理解与计算推理的大语言模型(Large Language Model)。每个子数据集均包含2200条样本(2000条训练样本+200条验证样本),并附带逐步推理轨迹。
完整集合可访问:[cetusian/markdown-table-qa-01](https://huggingface.co/datasets/cetusian/markdown-table-qa-01) 至 [cetusian/markdown-table-qa-20](https://huggingface.co/datasets/cetusian/markdown-table-qa-20)
父级数据集:[cetusian/markdown-table-qa](https://huggingface.co/datasets/cetusian/markdown-table-qa)(共11000条样本)
---
## 数据集内容说明
每条样本均包含一个Markdown表格(Markdown Table)、一条自然语言问题,以及一份附带逐步推理过程的详细答案:
| 字段 | 说明 |
|---|---|
| `instruction` | 针对表格的自然语言问题 |
| `input` | Markdown表格(Markdown Table)本身 |
| `response` | 包含`<think>...</think>`推理轨迹与最终答案的回复内容 |
| `domain` | 表格所属领域(例如`healthcare_appointments`(医疗预约)、`wildlife_survey`(野生动物调查)) |
| `question_type` | 共12种问题类型——各类别分布均衡(每个训练集约167条样本、验证集约17条样本) |
### 推理格式规范
每条回复均包含一个详细的`<think>`模块,该模块需满足以下要求:
- 引用表格中**精确的单元格数值**
- 完整展示**所有算术运算步骤**(如`a + b = c; c + d = e`)
- 针对计数任务,需按名称明确枚举相关行
- 不得直接跳至最终结果
---
## 问题类型(分布均衡)
| 类型 | 说明 |
|---|---|
| `sum` | 对某数值列求和 |
| `mean` | 某数值列的平均值 |
| `max_row` | 数值最高的行 |
| `min_row` | 数值最低的行 |
| `filtered_sum` | 带过滤条件的求和运算 |
| `filtered_count` | 带过滤条件的计数运算 |
| `percentage` | 符合指定条件的行占比 |
| `rank_top3` | 按数值排序的前3行 |
| `comparison` | 两行之间的数值对比 |
| `lookup` | 查询指定单元格的数值 |
| `compound` | 结合多项查询的复合问题 |
| `summarization` | 对整个表格进行总结 |
所有计算类问题的答案均经**pandas**库验证计算,确保数学准确性。
---
## 覆盖领域
涵盖35个真实世界领域,包含多样化的表格结构,涉及医疗、金融、体育、电子商务、能源、野生动物、物流等多个场景。
---
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("cetusian/markdown-table-qa-20")
# 加载全部20个子数据集
from datasets import concatenate_datasets
all_train = concatenate_datasets([
load_dataset(f"cetusian/markdown-table-qa-{i:02d}", split="train")
for i in range(1, 21)
])
# 总计40000条训练样本
---
## 数据集生成
本数据集基于**[vLLM](https://github.com/vllm-project/vllm)**构建的流水线,结合**OpenAI gpt-oss-120b**模型生成(使用4张GPU进行张量并行运算)。生成后经过质量过滤,确保推理轨迹规范、答案与表格内容对齐,且问题类型分布均衡。
---
## 关于Surogate
**[Surogate](https://surogate.ai)**是一款全栈式AgentOps平台,用于开发、部署、评估与监控可靠的AI智能体(AI Agent),由[Invergent AI](https://github.com/invergent-ai/surogate)团队开发。
提供机构:
cetusian



