cetusian/markdown-table-qa-20

Name: cetusian/markdown-table-qa-20
Creator: cetusian
Published: 2026-04-04 09:55:11
License: 暂无描述

Hugging Face2026-04-04 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/cetusian/markdown-table-qa-20

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: instruction dtype: string - name: input dtype: string - name: response dtype: string - name: domain dtype: string - name: question_type dtype: string - name: n_rows dtype: int64 - name: n_cols dtype: int64 - name: numeric_cols list: string - name: categorical_cols list: string splits: - name: train num_examples: 2000 - name: validation num_examples: 200 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # Markdown Table QA Dataset — Part 20/20 Part **20** of a 20-dataset collection for training and evaluating language models on structured table understanding and computational reasoning. Each part contains **2,200 samples** (2,000 train + 200 validation) with step-by-step reasoning traces. See the full collection: [cetusian/markdown-table-qa-01](https://huggingface.co/datasets/cetusian/markdown-table-qa-01) through [cetusian/markdown-table-qa-20](https://huggingface.co/datasets/cetusian/markdown-table-qa-20) Parent dataset: [cetusian/markdown-table-qa](https://huggingface.co/datasets/cetusian/markdown-table-qa) (11,000 samples) --- ## What's in it Each sample contains a markdown table paired with a natural language question and a detailed answer with step-by-step reasoning: | Field | Description | |---|---| | `instruction` | Natural language question about the table | | `input` | The markdown table | | `response` | Answer with `<think>...</think>` reasoning trace followed by a final answer | | `domain` | Table domain (e.g. `healthcare_appointments`, `wildlife_survey`) | | `question_type` | One of 12 types — equally balanced (~167 train + ~17 val per type) | ### Reasoning format Every response includes a detailed `<think>` block that: - Quotes **exact cell values** from the table - Shows **all arithmetic step by step** (`a + b = c; c + d = e`) - Enumerates rows explicitly by name for counting tasks - Never skips to final results --- ## Question types (equally balanced) | Type | Description | |---|---| | `sum` | Sum a numeric column | | `mean` | Average of a numeric column | | `max_row` | Row with highest value | | `min_row` | Row with lowest value | | `filtered_sum` | Sum with a filter condition | | `filtered_count` | Count with a filter condition | | `percentage` | Percentage of rows matching a condition | | `rank_top3` | Top 3 rows by a numeric column | | `comparison` | Compare values between two rows | | `lookup` | Look up a specific cell value | | `compound` | Multi-part question combining lookups | | `summarization` | Summarize the entire table | Computational types have **mathematically verified answers** computed with pandas. --- ## Domains 35 real-world domains covering diverse table structures including healthcare, finance, sports, e-commerce, energy, wildlife, logistics, and more. --- ## How to use ```python from datasets import load_dataset ds = load_dataset("cetusian/markdown-table-qa-20") # Load all 20 parts from datasets import concatenate_datasets all_train = concatenate_datasets([ load_dataset(f"cetusian/markdown-table-qa-{i:02d}", split="train") for i in range(1, 21) ]) # -> 40,000 training samples ``` --- ## Generation Generated using a pipeline built on **[vLLM](https://github.com/vllm-project/vllm)** with **OpenAI gpt-oss-120b** (4 GPUs, tensor parallelism). Quality-filtered for proper reasoning traces, answer grounding, and balanced type distribution. --- ## About Surogate **[Surogate](https://surogate.ai)** is a full-stack AgentOps platform for developing, deploying, evaluating, and monitoring reliable AI agents — built by [Invergent AI](https://github.com/invergent-ai/surogate).

--- dataset_info: 数据集信息：特征： - 名称：id，数据类型：字符串 - 名称：instruction，数据类型：字符串 - 名称：input，数据类型：字符串 - 名称：response，数据类型：字符串 - 名称：domain，数据类型：字符串 - 名称：question_type，数据类型：字符串 - 名称：n_rows（行数），数据类型：64位整数 - 名称：n_cols（列数），数据类型：64位整数 - 名称：numeric_cols（数值列），数据类型：字符串列表 - 名称：categorical_cols（分类列），数据类型：字符串列表划分： - 名称：train（训练集），示例数量：2000 - 名称：validation（验证集），示例数量：200 配置： - 配置名称：default（默认配置），数据文件： - 划分：train，路径：data/train-* - 划分：validation，路径：data/validation-* --- # Markdown表格（Markdown Table）问答数据集 — 第20/20部分本数据集为20个数据集集合中的第20部分，用于训练和评估面向结构化表格理解与计算推理的大语言模型（Large Language Model）。每个子数据集均包含2200条样本（2000条训练样本+200条验证样本），并附带逐步推理轨迹。完整集合可访问：[cetusian/markdown-table-qa-01](https://huggingface.co/datasets/cetusian/markdown-table-qa-01) 至 [cetusian/markdown-table-qa-20](https://huggingface.co/datasets/cetusian/markdown-table-qa-20) 父级数据集：[cetusian/markdown-table-qa](https://huggingface.co/datasets/cetusian/markdown-table-qa)（共11000条样本） --- ## 数据集内容说明每条样本均包含一个Markdown表格（Markdown Table）、一条自然语言问题，以及一份附带逐步推理过程的详细答案： | 字段 | 说明 | |---|---| | `instruction` | 针对表格的自然语言问题 | | `input` | Markdown表格（Markdown Table）本身 | | `response` | 包含`<think>...</think>`推理轨迹与最终答案的回复内容 | | `domain` | 表格所属领域（例如`healthcare_appointments`（医疗预约）、`wildlife_survey`（野生动物调查）） | | `question_type` | 共12种问题类型——各类别分布均衡（每个训练集约167条样本、验证集约17条样本） | ### 推理格式规范每条回复均包含一个详细的`<think>`模块，该模块需满足以下要求： - 引用表格中**精确的单元格数值** - 完整展示**所有算术运算步骤**（如`a + b = c; c + d = e`） - 针对计数任务，需按名称明确枚举相关行 - 不得直接跳至最终结果 --- ## 问题类型（分布均衡） | 类型 | 说明 | |---|---| | `sum` | 对某数值列求和 | | `mean` | 某数值列的平均值 | | `max_row` | 数值最高的行 | | `min_row` | 数值最低的行 | | `filtered_sum` | 带过滤条件的求和运算 | | `filtered_count` | 带过滤条件的计数运算 | | `percentage` | 符合指定条件的行占比 | | `rank_top3` | 按数值排序的前3行 | | `comparison` | 两行之间的数值对比 | | `lookup` | 查询指定单元格的数值 | | `compound` | 结合多项查询的复合问题 | | `summarization` | 对整个表格进行总结 | 所有计算类问题的答案均经**pandas**库验证计算，确保数学准确性。 --- ## 覆盖领域涵盖35个真实世界领域，包含多样化的表格结构，涉及医疗、金融、体育、电子商务、能源、野生动物、物流等多个场景。 --- ## 使用方法 python from datasets import load_dataset ds = load_dataset("cetusian/markdown-table-qa-20") # 加载全部20个子数据集 from datasets import concatenate_datasets all_train = concatenate_datasets([ load_dataset(f"cetusian/markdown-table-qa-{i:02d}", split="train") for i in range(1, 21) ]) # 总计40000条训练样本 --- ## 数据集生成本数据集基于**[vLLM](https://github.com/vllm-project/vllm)**构建的流水线，结合**OpenAI gpt-oss-120b**模型生成（使用4张GPU进行张量并行运算）。生成后经过质量过滤，确保推理轨迹规范、答案与表格内容对齐，且问题类型分布均衡。 --- ## 关于Surogate **[Surogate](https://surogate.ai)**是一款全栈式AgentOps平台，用于开发、部署、评估与监控可靠的AI智能体（AI Agent），由[Invergent AI](https://github.com/invergent-ai/surogate)团队开发。

提供机构：

cetusian

5,000+

优质数据集

54 个

任务类型

进入经典数据集