Tim-Pinecone/sec-10k-qa
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Tim-Pinecone/sec-10k-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
- text-retrieval
language:
- en
tags:
- sec
- 10-k
- rag
- chunking
- mtcb
- finance
pretty_name: SEC 10-K QA (MTCB)
size_categories:
- 1K<n<10K
configs:
- config_name: corpus
data_files:
- split: train
path: data/corpus/train-00000-of-00001.parquet
- config_name: questions
data_files:
- split: train
path: data/questions/train-00000-of-00001.parquet
---
# SEC 10-K QA Dataset
A retrieval QA dataset built from SEC 10-K annual filings, designed for benchmarking
RAG chunking strategies with [MTCB](https://github.com/chonkie-inc/mtcb).
## Contents
| Split | Rows | Description |
|-------|------|-------------|
| `corpus` | 95 | Cleaned 10-K filing text (20 companies × 5 years) |
| `questions` | 950 | QA pairs generated from corpus chunks |
## Companies
AAPL, MSFT, GOOGL, AMZN, TSLA, JPM, JNJ, UNH, V, PG,
NVDA, META, BRK, XOM, WMT, BAC, PFE, DIS, NFLX, AMD
## Schema
**corpus**
- `document_id` — filing identifier (ticker + accession number)
- `text` — cleaned filing text
**questions**
- `question` — question about a passage in the filing
- `answer` — answer to the question
- `chunk_must_contain` — verbatim excerpt from the source chunk (ground truth for retrieval)
- `document_id` — links back to corpus
## Usage with MTCB
```python
from datasets import load_dataset
from mtcb import SimpleEvaluator
ds = load_dataset("Tim-Pinecone/sec-10k-qa")
corpus = [row["text"] for row in ds["corpus"]]
questions = [row["question"] for row in ds["questions"]]
passages = [row["chunk_must_contain"] for row in ds["questions"]]
```
提供机构:
Tim-Pinecone



