five

Tim-Pinecone/sec-10k-qa

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Tim-Pinecone/sec-10k-qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering - text-retrieval language: - en tags: - sec - 10-k - rag - chunking - mtcb - finance pretty_name: SEC 10-K QA (MTCB) size_categories: - 1K<n<10K configs: - config_name: corpus data_files: - split: train path: data/corpus/train-00000-of-00001.parquet - config_name: questions data_files: - split: train path: data/questions/train-00000-of-00001.parquet --- # SEC 10-K QA Dataset A retrieval QA dataset built from SEC 10-K annual filings, designed for benchmarking RAG chunking strategies with [MTCB](https://github.com/chonkie-inc/mtcb). ## Contents | Split | Rows | Description | |-------|------|-------------| | `corpus` | 95 | Cleaned 10-K filing text (20 companies × 5 years) | | `questions` | 950 | QA pairs generated from corpus chunks | ## Companies AAPL, MSFT, GOOGL, AMZN, TSLA, JPM, JNJ, UNH, V, PG, NVDA, META, BRK, XOM, WMT, BAC, PFE, DIS, NFLX, AMD ## Schema **corpus** - `document_id` — filing identifier (ticker + accession number) - `text` — cleaned filing text **questions** - `question` — question about a passage in the filing - `answer` — answer to the question - `chunk_must_contain` — verbatim excerpt from the source chunk (ground truth for retrieval) - `document_id` — links back to corpus ## Usage with MTCB ```python from datasets import load_dataset from mtcb import SimpleEvaluator ds = load_dataset("Tim-Pinecone/sec-10k-qa") corpus = [row["text"] for row in ds["corpus"]] questions = [row["question"] for row in ds["questions"]] passages = [row["chunk_must_contain"] for row in ds["questions"]] ```
提供机构:
Tim-Pinecone
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作