superdoc-dev/docx-corpus

Name: superdoc-dev/docx-corpus
Creator: superdoc-dev
Published: 2026-03-09 19:46:50
License: 暂无描述

Hugging Face2026-03-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/superdoc-dev/docx-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by task_categories: - text-classification language: - en - ru - cs - pl - es - zh - lt - sk - fr - pt - de - it - sv - nl - bg - uk - tr - ja - hu - ko size_categories: - 100K<n<1M tags: - docx - word-documents - document-classification - ooxml pretty_name: docx-corpus --- # docx-corpus The largest classified corpus of Word documents. 736K+ `.docx` files from the public web, classified into 10 document types and 9 topics across 76 languages. ## Dataset Description This dataset contains metadata for publicly available `.docx` files collected from the web. Each document has been classified by document type and topic using a two-stage pipeline: LLM labeling (Claude) of a stratified sample, followed by fine-tuned XLM-RoBERTa classifiers applied at scale. ### Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | SHA-256 hash of the file (unique identifier) | | `filename` | string | Original filename from the source URL | | `type` | string | Document type (10 classes) | | `topic` | string | Document topic (9 classes) | | `language` | string | Detected language (ISO 639-1 code) | | `word_count` | int | Number of words in the document | | `confidence` | float | Classification confidence (min of type and topic) | | `url` | string | Direct download URL for the `.docx` file | ### Document Types legal, forms, reports, policies, educational, correspondence, technical, administrative, creative, reference ### Topics government, education, healthcare, finance, legal_judicial, technology, environment, nonprofit, general ## Download Files Each row includes a `url` column pointing to the `.docx` file on our CDN. You can download files directly: ```python from datasets import load_dataset import requests ds = load_dataset("superdoc-dev/docx-corpus", split="train") # Filter and download legal_en = ds.filter(lambda x: x["type"] == "legal" and x["language"] == "en") for row in legal_en: resp = requests.get(row["url"]) with open(f"corpus/{row['id']}.docx", "wb") as f: f.write(resp.content) ``` Or use the manifest API for bulk downloads: ```bash curl "https://api.docxcorp.us/manifest?type=legal&lang=en" -o manifest.txt wget -i manifest.txt -P ./corpus/ ``` ## Links - **Website**: [docxcorp.us](https://docxcorp.us) - **GitHub**: [superdoc-dev/docx-corpus](https://github.com/superdoc-dev/docx-corpus) - **Built by**: [🦋 SuperDoc](https://superdoc.dev)

> 许可证：odc-by > 任务类别：文本分类 > 支持语言：英语、俄语、捷克语、波兰语、西班牙语、中文、立陶宛语、斯洛伐克语、法语、葡萄牙语、德语、意大利语、瑞典语、荷兰语、保加利亚语、乌克兰语、土耳其语、日语、匈牙利语、韩语 > 样本量范围：10万 < 样本量 < 100万 > 标签：docx、Word文档、文档分类、OOXML > 展示名称：docx语料库 # docx语料库全球规模最大的标注Word文档语料库。本数据集包含来自公开网络的73.6万余个`.docx`格式文件，覆盖76种语言，被划分为10种文档类型与9个主题类别。 ## 数据集说明本数据集收录了从网络采集的公开`.docx`格式文件的元数据。所有文档均通过两阶段流水线完成文档类型与主题分类：首先针对分层抽样样本使用大语言模型（Large Language Model，Claude）进行标注，随后将微调后的XLM-RoBERTa分类器大规模应用于全量数据。 ### 数据结构 | 字段名 | 数据类型 | 说明 | |--------|----------|------| | `id` | 字符串 | 文件的SHA-256哈希值（唯一标识符） | | `filename` | 字符串 | 来源URL对应的原始文件名 | | `type` | 字符串 | 文档类型（共10个类别） | | `topic` | 字符串 | 文档主题（共9个类别） | | `language` | 字符串 | 检测到的语言（ISO 639-1代码） | | `word_count` | 整数 | 文档的单词数 | | `confidence` | 浮点数 | 分类置信度（取文档类型与主题分类置信度的最小值） | | `url` | 字符串 | `.docx`文件的直接下载链接 | ### 文档类型法律类、表单类、报告类、政策规章类、教育类、通信函件类、技术类、行政类、创意类、参考资料类 ### 主题类别政府、教育、医疗健康、金融、法律司法、科技、环境、非营利组织、通用 ## 下载方式数据集中的每一行均包含指向CDN上`.docx`文件的`url`字段，您可直接下载文件： python from datasets import load_dataset import requests ds = load_dataset("superdoc-dev/docx-corpus", split="train") # 筛选并下载 legal_en = ds.filter(lambda x: x["type"] == "legal" and x["language"] == "en") for row in legal_en: resp = requests.get(row["url"]) with open(f"corpus/{row['id']}.docx", "wb") as f: f.write(resp.content) 或者使用清单API进行批量下载： bash curl "https://api.docxcorp.us/manifest?type=legal&lang=en" -o manifest.txt wget -i manifest.txt -P ./corpus/ ## 相关链接 - **官方网站**：[docxcorp.us](https://docxcorp.us) - **GitHub仓库**：[superdoc-dev/docx-corpus](https://github.com/superdoc-dev/docx-corpus) - **开发方**：[🦋 SuperDoc](https://superdoc.dev)

提供机构：

superdoc-dev

5,000+

优质数据集

54 个

任务类型

进入经典数据集