superdoc-dev/docx-corpus
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/superdoc-dev/docx-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-classification
language:
- en
- ru
- cs
- pl
- es
- zh
- lt
- sk
- fr
- pt
- de
- it
- sv
- nl
- bg
- uk
- tr
- ja
- hu
- ko
size_categories:
- 100K<n<1M
tags:
- docx
- word-documents
- document-classification
- ooxml
pretty_name: docx-corpus
---
# docx-corpus
The largest classified corpus of Word documents. 736K+ `.docx` files from the public web, classified into 10 document types and 9 topics across 76 languages.
## Dataset Description
This dataset contains metadata for publicly available `.docx` files collected from the web. Each document has been classified by document type and topic using a two-stage pipeline: LLM labeling (Claude) of a stratified sample, followed by fine-tuned XLM-RoBERTa classifiers applied at scale.
### Schema
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | SHA-256 hash of the file (unique identifier) |
| `filename` | string | Original filename from the source URL |
| `type` | string | Document type (10 classes) |
| `topic` | string | Document topic (9 classes) |
| `language` | string | Detected language (ISO 639-1 code) |
| `word_count` | int | Number of words in the document |
| `confidence` | float | Classification confidence (min of type and topic) |
| `url` | string | Direct download URL for the `.docx` file |
### Document Types
legal, forms, reports, policies, educational, correspondence, technical, administrative, creative, reference
### Topics
government, education, healthcare, finance, legal_judicial, technology, environment, nonprofit, general
## Download Files
Each row includes a `url` column pointing to the `.docx` file on our CDN. You can download files directly:
```python
from datasets import load_dataset
import requests
ds = load_dataset("superdoc-dev/docx-corpus", split="train")
# Filter and download
legal_en = ds.filter(lambda x: x["type"] == "legal" and x["language"] == "en")
for row in legal_en:
resp = requests.get(row["url"])
with open(f"corpus/{row['id']}.docx", "wb") as f:
f.write(resp.content)
```
Or use the manifest API for bulk downloads:
```bash
curl "https://api.docxcorp.us/manifest?type=legal&lang=en" -o manifest.txt
wget -i manifest.txt -P ./corpus/
```
## Links
- **Website**: [docxcorp.us](https://docxcorp.us)
- **GitHub**: [superdoc-dev/docx-corpus](https://github.com/superdoc-dev/docx-corpus)
- **Built by**: [🦋 SuperDoc](https://superdoc.dev)
> 许可证:odc-by
> 任务类别:文本分类
> 支持语言:英语、俄语、捷克语、波兰语、西班牙语、中文、立陶宛语、斯洛伐克语、法语、葡萄牙语、德语、意大利语、瑞典语、荷兰语、保加利亚语、乌克兰语、土耳其语、日语、匈牙利语、韩语
> 样本量范围:10万 < 样本量 < 100万
> 标签:docx、Word文档、文档分类、OOXML
> 展示名称:docx语料库
# docx语料库
全球规模最大的标注Word文档语料库。本数据集包含来自公开网络的73.6万余个`.docx`格式文件,覆盖76种语言,被划分为10种文档类型与9个主题类别。
## 数据集说明
本数据集收录了从网络采集的公开`.docx`格式文件的元数据。所有文档均通过两阶段流水线完成文档类型与主题分类:首先针对分层抽样样本使用大语言模型(Large Language Model,Claude)进行标注,随后将微调后的XLM-RoBERTa分类器大规模应用于全量数据。
### 数据结构
| 字段名 | 数据类型 | 说明 |
|--------|----------|------|
| `id` | 字符串 | 文件的SHA-256哈希值(唯一标识符) |
| `filename` | 字符串 | 来源URL对应的原始文件名 |
| `type` | 字符串 | 文档类型(共10个类别) |
| `topic` | 字符串 | 文档主题(共9个类别) |
| `language` | 字符串 | 检测到的语言(ISO 639-1代码) |
| `word_count` | 整数 | 文档的单词数 |
| `confidence` | 浮点数 | 分类置信度(取文档类型与主题分类置信度的最小值) |
| `url` | 字符串 | `.docx`文件的直接下载链接 |
### 文档类型
法律类、表单类、报告类、政策规章类、教育类、通信函件类、技术类、行政类、创意类、参考资料类
### 主题类别
政府、教育、医疗健康、金融、法律司法、科技、环境、非营利组织、通用
## 下载方式
数据集中的每一行均包含指向CDN上`.docx`文件的`url`字段,您可直接下载文件:
python
from datasets import load_dataset
import requests
ds = load_dataset("superdoc-dev/docx-corpus", split="train")
# 筛选并下载
legal_en = ds.filter(lambda x: x["type"] == "legal" and x["language"] == "en")
for row in legal_en:
resp = requests.get(row["url"])
with open(f"corpus/{row['id']}.docx", "wb") as f:
f.write(resp.content)
或者使用清单API进行批量下载:
bash
curl "https://api.docxcorp.us/manifest?type=legal&lang=en" -o manifest.txt
wget -i manifest.txt -P ./corpus/
## 相关链接
- **官方网站**:[docxcorp.us](https://docxcorp.us)
- **GitHub仓库**:[superdoc-dev/docx-corpus](https://github.com/superdoc-dev/docx-corpus)
- **开发方**:[🦋 SuperDoc](https://superdoc.dev)
提供机构:
superdoc-dev



