five

superdoc-dev/docx-corpus

收藏
Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/superdoc-dev/docx-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-classification language: - en - ru - cs - pl - es - zh - lt - sk - fr - pt - de - it - sv - nl - bg - uk - tr - ja - hu - ko size_categories: - 100K<n<1M tags: - docx - word-documents - document-classification - ooxml pretty_name: docx-corpus --- # docx-corpus The largest classified corpus of Word documents. 736K+ `.docx` files from the public web, classified into 10 document types and 9 topics across 76 languages. ## Dataset Description This dataset contains metadata for publicly available `.docx` files collected from the web. Each document has been classified by document type and topic using a two-stage pipeline: LLM labeling (Claude) of a stratified sample, followed by fine-tuned XLM-RoBERTa classifiers applied at scale. ### Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | SHA-256 hash of the file (unique identifier) | | `filename` | string | Original filename from the source URL | | `type` | string | Document type (10 classes) | | `topic` | string | Document topic (9 classes) | | `language` | string | Detected language (ISO 639-1 code) | | `word_count` | int | Number of words in the document | | `confidence` | float | Classification confidence (min of type and topic) | | `url` | string | Direct download URL for the `.docx` file | ### Document Types legal, forms, reports, policies, educational, correspondence, technical, administrative, creative, reference ### Topics government, education, healthcare, finance, legal_judicial, technology, environment, nonprofit, general ## Download Files Each row includes a `url` column pointing to the `.docx` file on our CDN. You can download files directly: ```python from datasets import load_dataset import requests ds = load_dataset("superdoc-dev/docx-corpus", split="train") # Filter and download legal_en = ds.filter(lambda x: x["type"] == "legal" and x["language"] == "en") for row in legal_en: resp = requests.get(row["url"]) with open(f"corpus/{row['id']}.docx", "wb") as f: f.write(resp.content) ``` Or use the manifest API for bulk downloads: ```bash curl "https://api.docxcorp.us/manifest?type=legal&lang=en" -o manifest.txt wget -i manifest.txt -P ./corpus/ ``` ## Links - **Website**: [docxcorp.us](https://docxcorp.us) - **GitHub**: [superdoc-dev/docx-corpus](https://github.com/superdoc-dev/docx-corpus) - **Built by**: [🦋 SuperDoc](https://superdoc.dev)

> 许可证:odc-by > 任务类别:文本分类 > 支持语言:英语、俄语、捷克语、波兰语、西班牙语、中文、立陶宛语、斯洛伐克语、法语、葡萄牙语、德语、意大利语、瑞典语、荷兰语、保加利亚语、乌克兰语、土耳其语、日语、匈牙利语、韩语 > 样本量范围:10万 < 样本量 < 100万 > 标签:docx、Word文档、文档分类、OOXML > 展示名称:docx语料库 # docx语料库 全球规模最大的标注Word文档语料库。本数据集包含来自公开网络的73.6万余个`.docx`格式文件,覆盖76种语言,被划分为10种文档类型与9个主题类别。 ## 数据集说明 本数据集收录了从网络采集的公开`.docx`格式文件的元数据。所有文档均通过两阶段流水线完成文档类型与主题分类:首先针对分层抽样样本使用大语言模型(Large Language Model,Claude)进行标注,随后将微调后的XLM-RoBERTa分类器大规模应用于全量数据。 ### 数据结构 | 字段名 | 数据类型 | 说明 | |--------|----------|------| | `id` | 字符串 | 文件的SHA-256哈希值(唯一标识符) | | `filename` | 字符串 | 来源URL对应的原始文件名 | | `type` | 字符串 | 文档类型(共10个类别) | | `topic` | 字符串 | 文档主题(共9个类别) | | `language` | 字符串 | 检测到的语言(ISO 639-1代码) | | `word_count` | 整数 | 文档的单词数 | | `confidence` | 浮点数 | 分类置信度(取文档类型与主题分类置信度的最小值) | | `url` | 字符串 | `.docx`文件的直接下载链接 | ### 文档类型 法律类、表单类、报告类、政策规章类、教育类、通信函件类、技术类、行政类、创意类、参考资料类 ### 主题类别 政府、教育、医疗健康、金融、法律司法、科技、环境、非营利组织、通用 ## 下载方式 数据集中的每一行均包含指向CDN上`.docx`文件的`url`字段,您可直接下载文件: python from datasets import load_dataset import requests ds = load_dataset("superdoc-dev/docx-corpus", split="train") # 筛选并下载 legal_en = ds.filter(lambda x: x["type"] == "legal" and x["language"] == "en") for row in legal_en: resp = requests.get(row["url"]) with open(f"corpus/{row['id']}.docx", "wb") as f: f.write(resp.content) 或者使用清单API进行批量下载: bash curl "https://api.docxcorp.us/manifest?type=legal&lang=en" -o manifest.txt wget -i manifest.txt -P ./corpus/ ## 相关链接 - **官方网站**:[docxcorp.us](https://docxcorp.us) - **GitHub仓库**:[superdoc-dev/docx-corpus](https://github.com/superdoc-dev/docx-corpus) - **开发方**:[🦋 SuperDoc](https://superdoc.dev)
提供机构:
superdoc-dev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作