five

chuckreynolds/wikimedia-enterprise-structured-contents-enwiki

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/chuckreynolds/wikimedia-enterprise-structured-contents-enwiki
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Wikimedia Enterprise Structured Contents — enwiki_namespace_0 language: - en license: cc-by-sa-4.0 task_categories: - text-generation - question-answering - text-retrieval tags: - wikipedia - wikimedia-enterprise - structured-contents size_categories: - 1M<n<10M configs: - config_name: default data_files: - split: train path: data/*.parquet --- # enwiki_namespace_0 Structured Contents snapshot of `enwiki_namespace_0` from the [Wikimedia Enterprise API](https://enterprise.wikimedia.com/docs/snapshot/), converted to Parquet. ## Source - Upstream: [Wikimedia Enterprise Structured Contents API](https://enterprise.wikimedia.com/docs/snapshot/#structured-contents-snapshot-download-beta) - Snapshot identifier: `enwiki_namespace_0` - Format at source: `.tar.gz` containing sharded `.ndjson` - Shards in this release: 3 ## Processing 1. Downloaded the snapshot tarball from the Wikimedia Enterprise API. 2. Streamed each `.ndjson` shard through a normalization pass: - **JSON-encoded fields**: `sections`, `infoboxes`, `tables`, and `references[].metadata` are stored as JSON-encoded strings. These fields either have recursive nesting (depth > 50) that exceeds Apache Arrow's C Data Interface limit, or are open-dict structures whose keys vary across articles. Decode with `json.loads` on read. - Canonicalised struct field ordering (alphabetic, recursive) so schemas match byte-for-byte across shards. 3. Wrote one Parquet file per shard (zstd level 9 compression, row_group_size tuned for HF streaming). 4. Unified per-shard schemas with `pa.unify_schemas`; pinned result to `schema.json`; re-cast every shard so embedded schemas are identical. ## Loading ```python from datasets import load_dataset import json ds = load_dataset("chuckreynolds/wikimedia-enterprise-structured-contents-enwiki", split="train", streaming=True) row = next(iter(ds)) print(row["name"], row["url"]) # JSON-encoded columns sections = json.loads(row["sections"]) infoboxes = json.loads(row["infoboxes"]) ``` ## Notes - Fields stored as JSON strings: `sections`, `infoboxes`, `tables`, and `references[].metadata`. All other fields (`references[]`, `license[]`, `version`, `event`, etc.) retain their native Arrow struct/list types and are queryable without decoding. - License passes through the upstream license for article text.
提供机构:
chuckreynolds
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作