chuckreynolds/wikimedia-enterprise-structured-contents-enwiki

Name: chuckreynolds/wikimedia-enterprise-structured-contents-enwiki
Creator: chuckreynolds
Published: 2026-04-21 18:30:14
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/chuckreynolds/wikimedia-enterprise-structured-contents-enwiki

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Wikimedia Enterprise Structured Contents — enwiki_namespace_0 language: - en license: cc-by-sa-4.0 task_categories: - text-generation - question-answering - text-retrieval tags: - wikipedia - wikimedia-enterprise - structured-contents size_categories: - 1M<n<10M configs: - config_name: default data_files: - split: train path: data/*.parquet --- # enwiki_namespace_0 Structured Contents snapshot of `enwiki_namespace_0` from the [Wikimedia Enterprise API](https://enterprise.wikimedia.com/docs/snapshot/), converted to Parquet. ## Source - Upstream: [Wikimedia Enterprise Structured Contents API](https://enterprise.wikimedia.com/docs/snapshot/#structured-contents-snapshot-download-beta) - Snapshot identifier: `enwiki_namespace_0` - Format at source: `.tar.gz` containing sharded `.ndjson` - Shards in this release: 3 ## Processing 1. Downloaded the snapshot tarball from the Wikimedia Enterprise API. 2. Streamed each `.ndjson` shard through a normalization pass: - **JSON-encoded fields**: `sections`, `infoboxes`, `tables`, and `references[].metadata` are stored as JSON-encoded strings. These fields either have recursive nesting (depth > 50) that exceeds Apache Arrow's C Data Interface limit, or are open-dict structures whose keys vary across articles. Decode with `json.loads` on read. - Canonicalised struct field ordering (alphabetic, recursive) so schemas match byte-for-byte across shards. 3. Wrote one Parquet file per shard (zstd level 9 compression, row_group_size tuned for HF streaming). 4. Unified per-shard schemas with `pa.unify_schemas`; pinned result to `schema.json`; re-cast every shard so embedded schemas are identical. ## Loading ```python from datasets import load_dataset import json ds = load_dataset("chuckreynolds/wikimedia-enterprise-structured-contents-enwiki", split="train", streaming=True) row = next(iter(ds)) print(row["name"], row["url"]) # JSON-encoded columns sections = json.loads(row["sections"]) infoboxes = json.loads(row["infoboxes"]) ``` ## Notes - Fields stored as JSON strings: `sections`, `infoboxes`, `tables`, and `references[].metadata`. All other fields (`references[]`, `license[]`, `version`, `event`, etc.) retain their native Arrow struct/list types and are queryable without decoding. - License passes through the upstream license for article text.

提供机构：

chuckreynolds

5,000+

优质数据集

54 个

任务类型

进入经典数据集