chuckreynolds/wikimedia-enterprise-structured-contents-enwiki
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/chuckreynolds/wikimedia-enterprise-structured-contents-enwiki
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Wikimedia Enterprise Structured Contents — enwiki_namespace_0
language:
- en
license: cc-by-sa-4.0
task_categories:
- text-generation
- question-answering
- text-retrieval
tags:
- wikipedia
- wikimedia-enterprise
- structured-contents
size_categories:
- 1M<n<10M
configs:
- config_name: default
data_files:
- split: train
path: data/*.parquet
---
# enwiki_namespace_0
Structured Contents snapshot of `enwiki_namespace_0` from the
[Wikimedia Enterprise API](https://enterprise.wikimedia.com/docs/snapshot/), converted to Parquet.
## Source
- Upstream: [Wikimedia Enterprise Structured Contents API](https://enterprise.wikimedia.com/docs/snapshot/#structured-contents-snapshot-download-beta)
- Snapshot identifier: `enwiki_namespace_0`
- Format at source: `.tar.gz` containing sharded `.ndjson`
- Shards in this release: 3
## Processing
1. Downloaded the snapshot tarball from the Wikimedia Enterprise API.
2. Streamed each `.ndjson` shard through a normalization pass:
- **JSON-encoded fields**: `sections`, `infoboxes`, `tables`, and
`references[].metadata` are stored as JSON-encoded strings.
These fields either have recursive nesting (depth > 50) that
exceeds Apache Arrow's C Data Interface limit, or are open-dict
structures whose keys vary across articles. Decode with
`json.loads` on read.
- Canonicalised struct field ordering (alphabetic, recursive) so
schemas match byte-for-byte across shards.
3. Wrote one Parquet file per shard (zstd level 9 compression,
row_group_size tuned for HF streaming).
4. Unified per-shard schemas with `pa.unify_schemas`; pinned result to
`schema.json`; re-cast every shard so embedded schemas are identical.
## Loading
```python
from datasets import load_dataset
import json
ds = load_dataset("chuckreynolds/wikimedia-enterprise-structured-contents-enwiki", split="train", streaming=True)
row = next(iter(ds))
print(row["name"], row["url"])
# JSON-encoded columns
sections = json.loads(row["sections"])
infoboxes = json.loads(row["infoboxes"])
```
## Notes
- Fields stored as JSON strings: `sections`, `infoboxes`, `tables`,
and `references[].metadata`. All other fields (`references[]`,
`license[]`, `version`, `event`, etc.) retain their native Arrow
struct/list types and are queryable without decoding.
- License passes through the upstream license for article text.
提供机构:
chuckreynolds



