SaulLu/wikipedia_html_enterprise
收藏Hugging Face2023-03-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SaulLu/wikipedia_html_enterprise
下载链接
链接失效反馈官方服务:
资源简介:
# This is an helper script to load an html enterprise dataset into a datasets object
## How to use
1. Download a NS0 dump at https://dumps.wikimedia.org/other/enterprise_html/runs/20230220/
2. Untar it
For example with:
```
mkdir enwiki-NS6-20230220-ENTERPRISE-HTML
tar -I pigz -vxf enwiki-NS6-20230220-ENTERPRISE-HTML.json.tar.gz -C enwiki-NS6-20230220-ENTERPRISE-HTML
```
3. Load it:
```python
from datasets import load_dataset
local_path=... # Path to directory where you extracted the NS0 dump
shard_id=...
ds = load_dataset(
"SaulLu/wikipedia_html_enterprise",
shard=shard_id,
data_dir=local_path
)
```
提供机构:
SaulLu
原始信息汇总
数据集概述
数据集名称
- 名称: wikipedia_html_enterprise
- 作者: SaulLu
数据集获取
- 下载地址: https://dumps.wikimedia.org/other/enterprise_html/runs/20230220/
- 文件格式: tar.gz
- 解压命令: bash mkdir enwiki-NS6-20230220-ENTERPRISE-HTML tar -I pigz -vxf enwiki-NS6-20230220-ENTERPRISE-HTML.json.tar.gz -C enwiki-NS6-20230220-ENTERPRISE-HTML
数据集加载
-
加载方法: python from datasets import load_dataset
local_path=... # Path to directory where you extracted the NS0 dump shard_id=...
ds = load_dataset( "SaulLu/wikipedia_html_enterprise", shard=shard_id, data_dir=local_path )



