flyingfishinwater/wikipedia_20231001
收藏Hugging Face2023-11-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/flyingfishinwater/wikipedia_20231001
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- chemistry
- biology
- legal
- music
- art
- medical
size_categories:
- 10B<n<100B
---
It's the English content dumped from 2023-10-01 version of Wikipedia dump site.
The format is similar with "[datasets/wikipedia](https://huggingface.co/datasets/wikipedia?row=0)". It has use same method to clean the text.
However, I ommitted the 'url' field because it follows the same format: "https://en.wikipedia.org/wiki/[title]".
Another change is the title. I merged the "REDIRECTED" title with its original and use comma as seperator.
For example, the title "An American in Paris, AnAmericanInParis" means "An American in Paris" and "AnAmericanInParis" points to the same content.
提供机构:
flyingfishinwater
原始信息汇总
数据集概述
许可证
- Apache 2.0
任务类别
- 文本生成
语言
- 英语
标签
- 化学
- 生物学
- 法律
- 音乐
- 艺术
- 医学
数据集大小
- 10B<n<100B
数据来源
- 来自2023年10月1日的Wikipedia数据集
数据格式
- 类似于"datasets/wikipedia"
- 使用了相同的方法清理文本
数据处理
- 省略了url字段,格式为"https://en.wikipedia.org/wiki/[title]"
- 合并了"REDIRECTED"标题与原始标题,使用逗号作为分隔符
- 例如,标题"An American in Paris, AnAmericanInParis"表示"An American in Paris"和"AnAmericanInParis"指向相同内容



