Ti-Ma/wikipedia_2015
收藏Hugging Face2024-04-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Ti-Ma/wikipedia_2015
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
---
# Dataset Card for Dataset Name
This is a Wikipedia dataset correct to "31-12-2015".
## Dataset Details
### Dataset Description
WikiMedia routinely publishes dumps of Wikipedia, each containing the revision history of articles. We first defined the relevant revision before extracting the article information. Specifically, we select the most recent revision as of December 31st for each year. Consequently, some revisions in our datasets date back several years from the target date since these pages haven't been edited. While this inclusion of older revisions might initially appear problematic, it is important to note that these are the existing versions of Wikipedia pages as of the cutoff date. The content of these pages was considered current enough at that time. This approach ensures that our training datasets reflect the most up-to-date information available on Wikipedia at each year's end, providing a realistic snapshot of knowledge for that specific point in time.
Once each revision has been identified we clean the page using the code from \textit{wiki-dump-reader} \footnote{https://github.com/CyberZHG/wiki-dump-reader/tree/master}, which parses the page and outputs clean text. During the cleaning phase a number of unwanted features and attributes are removed: file links, emphasises, comments, indents, HTML, references etc.
- **Language(s):** English
- **License:** cc-by-sa-3.0
## Uses
Diachronic studies of Wikipedia, historical LLM pre-training, and any task that requires strict temporal partitioning of data.
## Dataset Structure
The dataset is saved in a format that is suitable for fast loading of large files and is compatible with the Huggingface datasets framework.
## Bias, Risks, and Limitations
This dataset does include all Wikipedia articles, some of which might not be useful to the end user. Filtering of relevant articles may be necessary for downstream tasks.
## Dataset Card Contact
felix.drinkall@eng.ox.ac.uk
## Acknowledgments
We are grateful to Graphcore, and their team, for their support in providing us with compute for this project. The first author was funded by the Economic and Social Research Council of the UK via the Grand Union DTP. This work was supported in part by a grant from the Engineering and Physical Sciences Research Council (EP/T023333/1). We are also grateful to the Oxford-Man Institute of Quantitative Finance and the Oxford e-Research Centre for their support.
## Citation
**BibTeX:**
@inproceedings{drinkall-tima-2024,
title = "Time Machine GPT",
author = "Drinkall, Felix and Zohren, Stefan and Pierrehumbert, Janet",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
month = june,
year = "2024",
publisher = "Association for Computational Linguistics" }
提供机构:
Ti-Ma
原始信息汇总
数据集概述
名称: Dataset Name
描述: 该数据集是截至2015年12月31日的维基百科数据集。数据集包含了每年12月31日最晚的修订版本,确保了数据集反映了每年年底维基百科上最新的可用信息。
语言: 英语
许可证: cc-by-sa-3.0
数据处理
数据集通过使用wiki-dump-reader工具进行清洗,移除了文件链接、强调、评论、缩进、HTML和参考等不必要的内容,以输出干净的文本。
数据集结构
数据集以适合快速加载大型文件的格式保存,并与Huggingface数据集框架兼容。
使用场景
- 维基百科的历时研究
- 历史语言模型预训练
- 需要严格时间分区的数据任务
局限性
数据集并非包含所有维基百科文章,部分文章可能对终端用户无用,可能需要对相关文章进行筛选以适应下游任务。



