five

Ti-Ma/wikipedia_2022

收藏
Hugging Face2024-04-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Ti-Ma/wikipedia_2022
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 --- # Dataset Card for Dataset Name This is a Wikipedia dataset correct to "31-12-2022". ## Dataset Details ### Dataset Description WikiMedia routinely publishes dumps of Wikipedia, each containing the revision history of articles. We first defined the relevant revision before extracting the article information. Specifically, we select the most recent revision as of December 31st for each year. Consequently, some revisions in our datasets date back several years from the target date since these pages haven't been edited. While this inclusion of older revisions might initially appear problematic, it is important to note that these are the existing versions of Wikipedia pages as of the cutoff date. The content of these pages was considered current enough at that time. This approach ensures that our training datasets reflect the most up-to-date information available on Wikipedia at each year's end, providing a realistic snapshot of knowledge for that specific point in time. Once each revision has been identified we clean the page using the code from \textit{wiki-dump-reader} \footnote{https://github.com/CyberZHG/wiki-dump-reader/tree/master}, which parses the page and outputs clean text. During the cleaning phase a number of unwanted features and attributes are removed: file links, emphasises, comments, indents, HTML, references etc. - **Language(s):** English - **License:** cc-by-sa-3.0 ## Uses Diachronic studies of Wikipedia, historical LLM pre-training, and any task that requires strict temporal partitioning of data. ## Dataset Structure The dataset is saved in a format that is suitable for fast loading of large files and is compatible with the Huggingface datasets framework. ## Bias, Risks, and Limitations This dataset does include all Wikipedia articles, some of which might not be useful to the end user. Filtering of relevant articles may be necessary for downstream tasks. ## Dataset Card Contact felix.drinkall@eng.ox.ac.uk ## Acknowledgments We are grateful to Graphcore, and their team, for their support in providing us with compute for this project. The first author was funded by the Economic and Social Research Council of the UK via the Grand Union DTP. This work was supported in part by a grant from the Engineering and Physical Sciences Research Council (EP/T023333/1). We are also grateful to the Oxford-Man Institute of Quantitative Finance and the Oxford e-Research Centre for their support. ## Citation **BibTeX:** @inproceedings{drinkall-tima-2024, title = "Time Machine GPT", author = "Drinkall, Felix and Zohren, Stefan and Pierrehumbert, Janet", booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024", month = june, year = "2024", publisher = "Association for Computational Linguistics" }
提供机构:
Ti-Ma
原始信息汇总

数据集概述

数据集名称

Dataset Name

数据集描述

这是一个截至2022年12月31日的维基百科数据集。数据集包含了每年12月31日为止的最新文章修订版本。尽管部分修订版本可能来自几年前,但这些页面在截止日期前未被编辑,因此反映了截止日期的当前状态。数据集通过使用wiki-dump-reader工具清理页面,移除了文件链接、强调、评论、缩进、HTML、参考等不必要的内容。

语言

英语

许可证

cc-by-sa-3.0

用途

  • 维基百科的历时研究
  • 历史语言模型预训练
  • 需要严格时间分区的数据任务

数据集结构

数据集采用适合快速加载大型文件的格式,并与Huggingface数据集框架兼容。

偏差、风险和限制

数据集包含所有维基百科文章,其中部分可能对终端用户无用,可能需要对相关文章进行筛选以适应下游任务。

数据集联系人

felix.drinkall@eng.ox.ac.uk

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作