five

ucrelnlp/wikipedia-ga-fa-ids

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ucrelnlp/wikipedia-ga-fa-ids
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - ko - nl - pt - es - da - it - fi - zh license: - cc-by-sa-4.0 - gfdl multilinguality: multilingual size_categories: - 10K<n<100K pretty_name: Wikipedia Good and Featured Articles configs: - config_name: en data_files: - split: train path: data/en.jsonl - config_name: ko data_files: - split: train path: data/ko.jsonl - config_name: nl data_files: - split: train path: data/nl.jsonl - config_name: pt data_files: - split: train path: data/pt.jsonl - config_name: es data_files: - split: train path: data/es.jsonl - config_name: da data_files: - split: train path: data/da.jsonl - config_name: it data_files: - split: train path: data/it.jsonl - config_name: fi data_files: - split: train path: data/fi.jsonl - config_name: zh data_files: - split: train path: data/zh.jsonl viewer: true --- # Wikipedia Good and Featured Articles Contains the Wikipedia Article IDs and page titles for Good and Featured articles on Wikipedia for a given timestamped data dump, whereby the data was extracted from [Wikipedia/Wikimedia SQL table dumps](https://dumps.wikimedia.org/). This dataset covers 9 Language Wikipedia sites. For more information on how the dataset was generated see [https://github.com/UCREL/wikipedia-ga-fa-extraction](https://github.com/UCREL/wikipedia-ga-fa-extraction). The data is specific to a given data dump timestamp, the main tag of the repository relates to the most up to date timestamp that has been tagged (versioned). Each timestamp of data that is uploaded will also be uploaded to it's Git tagged specific timestamp (version). This upload relates to timestamp: 2025-12-01 - **Curated by:** [University Centre for Computer Corpus Research on Language (UCREL) group](https://ucrel.lancs.ac.uk/) at [Lancaster University](https://www.lancaster.ac.uk/) - **Multi-lingual** - **Repository:** [https://github.com/UCREL/wikipedia-ga-fa-extraction](https://github.com/UCREL/wikipedia-ga-fa-extraction) ## Uses It can be used to filter a Wikipedia dataset to contain only Good or Featured Articles for a given language. ## Dataset Structure Each JSONL line contains the following information for each article, each line is unique, a `page_id` value only occurs once in the file, and each article has either `ga` or `fa` as True; * `page_id` - The unique per Wikipedia language site article ID. (INT). * `page_title` - The 255 byte text string of the article title. This is a string but it can be truncated if the original title was longer than 255 bytes, of which a character can be up to 4 bytes. (STRING). * `ga` - False if the article is not a Good Article (GA), otherwise True. (BOOL). * `fa` - False if the article is not a Featured Article (FA), otherwise True. (BOOL). Example of a JSONL file; ``` JSON {"page_id": 130, "page_title": "Norge", "ga": true, "fa": false} {"page_id": 167, "page_title": "Sverige", "ga": true, "fa": false} ``` ## Dataset Statistics The table below shows per language the number of entries/articles that are either Good or Featured (Total), Good, or Featured: | Language | Code | Total | GA | FA | | --- | --- | --- | --- | --- | | Chinese | zh | 4,367 | 3,339 | 1,028 | | Danish | da | 187 | 170 | 17 | | Dutch | nl | 380 | 0 | 380 | | English | en | 49,845 | 43,023 | 6,822 | | Finnish | fi | 869 | 516 | 353 | | Italian | it | 1,162 | 573 | 589 | | Korean | ko | 384 | 240 | 144 | | Portuguese | pt | 3,488 | 1,955 | 1,533 | | Spanish | es | 4,710 | 3,368 | 1,342 | ## License This dataset contains text from Wikipedia, licensed under [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en) (CC BY-SA 4.0) and also available under [GFDL](https://www.gnu.org/licenses/fdl-1.3.html). See Wikipedia’s licensing and Terms of Use: [https://dumps.wikimedia.org/legal.html](https://dumps.wikimedia.org/legal.html) We release this data under the same license; [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en) (CC BY-SA 4.0) and also available under [GFDL](https://www.gnu.org/licenses/fdl-1.3.html). ## Dataset Card Authors * UCREL (ucrel@lancaster.ac.uk) * Andrew Moore / apmoore1 (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com) * Paul Rayson (p.rayson@lancaster.ac.uk) ## Dataset Card Contact * UCREL (ucrel@lancaster.ac.uk) * Andrew Moore / apmoore1 (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com) * Paul Rayson (p.rayson@lancaster.ac.uk)
提供机构:
ucrelnlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作