five

recursal/SuperWiki-1.5

收藏
Hugging Face2025-12-28 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/recursal/SuperWiki-1.5
下载链接
链接失效反馈
官方服务:
资源简介:
SuperWIKI-1.5是一个多语言的Wikipedia文章数据集,包含约18.23B(使用llama-2-7b-chat-tokenizer)或15.17B(使用RWKV Tokenizer)的标记。该数据集从Wikipedia HTML转储中手动构建,主要用于训练大型语言模型和其他NLP任务。数据集包含多种语言的Wikipedia文章,每个示例包含一篇完整的Wikipedia文章内容。数据集由KaraKaraWitch策划,由Recursal.ai资助,并在时间紧迫的情况下创建,可能存在选择偏差。建议使用改进版本SuperWikipedia-NEXT。

SuperWIKI-1.5是一个多语言的Wikipedia文章数据集,包含约18.23B(使用llama-2-7b-chat-tokenizer)或15.17B(使用RWKV Tokenizer)的标记。该数据集从Wikipedia HTML转储中手动构建,主要用于训练大型语言模型和其他NLP任务。数据集包含多种语言的Wikipedia文章,每个示例包含一篇完整的Wikipedia文章内容。数据集由KaraKaraWitch策划,由Recursal.ai资助,并在时间紧迫的情况下创建,可能存在选择偏差。建议使用改进版本SuperWikipedia-NEXT。
提供机构:
recursal
原始信息汇总

数据集详情

数据集描述

SuperWIKI-1.5 是一个多语言数据集,包含约 18.23B 个 Tokens(使用 llama-2-7b-chat-tokenizer)或 15.17B 个 Tokens(使用 RWKV Tokenizer),从维基百科的 HTML 转储中精选而来。该数据集主要用于大型语言模型和其他自然语言处理任务的训练。

  • 数据集名称: SuperWIKI-1.5
  • 语言: 多语言(见支持的语言)
  • 许可证: cc-by-sa-4.0
  • 创建者: KaraKaraWitch
  • 资助者: Recursal.ai
  • 共享者: KaraKaraWitch

数据集概述

该数据集包含来自所有语言的维基百科文章的清洁内容。数据集是从维基百科的 HTML 转储中手动构建的,每个语言都有对应的分割。每个示例包含一篇完整的维基百科文章的内容。

支持的任务和排行榜

主要用于语言建模。

支持的语言

数据集包括以下语言的维基百科文章:

  • ar.wikipedia.org
  • de.wikipedia.org
  • en.wikipedia.org
  • es.wikipedia.org
  • fa.wikipedia.org
  • fr.wikipedia.org
  • he.wikipedia.org
  • hi.wikipedia.org
  • id.wikipedia.org
  • it.wikipedia.org
  • ja.wikipedia.org
  • ko.wikipedia.org
  • nl.wikipedia.org
  • pl.wikipedia.org
  • pt.wikipedia.org
  • ru.wikipedia.org
  • simple.wikipedia.org
  • sv.wikipedia.org
  • tr.wikipedia.org
  • uk.wikipedia.org
  • vi.wikipedia.org
  • zh.wikipedia.org

选择偏差

与 SuperWikipedia-NEXT 不同,SuperWIKI-1.5 的语言选择是手动的,可能偏向某些语言(例如,中日韩和欧洲语言)。

过滤

过滤过程在代码中有文档记录,但组织不够好。建议直接查看代码以获取详细信息。

数据实例

以下是一个数据实例的示例:

json { "id": 4024053, "title": "Tharman Shanmugaratnam", "url": "https://en.wikipedia.org/wiki/Tharman_Shanmugaratnam", "stub": false, "template": [ "Efn", "C-SPAN", "S-aft", "S-new", "Reflist", "Cite news", "S-par", "Cite journal", "Short description", "EngvarB" ], "category": [ "Finance ministers of Singapore", "Singaporean Hindus", "Alumni of Wolfson College, Cambridge", "Deputy Prime Ministers of Singapore", "Ministers for Manpower of Singapore", "Presidents of Singapore", "Singaporean people of Sri Lankan descent", "Singaporean people of Tamil descent", "Articles with WorldCat Entities identifiers", "Articles with GND identifiers", "Articles with VIAF identifiers" ], "license": [ "Creative Commons Attribution Share Alike 3.0 Unported" ], "wikitext": "<...TRUNCATED SAMPLE...> Tharman Shanmugaratnam{{efn|{{lang-ta|தர்மன் சண்முகரத்தினம்}}}} (born 25 February 1957), also known [[mononymously]] as Tharman, is a Singaporean politician and economist who has been serving as the ninth [[president of Singapore]] since 2023 after winning the [[2023 Singaporean presidential election|2023 presidential election]].

Prior to his presidency, Tharman served as [[Senior Minister of Singapore]] between 2019 and 2023, [[Coordinating Minister for Social Policies (Singapore)|Coordinating Minister for Social Policies]] between 2015 and 2023, and Chairman of the [[Monetary Authority of Singapore]] between 2011 and 2023.<ref name="Parliament Profile"/>

Tharman is an economist in roles principally related to economic and social policies. He has also led various international councils and panels simultaneously. Tharman chairs the Board of Trustees of the [[Group of Thirty]], a global council of economic and financial leaders from the public and private sectors and academia. He also co-chairs the Global Commission on the Economics of Water with [[Ngozi Okonjo-Iweala|Ngozi Owonjo-Iweala]], [[Mariana Mazzucato]] and [[Johan Rockström]]. Its initial recommendations helped shape the outcomes of the UN Water Conference in March 2023. Tharman has also been co-chair of the [[G20]] High Level Independent Panel on Global Financing for Pandemic Preparedness and Response since 2021. In 2017, Tharman was appointed to chair the G20 Eminent Persons Group on Global Financial Governance.

A former member of the governing [[Peoples Action Party]] (PAP), he was the... <...TRUNCATED SAMPLE...>", "lang": "en", "abstract": "Tharman Shanmugaratnam, also known mononymously as Tharman, is a Singaporean politician and economist who has been serving as the ninth president of Singapore since 2023. Prior to his presidency, Tharman served as Senior Minister of Singapore between 2019 and 2023, Coordinating Minister for Social Policies between 2015 and 2023, and Chairman of the Monetary Authority of Singapore between 2011 and 2023. Tharman is an economist in roles principally related to economic and social policies. He has also led various international councils and panels simultaneously. Tharman chairs the Board of Trustees of the Group of Thirty, a global council of economic and financial leaders from the public and private sectors and academia. He also co-chairs the Global Commission on the Economics of Water with Ngozi Owonjo-Iweala, Mariana Mazzucato and Johan Rockström. Its initial recommendations helped shape the outcomes of the UN Water Conference in March 2023. Tharman has also been co-chair of the G20 High Level Independent Panel on Global Financing for Pandemic Preparedness and Response since 2021. In 2017, Tharman was appointed to chair the G20 Eminent Persons Group on Global Financial Governance. <...TRUNCATED SAMPLE...>", "boxes_filters": [], "infobox_html": [ "<...TRUNCATED SAMPLE...>" ], "figures_dict": [ { "file_url": "./File:Mr_Tharman_at_Bloomberg_New_Economy_Forum.jpg", "caption": "" } ], "text": "9th President of Singapore

Tharman Shanmugaratnam (born 25 February 1957), also known mononymously as Tharman, is a Singaporean politician and economist who has been serving as the ninth president of Singapore since 2023. Prior to his presidency, Tharman served as Senior Minister of Singapore between 2019 and 2023, Coordinating Minister for Social Policies between 2015 and 2023, and Chairman of the Monetary Authority of Singapore between 2011 and 2023.

Tharman is an economist in roles principally related to economic and social policies. He has also led various international councils and panels simultaneously. <...TRUNCATED SAMPLE...>" }

数据字段

  • id: 文章的ID
  • title: 维基百科文章的标题
  • url: 文章的URL
  • stub: 一个布尔值(true/false),表示文章是否为存根
  • template: 文章中发现的模板列表
  • text: 经过后处理的HTML文本,转换为Markdown格式,链接已移除,格式(加粗、斜体)保留
  • license: 文章的许可证
  • wikitext: 维基文本。未使用,但可用作参考
  • lang: 语言。应与维基相同(对于simplewiki,应为en
  • boxes_filters: 也称为rituals,在原始SuperWIKI中找到。这些是从CSS选择器.ombox.ambox中提取的
  • infobox_html: 从文本中提取的侧边信息框列表
  • figures_dict: 文章中使用的图表列表。再次,从文本中提取
  • text: Markdown文本。这是您可能用于LLM训练的内容
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作