five

community-datasets/wiki_snippets

收藏
Hugging Face2024-06-26 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/community-datasets/wiki_snippets
下载链接
链接失效反馈
官方服务:
资源简介:
WikiSnippets数据集是从Wikipedia和Wiki-40B中提取的文本片段,用于密集语义索引。数据集分为两个配置:wiki40b_en_100_0和wikipedia_en_100_0,分别对应Wiki-40B和Wikipedia的英文版本。每个配置的数据集包含多个字段,如文章标题、段落文本等,并且数据集的大小和下载信息也详细列出。
提供机构:
community-datasets
原始信息汇总

数据集概述

基本信息

  • 数据集名称: WikiSnippets
  • 语言: 英语
  • 许可证: 未知
  • 多语言性: 多语种
  • 数据集大小: 10M<n<100M
  • 源数据集: 扩展自 wiki40b 和 wikipedia
  • 任务类别: 文本生成、其他
  • 任务ID: 语言建模
  • 标签: 文本搜索

数据集结构

配置信息

  • 配置名称: wiki40b_en_100_0

    • 特征:
      • _id: 字符串
      • datasets_id: 整数32位
      • wiki_id: 字符串
      • start_paragraph: 整数32位
      • start_character: 整数32位
      • end_paragraph: 整数32位
      • end_character: 整数32位
      • article_title: 字符串
      • section_title: 字符串
      • passage_text: 字符串
    • 分割:
      • train: 17553713个样本, 12.94GB
  • 配置名称: wikipedia_en_100_0

    • 特征:
      • _id: 字符串
      • datasets_id: 整数32位
      • wiki_id: 字符串
      • start_paragraph: 整数32位
      • start_character: 整数32位
      • end_paragraph: 整数32位
      • end_character: 整数32位
      • article_title: 字符串
      • section_title: 字符串
      • passage_text: 字符串
    • 分割:
      • train: 33849898个样本, 26.41GB

数据实例

wiki40b_en_100_0

  • 下载数据集文件大小: 0.00 MB
  • 生成数据集大小: 12.94 GB
  • 总磁盘使用量: 12.94 GB
  • 训练样本示例: json { "_id": "{"datasets_id": 0, "wiki_id": "Q1294448", "sp": 2, "sc": 0, "ep": 6, "ec": 610}", "datasets_id": 0, "wiki_id": "Q1294448", "start_paragraph": 2, "start_character": 0, "end_paragraph": 6, "end_character": 610, "article_title": "Ági Szalóki", "section_title": "Life", "passage_text": "Ági Szalóki Life She started singing as a toddler, considering Márta Sebestyén a role model. Her musical background is traditional folk music; she first won recognition for singing with Ökrös in a traditional folk style, and Besh o droM, a Balkan gypsy brass band. With these ensembles she toured around the world from the Montreal Jazz Festival, through Glastonbury Festival to the Théatre de la Ville in Paris, from New York to Beijing. Since 2005, she began to pursue her solo career and explore various genres, such as jazz, thirties ballads, or childrens songs. Until now, three of her six released albums" }

wikipedia_en_100_0

  • 下载数据集文件大小: 0.00 MB
  • 生成数据集大小: 26.41 GB
  • 总磁盘使用量: 26.41 GB
  • 训练样本示例: json { "_id": "{"datasets_id": 0, "wiki_id": "Anarchism", "sp": 0, "sc": 0, "ep": 2, "ec": 129}", "datasets_id": 0, "wiki_id": "Anarchism", "start_paragraph": 0, "start_character": 0, "end_paragraph": 2, "end_character": 129, "article_title": "Anarchism", "section_title": "Start", "passage_text": "Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism. Humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the" }

引用信息

Wiki-40B

bibtex @inproceedings{49029, title = {Wiki-40B: Multilingual Language Model Dataset}, author = {Mandy Guo and Zihang Dai and Denny Vrandecic and Rami Al-Rfou}, year = {2020}, booktitle = {LREC 2020} }

Wikipedia

bibtex @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作