WikiDes
收藏WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs
数据集概述
Wikides 是一个基于维基百科段落生成 Wikidata 描述的数据集。该数据集适用于以下 NLP 问题:
- 标题生成
- 基于实例(主题)的文本分类
- 从文本中提取实例(主题)
数据集内容
数据集包含超过 80,000 个样本,存储在文件 collected_data.json 中。
数据字段
- wikidata_id: Wikidata 项目的标识符。
- label: Wikidata 项目的标签或维基百科文章标题。
- description: Wikidata 项目的描述或黄金描述。
- instances: Wikidata 项目的实例列表(P31),可作为基线描述。
- subclasses: Wikidata 项目的子类列表(P279)。
- aliases: Wikidata 项目标签的替代名称。
- first_paragraph: 与 Wikidata 项目关联的维基百科文章的第一段。
- first_sentence: 第一段的第一句话。
示例数据
json { "wikidata_id": "Q65293712", "label": "Lepisma saccharina", "description": "small, wingless insect in the order Thysanura", "instances": [ [ "Q16521", "taxon" ] ], "subclasses": [ [ "Q219174", "pest" ] ], "aliases": [ "Lepisma saccharina", "fishmoth", "Silverfish" ], "first_paragraph": "The silverfish (Lepisma saccharinum) is a species of small, primitive, wingless insect in the order Zygentoma (formerly Thysanura). Its common name derives from the insects silvery light grey colour, combined with the fish-like appearance of its movements. The scientific name (L. saccharinum) indicates that the silverfishs diet consists of carbohydrates such as sugar or starches. While the common name silverfish is used throughout the global literature to refer to various species of Zygentoma, the Entomological Society of America restricts use of the term solely for Lepisma saccharinum.", "first_sentence": "The silverfish (Lepisma saccharinum) is a species of small, primitive, wingless insect in the order Zygentoma (formerly Thysanura)." }
训练过程
训练过程分为两个阶段:描述生成和候选排名。
阶段 1:描述生成
数据集分为两种分割方式:
- topic-exclusive split (diff): 训练集、验证集和测试集包含不同主题,样本分布为 65,772/7,820/7,827。
- topic-independent split (random): 所有集包含随机主题,样本分布为 68,296/8,540/8,542。
示例数据
json {"wikidata_id": "Q55135146", "label": "Xyleborus intrusus", "source": "Xyleborus intrusus is a species of typical bark beetle in the family Curculionidae. It is found in North America.", "target": "species of insect", "baseline_candidates": ["taxon"]}
阶段 2:候选排名
数据集分为两种分割方式:
- different topic splitting: 训练集、验证集和测试集包含不同主题,样本分布为 6000/1000/1000。
- random topic splitting: 所有集包含随机主题,样本分布为 6000/1000/1000。
示例数据
json {"source": "Knuthenborg Safaripark is a safari park on the island of Lolland in the southeast of Denmark. It is located 7 km (on Rte 289) to the north of Maribo, near Bandholm. It is one of Lollands major tourist attractions with over 250,000 visitors annually, and is the largest safari park in northern Europe. It is also the largest natural playground for both children and adults in Denmark. Among others, it houses an arboretum, aviaries, a drive-through safari park, a monkey forest (with baboons, tamarins and lemurs) and a tiger enclosure. Knuthenborg covers a total of 660 hectares (1,600 acres), including the 400-hectare (990-acre) Safaripark. The park is viewable on Google Street View.", "candidate": ["park in Lolland, Denmark", "safari park"], "target": "Safari park in Denmark"}




