chentong00/propositionizer-wiki-data
收藏数据集概述
该数据集是模型Propositionizer-wiki的训练数据,用于探索将命题作为检索单元的概念。数据集通过提示GPT-4将维基百科段落分解为一系列命题。命题的定义如下:
- 每个命题应对应文本中一个独立的意义片段,所有命题的组合应代表整个文本的语义。
- 命题应是最小的,即不能再进一步分割为独立的命题。
- 命题应是情境化且自包含的,应包含文本中解释其意义所需的所有必要情境(例如指代)。
数据集结构
数据集结构如下:
sources表示一个维基百科段落,格式为"Title: {title}. Section: {section}. {content}"。标题不会为空,但章节可以为空。targets是一个JSON格式的字符串,表示一系列命题。
示例: json { "sources": "Title: Leaning Tower of Pisa. Section: . Prior to restoration work performed between 1990 and 2001, the tower leaned at an angle of 5.5 degrees, but the tower now leans at about 3.99 degrees. This means the top of the Leaning Tower of Pisa is displaced horizontally 3.9 meters (12 ft 10 in) from the center.", "targets": "["Prior to restoration work performed between 1990 and 2001, the Leaning Tower of Pisa leaned at an angle of 5.5 degrees.", "The Leaning Tower of Pisa now leans at about 3.99 degrees.", "The top of the Leaning Tower of Pisa is displaced horizontally 3.9 meters (12 ft 10 in) from the center."]" }




