five

Teddy487/WikiAssertions

收藏
Hugging Face2024-05-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Teddy487/WikiAssertions
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: articleId dtype: int64 - name: lineId dtype: int64 - name: factId dtype: int64 - name: text dtype: string - name: subj dtype: string - name: pred dtype: string - name: auxi sequence: string - name: prep1 dtype: string - name: obj1 dtype: string - name: prep2 dtype: string - name: obj2 dtype: string - name: prep3 dtype: string - name: obj3 dtype: string - name: prep4 dtype: string - name: obj4 dtype: string - name: prep5 dtype: string - name: obj5 dtype: string splits: - name: train num_bytes: 86532750060 num_examples: 439305160 download_size: 31248975339 dataset_size: 86532750060 configs: - config_name: default data_files: - split: train path: data/train-* --- # WikiAssertions: A Dataset of Assertions from Wikipedia WikiAssertions contains all the Assertions (a.k.a. Atomic Facts) in Wikipedia. It was created by running a [strong multi-valent open IE system](https://github.com/Teddy-Li/MulVOIEL/) on the sentencized [Wikipedia](https://huggingface.co/datasets/wikipedia) corpus. The same exact model checkpoint that we used to parse the corpus can be downloaded at [Teddy487/LLaMA3-8b-for-OpenIE](https://huggingface.co/Teddy487/LLaMA3-8b-for-OpenIE). Assertions are multi-valent relation tuples representing factoid information at the atomic level. For example, given the following sentence: `Earlier this year , President Bush made a final `` take - it - or - leave it '' offer on the minimum wage` The following assertion can be extracted: <<span style="color:#2471A3">President Bush</span>, <span style="color:#A93226">made</span>, <span style="color:#138D75">a final "take-it-or-leave-it" offer</span>, <span style="color:#B7950B ">on the minimum wage</span>, <span style="color:#B9770E">earlier this year</span>> We introduce the data format below, and refer users to our [Github Repository](https://github.com/Teddy-Li/MulVOIEL/) and our Model Cards ([Teddy487/LLaMA2-7b-for-OpenIE](https://huggingface.co/Teddy487/LLaMA2-7b-for-OpenIE), [**Teddy487/LLaMA3-8b-for-OpenIE**](https://huggingface.co/Teddy487/LLaMA3-8b-for-OpenIE)) for more information. ## Data Format Each entry in this dataset is an assertion in Wikipedia. An assertion is a multi-valent relation tuple of the format: `<subj> ,, (<auxi> ###) <predicate> ,, (<prep1> ###) <obj1>, (<prep2> ###) <obj2>, ...` An assertion includes a subject, a predicate (essentially verbs), an optional auxiliary (negation / modal verb / etc.), and a number of objects (0, 1, 2, etc.). Each object may come with an optional preposition (e.g. on, with, for, etc.). The dataset follows the format laid out as belows: ### MetaData Columns: 1. articleId: the ID of the document as in the id column in the [Wikipedia](https://huggingface.co/datasets/wikipedia) corpus. 2. lineId: the sentence ID within a document as sentencized using the [spaCy sentencizer](https://spacy.io/api/sentencizer). 3. factId: the assertion ID within the sentence. ### Content Columns: 1. text: the assertion presented in the form of natural language text. 2. subj: subject of the assertion. 3. pred: predicate (main verb) of the assertion. 4. (optional) auxi: auxiliary element of the assertion (negation, model verbs, etc.) 5. (optional) prep1: preposition for object 1 (can be empty); 6. (optional) obj1: object 1 (typically the direct object for transitive verbs, could be empty for intranitive verbs). 7. (optional) prep2: preposition for object 2 (can be empty); 8. (optional) obj2: object 2 10. (optional) prep3: preposition for object 3 (can be empty); 11. (optional) obj3: object 3 12. (optional) prep4:preposition for object 4 (can be empty); 13. (optional) obj4: object 4 14. (optional) prep5:preposition for object 5 (can be empty); 15. (optional) obj5: object 5 Note that we keep a maximum number of 5 object slots per assertion. When an assertion involves more arguments, the overflowing arguments are ignored. When the predicate is a [light verb](https://en.wikipedia.org/wiki/Light_verb), the light verb itself does not bear sufficient meaning to disambiguate the eventualities. Therefore, in that case, we merge the first object (`obj1`) into the `pred` field. We only do this when the first object does not come with a preposition (i.e. `prep1` is empty); otherwise, we treat it as an anomaly and disgard that assertion. ## Mapping with the original text in Wikipedia The original text for each assertion in the dataset can be found from the [Wikipedia](https://huggingface.co/datasets/wikipedia) corpus in Huggingface. We use the `20220301.en` version of Wikipedia, which can be loaded using the following python command: ```python from datasets import load_dataset wiki_corpus = load_dataset("wikipedia", "20220301.en", split='train') ``` The `articleId` field in the dataset corresponds to the document `id` in the loaded dataset. If you wish to locate the exact sentence from which an assertion was extracted, you can use the following python commands: ```python import spacy nlp = spacy.load('en_core_web_sm') nlp.add_pipe("sentencizer") nlp.select_pipes(enable=["sentencizer"]) doc = wiki_corpus[articleId]['text'] doc = nlp(doc) sents = list(doc.sents) this_sent = sents[lineId].text print(this_sent) ``` Note that you would first need to run `python -m spacy download en_core_web_sm` before running the above script.
提供机构:
Teddy487
原始信息汇总

WikiAssertions 数据集概述

数据集信息

特征列

  • articleId: 文档ID,类型为 int64
  • lineId: 句子ID,类型为 int64
  • factId: 断言ID,类型为 int64
  • text: 断言的自然语言文本表示,类型为 string
  • subj: 断言的主语,类型为 string
  • pred: 断言的谓语(主要动词),类型为 string
  • auxi: 断言的辅助元素(否定、情态动词等),类型为 string 序列
  • prep1: 对象1的介词,类型为 string
  • obj1: 对象1,类型为 string
  • prep2: 对象2的介词,类型为 string
  • obj2: 对象2,类型为 string
  • prep3: 对象3的介词,类型为 string
  • obj3: 对象3,类型为 string
  • prep4: 对象4的介词,类型为 string
  • obj4: 对象4,类型为 string
  • prep5: 对象5的介词,类型为 string
  • obj5: 对象5,类型为 string

数据分割

  • train: 训练集,包含 439,305,160 个样本,总大小为 86,532,750,060 字节

数据集大小

  • 下载大小: 31,248,975,339 字节
  • 数据集大小: 86,532,750,060 字节

配置

  • default: 默认配置,训练数据文件路径为 data/train-*

数据格式

每个条目是一个来自维基百科的断言。断言是一个多价关系元组,格式如下:

<subj> ,, (<auxi> ###) <predicate> ,, (<prep1> ###) <obj1>, (<prep2> ###) <obj2>, ...

断言包括一个主语、一个谓语(主要是动词)、一个可选的辅助元素(否定、情态动词等)以及多个对象(0、1、2等)。每个对象可能带有一个可选的介词(例如 on、with、for 等)。

元数据列

  1. articleId: 文档ID,对应维基百科语料库中的 id 列。
  2. lineId: 句子ID,使用 spaCy sentencizer 进行句子分割。
  3. factId: 句子中的断言ID。

内容列

  1. text: 断言的自然语言文本表示。
  2. subj: 断言的主语。
  3. pred: 断言的谓语(主要动词)。
  4. (可选) auxi: 断言的辅助元素(否定、情态动词等)。
  5. (可选) prep1: 对象1的介词(可以为空)。
  6. (可选) obj1: 对象1(通常是及物动词的直接宾语,对于不及物动词可以为空)。
  7. (可选) prep2: 对象2的介词(可以为空)。
  8. (可选) obj2: 对象2。
  9. (可选) prep3: 对象3的介词(可以为空)。
  10. (可选) obj3: 对象3。
  11. (可选) prep4: 对象4的介词(可以为空)。
  12. (可选) obj4: 对象4。
  13. (可选) prep5: 对象5的介词(可以为空)。
  14. (可选) obj5: 对象5。

注意,每个断言最多保留5个对象槽。如果断言涉及更多参数,溢出的参数将被忽略。

当谓语是 轻动词 时,轻动词本身不足以消除事件的歧义。因此,在这种情况下,我们将第一个对象 (obj1) 合并到 pred 字段中。仅当第一个对象不带介词(即 prep1 为空)时,我们才会这样做;否则,我们将其视为异常并丢弃该断言。

与维基百科原文的映射

数据集中每个断言的原始文本可以在 Huggingface 的维基百科语料库中找到。我们使用 20220301.en 版本的维基百科,可以使用以下 Python 命令加载:

python from datasets import load_dataset

wiki_corpus = load_dataset("wikipedia", "20220301.en", split=train)

数据集中的 articleId 字段对应加载数据集中的文档 id

如果您希望定位从中提取断言的确切句子,可以使用以下 Python 命令:

python import spacy

nlp = spacy.load(en_core_web_sm) nlp.add_pipe("sentencizer") nlp.select_pipes(enable=["sentencizer"])

doc = wiki_corpus[articleId][text] doc = nlp(doc) sents = list(doc.sents)

this_sent = sents[lineId].text

print(this_sent)

注意,在运行上述脚本之前,您需要先运行 python -m spacy download en_core_web_sm

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作