Teddy487/WikiAssertions

Name: Teddy487/WikiAssertions
Creator: Teddy487
Published: 2024-05-30 20:12:13
License: 暂无描述

Hugging Face2024-05-30 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Teddy487/WikiAssertions

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: articleId dtype: int64 - name: lineId dtype: int64 - name: factId dtype: int64 - name: text dtype: string - name: subj dtype: string - name: pred dtype: string - name: auxi sequence: string - name: prep1 dtype: string - name: obj1 dtype: string - name: prep2 dtype: string - name: obj2 dtype: string - name: prep3 dtype: string - name: obj3 dtype: string - name: prep4 dtype: string - name: obj4 dtype: string - name: prep5 dtype: string - name: obj5 dtype: string splits: - name: train num_bytes: 86532750060 num_examples: 439305160 download_size: 31248975339 dataset_size: 86532750060 configs: - config_name: default data_files: - split: train path: data/train-* --- # WikiAssertions: A Dataset of Assertions from Wikipedia WikiAssertions contains all the Assertions (a.k.a. Atomic Facts) in Wikipedia. It was created by running a [strong multi-valent open IE system](https://github.com/Teddy-Li/MulVOIEL/) on the sentencized [Wikipedia](https://huggingface.co/datasets/wikipedia) corpus. The same exact model checkpoint that we used to parse the corpus can be downloaded at [Teddy487/LLaMA3-8b-for-OpenIE](https://huggingface.co/Teddy487/LLaMA3-8b-for-OpenIE). Assertions are multi-valent relation tuples representing factoid information at the atomic level. For example, given the following sentence: `Earlier this year , President Bush made a final `` take - it - or - leave it '' offer on the minimum wage` The following assertion can be extracted: <President Bush, made, a final "take-it-or-leave-it" offer, on the minimum wage, earlier this year> We introduce the data format below, and refer users to our [Github Repository](https://github.com/Teddy-Li/MulVOIEL/) and our Model Cards ([Teddy487/LLaMA2-7b-for-OpenIE](https://huggingface.co/Teddy487/LLaMA2-7b-for-OpenIE), [**Teddy487/LLaMA3-8b-for-OpenIE**](https://huggingface.co/Teddy487/LLaMA3-8b-for-OpenIE)) for more information. ## Data Format Each entry in this dataset is an assertion in Wikipedia. An assertion is a multi-valent relation tuple of the format: `<subj> ,, (<auxi> ###) <predicate> ,, (<prep1> ###) <obj1>, (<prep2> ###) <obj2>, ...` An assertion includes a subject, a predicate (essentially verbs), an optional auxiliary (negation / modal verb / etc.), and a number of objects (0, 1, 2, etc.). Each object may come with an optional preposition (e.g. on, with, for, etc.). The dataset follows the format laid out as belows: ### MetaData Columns: 1. articleId: the ID of the document as in the id column in the [Wikipedia](https://huggingface.co/datasets/wikipedia) corpus. 2. lineId: the sentence ID within a document as sentencized using the [spaCy sentencizer](https://spacy.io/api/sentencizer). 3. factId: the assertion ID within the sentence. ### Content Columns: 1. text: the assertion presented in the form of natural language text. 2. subj: subject of the assertion. 3. pred: predicate (main verb) of the assertion. 4. (optional) auxi: auxiliary element of the assertion (negation, model verbs, etc.) 5. (optional) prep1: preposition for object 1 (can be empty); 6. (optional) obj1: object 1 (typically the direct object for transitive verbs, could be empty for intranitive verbs). 7. (optional) prep2: preposition for object 2 (can be empty); 8. (optional) obj2: object 2 10. (optional) prep3: preposition for object 3 (can be empty); 11. (optional) obj3: object 3 12. (optional) prep4:preposition for object 4 (can be empty); 13. (optional) obj4: object 4 14. (optional) prep5:preposition for object 5 (can be empty); 15. (optional) obj5: object 5 Note that we keep a maximum number of 5 object slots per assertion. When an assertion involves more arguments, the overflowing arguments are ignored. When the predicate is a [light verb](https://en.wikipedia.org/wiki/Light_verb), the light verb itself does not bear sufficient meaning to disambiguate the eventualities. Therefore, in that case, we merge the first object (`obj1`) into the `pred` field. We only do this when the first object does not come with a preposition (i.e. `prep1` is empty); otherwise, we treat it as an anomaly and disgard that assertion. ## Mapping with the original text in Wikipedia The original text for each assertion in the dataset can be found from the [Wikipedia](https://huggingface.co/datasets/wikipedia) corpus in Huggingface. We use the `20220301.en` version of Wikipedia, which can be loaded using the following python command: ```python from datasets import load_dataset wiki_corpus = load_dataset("wikipedia", "20220301.en", split='train') ``` The `articleId` field in the dataset corresponds to the document `id` in the loaded dataset. If you wish to locate the exact sentence from which an assertion was extracted, you can use the following python commands: ```python import spacy nlp = spacy.load('en_core_web_sm') nlp.add_pipe("sentencizer") nlp.select_pipes(enable=["sentencizer"]) doc = wiki_corpus[articleId]['text'] doc = nlp(doc) sents = list(doc.sents) this_sent = sents[lineId].text print(this_sent) ``` Note that you would first need to run `python -m spacy download en_core_web_sm` before running the above script.

提供机构：

Teddy487

原始信息汇总

WikiAssertions 数据集概述

数据集信息

特征列

articleId: 文档ID，类型为 int64
lineId: 句子ID，类型为 int64
factId: 断言ID，类型为 int64
text: 断言的自然语言文本表示，类型为 string
subj: 断言的主语，类型为 string
pred: 断言的谓语（主要动词），类型为 string
auxi: 断言的辅助元素（否定、情态动词等），类型为 string 序列
prep1: 对象1的介词，类型为 string
obj1: 对象1，类型为 string
prep2: 对象2的介词，类型为 string
obj2: 对象2，类型为 string
prep3: 对象3的介词，类型为 string
obj3: 对象3，类型为 string
prep4: 对象4的介词，类型为 string
obj4: 对象4，类型为 string
prep5: 对象5的介词，类型为 string
obj5: 对象5，类型为 string

数据分割

train: 训练集，包含 439,305,160 个样本，总大小为 86,532,750,060 字节

数据集大小

下载大小: 31,248,975,339 字节
数据集大小: 86,532,750,060 字节

配置

default: 默认配置，训练数据文件路径为 data/train-*

数据格式

每个条目是一个来自维基百科的断言。断言是一个多价关系元组，格式如下：

<subj> ,, (<auxi> ###) <predicate> ,, (<prep1> ###) <obj1>, (<prep2> ###) <obj2>, ...

断言包括一个主语、一个谓语（主要是动词）、一个可选的辅助元素（否定、情态动词等）以及多个对象（0、1、2等）。每个对象可能带有一个可选的介词（例如 on、with、for 等）。

元数据列

articleId: 文档ID，对应维基百科语料库中的 id 列。
lineId: 句子ID，使用 spaCy sentencizer 进行句子分割。
factId: 句子中的断言ID。

内容列

text: 断言的自然语言文本表示。
subj: 断言的主语。
pred: 断言的谓语（主要动词）。
(可选) auxi: 断言的辅助元素（否定、情态动词等）。
(可选) prep1: 对象1的介词（可以为空）。
(可选) obj1: 对象1（通常是及物动词的直接宾语，对于不及物动词可以为空）。
(可选) prep2: 对象2的介词（可以为空）。
(可选) obj2: 对象2。
(可选) prep3: 对象3的介词（可以为空）。
(可选) obj3: 对象3。
(可选) prep4: 对象4的介词（可以为空）。
(可选) obj4: 对象4。
(可选) prep5: 对象5的介词（可以为空）。
(可选) obj5: 对象5。

注意，每个断言最多保留5个对象槽。如果断言涉及更多参数，溢出的参数将被忽略。

当谓语是轻动词时，轻动词本身不足以消除事件的歧义。因此，在这种情况下，我们将第一个对象 (obj1) 合并到 pred 字段中。仅当第一个对象不带介词（即 prep1 为空）时，我们才会这样做；否则，我们将其视为异常并丢弃该断言。

与维基百科原文的映射

数据集中每个断言的原始文本可以在 Huggingface 的维基百科语料库中找到。我们使用 20220301.en 版本的维基百科，可以使用以下 Python 命令加载：

python from datasets import load_dataset

wiki_corpus = load_dataset("wikipedia", "20220301.en", split=train)

数据集中的 articleId 字段对应加载数据集中的文档 id。

如果您希望定位从中提取断言的确切句子，可以使用以下 Python 命令：

python import spacy

nlp = spacy.load(en_core_web_sm) nlp.add_pipe("sentencizer") nlp.select_pipes(enable=["sentencizer"])

doc = wiki_corpus[articleId][text] doc = nlp(doc) sents = list(doc.sents)

this_sent = sents[lineId].text

print(this_sent)

注意，在运行上述脚本之前，您需要先运行 python -m spacy download en_core_web_sm。

5,000+

优质数据集

54 个

任务类型

进入经典数据集