knowitall/ollie

Name: knowitall/ollie
Creator: knowitall
Published: 2024-01-18 11:11:13
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/knowitall/ollie

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - crowdsourced language: - en license: - other multilinguality: - monolingual size_categories: - 10M<n<100M - 1M<n<10M source_datasets: - original task_categories: [] task_ids: [] pretty_name: Ollie tags: - relation-extraction - text-to-structured dataset_info: - config_name: ollie_lemmagrep features: - name: arg1 dtype: string - name: arg2 dtype: string - name: rel dtype: string - name: search_query dtype: string - name: sentence dtype: string - name: words dtype: string - name: pos dtype: string - name: chunk dtype: string - name: sentence_cnt dtype: string splits: - name: train num_bytes: 12324648919 num_examples: 18674630 download_size: 1789363108 dataset_size: 12324648919 - config_name: ollie_patterned features: - name: rel dtype: string - name: arg1 dtype: string - name: arg2 dtype: string - name: slot0 dtype: string - name: search_query dtype: string - name: pattern dtype: string - name: sentence dtype: string - name: parse dtype: string splits: - name: train num_bytes: 2930309084 num_examples: 3048961 download_size: 387514061 dataset_size: 2930309084 config_names: - ollie_lemmagrep - ollie_patterned --- # Dataset Card for Ollie ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Ollie](https://knowitall.github.io/ollie/) - **Repository:** [Github](https://github.com/knowitall/ollie) - **Paper:** [Aclweb](https://www.aclweb.org/anthology/D12-1048/) ### Dataset Summary The Ollie dataset includes two configs for the data used to train the Ollie informatation extraction algorithm, for 18M sentences and 3M sentences respectively. This data is for academic use only. From the authors: Ollie is a program that automatically identifies and extracts binary relationships from English sentences. Ollie is designed for Web-scale information extraction, where target relations are not specified in advance. Ollie is our second-generation information extraction system . Whereas ReVerb operates on flat sequences of tokens, Ollie works with the tree-like (graph with only small cycles) representation using Stanford's compression of the dependencies. This allows Ollie to capture expression that ReVerb misses, such as long-range relations. Ollie also captures context that modifies a binary relation. Presently Ollie handles attribution (He said/she believes) and enabling conditions (if X then). More information is available at the Ollie homepage: https://knowitall.github.io/ollie/ ### Supported Tasks and Leaderboards [More Information Needed] ### Languages en ## Dataset Structure ### Data Instances There are two configurations for the dataset: ollie_lemmagrep which are 18M sentences from web searches for a subset of the Reverb relationships (110,000 relationships), and the 3M sentences for ollie_patterned which is a subset of the ollie_lemmagrep dataset derived from patterns according to the Ollie paper. An example of an ollie_lemmagrep record: `` {'arg1': 'adobe reader', 'arg2': 'pdf', 'chunk': 'B-NP I-NP I-NP I-NP B-PP B-NP I-NP B-VP B-PP B-NP I-NP O B-VP B-NP I-NP I-NP I-NP B-VP I-VP I-VP O', 'pos': 'JJ NNS CC NNS IN PRP$ NN VBP IN NNP NN CC VB DT NNP NNP NNP TO VB VBN .', 'rel': 'be require to view', 'search_query': 'require reader pdf adobe view', 'sentence': 'Many documents and reports on our site are in PDF format and require the Adobe Acrobat Reader to be viewed .', 'sentence_cnt': '9', 'words': 'many,document,and,report,on,our,site,be,in,pdf,format,and,require,the,adobe,acrobat,reader,to,be,view'} `` An example of an ollie_patterned record: `` {'arg1': 'english', 'arg2': 'internet', 'parse': '(in_IN_6), advmod(important_JJ_4, most_RBS_3); nsubj(language_NN_5, English_NNP_0); cop(language_NN_5, being_VBG_1); det(language_NN_5, the_DT_2); amod(language_NN_5, important_JJ_4); prep_in(language_NN_5, era_NN_9); punct(language_NN_5, ,_,_10); conj(language_NN_5, education_NN_12); det(era_NN_9, the_DT_7); nn(era_NN_9, Internet_NNP_8); amod(education_NN_12, English_JJ_11); nsubjpass(enriched_VBN_15, language_NN_5); aux(enriched_VBN_15, should_MD_13); auxpass(enriched_VBN_15, be_VB_14); punct(enriched_VBN_15, ._._16)', 'pattern': '{arg1} <nsubj< {rel:NN} >prep_in> {slot0:NN} >nn> {arg2}', 'rel': 'be language of', 'search_query': 'english language internet', 'sentence': 'English being the most important language in the Internet era , English education should be enriched .', 'slot0': 'era'} `` ### Data Fields For ollie_lemmagrep: * rel: the relationship phrase/verb phrase. This may be empty, which represents the "be" relationship. * arg1: the first argument in the relationship * arg2: the second argument in the relationship. * chunk: a tag of each token in the sentence, showing the pos chunks * pos: part of speech tagging of the sentence * sentence: the sentence * sentence_cnt: the number of copies of this sentence encountered * search_query: a combintion of rel, arg1, arg2 * words: the lemma of the words of the sentence separated by commas For ollie_patterned: * rel: the relationship phrase/verb phrase. * arg1: the first argument in the relationship * arg2: the second argument in the relationship. * slot0: the third argument in the relationship, which might be empty. * pattern: a parse pattern for the relationship * parse: a dependency parse forthe sentence * search_query: a combintion of rel, arg1, arg2 * sentence: the senence ### Data Splits There are no splits. ## Dataset Creation ### Curation Rationale This dataset was created as part of research on open information extraction. ### Source Data #### Initial Data Collection and Normalization See the research paper on OLlie. The training data is extracted from web pages (Cluebweb09). #### Who are the source language producers? The Ollie authors at the Univeristy of Washington and data from Cluebweb09 and the open web. ### Annotations #### Annotation process The various parsers and code from the Ollie alogrithm. #### Who are the annotators? Machine annotated. ### Personal and Sensitive Information Unkown, but likely there are names of famous individuals. ## Considerations for Using the Data ### Social Impact of Dataset The goal for the work is to help machines learn to extract information form open domains. ### Discussion of Biases Since the data is gathered from the web, there is likely to be biased text and relationships. [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The authors of Ollie at The University of Washington ### Licensing Information The University of Washington academic license: https://raw.githubusercontent.com/knowitall/ollie/master/LICENSE ### Citation Information ``` @inproceedings{ollie-emnlp12, author = {Mausam and Michael Schmitz and Robert Bart and Stephen Soderland and Oren Etzioni}, title = {Open Language Learning for Information Extraction}, booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL)}, year = {2012} } ``` ### Contributions Thanks to [@ontocord](https://github.com/ontocord) for adding this dataset.

annotations_creators: - 机器生成 language_creators: - 众包 language: - 英语（en） license: - 其他（other） multilinguality: - 单语言（monolingual） size_categories: - 1000万<样本数<1亿 - 100万<样本数<1000万 source_datasets: - 原创数据集（original） task_categories: [] task_ids: [] pretty_name: Ollie tags: - 关系抽取（relation-extraction） - 文本转结构化（text-to-structured） dataset_info: - config_name: ollie_lemmagrep features: - name: arg1 dtype: string - name: arg2 dtype: string - name: rel dtype: string - name: search_query dtype: string - name: sentence dtype: string - name: words dtype: string - name: pos dtype: string - name: chunk dtype: string - name: sentence_cnt dtype: string splits: - name: train num_bytes: 12324648919 num_examples: 18674630 download_size: 1789363108 dataset_size: 12324648919 - config_name: ollie_patterned features: - name: rel dtype: string - name: arg1 dtype: string - name: arg2 dtype: string - name: slot0 dtype: string - name: search_query dtype: string - name: pattern dtype: string - name: sentence dtype: string - name: parse dtype: string splits: - name: train num_bytes: 2930309084 num_examples: 3048961 download_size: 387514061 dataset_size: 2930309084 config_names: - ollie_lemmagrep - ollie_patterned # Ollie 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持的任务与基准测试平台](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [致谢](#contributions) ## 数据集描述 - **主页:** [Ollie](https://knowitall.github.io/ollie/) - **代码仓库:** [Github](https://github.com/knowitall/ollie) - **论文:** [ACLWeb](https://www.aclweb.org/anthology/D12-1048/) ### 数据集摘要 Ollie数据集包含两个数据配置，分别用于训练Ollie信息抽取（information extraction）算法，对应1800万句与300万句英文语句。本数据集仅可用于学术用途。作者说明如下： Ollie是一款可自动识别并从英文语句中抽取二元关系的程序，专为网页规模的信息抽取任务设计，此类任务的目标关系无需提前指定。Ollie是我们的第二代信息抽取系统。相较于仅能处理扁平Token序列的ReVerb系统，Ollie依托斯坦福依存句法压缩后的树状（仅含少量环的图）表示进行运算，这使得Ollie能够捕捉ReVerb无法覆盖的表达，例如长距离关系。此外，Ollie还可捕获修饰二元关系的上下文信息，目前支持处理归因（如“他说/她认为”）和启用条件（如“若X则Y”）。更多信息可访问Ollie官方主页：https://knowitall.github.io/ollie/ ### 支持的任务与基准测试平台需补充更多信息 ### 语言英语（en） ## 数据集结构 ### 数据实例本数据集包含两种配置：ollie_lemmagrep配置包含来自网页搜索的1800万句文本，对应ReVerb关系子集的11万个关系；ollie_patterned配置则是根据Ollie论文中的模式从ollie_lemmagrep数据集衍生出的300万句文本子集。以下是ollie_lemmagrep配置的一条记录示例： {'arg1': 'adobe reader', 'arg2': 'pdf', 'chunk': 'B-NP I-NP I-NP I-NP B-PP B-NP I-NP B-VP B-PP B-NP I-NP CC VB DT NNP NNP NNP TO VB VBN .', 'pos': 'JJ NNS CC NNS IN PRP$ NN VBP IN NNP NN CC VB DT NNP NNP NNP TO VB VBN .', 'rel': 'be require to view', 'search_query': 'require reader pdf adobe view', 'sentence': 'Many documents and reports on our site are in PDF format and require the Adobe Acrobat Reader to be viewed .', 'sentence_cnt': '9', 'words': 'many,document,and,report,on,our,site,be,in,pdf,format,and,require,the,adobe,acrobat,reader,to,be,view'} 以下是ollie_patterned配置的一条记录示例： {'arg1': 'english', 'arg2': 'internet', 'parse': '(in_IN_6), advmod(important_JJ_4, most_RBS_3); nsubj(language_NN_5, English_NNP_0); cop(language_NN_5, being_VBG_1); det(language_NN_5, the_DT_2); amod(language_NN_5, important_JJ_4); prep_in(language_NN_5, era_NN_9); punct(language_NN_5, ,_,_10); conj(language_NN_5, education_NN_12); det(era_NN_9, the_DT_7); nn(era_NN_9, Internet_NNP_8); amod(education_NN_12, English_JJ_11); nsubjpass(enriched_VBN_15, language_NN_5); aux(enriched_VBN_15, should_MD_13); auxpass(enriched_VBN_15, be_VB_14); punct(enriched_VBN_15, ._._16)', 'pattern': '{arg1} <nsubj< {rel:NN} >prep_in> {slot0:NN} >nn> {arg2}', 'rel': 'be language of', 'search_query': 'english language internet', 'sentence': 'English being the most important language in the Internet era , English education should be enriched .', 'slot0': 'era'} ### 数据字段 #### ollie_lemmagrep配置字段： * rel: 关系短语/动词短语，可留空以代表基础的“是”关系 * arg1: 关系中的第一个论元 * arg2: 关系中的第二个论元 * chunk: 语句中每个Token的词性块标注标签 * pos: 语句的词性标注序列 * sentence: 原始语句 * sentence_cnt: 该语句被记录的出现次数 * search_query: 由rel、arg1、arg2组合而成的搜索查询词 * words: 以逗号分隔的语句词汇词形还原结果 #### ollie_patterned配置字段： * rel: 关系短语/动词短语 * arg1: 关系中的第一个论元 * arg2: 关系中的第二个论元 * slot0: 关系中的第三个论元，可为空 * pattern: 用于抽取该关系的句法模式 * parse: 语句的依存句法分析结果 * search_query: 由rel、arg1、arg2组合而成的搜索查询词 * sentence: 原始语句 * slot0: 第三个关系论元 ### 数据划分本数据集无预设划分。 ## 数据集构建 ### 构建初衷本数据集作为开放信息抽取研究的一部分创建。 ### 源数据 #### 初始数据收集与标准化详见Ollie相关研究论文。训练数据从网页（ClueWeb09）中抽取。 #### 源语言内容生产者华盛顿大学Ollie项目团队及ClueWeb09与开放网页数据源。 ### 标注信息 #### 标注流程使用Ollie算法配套的各类解析器与代码完成标注。 #### 标注者机器自动标注。 ### 个人与敏感信息未知，但可能包含知名人士姓名。 ## 数据集使用注意事项 ### 数据集的社会影响本研究的目标是助力机器学会从开放领域抽取信息。 ### 偏差讨论由于数据从网页采集，文本与关系可能存在偏差。需补充更多信息 ### 其他已知局限性需补充更多信息 ## 附加信息 ### 数据集维护者华盛顿大学Ollie项目团队 ### 许可信息华盛顿大学学术许可协议：https://raw.githubusercontent.com/knowitall/ollie/master/LICENSE ### 引用信息 @inproceedings{ollie-emnlp12, author = {Mausam and Michael Schmitz and Robert Bart and Stephen Soderland and Oren Etzioni}, title = {Open Language Learning for Information Extraction}, booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL)}, year = {2012} } ### 致谢感谢 [@ontocord](https://github.com/ontocord) 贡献本数据集。

提供机构：

knowitall

原始信息汇总

数据集概述

名称: Ollie
语言: 英语 (en)
许可证: 其他
多语言性: 单语
大小:
- ollie_lemmagrep: 10M<n<100M
- ollie_patterned: 1M<n<10M
来源: 原始数据
任务类别: 无
标签:
- 关系抽取
- 文本到结构化

数据集结构

ollie_lemmagrep

特征:
- arg1: 字符串
- arg2: 字符串
- rel: 字符串
- search_query: 字符串
- sentence: 字符串
- words: 字符串
- pos: 字符串
- chunk: 字符串
- sentence_cnt: 字符串
分割:
- 训练集:
  - 字节数: 12324648919
  - 示例数: 18674630
  - 下载大小: 1789363108
  - 数据集大小: 12324648919

ollie_patterned

特征:
- rel: 字符串
- arg1: 字符串
- arg2: 字符串
- slot0: 字符串
- search_query: 字符串
- pattern: 字符串
- sentence: 字符串
- parse: 字符串
分割:
- 训练集:
  - 字节数: 2930309084
  - 示例数: 3048961
  - 下载大小: 387514061
  - 数据集大小: 2930309084

数据集创建

注释创建者: 机器生成
语言创建者: 众包
源数据:
- 初始数据收集和规范化: 从网页(Cluebweb09)提取的训练数据
- 源语言生产者: Ollie作者在华盛顿大学和Cluebweb09及开放网络的数据
注释:
- 注释过程: Ollie算法中的各种解析器和代码
- 注释者: 机器注释

注意事项

个人和敏感信息: 未知，但可能包含知名人士的姓名

搜集汇总

数据集介绍

背景与挑战

背景概述

Ollie数据集包含18M和3M两种句子配置，用于训练信息提取算法，支持英语关系提取任务，数据来源于网页并由机器标注，适用于学术研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集