knowitall/ollie
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/knowitall/ollie
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- crowdsourced
language:
- en
license:
- other
multilinguality:
- monolingual
size_categories:
- 10M<n<100M
- 1M<n<10M
source_datasets:
- original
task_categories: []
task_ids: []
pretty_name: Ollie
tags:
- relation-extraction
- text-to-structured
dataset_info:
- config_name: ollie_lemmagrep
features:
- name: arg1
dtype: string
- name: arg2
dtype: string
- name: rel
dtype: string
- name: search_query
dtype: string
- name: sentence
dtype: string
- name: words
dtype: string
- name: pos
dtype: string
- name: chunk
dtype: string
- name: sentence_cnt
dtype: string
splits:
- name: train
num_bytes: 12324648919
num_examples: 18674630
download_size: 1789363108
dataset_size: 12324648919
- config_name: ollie_patterned
features:
- name: rel
dtype: string
- name: arg1
dtype: string
- name: arg2
dtype: string
- name: slot0
dtype: string
- name: search_query
dtype: string
- name: pattern
dtype: string
- name: sentence
dtype: string
- name: parse
dtype: string
splits:
- name: train
num_bytes: 2930309084
num_examples: 3048961
download_size: 387514061
dataset_size: 2930309084
config_names:
- ollie_lemmagrep
- ollie_patterned
---
# Dataset Card for Ollie
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Ollie](https://knowitall.github.io/ollie/)
- **Repository:** [Github](https://github.com/knowitall/ollie)
- **Paper:** [Aclweb](https://www.aclweb.org/anthology/D12-1048/)
### Dataset Summary
The Ollie dataset includes two configs for the data
used to train the Ollie informatation extraction algorithm, for 18M
sentences and 3M sentences respectively.
This data is for academic use only. From the authors:
Ollie is a program that automatically identifies and extracts binary
relationships from English sentences. Ollie is designed for Web-scale
information extraction, where target relations are not specified in
advance.
Ollie is our second-generation information extraction system . Whereas
ReVerb operates on flat sequences of tokens, Ollie works with the
tree-like (graph with only small cycles) representation using
Stanford's compression of the dependencies. This allows Ollie to
capture expression that ReVerb misses, such as long-range relations.
Ollie also captures context that modifies a binary relation. Presently
Ollie handles attribution (He said/she believes) and enabling
conditions (if X then).
More information is available at the Ollie homepage:
https://knowitall.github.io/ollie/
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
en
## Dataset Structure
### Data Instances
There are two configurations for the dataset: ollie_lemmagrep which
are 18M sentences from web searches for a subset of the Reverb
relationships (110,000 relationships), and the 3M sentences for
ollie_patterned which is a subset of the ollie_lemmagrep dataset
derived from patterns according to the Ollie paper.
An example of an ollie_lemmagrep record:
``
{'arg1': 'adobe reader',
'arg2': 'pdf',
'chunk': 'B-NP I-NP I-NP I-NP B-PP B-NP I-NP B-VP B-PP B-NP I-NP O B-VP B-NP I-NP I-NP I-NP B-VP I-VP I-VP O',
'pos': 'JJ NNS CC NNS IN PRP$ NN VBP IN NNP NN CC VB DT NNP NNP NNP TO VB VBN .',
'rel': 'be require to view',
'search_query': 'require reader pdf adobe view',
'sentence': 'Many documents and reports on our site are in PDF format and require the Adobe Acrobat Reader to be viewed .',
'sentence_cnt': '9',
'words': 'many,document,and,report,on,our,site,be,in,pdf,format,and,require,the,adobe,acrobat,reader,to,be,view'}
``
An example of an ollie_patterned record:
``
{'arg1': 'english',
'arg2': 'internet',
'parse': '(in_IN_6), advmod(important_JJ_4, most_RBS_3); nsubj(language_NN_5, English_NNP_0); cop(language_NN_5, being_VBG_1); det(language_NN_5, the_DT_2); amod(language_NN_5, important_JJ_4); prep_in(language_NN_5, era_NN_9); punct(language_NN_5, ,_,_10); conj(language_NN_5, education_NN_12); det(era_NN_9, the_DT_7); nn(era_NN_9, Internet_NNP_8); amod(education_NN_12, English_JJ_11); nsubjpass(enriched_VBN_15, language_NN_5); aux(enriched_VBN_15, should_MD_13); auxpass(enriched_VBN_15, be_VB_14); punct(enriched_VBN_15, ._._16)',
'pattern': '{arg1} <nsubj< {rel:NN} >prep_in> {slot0:NN} >nn> {arg2}',
'rel': 'be language of',
'search_query': 'english language internet',
'sentence': 'English being the most important language in the Internet era , English education should be enriched .',
'slot0': 'era'}
``
### Data Fields
For ollie_lemmagrep:
* rel: the relationship phrase/verb phrase. This may be empty, which represents the "be" relationship.
* arg1: the first argument in the relationship
* arg2: the second argument in the relationship.
* chunk: a tag of each token in the sentence, showing the pos chunks
* pos: part of speech tagging of the sentence
* sentence: the sentence
* sentence_cnt: the number of copies of this sentence encountered
* search_query: a combintion of rel, arg1, arg2
* words: the lemma of the words of the sentence separated by commas
For ollie_patterned:
* rel: the relationship phrase/verb phrase.
* arg1: the first argument in the relationship
* arg2: the second argument in the relationship.
* slot0: the third argument in the relationship, which might be empty.
* pattern: a parse pattern for the relationship
* parse: a dependency parse forthe sentence
* search_query: a combintion of rel, arg1, arg2
* sentence: the senence
### Data Splits
There are no splits.
## Dataset Creation
### Curation Rationale
This dataset was created as part of research on open information extraction.
### Source Data
#### Initial Data Collection and Normalization
See the research paper on OLlie. The training data is extracted from web pages (Cluebweb09).
#### Who are the source language producers?
The Ollie authors at the Univeristy of Washington and data from Cluebweb09 and the open web.
### Annotations
#### Annotation process
The various parsers and code from the Ollie alogrithm.
#### Who are the annotators?
Machine annotated.
### Personal and Sensitive Information
Unkown, but likely there are names of famous individuals.
## Considerations for Using the Data
### Social Impact of Dataset
The goal for the work is to help machines learn to extract information form open domains.
### Discussion of Biases
Since the data is gathered from the web, there is likely to be biased text and relationships.
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
The authors of Ollie at The University of Washington
### Licensing Information
The University of Washington academic license: https://raw.githubusercontent.com/knowitall/ollie/master/LICENSE
### Citation Information
```
@inproceedings{ollie-emnlp12,
author = {Mausam and Michael Schmitz and Robert Bart and Stephen Soderland and Oren Etzioni},
title = {Open Language Learning for Information Extraction},
booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL)},
year = {2012}
}
```
### Contributions
Thanks to [@ontocord](https://github.com/ontocord) for adding this dataset.
annotations_creators:
- 机器生成
language_creators:
- 众包
language:
- 英语(en)
license:
- 其他(other)
multilinguality:
- 单语言(monolingual)
size_categories:
- 1000万<样本数<1亿
- 100万<样本数<1000万
source_datasets:
- 原创数据集(original)
task_categories: []
task_ids: []
pretty_name: Ollie
tags:
- 关系抽取(relation-extraction)
- 文本转结构化(text-to-structured)
dataset_info:
- config_name: ollie_lemmagrep
features:
- name: arg1
dtype: string
- name: arg2
dtype: string
- name: rel
dtype: string
- name: search_query
dtype: string
- name: sentence
dtype: string
- name: words
dtype: string
- name: pos
dtype: string
- name: chunk
dtype: string
- name: sentence_cnt
dtype: string
splits:
- name: train
num_bytes: 12324648919
num_examples: 18674630
download_size: 1789363108
dataset_size: 12324648919
- config_name: ollie_patterned
features:
- name: rel
dtype: string
- name: arg1
dtype: string
- name: arg2
dtype: string
- name: slot0
dtype: string
- name: search_query
dtype: string
- name: pattern
dtype: string
- name: sentence
dtype: string
- name: parse
dtype: string
splits:
- name: train
num_bytes: 2930309084
num_examples: 3048961
download_size: 387514061
dataset_size: 2930309084
config_names:
- ollie_lemmagrep
- ollie_patterned
# Ollie 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持的任务与基准测试平台](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [致谢](#contributions)
## 数据集描述
- **主页:** [Ollie](https://knowitall.github.io/ollie/)
- **代码仓库:** [Github](https://github.com/knowitall/ollie)
- **论文:** [ACLWeb](https://www.aclweb.org/anthology/D12-1048/)
### 数据集摘要
Ollie数据集包含两个数据配置,分别用于训练Ollie信息抽取(information extraction)算法,对应1800万句与300万句英文语句。本数据集仅可用于学术用途。作者说明如下:
Ollie是一款可自动识别并从英文语句中抽取二元关系的程序,专为网页规模的信息抽取任务设计,此类任务的目标关系无需提前指定。Ollie是我们的第二代信息抽取系统。相较于仅能处理扁平Token序列的ReVerb系统,Ollie依托斯坦福依存句法压缩后的树状(仅含少量环的图)表示进行运算,这使得Ollie能够捕捉ReVerb无法覆盖的表达,例如长距离关系。此外,Ollie还可捕获修饰二元关系的上下文信息,目前支持处理归因(如“他说/她认为”)和启用条件(如“若X则Y”)。
更多信息可访问Ollie官方主页:https://knowitall.github.io/ollie/
### 支持的任务与基准测试平台
需补充更多信息
### 语言
英语(en)
## 数据集结构
### 数据实例
本数据集包含两种配置:ollie_lemmagrep配置包含来自网页搜索的1800万句文本,对应ReVerb关系子集的11万个关系;ollie_patterned配置则是根据Ollie论文中的模式从ollie_lemmagrep数据集衍生出的300万句文本子集。
以下是ollie_lemmagrep配置的一条记录示例:
{'arg1': 'adobe reader',
'arg2': 'pdf',
'chunk': 'B-NP I-NP I-NP I-NP B-PP B-NP I-NP B-VP B-PP B-NP I-NP CC VB DT NNP NNP NNP TO VB VBN .',
'pos': 'JJ NNS CC NNS IN PRP$ NN VBP IN NNP NN CC VB DT NNP NNP NNP TO VB VBN .',
'rel': 'be require to view',
'search_query': 'require reader pdf adobe view',
'sentence': 'Many documents and reports on our site are in PDF format and require the Adobe Acrobat Reader to be viewed .',
'sentence_cnt': '9',
'words': 'many,document,and,report,on,our,site,be,in,pdf,format,and,require,the,adobe,acrobat,reader,to,be,view'}
以下是ollie_patterned配置的一条记录示例:
{'arg1': 'english',
'arg2': 'internet',
'parse': '(in_IN_6), advmod(important_JJ_4, most_RBS_3); nsubj(language_NN_5, English_NNP_0); cop(language_NN_5, being_VBG_1); det(language_NN_5, the_DT_2); amod(language_NN_5, important_JJ_4); prep_in(language_NN_5, era_NN_9); punct(language_NN_5, ,_,_10); conj(language_NN_5, education_NN_12); det(era_NN_9, the_DT_7); nn(era_NN_9, Internet_NNP_8); amod(education_NN_12, English_JJ_11); nsubjpass(enriched_VBN_15, language_NN_5); aux(enriched_VBN_15, should_MD_13); auxpass(enriched_VBN_15, be_VB_14); punct(enriched_VBN_15, ._._16)',
'pattern': '{arg1} <nsubj< {rel:NN} >prep_in> {slot0:NN} >nn> {arg2}',
'rel': 'be language of',
'search_query': 'english language internet',
'sentence': 'English being the most important language in the Internet era , English education should be enriched .',
'slot0': 'era'}
### 数据字段
#### ollie_lemmagrep配置字段:
* rel: 关系短语/动词短语,可留空以代表基础的“是”关系
* arg1: 关系中的第一个论元
* arg2: 关系中的第二个论元
* chunk: 语句中每个Token的词性块标注标签
* pos: 语句的词性标注序列
* sentence: 原始语句
* sentence_cnt: 该语句被记录的出现次数
* search_query: 由rel、arg1、arg2组合而成的搜索查询词
* words: 以逗号分隔的语句词汇词形还原结果
#### ollie_patterned配置字段:
* rel: 关系短语/动词短语
* arg1: 关系中的第一个论元
* arg2: 关系中的第二个论元
* slot0: 关系中的第三个论元,可为空
* pattern: 用于抽取该关系的句法模式
* parse: 语句的依存句法分析结果
* search_query: 由rel、arg1、arg2组合而成的搜索查询词
* sentence: 原始语句
* slot0: 第三个关系论元
### 数据划分
本数据集无预设划分。
## 数据集构建
### 构建初衷
本数据集作为开放信息抽取研究的一部分创建。
### 源数据
#### 初始数据收集与标准化
详见Ollie相关研究论文。训练数据从网页(ClueWeb09)中抽取。
#### 源语言内容生产者
华盛顿大学Ollie项目团队及ClueWeb09与开放网页数据源。
### 标注信息
#### 标注流程
使用Ollie算法配套的各类解析器与代码完成标注。
#### 标注者
机器自动标注。
### 个人与敏感信息
未知,但可能包含知名人士姓名。
## 数据集使用注意事项
### 数据集的社会影响
本研究的目标是助力机器学会从开放领域抽取信息。
### 偏差讨论
由于数据从网页采集,文本与关系可能存在偏差。
需补充更多信息
### 其他已知局限性
需补充更多信息
## 附加信息
### 数据集维护者
华盛顿大学Ollie项目团队
### 许可信息
华盛顿大学学术许可协议:https://raw.githubusercontent.com/knowitall/ollie/master/LICENSE
### 引用信息
@inproceedings{ollie-emnlp12,
author = {Mausam and Michael Schmitz and Robert Bart and Stephen Soderland and Oren Etzioni},
title = {Open Language Learning for Information Extraction},
booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL)},
year = {2012}
}
### 致谢
感谢 [@ontocord](https://github.com/ontocord) 贡献本数据集。
提供机构:
knowitall
原始信息汇总
数据集概述
- 名称: Ollie
- 语言: 英语 (en)
- 许可证: 其他
- 多语言性: 单语
- 大小:
- ollie_lemmagrep: 10M<n<100M
- ollie_patterned: 1M<n<10M
- 来源: 原始数据
- 任务类别: 无
- 标签:
- 关系抽取
- 文本到结构化
数据集结构
ollie_lemmagrep
- 特征:
- arg1: 字符串
- arg2: 字符串
- rel: 字符串
- search_query: 字符串
- sentence: 字符串
- words: 字符串
- pos: 字符串
- chunk: 字符串
- sentence_cnt: 字符串
- 分割:
- 训练集:
- 字节数: 12324648919
- 示例数: 18674630
- 下载大小: 1789363108
- 数据集大小: 12324648919
- 训练集:
ollie_patterned
- 特征:
- rel: 字符串
- arg1: 字符串
- arg2: 字符串
- slot0: 字符串
- search_query: 字符串
- pattern: 字符串
- sentence: 字符串
- parse: 字符串
- 分割:
- 训练集:
- 字节数: 2930309084
- 示例数: 3048961
- 下载大小: 387514061
- 数据集大小: 2930309084
- 训练集:
数据集创建
- 注释创建者: 机器生成
- 语言创建者: 众包
- 源数据:
- 初始数据收集和规范化: 从网页(Cluebweb09)提取的训练数据
- 源语言生产者: Ollie作者在华盛顿大学和Cluebweb09及开放网络的数据
- 注释:
- 注释过程: Ollie算法中的各种解析器和代码
- 注释者: 机器注释
注意事项
- 个人和敏感信息: 未知,但可能包含知名人士的姓名
搜集汇总
数据集介绍

背景与挑战
背景概述
Ollie数据集包含18M和3M两种句子配置,用于训练信息提取算法,支持英语关系提取任务,数据来源于网页并由机器标注,适用于学术研究。
以上内容由遇见数据集搜集并总结生成



