five

yanaiela/tne

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/yanaiela/tne
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - found language: - en license: - mit multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-retrieval task_ids: - document-retrieval pretty_name: Text-based NP Enrichment dataset_info: features: - name: id dtype: string - name: text dtype: string - name: tokens sequence: string - name: nps list: - name: text dtype: string - name: first_char dtype: int32 - name: last_char dtype: int32 - name: first_token dtype: int32 - name: last_token dtype: int32 - name: id dtype: string - name: np_relations list: - name: anchor dtype: string - name: complement dtype: string - name: preposition dtype: class_label: names: '0': about '1': for '2': with '3': from '4': among '5': by '6': 'on' '7': at '8': during '9': of '10': member(s) of '11': in '12': after '13': under '14': to '15': into '16': before '17': near '18': outside '19': around '20': between '21': against '22': over '23': inside - name: complement_coref_cluster_id dtype: string - name: coref list: - name: id dtype: string - name: members sequence: string - name: np_type dtype: class_label: names: '0': standard '1': time/date/measurement '2': idiomatic - name: metadata struct: - name: annotators struct: - name: coref_worker dtype: int32 - name: consolidator_worker dtype: int32 - name: np-relations_worker sequence: int32 - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 41308170 num_examples: 3988 - name: validation num_bytes: 5495419 num_examples: 500 - name: test num_bytes: 2203716 num_examples: 500 - name: test_ood num_bytes: 2249352 num_examples: 509 download_size: 14194578 dataset_size: 51256657 --- # Dataset Card for Text-based NP Enrichment ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://yanaiela.github.io/TNE/ - **Repository:** https://github.com/yanaiela/TNE - **Paper:** https://arxiv.org/abs/2109.12085 - **Leaderboard:** [TNE OOD](https://leaderboard.allenai.org/tne-ood/submissions/public) [TNE](https://leaderboard.allenai.org/tne/submissions/public) - **Point of Contact:** [Yanai Elazar](mailto:yanaiela@gmail.com) ### Dataset Summary Text-based NP Enrichment (TNE) is a natural language understanding (NLU) task, which focus on relations between noun phrases (NPs) that can be mediated via prepositions. The dataset contains 5,497 documents, annotated exhaustively with all possible links between the NPs in each document. The main data comes from WikiNews, which is used for train/dev/test. We also collected an additional set of 509 documents to serve as out of distribution (OOD) data points, from the Book Corpus, IMDB reviews and Reddit. ### Supported Tasks and Leaderboards The data contain both the main data for the TNE task, as well as coreference resolution data. There are two leaderboards for the TNE data, one for the standard test set, and another one for the OOD test set: - [TNE Leaderboard](https://leaderboard.allenai.org/tne/submissions/public) - [TNE OOD Leaderboard](https://leaderboard.allenai.org/tne-ood/submissions/public) ### Languages The text in the dataset is in English, as spoken in the different domains we include. The associated BCP-47 code is `en`. ## Dataset Structure ### Data Instances The original files are in a jsonl format, containing a dictionary of a single document, in each line. Each document contain a different amount of labels, due to the different amount of NPs. The test and ood splits come without the annotated labels. ### Data Fields A document consists of: * `id`: a unique identifier of a document, beginning with `r` and followed by a number * `text`: the text of the document. The title and subtitles (if exists) are separated with two new lines. The paragraphs are separated by a single new line. * `tokens`: a list of string, containing the tokenized tokens * `nps`: a list of dictionaries, containing the following entries: * `text`: the text of the np * `start_index`: an integer indicating the starting index in the text * `end_index`: an integer indicating the ending index in the text * `start_token`: an integer indicating the first token of the np out of the tokenized tokens * `end_token`: an integer indicating the last token of the np out of the tokenized tokens * `id`: the id of the np * `np_relations`: these are the relation labels of the document. It is a list of dictionaries, where each dictionary contains: * `anchor`: the id of the anchor np * `complement`: the id of the complement np * `preposition`: the preposition that links between the anchor and the complement. This can take one out of 24 pre-defined preposition (23 + member(s)-of) * `complement_coref_cluster_id`: the coreference id, which the complement is part of. * `coref`: the coreference labels. It contains a list of dictionaries, where each dictionary contains: * `id`: the id of the coreference cluster * `members`: the ids of the nps members of such cluster * `np_type`: the type of cluster. It can be either * `standard`: regular coreference cluster * `time/date/measurement`: a time / date / measurement np. These will be singletons. * `idiomatic`: an idiomatic expression * `metadata`: metadata of the document. It contains the following: * `annotators`: a dictionary with anonymized annotators id * `coref_worker`: the coreference worker id * `consolidator_worker`: the consolidator worker id * `np-relations_worker`: the np relations worker id * `url`: the url where the document was taken from (not always existing) * `source`: the original file name where the document was taken from ### Data Splits The dataset is spread across four files, for the four different splits: train, dev, test and test_ood. Additional details on the data statistics can be found in the [paper](https://arxiv.org/abs/2109.12085) ## Dataset Creation ### Curation Rationale TNE was build as a new task for language understanding, focusing on extracting relations between nouns, moderated by prepositions. ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators The dataset was created by Yanai Elazar, Victoria Basmov, Yoav Goldberg, Reut Tsarfaty, during work done at Bar-Ilan University, and AI2. ### Licensing Information The data is released under the MIT license. ### Citation Information ```bibtex @article{tne, author = {Elazar, Yanai and Basmov, Victoria and Goldberg, Yoav and Tsarfaty, Reut}, title = "{Text-based NP Enrichment}", journal = {Transactions of the Association for Computational Linguistics}, year = {2022}, } ``` ### Contributions Thanks to [@yanaiela](https://github.com/yanaiela), who is also the first author of the paper, for adding this dataset.
提供机构:
yanaiela
原始信息汇总

数据集概述

数据集基本信息

  • 数据集名称: Text-based NP Enrichment
  • 语言: 英语
  • 许可证: MIT
  • 数据集大小: 5,497 文档
  • 数据集类型: 单语种
  • 数据集来源: 原始数据
  • 任务类别: 文本检索
  • 任务ID: 文档检索

数据集结构

特征

  • id: 字符串类型,文档的唯一标识符
  • text: 字符串类型,文档的文本内容
  • tokens: 字符串序列,分词后的标记
  • nps: 列表类型,包含以下字段:
    • text: 字符串类型,名词短语的文本
    • first_char: 整数类型,名词短语在文本中的起始字符位置
    • last_char: 整数类型,名词短语在文本中的结束字符位置
    • first_token: 整数类型,名词短语在分词后的起始标记位置
    • last_token: 整数类型,名词短语在分词后的结束标记位置
    • id: 字符串类型,名词短语的唯一标识符
  • np_relations: 列表类型,包含以下字段:
    • anchor: 字符串类型,锚点名词短语的标识符
    • complement: 字符串类型,补语名词短语的标识符
    • preposition: 分类标签类型,介词,包含24种预定义的介词
    • complement_coref_cluster_id: 字符串类型,补语名词短语的共指聚类标识符
  • coref: 列表类型,包含以下字段:
    • id: 字符串类型,共指聚类的标识符
    • members: 字符串序列,共指聚类的成员名词短语标识符
    • np_type: 分类标签类型,共指聚类的类型,包含三种:标准、时间/日期/测量、习语
  • metadata: 结构体类型,包含以下字段:
    • annotators: 结构体类型,包含以下字段:
      • coref_worker: 整数类型,共指标注者ID
      • consolidator_worker: 整数类型,合并标注者ID
      • np-relations_worker: 整数序列,名词短语关系标注者ID
    • url: 字符串类型,文档来源的URL
    • source: 字符串类型,文档来源的原始文件名

数据分割

  • train: 3988 个样本,41308170 字节
  • validation: 500 个样本,5495419 字节
  • test: 500 个样本,2203716 字节
  • test_ood: 509 个样本,2249352 字节

数据集创建

  • 注释创建者: 众包
  • 语言创建者: 发现
  • 数据集大小类别: 1K<n<10K
  • 源数据集: 原始数据

数据集下载

  • 下载大小: 14194578 字节
  • 数据集大小: 51256657 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作