five

yanaiela/numeric_fused_head

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/yanaiela/numeric_fused_head
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced - expert-generated - machine-generated language_creators: - found language: - en license: - mit multilinguality: - monolingual size_categories: - 100K<n<1M - 1K<n<10K source_datasets: - original task_categories: - token-classification task_ids: [] paperswithcode_id: numeric-fused-head pretty_name: Numeric Fused Heads tags: - fused-head-identification dataset_info: - config_name: identification features: - name: tokens sequence: string - name: start_index dtype: int32 - name: end_index dtype: int32 - name: label dtype: class_label: names: '0': neg '1': pos splits: - name: train num_bytes: 22290345 num_examples: 165606 - name: test num_bytes: 68282 num_examples: 500 - name: validation num_bytes: 2474528 num_examples: 18401 download_size: 24407520 dataset_size: 24833155 - config_name: resolution features: - name: tokens sequence: string - name: line_indices sequence: int32 - name: head sequence: string - name: speakers sequence: string - name: anchors_indices sequence: int32 splits: - name: train num_bytes: 19766437 num_examples: 7412 - name: test num_bytes: 2743071 num_examples: 1000 - name: validation num_bytes: 2633549 num_examples: 1000 download_size: 24923403 dataset_size: 25143057 config_names: - identification - resolution --- # Dataset Card for Numeric Fused Heads ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [The Numeric Fused-Head demo](https://nlp.biu.ac.il/~lazary/fh/) - **Repository:** [Github Repo](https://github.com/yanaiela/num_fh) - **Paper:** [Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution](https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00280) - **Leaderboard:** [NLP Progress](http://nlpprogress.com/english/missing_elements.html) - **Point of Contact:** [Yanai Elazar](https://yanaiela.github.io), [Yoav Goldberg](https://www.cs.bgu.ac.il/~yoavg/uni/) ### Dataset Summary [More Information Needed] ### Supported Tasks and Leaderboards - Numeric Fused Head Identification - Numeric Fused Head Resolution ### Languages English ## Dataset Structure ### Data Instances ## Identification ``` { "tokens": ["It", "’s", "a", "curious", "thing", ",", "the", "death", "of", "a", "loved", "one", "."] "start_index": 11 "end_index": 12 "label": 1 } ``` ## Resolution ``` { "tokens": ["I", "'m", "eighty", "tomorrow", ".", "Are", "you", "sure", "?"], "line_indices": [0, 0, 0, 0, 0, 1, 1, 1, 1], "head": ["AGE"], "speakers": ["John Doe", "John Doe", "John Doe", "John Doe", "John Doe", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs"], "anchors_indices": [2] } ``` ### Data Fields ## Identification - `tokens` - List of token strings as tokenized with [Spacy](spacy.io). - `start_index` - Start index of the anchor. - `end_index` - End index of the anchor. - `label` - "pos" or "neg" depending on whether this example contains a numeric fused head. ## Resolution - `tokens` - List of token strings as tokenized with [Spacy](spacy.io) - `line_indices` - List of indices indicating line number (one for each token) - `head` - Reference to the missing head. If the head exists elsewhere in the sentence this is given as a token index. - `speakers` - List of speaker names (one for each token) - `anchors_indices` - Index to indicate which token is the anchor (the visible number) ### Data Splits Train, Test, Dev [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information MIT License ### Citation Information ``` @article{doi:10.1162/tacl\_a\_00280, author = {Elazar, Yanai and Goldberg, Yoav}, title = {Where’s My Head? Definition, Data Set, and Models for Numeric Fused-Head Identification and Resolution}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {519-535}, year = {2019}, doi = {10.1162/tacl\_a\_00280}, } ``` ### Contributions Thanks to [@ghomasHudson](https://github.com/ghomasHudson) for adding this dataset.
提供机构:
yanaiela
原始信息汇总

数据集概述

数据集名称

  • 名称: Numeric Fused Heads
  • 别名: num_fh

数据集基本信息

  • 语言: 英语 (en)
  • 许可证: MIT
  • 多语言性: 单语种
  • 大小:
    • 1K<n<10K
    • 100K<n<1M

数据集创建

  • 注释创建者:
    • 众包
    • 专家生成
    • 机器生成
  • 语言创建者: 发现

任务类别

  • 任务类别: 令牌分类
  • 任务ID: 无
  • 论文代码ID: numeric-fused-head

数据集结构

配置名称
  • identification
  • resolution
特征
identification
  • tokens: 字符串序列
  • start_index: int32
  • end_index: int32
  • label: 类别标签,包括 neg 和 pos
resolution
  • tokens: 字符串序列
  • line_indices: int32序列
  • head: 字符串序列
  • speakers: 字符串序列
  • anchors_indices: int32序列
数据分割
identification
  • train: 165606个示例,22290345字节
  • test: 500个示例,68282字节
  • validation: 18401个示例,2474528字节
  • 下载大小: 24407520字节
  • 数据集大小: 24833155字节
resolution
  • train: 7412个示例,19766437字节
  • test: 1000个示例,2743071字节
  • validation: 1000个示例,2633549字节
  • 下载大小: 24923403字节
  • 数据集大小: 25143057字节

支持的任务和排行榜

  • Numeric Fused Head Identification
  • Numeric Fused Head Resolution

标签信息

  • label:
    • 0: neg
    • 1: pos

数据实例

identification

json { "tokens": ["It", "’s", "a", "curious", "thing", ",", "the", "death", "of", "a", "loved", "one", "."], "start_index": 11, "end_index": 12, "label": 1 }

resolution

json { "tokens": ["I", "m", "eighty", "tomorrow", ".", "Are", "you", "sure", "?"], "line_indices": [0, 0, 0, 0, 0, 1, 1, 1, 1], "head": ["AGE"], "speakers": ["John Doe", "John Doe", "John Doe", "John Doe", "John Doe", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs"], "anchors_indices": [2] }

数据字段

identification
  • tokens: 令牌字符串列表
  • start_index: 锚点起始索引
  • end_index: 锚点结束索引
  • label: 根据示例是否包含数值融合头,标记为 pos 或 neg
resolution
  • tokens: 令牌字符串列表
  • line_indices: 指示行号的索引列表
  • head: 缺失头的引用
  • speakers: 说话者名称列表
  • anchors_indices: 指示哪个令牌是锚点的索引
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作