yanaiela/numeric_fused_head

Name: yanaiela/numeric_fused_head
Creator: yanaiela
Published: 2024-01-18 11:10:59
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/yanaiela/numeric_fused_head

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced - expert-generated - machine-generated language_creators: - found language: - en license: - mit multilinguality: - monolingual size_categories: - 100K<n<1M - 1K<n<10K source_datasets: - original task_categories: - token-classification task_ids: [] paperswithcode_id: numeric-fused-head pretty_name: Numeric Fused Heads tags: - fused-head-identification dataset_info: - config_name: identification features: - name: tokens sequence: string - name: start_index dtype: int32 - name: end_index dtype: int32 - name: label dtype: class_label: names: '0': neg '1': pos splits: - name: train num_bytes: 22290345 num_examples: 165606 - name: test num_bytes: 68282 num_examples: 500 - name: validation num_bytes: 2474528 num_examples: 18401 download_size: 24407520 dataset_size: 24833155 - config_name: resolution features: - name: tokens sequence: string - name: line_indices sequence: int32 - name: head sequence: string - name: speakers sequence: string - name: anchors_indices sequence: int32 splits: - name: train num_bytes: 19766437 num_examples: 7412 - name: test num_bytes: 2743071 num_examples: 1000 - name: validation num_bytes: 2633549 num_examples: 1000 download_size: 24923403 dataset_size: 25143057 config_names: - identification - resolution --- # Dataset Card for Numeric Fused Heads ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [The Numeric Fused-Head demo](https://nlp.biu.ac.il/~lazary/fh/) - **Repository:** [Github Repo](https://github.com/yanaiela/num_fh) - **Paper:** [Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution](https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00280) - **Leaderboard:** [NLP Progress](http://nlpprogress.com/english/missing_elements.html) - **Point of Contact:** [Yanai Elazar](https://yanaiela.github.io), [Yoav Goldberg](https://www.cs.bgu.ac.il/~yoavg/uni/) ### Dataset Summary [More Information Needed] ### Supported Tasks and Leaderboards - Numeric Fused Head Identification - Numeric Fused Head Resolution ### Languages English ## Dataset Structure ### Data Instances ## Identification ``` { "tokens": ["It", "’s", "a", "curious", "thing", ",", "the", "death", "of", "a", "loved", "one", "."] "start_index": 11 "end_index": 12 "label": 1 } ``` ## Resolution ``` { "tokens": ["I", "'m", "eighty", "tomorrow", ".", "Are", "you", "sure", "?"], "line_indices": [0, 0, 0, 0, 0, 1, 1, 1, 1], "head": ["AGE"], "speakers": ["John Doe", "John Doe", "John Doe", "John Doe", "John Doe", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs"], "anchors_indices": [2] } ``` ### Data Fields ## Identification - `tokens` - List of token strings as tokenized with [Spacy](spacy.io). - `start_index` - Start index of the anchor. - `end_index` - End index of the anchor. - `label` - "pos" or "neg" depending on whether this example contains a numeric fused head. ## Resolution - `tokens` - List of token strings as tokenized with [Spacy](spacy.io) - `line_indices` - List of indices indicating line number (one for each token) - `head` - Reference to the missing head. If the head exists elsewhere in the sentence this is given as a token index. - `speakers` - List of speaker names (one for each token) - `anchors_indices` - Index to indicate which token is the anchor (the visible number) ### Data Splits Train, Test, Dev [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information MIT License ### Citation Information ``` @article{doi:10.1162/tacl\_a\_00280, author = {Elazar, Yanai and Goldberg, Yoav}, title = {Where’s My Head? Definition, Data Set, and Models for Numeric Fused-Head Identification and Resolution}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {519-535}, year = {2019}, doi = {10.1162/tacl\_a\_00280}, } ``` ### Contributions Thanks to [@ghomasHudson](https://github.com/ghomasHudson) for adding this dataset.

提供机构：

yanaiela

原始信息汇总

数据集概述

数据集名称

名称: Numeric Fused Heads
别名: num_fh

数据集基本信息

语言: 英语 (en)
许可证: MIT
多语言性: 单语种
大小:
- 1K<n<10K
- 100K<n<1M

数据集创建

注释创建者:
- 众包
- 专家生成
- 机器生成
语言创建者: 发现

任务类别

任务类别: 令牌分类
任务ID: 无
论文代码ID: numeric-fused-head

数据集结构

配置名称

identification
resolution

特征

identification

tokens: 字符串序列
start_index: int32
end_index: int32
label: 类别标签，包括 neg 和 pos

resolution

tokens: 字符串序列
line_indices: int32序列
head: 字符串序列
speakers: 字符串序列
anchors_indices: int32序列

数据分割

identification

train: 165606个示例，22290345字节
test: 500个示例，68282字节
validation: 18401个示例，2474528字节
下载大小: 24407520字节
数据集大小: 24833155字节

resolution

train: 7412个示例，19766437字节
test: 1000个示例，2743071字节
validation: 1000个示例，2633549字节
下载大小: 24923403字节
数据集大小: 25143057字节

支持的任务和排行榜

Numeric Fused Head Identification
Numeric Fused Head Resolution

标签信息

label:
- 0: neg
- 1: pos

数据实例

identification

json { "tokens": ["It", "’s", "a", "curious", "thing", ",", "the", "death", "of", "a", "loved", "one", "."], "start_index": 11, "end_index": 12, "label": 1 }

resolution

json { "tokens": ["I", "m", "eighty", "tomorrow", ".", "Are", "you", "sure", "?"], "line_indices": [0, 0, 0, 0, 0, 1, 1, 1, 1], "head": ["AGE"], "speakers": ["John Doe", "John Doe", "John Doe", "John Doe", "John Doe", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs", "Joe Bloggs"], "anchors_indices": [2] }

数据字段

identification

tokens: 令牌字符串列表
start_index: 锚点起始索引
end_index: 锚点结束索引
label: 根据示例是否包含数值融合头，标记为 pos 或 neg

resolution

tokens: 令牌字符串列表
line_indices: 指示行号的索引列表
head: 缺失头的引用
speakers: 说话者名称列表
anchors_indices: 指示哪个令牌是锚点的索引

5,000+

优质数据集

54 个

任务类型

进入经典数据集