amirveyseh/acronym_identification

Name: amirveyseh/acronym_identification
Creator: amirveyseh
Published: 2024-01-09 11:39:57
License: 暂无描述

Hugging Face2024-01-09 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/amirveyseh/acronym_identification

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - en license: - mit multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - token-classification task_ids: [] paperswithcode_id: acronym-identification pretty_name: Acronym Identification Dataset tags: - acronym-identification dataset_info: features: - name: id dtype: string - name: tokens sequence: string - name: labels sequence: class_label: names: '0': B-long '1': B-short '2': I-long '3': I-short '4': O splits: - name: train num_bytes: 7792771 num_examples: 14006 - name: validation num_bytes: 952689 num_examples: 1717 - name: test num_bytes: 987712 num_examples: 1750 download_size: 2071007 dataset_size: 9733172 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* train-eval-index: - config: default task: token-classification task_id: entity_extraction splits: eval_split: test col_mapping: tokens: tokens labels: tags --- # Dataset Card for Acronym Identification Dataset ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://sites.google.com/view/sdu-aaai21/shared-task - **Repository:** https://github.com/amirveyseh/AAAI-21-SDU-shared-task-1-AI - **Paper:** [What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation](https://arxiv.org/pdf/2010.14678v1.pdf) - **Leaderboard:** https://competitions.codalab.org/competitions/26609 - **Point of Contact:** [More Information Needed] ### Dataset Summary This dataset contains the training, validation, and test data for the **Shared Task 1: Acronym Identification** of the AAAI-21 Workshop on Scientific Document Understanding. ### Supported Tasks and Leaderboards The dataset supports an `acronym-identification` task, where the aim is to predic which tokens in a pre-tokenized sentence correspond to acronyms. The dataset was released for a Shared Task which supported a [leaderboard](https://competitions.codalab.org/competitions/26609). ### Languages The sentences in the dataset are in English (`en`). ## Dataset Structure ### Data Instances A sample from the training set is provided below: ``` {'id': 'TR-0', 'labels': [4, 4, 4, 4, 0, 2, 2, 4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4], 'tokens': ['What', 'is', 'here', 'called', 'controlled', 'natural', 'language', '(', 'CNL', ')', 'has', 'traditionally', 'been', 'given', 'many', 'different', 'names', '.']} ``` Please note that in test set sentences only the `id` and `tokens` fields are available. `labels` can be ignored for test set. Labels in the test set are all `O` ### Data Fields The data instances have the following fields: - `id`: a `string` variable representing the example id, unique across the full dataset - `tokens`: a list of `string` variables representing the word-tokenized sentence - `labels`: a list of `categorical` variables with possible values `["B-long", "B-short", "I-long", "I-short", "O"]` corresponding to a BIO scheme. `-long` corresponds to the expanded acronym, such as *controlled natural language* here, and `-short` to the abbrviation, `CNL` here. ### Data Splits The training, validation, and test set contain `14,006`, `1,717`, and `1750` sentences respectively. ## Dataset Creation ### Curation Rationale > First, most of the existing datasets for acronym identification (AI) are either limited in their sizes or created using simple rule-based methods. > This is unfortunate as rules are in general not able to capture all the diverse forms to express acronyms and their long forms in text. > Second, most of the existing datasets are in the medical domain, ignoring the challenges in other scientific domains. > In order to address these limitations this paper introduces two new datasets for Acronym Identification. > Notably, our datasets are annotated by human to achieve high quality and have substantially larger numbers of examples than the existing AI datasets in the non-medical domain. ### Source Data #### Initial Data Collection and Normalization > In order to prepare a corpus for acronym annotation, we collect a corpus of 6,786 English papers from arXiv. > These papers consist of 2,031,592 sentences that would be used for data annotation for AI in this work. The dataset paper does not report the exact tokenization method. #### Who are the source language producers? The language was comes from papers hosted on the online digital archive [arXiv](https://arxiv.org/). No more information is available on the selection process or identity of the writers. ### Annotations #### Annotation process > Each sentence for annotation needs to contain at least one word in which more than half of the characters in are capital letters (i.e., acronym candidates). > Afterward, we search for a sub-sequence of words in which the concatenation of the first one, two or three characters of the words (in the order of the words in the sub-sequence could form an acronym candidate. > We call the sub-sequence a long form candidate. If we cannot find any long form candidate, we remove the sentence. > Using this process, we end up with 17,506 sentences to be annotated manually by the annotators from Amazon Mechanical Turk (MTurk). > In particular, we create a HIT for each sentence and ask the workers to annotate the short forms and the long forms in the sentence. > In case of disagreements, if two out of three workers agree on an annotation, we use majority voting to decide the correct annotation. > Otherwise, a fourth annotator is hired to resolve the conflict #### Who are the annotators? Workers were recruited through Amazon MEchanical Turk and paid $0.05 per annotation. No further demographic information is provided. ### Personal and Sensitive Information Papers published on arXiv are unlikely to contain much personal information, although some do include some poorly chosen examples revealing personal details, so the data should be used with care. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations Dataset provided for research purposes only. Please check dataset license for additional information. ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset provided for this shared task is licensed under CC BY-NC-SA 4.0 international license. ### Citation Information ``` @inproceedings{Veyseh2020, author = {Amir Pouran Ben Veyseh and Franck Dernoncourt and Quan Hung Tran and Thien Huu Nguyen}, editor = {Donia Scott and N{\'{u}}ria Bel and Chengqing Zong}, title = {What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING} 2020, Barcelona, Spain (Online), December 8-13, 2020}, pages = {3285--3301}, publisher = {International Committee on Computational Linguistics}, year = {2020}, url = {https://doi.org/10.18653/v1/2020.coling-main.292}, doi = {10.18653/v1/2020.coling-main.292} } ``` ### Contributions Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) for adding this dataset.

提供机构：

amirveyseh

原始信息汇总

Acronym Identification Dataset 概述

数据集描述

数据集摘要

该数据集包含用于 Shared Task 1: Acronym Identification 的训练、验证和测试数据，这是 AAAI-21 科学文档理解研讨会的一部分。

支持的任务和排行榜

数据集支持 acronym-identification 任务，旨在预测预分词句子中哪些标记对应于缩写词。该数据集是为一个支持排行榜的共享任务发布的。

语言

数据集中的句子为英语 (en)。

数据集结构

数据实例

训练集中的一个样本如下：

json { "id": "TR-0", "labels": [4, 4, 4, 4, 0, 2, 2, 4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4], "tokens": ["What", "is", "here", "called", "controlled", "natural", "language", "(", "CNL", ")", "has", "traditionally", "been", "given", "many", "different", "names", "."] }

请注意，在测试集中，只有 id 和 tokens 字段可用。labels 可以忽略。测试集中的标签均为 O。

数据字段

数据实例包含以下字段：

id: 表示示例 ID 的字符串变量，在整个数据集中是唯一的。
tokens: 表示分词句子的字符串列表。
labels: 具有可能值 ["B-long", "B-short", "I-long", "I-short", "O"] 的分类变量列表，对应于 BIO 方案。-long 对应于扩展的缩写词，如这里的 controlled natural language，-short 对应于缩写，如这里的 CNL。

数据分割

训练集、验证集和测试集分别包含 14,006、1,717 和 1750 个句子。

数据集创建

策划理由

首先，现有的大多数缩写识别 (AI) 数据集要么规模有限，要么使用简单的基于规则的方法创建。这很不幸，因为规则通常无法捕捉文本中缩写及其长形式的多样表达方式。其次，现有的大多数数据集都在医学领域，忽视了其他科学领域的挑战。为了解决这些限制，本文介绍了两个新的缩写识别数据集。值得注意的是，我们的数据集由人工标注以实现高质量，并且在非医学领域的现有 AI 数据集中具有大量示例。

源数据

初始数据收集和规范化

为了准备用于缩写标注的语料库，我们从 arXiv 收集了 6,786 篇英语论文。这些论文包含 2,031,592 个句子，将用于本工作的数据标注。

数据集论文没有报告确切的标记化方法。

源语言生产者是谁？

语言来自在线数字档案 arXiv 上托管的论文。关于选择过程或作者身份的更多信息不可用。

标注

标注过程

每个用于标注的句子需要包含至少一个单词，其中超过一半的字符是大写字母（即缩写候选词）。然后，我们搜索单词的子序列，其中单词的第一个、两个或三个字符的连接（按子序列中单词的顺序）可以形成缩写候选词。我们称该子序列为长形式候选词。如果我们找不到任何长形式候选词，我们将删除该句子。使用此过程，我们最终得到 17,506 个句子，由 Amazon Mechanical Turk (MTurk) 的标注者手动标注。具体来说，我们为每个句子创建一个 HIT，并要求工作者标注句子中的短形式和长形式。如果有分歧，如果三个工作者中有两个同意一个标注，我们使用多数投票来决定正确的标注。否则，聘请第四个标注者来解决冲突。

标注者是谁？

工作者通过 Amazon Mechanical Turk 招募，每条标注支付 $0.05。没有提供进一步的 demogrphic 信息。

个人和敏感信息

在 arXiv 上发布的论文不太可能包含大量个人信息，尽管有些确实包含一些揭示个人细节的不良示例，因此应谨慎使用数据。

5,000+

优质数据集

54 个

任务类型

进入经典数据集