saier/unarXive_imrad_clf

Name: saier/unarXive_imrad_clf
Creator: saier
Published: 2023-04-02 00:56:43
License: 暂无描述

Hugging Face2023-04-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/saier/unarXive_imrad_clf

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language: - en language_creators: - found license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: unarXive IMRaD classification size_categories: - 100K<n<1M tags: - arXiv.org - arXiv - IMRaD - publication - paper - preprint - section - physics - mathematics - computer science - cs task_categories: - text-classification task_ids: - multi-class-classification source_datasets: - extended|10.5281/zenodo.7752615 dataset_info: features: - name: _id dtype: string - name: text dtype: string - name: label dtype: string splits: - name: train num_bytes: 451908280 num_examples: 520053 - name: test num_bytes: 4650429 num_examples: 5000 - name: validation num_bytes: 4315597 num_examples: 5001 download_size: 482376743 dataset_size: 460874306 --- # Dataset Card for unarXive IMRaD classification ## Dataset Description * **Homepage:** [https://github.com/IllDepence/unarXive](https://github.com/IllDepence/unarXive) * **Paper:** [unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network](https://arxiv.org/abs/2303.14957) ### Dataset Summary The unarXive IMRaD classification dataset contains 530k paragraphs from computer science papers and the IMRaD section they originate from. The paragraphs are derived from [unarXive](https://github.com/IllDepence/unarXive). The dataset can be used as follows. ``` from datasets import load_dataset imrad_data = load_dataset('saier/unarXive_imrad_clf') imrad_data = imrad_data.class_encode_column('label') # assign target label column imrad_data = imrad_data.remove_columns('_id') # remove sample ID column ``` ## Dataset Structure ### Data Instances Each data instance contains the paragraph’s text as well as one of the labels ('i', 'm', 'r', 'd', 'w' — for Introduction, Methods, Results, Discussion and Related Work). An example is shown below. ``` {'_id': '789f68e7-a1cc-4072-b07d-ecffc3e7ca38', 'label': 'm', 'text': 'To link the mentions encoded by BERT to the KGE entities, we define ' 'an entity linking loss as cross-entropy between self-supervised ' 'entity labels and similarities obtained from the linker in KGE ' 'space:\n' '\\(\\mathcal {L}_{EL}=\\sum -\\log \\dfrac{\\exp (h_m^{proj}\\cdot ' '\\textbf {e})}{\\sum _{\\textbf {e}_j\\in \\mathcal {E}} \\exp ' '(h_m^{proj}\\cdot \\textbf {e}_j)}\\) \n'} ``` ### Data Splits The data is split into training, development, and testing data as follows. * Training: 520,053 instances * Development: 5000 instances * Testing: 5001 instances ## Dataset Creation ### Source Data The paragraph texts are extracted from the data set [unarXive](https://github.com/IllDepence/unarXive). #### Who are the source language producers? The paragraphs were written by the authors of the arXiv papers. In file `license_info.jsonl` author and text licensing information can be found for all samples, An example is shown below. ``` {'authors': 'Yusuke Sekikawa, Teppei Suzuki', 'license': 'http://creativecommons.org/licenses/by/4.0/', 'paper_arxiv_id': '2011.09852', 'sample_ids': ['cc375518-347c-43d0-bfb2-f88564d66df8', '18dc073e-a48e-488e-b34c-e5fc3cb8a4ca', '0c2e89b3-d863-4bc2-9e11-8f6c48d867cb', 'd85e46cf-b11d-49b6-801b-089aa2dd037d', '92915cea-17ab-4a98-aad2-417f6cdd53d2', 'e88cb422-47b7-4f69-9b0b-fbddf8140d98', '4f5094a4-0e6e-46ae-a34d-e15ce0b9803c', '59003494-096f-4a7c-ad65-342b74eed561', '6a99b3f5-217e-4d3d-a770-693483ef8670']} ``` ### Annotations Class labels were automatically determined ([see implementation](https://github.com/IllDepence/unarXive/blob/master/src/utility_scripts/ml_tasks_prep_data.py)). ## Considerations for Using the Data ### Discussion and Biases Because only paragraphs unambiguously assignable to one of the IMRaD classeswere used, a certain selection bias is to be expected in the data. ### Other Known Limitations Depending on authors’ writing styles as well LaTeX processing quirks, paragraphs can vary in length a significantly. ## Additional Information ### Licensing information The dataset is released under the Creative Commons Attribution-ShareAlike 4.0. ### Citation Information ``` @inproceedings{Saier2023unarXive, author = {Saier, Tarek and Krause, Johan and F\"{a}rber, Michael}, title = {{unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network}}, booktitle = {Proceedings of the 23rd ACM/IEEE Joint Conference on Digital Libraries}, year = {2023}, series = {JCDL '23} } ```

提供机构：

saier

原始信息汇总

数据集卡片 for unarXive IMRaD classification

数据集描述

数据集概述

unarXive IMRaD classification 数据集包含 530k 个来自计算机科学论文的段落及其所属的 IMRaD 部分。这些段落源自 unarXive。

数据集结构

数据实例

每个数据实例包含段落的文本以及一个标签（i, m, r, d, w — 分别代表 Introduction, Methods, Results, Discussion 和 Related Work）。示例如下：

json { "id": "789f68e7-a1cc-4072-b07d-ecffc3e7ca38", "label": "m", "text": "To link the mentions encoded by BERT to the KGE entities, we define an entity linking loss as cross-entropy between self-supervised entity labels and similarities obtained from the linker in KGE space: (\mathcal {L}{EL}=\sum -\log \dfrac{\exp (h_m^{proj}\cdot \textbf {e})}{\sum _{\textbf {e}_j\in \mathcal {E}} \exp (h_m^{proj}\cdot \textbf {e}_j)}) " }

数据分割

数据分为训练、开发和测试集，具体如下：

训练集：520,053 个实例
开发集：5000 个实例
测试集：5001 个实例

数据集创建

源数据

段落文本提取自 unarXive 数据集。

源语言生产者

段落由 arXiv 论文的作者编写。在 license_info.jsonl 文件中可以找到所有样本的作者和文本许可信息。示例如下：

json { "authors": "Yusuke Sekikawa, Teppei Suzuki", "license": "http://creativecommons.org/licenses/by/4.0/", "paper_arxiv_id": "2011.09852", "sample_ids": [ "cc375518-347c-43d0-bfb2-f88564d66df8", "18dc073e-a48e-488e-b34c-e5fc3cb8a4ca", "0c2e89b3-d863-4bc2-9e11-8f6c48d867cb", "d85e46cf-b11d-49b6-801b-089aa2dd037d", "92915cea-17ab-4a98-aad2-417f6cdd53d2", "e88cb422-47b7-4f69-9b0b-fbddf8140d98", "4f5094a4-0e6e-46ae-a34d-e15ce0b9803c", "59003494-096f-4a7c-ad65-342b74eed561", "6a99b3f5-217e-4d3d-a770-693483ef8670" ] }

标注

类别标签是自动确定的（见实现）。

数据使用考虑

讨论和偏见

由于只使用了可以明确分配到 IMRaD 类别之一的段落，因此数据中可能存在一定的选择偏见。

其他已知限制

由于作者的写作风格和 LaTeX 处理的特殊性，段落长度可能会有显著差异。

附加信息

许可信息

数据集在 Creative Commons Attribution-ShareAlike 4.0 许可下发布。

引用信息

plaintext @inproceedings{Saier2023unarXive, author = {Saier, Tarek and Krause, Johan and F"{a}rber, Michael}, title = {{unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network}}, booktitle = {Proceedings of the 23rd ACM/IEEE Joint Conference on Digital Libraries}, year = {2023}, series = {JCDL 23} }

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个用于文本分类的学术论文段落数据集，包含53万条从arXiv计算机科学论文中提取的段落，每条数据标注了IMRaD（引言、方法、结果、讨论、相关工作）章节标签。数据集基于unarXive构建，适用于多类别分类任务，旨在训练模型自动识别论文段落的所属章节。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集