TimSchopf/nlp_taxonomy_data

Name: TimSchopf/nlp_taxonomy_data
Creator: TimSchopf
Published: 2024-02-17 22:08:29
License: 暂无描述

Hugging Face2024-02-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/TimSchopf/nlp_taxonomy_data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: features: - name: id dtype: string - name: title dtype: string - name: abstract dtype: string - name: classification_labels sequence: string - name: numerical_classification_labels sequence: int64 splits: - name: train num_bytes: 235500446 num_examples: 178521 - name: test num_bytes: 1175810 num_examples: 828 download_size: 116387254 dataset_size: 236676256 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - text-classification language: - en pretty_name: NLP Taxonomy Data size_categories: - 100K<n<1M --- # NLP Taxonomy Classification Data The dataset consists of titles and abstracts from NLP-related papers. Each paper is annotated with multiple fields of study from the [NLP taxonomy](#nlp-taxonomy). Each sample is annotated with all possible lower-level concepts and their hypernyms in the [NLP taxonomy](#nlp-taxonomy). The training dataset contains 178,521 weakly annotated samples. The test dataset consists of 828 manually annotated samples from the EMNLP22 conference. The manually labeled test dataset might not contain all possible classes since it consists of EMNLP22 papers only, and some rarer classes haven’t been published there. Therefore, we advise creating an additional test or validation set from the train data that includes all the possible classes. 📄 Paper: [Exploring the Landscape of Natural Language Processing Research (RANLP 2023)](https://aclanthology.org/2023.ranlp-1.111) 💻 Code: [https://github.com/sebischair/Exploring-NLP-Research](https://github.com/sebischair/Exploring-NLP-Research) 🤗 Model: [https://huggingface.co/TimSchopf/nlp_taxonomy_classifier](https://huggingface.co/TimSchopf/nlp_taxonomy_classifier) <a name="#nlp-taxonomy"/></a> ## NLP Taxonomy ![NLP taxonomy](https://github.com/sebischair/Exploring-NLP-Research/blob/main/figures/NLP-Taxonomy.jpg?raw=true) ## Citation information When citing our work in academic papers and theses, please use this BibTeX entry: ``` @inproceedings{schopf-etal-2023-exploring, title = "Exploring the Landscape of Natural Language Processing Research", author = "Schopf, Tim and Arabi, Karim and Matthes, Florian", editor = "Mitkov, Ruslan and Angelova, Galia", booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing", month = sep, year = "2023", address = "Varna, Bulgaria", publisher = "INCOMA Ltd., Shoumen, Bulgaria", url = "https://aclanthology.org/2023.ranlp-1.111", pages = "1034--1045", abstract = "As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics, identifies trends, and outlines areas for future research remains absent. Contributing to closing this gap, we have systematically classified and analyzed research papers in the ACL Anthology. As a result, we present a structured overview of the research landscape, provide a taxonomy of fields of study in NLP, analyze recent developments in NLP, summarize our findings, and highlight directions for future work.", } ```

许可证：MIT 数据集信息：特征： - 名称：id，数据类型：字符串 - 名称：title，数据类型：字符串 - 名称：abstract，数据类型：字符串 - 名称：classification_labels，数据类型：字符串序列 - 名称：numerical_classification_labels，数据类型：64位整数序列划分集： - 名称：训练集，字节数：235500446，样本数：178521 - 名称：测试集，字节数：1175810，样本数：828 下载大小：116387254 数据集总大小：236676256 配置项： - 配置名称：default 数据文件： - 划分集：训练集，路径：data/train-* - 划分集：测试集，路径：data/test-* 任务类别： - 文本分类语言： - 英语友好名称：NLP分类数据集规模类别： - 10万<样本数<100万 # NLP分类数据集本数据集收录自然语言处理（Natural Language Processing, NLP）相关论文的标题与摘要。每篇论文均标注有来自[NLP分类体系](#nlp-taxonomy)的多个研究领域标签，每个样本均标注了[NLP分类体系](#nlp-taxonomy)中所有可能的下位概念及其上位概念。训练集包含178,521条弱标注样本；测试集由EMNLP 2022会议的828条人工标注样本构成。由于该测试集仅涵盖EMNLP 2022的论文，部分小众类别未在该会议中出现，因此人工标注的测试集可能未覆盖全部类别。故建议从训练集中额外构建包含所有类别的测试集或验证集。 📄 论文：[《探索自然语言处理研究全景》（RANLP 2023）](https://aclanthology.org/2023.ranlp-1.111) 💻 代码：[https://github.com/sebischair/Exploring-NLP-Research](https://github.com/sebischair/Exploring-NLP-Research) 🤗 模型：[https://huggingface.co/TimSchopf/nlp_taxonomy_classifier](https://huggingface.co/TimSchopf/nlp_taxonomy_classifier) <a name="#nlp-taxonomy"/> ## NLP分类体系 ![NLP分类体系](https://github.com/sebischair/Exploring-NLP-Research/blob/main/figures/NLP-Taxonomy.jpg?raw=true) ## 引用信息若在学术论文或学位论文中引用本工作，请使用如下BibTeX条目： @inproceedings{schopf-etal-2023-exploring, title = "探索自然语言处理研究全景", author = "Schopf, Tim and Arabi, Karim and Matthes, Florian", editor = "Mitkov, Ruslan and Angelova, Galia", booktitle = "第14届国际自然语言处理前沿进展会议论文集", month = "9月", year = "2023", address = "保加利亚瓦尔纳", publisher = "保加利亚舒门INCOMA有限公司", url = "https://aclanthology.org/2023.ranlp-1.111", pages = "1034--1045", abstract = "作为理解、生成与处理自然语言文本的高效手段，自然语言处理（NLP）领域的研究近年来得到了快速发展与广泛应用。随着该领域研究工作的不断增多，学术界已对多种NLP相关方法开展了综述研究，但目前仍缺乏一项能够对已有研究主题进行分类、识别研究趋势并展望未来研究方向的综合性研究。为填补这一空白，我们对ACL Anthology中的研究论文进行了系统性分类与分析。最终，我们呈现了该研究领域的结构化全景概览，提供了NLP研究领域的分类体系，分析了NLP领域的最新进展，总结了研究发现，并展望了未来的研究方向。", }

提供机构：

TimSchopf

原始信息汇总

NLP Taxonomy Classification Data

数据集概述

该数据集包含来自NLP相关论文的标题和摘要。每篇论文都根据NLP分类法标注了多个研究领域。每个样本都标注了NLP分类法中的所有可能的低级概念及其上位词。

数据集特征

id: 字符串类型
title: 字符串类型
abstract: 字符串类型
classification_labels: 字符串序列
numerical_classification_labels: 整数序列

数据集划分

train: 包含178,521个样本，大小为235,500,446字节
test: 包含828个样本，大小为1,175,810字节

数据集大小

下载大小: 116,387,254字节
数据集大小: 236,676,256字节

配置

default:
- train: data/train-*
- test: data/test-*

任务类别

文本分类

语言

英语

数据集名称

NLP Taxonomy Data

数据集规模

100K<n<1M

搜集汇总

数据集介绍

构建方式

在自然语言处理研究领域，系统性地构建分类数据集对于理解学科脉络至关重要。该数据集通过从学术文献中提取标题与摘要，并依据精心设计的NLP分类体系进行标注。训练集包含178,521个经过弱标注的样本，而测试集则专门收录了828篇来自EMNLP22会议的论文，这些样本均经过人工精细标注，确保了评估的可靠性。

特点

该数据集的核心特征在于其多层次标注体系，每个样本不仅标注了具体的底层研究概念，还涵盖了其在分类体系中的上位类，从而形成了完整的语义层级结构。数据集中涵盖了自然语言处理领域的广泛研究方向，从基础理论到应用技术均有涉及，为研究者提供了丰富的分类视角。测试集虽聚焦于特定会议论文，但通过训练集的补充，能够全面反映NLP研究的多样性。

使用方法

在具体应用层面，该数据集可直接用于多标签文本分类任务的模型训练与评估。研究者可将论文标题与摘要作为输入特征，对应的分类标签作为预测目标，利用深度学习框架构建分类模型。为获得更稳健的评估结果，建议从训练集中划分出包含所有类别的验证集，以弥补测试集可能存在的类别覆盖不足。相关预训练模型与分类器已在开源平台发布，便于快速实现研究复现与拓展。

背景与挑战

背景概述

在自然语言处理（NLP）领域迅速扩张的背景下，对研究文献进行系统化分类与趋势分析成为一项关键需求。TimSchopf/nlp_taxonomy_data数据集应运而生，由慕尼黑工业大学的研究团队于2023年创建，旨在通过构建结构化知识体系，系统梳理ACL Anthology中NLP论文的研究主题。该数据集以论文标题和摘要为基础，采用多层次分类标签体系进行标注，不仅揭示了NLP研究领域的知识图谱，更为后续的学术趋势预测与跨领域研究提供了数据基础，对推动NLP领域的元研究具有重要价值。

当前挑战

该数据集致力于解决NLP研究文献的多标签分类与知识体系构建问题，其核心挑战在于如何准确捕捉快速演进的NLP子领域间的复杂关联。在构建过程中，面临标注一致性难题：训练集采用弱标注方式，可能存在噪声；测试集仅基于EMNLP22会议论文，导致部分罕见类别覆盖不全。此外，将非结构化的学术文本映射到动态扩展的学科分类体系，需要克服概念层级划分与标签稀疏性之间的平衡问题，这对模型的泛化能力提出了更高要求。

常用场景

经典使用场景

在自然语言处理研究领域，对学术文献进行系统化分类是理解学科演进脉络的关键。该数据集通过整合大量NLP相关论文的标题与摘要，并依据精心构建的NLP分类体系进行多层次标注，为研究者提供了一个标准化的文本分类基准。其经典应用场景在于训练和评估多标签分类模型，使模型能够自动识别论文所属的细分研究领域，从而辅助学术文献的自动化归档与知识图谱构建。

衍生相关工作

围绕该数据集衍生的经典工作，首推其配套发布的NLP分类体系本体（OWL文件）及预训练分类模型，为后续研究提供了可直接利用的工具与标准。进一步地，基于此分类体系扩展构建的NLP-KG（自然语言处理知识图谱）项目，将分类数据转化为结构化的知识网络，催生了在学术知识发现、跨领域关联分析等方面的一系列创新应用与研究。

数据集最近研究