meliascosta/wiki_academic_subjects

Name: meliascosta/wiki_academic_subjects
Creator: meliascosta
Published: 2022-12-05 20:16:02
License: 暂无描述

Hugging Face2022-12-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/meliascosta/wiki_academic_subjects

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-3.0 annotations_creators: - crowdsourced language: - en language_creators: - crowdsourced multilinguality: - monolingual paperswithcode_id: wikitext-2 pretty_name: Wikipedia Outline of Academic Disciplines size_categories: - 10K<n<100K source_datasets: - original tags: - hierarchical - academic - tree - dag - topics - subjects task_categories: - text-classification task_ids: - multi-label-classification --- # Dataset Card for Wiki Academic Disciplines` ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset was created from the [English wikipedia](https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia) dump of January 2022. The main goal was to train a hierarchical classifier of academic subjects using [HiAGM](https://github.com/Alibaba-NLP/HiAGM). ### Supported Tasks and Leaderboard Text classification - No leaderboard at the moment. ### Languages English ## Dataset Structure The dataset consists of groups of labeled text chunks (tokenized by spaces and with stopwords removed). Labels are organized in a hieararchy (a DAG with a special Root node) of academic subjects. Nodes correspond to entries in the [outline of academic disciplines](https://en.wikipedia.org/wiki/Outline_of_academic_disciplines) article from Wikipedia. ### Data Instances Data is split in train/test/val each on a separate `.jsonl` file. Label hierarchy is listed a as TAB separated adjacency list on a `.taxonomy` file. ### Data Fields JSONL files contain only two fields: a "token" field which holds the text tokens and a "label" field which holds a list of labels for that text. ### Data Splits 80/10/10 TRAIN/TEST/VAL schema ## Dataset Creation All texts where extracted following the linked articles on [outline of academic disciplines](https://en.wikipedia.org/wiki/Outline_of_academic_disciplines) ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Wiki Dump #### Who are the source language producers? Wikipedia community. ### Annotations #### Annotation process Texts where automatically assigned to their linked academic discipline #### Who are the annotators? Wikipedia Community. ### Personal and Sensitive Information All information is public. ## Considerations for Using the Data ### Social Impact of Dataset ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Creative Commons 3.0 (see [Wikipedia:Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights)) ### Citation Information 1. Zhou, Jie, et al. "Hierarchy-aware global model for hierarchical text classification." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. ### Contributions Thanks to [@meliascosta](https://github.com/meliascosta) for adding this dataset.

提供机构：

meliascosta

原始信息汇总

数据集概述

数据集名称

名称: Wikipedia Outline of Academic Disciplines
别名: Wiki Academic Disciplines

数据集基本信息

许可证: CC-BY-3.0
语言: 英语
多语言性: 单语种
大小: 10K<n<100K
来源: 原始数据
任务类别: 文本分类
任务ID: 多标签分类

数据集描述

摘要: 该数据集源自2022年1月英语维基百科的转储，主要用于训练学术科目层次分类器，使用HiAGM方法。
支持任务: 文本分类，目前无排行榜。
结构: 数据集包含标记的文本块，标签组织为学术科目的层次结构（特殊根节点的DAG）。

数据集结构

数据实例: 数据分为训练/测试/验证集，分别存储在.jsonl文件中，标签层次结构存储在.taxonomy文件中。
数据字段: JSONL文件包含两个字段："token"（文本令牌）和"label"（文本标签列表）。
数据分割: 遵循80/10/10的训练/测试/验证分割。

数据集创建

来源数据: 从维基百科转储中提取，遵循学术科目概述文章的链接。
注释过程: 文本自动分配到其链接的学术科目。
注释者: 维基百科社区。

许可证信息

许可证: 创意共享3.0许可证（参考维基百科版权信息）

引用信息

引用文献: Zhou, Jie, et al. "Hierarchy-aware global model for hierarchical text classification." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.

5,000+

优质数据集

54 个

任务类型

进入经典数据集