meliascosta/wiki_academic_subjects
收藏Hugging Face2022-12-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/meliascosta/wiki_academic_subjects
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-3.0
annotations_creators:
- crowdsourced
language:
- en
language_creators:
- crowdsourced
multilinguality:
- monolingual
paperswithcode_id: wikitext-2
pretty_name: Wikipedia Outline of Academic Disciplines
size_categories:
- 10K<n<100K
source_datasets:
- original
tags:
- hierarchical
- academic
- tree
- dag
- topics
- subjects
task_categories:
- text-classification
task_ids:
- multi-label-classification
---
# Dataset Card for Wiki Academic Disciplines`
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
This dataset was created from the [English wikipedia](https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia) dump of January 2022.
The main goal was to train a hierarchical classifier of academic subjects using [HiAGM](https://github.com/Alibaba-NLP/HiAGM).
### Supported Tasks and Leaderboard
Text classification - No leaderboard at the moment.
### Languages
English
## Dataset Structure
The dataset consists of groups of labeled text chunks (tokenized by spaces and with stopwords removed).
Labels are organized in a hieararchy (a DAG with a special Root node) of academic subjects.
Nodes correspond to entries in the [outline of academic disciplines](https://en.wikipedia.org/wiki/Outline_of_academic_disciplines) article from Wikipedia.
### Data Instances
Data is split in train/test/val each on a separate `.jsonl` file. Label hierarchy is listed a as TAB separated adjacency list on a `.taxonomy` file.
### Data Fields
JSONL files contain only two fields: a "token" field which holds the text tokens and a "label" field which holds a list of labels for that text.
### Data Splits
80/10/10 TRAIN/TEST/VAL schema
## Dataset Creation
All texts where extracted following the linked articles on [outline of academic disciplines](https://en.wikipedia.org/wiki/Outline_of_academic_disciplines)
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
Wiki Dump
#### Who are the source language producers?
Wikipedia community.
### Annotations
#### Annotation process
Texts where automatically assigned to their linked academic discipline
#### Who are the annotators?
Wikipedia Community.
### Personal and Sensitive Information
All information is public.
## Considerations for Using the Data
### Social Impact of Dataset
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Creative Commons 3.0 (see [Wikipedia:Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights))
### Citation Information
1. Zhou, Jie, et al. "Hierarchy-aware global model for hierarchical text classification." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
### Contributions
Thanks to [@meliascosta](https://github.com/meliascosta) for adding this dataset.
提供机构:
meliascosta
原始信息汇总
数据集概述
数据集名称
- 名称: Wikipedia Outline of Academic Disciplines
- 别名: Wiki Academic Disciplines
数据集基本信息
- 许可证: CC-BY-3.0
- 语言: 英语
- 多语言性: 单语种
- 大小: 10K<n<100K
- 来源: 原始数据
- 任务类别: 文本分类
- 任务ID: 多标签分类
数据集描述
- 摘要: 该数据集源自2022年1月英语维基百科的转储,主要用于训练学术科目层次分类器,使用HiAGM方法。
- 支持任务: 文本分类,目前无排行榜。
- 结构: 数据集包含标记的文本块,标签组织为学术科目的层次结构(特殊根节点的DAG)。
数据集结构
- 数据实例: 数据分为训练/测试/验证集,分别存储在
.jsonl文件中,标签层次结构存储在.taxonomy文件中。 - 数据字段: JSONL文件包含两个字段:"token"(文本令牌)和"label"(文本标签列表)。
- 数据分割: 遵循80/10/10的训练/测试/验证分割。
数据集创建
- 来源数据: 从维基百科转储中提取,遵循学术科目概述文章的链接。
- 注释过程: 文本自动分配到其链接的学术科目。
- 注释者: 维基百科社区。
许可证信息
- 许可证: 创意共享3.0许可证(参考维基百科版权信息)
引用信息
- 引用文献: Zhou, Jie, et al. "Hierarchy-aware global model for hierarchical text classification." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.



