five

meliascosta/wiki_academic_subjects

收藏
Hugging Face2022-12-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/meliascosta/wiki_academic_subjects
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-3.0 annotations_creators: - crowdsourced language: - en language_creators: - crowdsourced multilinguality: - monolingual paperswithcode_id: wikitext-2 pretty_name: Wikipedia Outline of Academic Disciplines size_categories: - 10K<n<100K source_datasets: - original tags: - hierarchical - academic - tree - dag - topics - subjects task_categories: - text-classification task_ids: - multi-label-classification --- # Dataset Card for Wiki Academic Disciplines` ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset was created from the [English wikipedia](https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia) dump of January 2022. The main goal was to train a hierarchical classifier of academic subjects using [HiAGM](https://github.com/Alibaba-NLP/HiAGM). ### Supported Tasks and Leaderboard Text classification - No leaderboard at the moment. ### Languages English ## Dataset Structure The dataset consists of groups of labeled text chunks (tokenized by spaces and with stopwords removed). Labels are organized in a hieararchy (a DAG with a special Root node) of academic subjects. Nodes correspond to entries in the [outline of academic disciplines](https://en.wikipedia.org/wiki/Outline_of_academic_disciplines) article from Wikipedia. ### Data Instances Data is split in train/test/val each on a separate `.jsonl` file. Label hierarchy is listed a as TAB separated adjacency list on a `.taxonomy` file. ### Data Fields JSONL files contain only two fields: a "token" field which holds the text tokens and a "label" field which holds a list of labels for that text. ### Data Splits 80/10/10 TRAIN/TEST/VAL schema ## Dataset Creation All texts where extracted following the linked articles on [outline of academic disciplines](https://en.wikipedia.org/wiki/Outline_of_academic_disciplines) ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Wiki Dump #### Who are the source language producers? Wikipedia community. ### Annotations #### Annotation process Texts where automatically assigned to their linked academic discipline #### Who are the annotators? Wikipedia Community. ### Personal and Sensitive Information All information is public. ## Considerations for Using the Data ### Social Impact of Dataset ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Creative Commons 3.0 (see [Wikipedia:Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights)) ### Citation Information 1. Zhou, Jie, et al. "Hierarchy-aware global model for hierarchical text classification." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. ### Contributions Thanks to [@meliascosta](https://github.com/meliascosta) for adding this dataset.
提供机构:
meliascosta
原始信息汇总

数据集概述

数据集名称

  • 名称: Wikipedia Outline of Academic Disciplines
  • 别名: Wiki Academic Disciplines

数据集基本信息

  • 许可证: CC-BY-3.0
  • 语言: 英语
  • 多语言性: 单语种
  • 大小: 10K<n<100K
  • 来源: 原始数据
  • 任务类别: 文本分类
  • 任务ID: 多标签分类

数据集描述

  • 摘要: 该数据集源自2022年1月英语维基百科的转储,主要用于训练学术科目层次分类器,使用HiAGM方法。
  • 支持任务: 文本分类,目前无排行榜。
  • 结构: 数据集包含标记的文本块,标签组织为学术科目的层次结构(特殊根节点的DAG)。

数据集结构

  • 数据实例: 数据分为训练/测试/验证集,分别存储在.jsonl文件中,标签层次结构存储在.taxonomy文件中。
  • 数据字段: JSONL文件包含两个字段:"token"(文本令牌)和"label"(文本标签列表)。
  • 数据分割: 遵循80/10/10的训练/测试/验证分割。

数据集创建

  • 来源数据: 从维基百科转储中提取,遵循学术科目概述文章的链接。
  • 注释过程: 文本自动分配到其链接的学术科目。
  • 注释者: 维基百科社区。

许可证信息

  • 许可证: 创意共享3.0许可证(参考维基百科版权信息)

引用信息

  • 引用文献: Zhou, Jie, et al. "Hierarchy-aware global model for hierarchical text classification." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作