five

elenaovv/igc-labeled

收藏
Hugging Face2024-04-22 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/elenaovv/igc-labeled
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - is task_categories: - text-classification --- **The Icelandic Gigaword Corpus** (IGC-2022)[1] is a diverse collection of Icelandic texts from 9 individual corpora. Each is available as raw text placed in \<p\> tags and as processed text with tokenization, POS tags, and lemmatization. **Dataset Description** This dataset was created for Icelandic text classification from the IGC-2022 corpus. XML files were parsed. The books category, which contains the fewest total tokens, was selected as a benchmark. The news category, having the lowest average token count per entry, was chosen as another benchmark. Entries from books, journals, and law were segmented into chunks of approximately same amount of tokens, without splitting sentences: [Books (unannotated)](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/316) (**Label 0**): Contains texts from books that have been published in Iceland; **56,788 entries**. \ [Journals (unannotated)](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/245) (**Label 1**): Contains texts from scientific and scholarly journals or websites publishing scientific or scholarly articles; **57,728 entries**. \ [News (unannotated)](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/236) (**Label 2**): Contains texts from news media, online and written as well as some from TV and radio; **58,451 entries**. \ [Law (unannotated)](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/247) (**Label 3**): Contains texts from Icelandic legislative and legal documents; **54,376 entries**. [1]: Barkarson, Starkaður; et al., 2022, Icelandic Gigaword Corpus (IGC-2022) - unannotated version, CLARIN-IS, http://hdl.handle.net/20.500.12537/253.
提供机构:
elenaovv
原始信息汇总

数据集概述

数据集名称:The Icelandic Gigaword Corpus (IGC-2022)

数据集来源:由9个不同的冰岛文本语料库组成。

数据处理:提供原始文本和经过分词、词性标注及词形还原处理的文本。

数据集用途:用于冰岛语文本分类。

数据集详细内容

  • Books (unannotated) (Label 0)

    • 来源:已出版的冰岛书籍。
    • 条目数量:56,788。
  • Journals (unannotated) (Label 1)

    • 来源:科学和学术期刊或网站。
    • 条目数量:57,728。
  • News (unannotated) (Label 2)

    • 来源:新闻媒体,包括在线和书面新闻,以及部分电视和广播内容。
    • 条目数量:58,451。
  • Law (unannotated) (Label 3)

    • 来源:冰岛立法和法律文档。
    • 条目数量:54,376。

数据集特点

  • 数据集中的每个类别都未经过注释。
  • 数据根据不同类别进行了分割,确保每块文本的令牌数量大致相同,且不分割句子。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作