Taiga Corpus (An open-source corpus for machine learning.)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/Taiga_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
Taiga 是一个语料库,其中根据流行的 ML 任务收集文本源及其元信息。语料库中的每个文本都以纯文本表示,并带有形态和句法注释(UDPipe,同音异义自动解析)+具有元信息 - 日期、主题、作者身份、文本难度……等(取决于来源)到目前为止,大约有 50 亿个单词77% 的文学文本(33 种文学杂志)、19% 的幼稚诗歌、2% 的新闻(4 个热门网站)和 2% 的其他(科普、文化杂志、社交网络、业余诗歌和散文),并提供文档。细分信息
Taiga is a corpus that collects text sources and their corresponding metadata for mainstream machine learning tasks. Each text within the corpus is stored in plain text format, paired with morphological and syntactic annotations generated via UDPipe (automatic homonym parsing), alongside various metadata attributes including date, topic, authorship, text difficulty, and more, with the specific available metadata varying depending on the source material. As of the current stage, the corpus contains approximately 5 billion words in total, with the following composition: 77% literary texts (from 33 literary magazines), 19% children's poetry, 2% news content (from 4 popular news websites), and the remaining 2% comprising other categories including popular science works, cultural magazines, social media content, and amateur poetry and prose. Detailed corpus breakdown information is also provided.
提供机构:
OpenDataLab
创建时间:
2022-05-24
搜集汇总
数据集介绍

背景与挑战
背景概述
Taiga Corpus是一个开源的机器学习语料库,收集了文本源及其元信息,包括形态句法注释和日期、主题等属性,涵盖约50亿单词,主要来自文学、诗歌、新闻等类型。该数据集由中国科学院语言学研究所和圣彼得堡国立大学于2017年发布。
以上内容由遇见数据集搜集并总结生成



