Taiga Corpus (An open-source corpus for machine learning.)

Name: Taiga Corpus (An open-source corpus for machine learning.)
Creator: OpenDataLab
Published: 2026-05-24 06:30:11
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/Taiga_Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

Taiga 是一个语料库，其中根据流行的 ML 任务收集文本源及其元信息。语料库中的每个文本都以纯文本表示，并带有形态和句法注释（UDPipe，同音异义自动解析）+具有元信息 - 日期、主题、作者身份、文本难度……等（取决于来源）到目前为止，大约有 50 亿个单词77% 的文学文本（33 种文学杂志）、19% 的幼稚诗歌、2% 的新闻（4 个热门网站）和 2% 的其他（科普、文化杂志、社交网络、业余诗歌和散文），并提供文档。细分信息

Taiga is a corpus that collects text sources and their corresponding metadata for mainstream machine learning tasks. Each text within the corpus is stored in plain text format, paired with morphological and syntactic annotations generated via UDPipe (automatic homonym parsing), alongside various metadata attributes including date, topic, authorship, text difficulty, and more, with the specific available metadata varying depending on the source material. As of the current stage, the corpus contains approximately 5 billion words in total, with the following composition: 77% literary texts (from 33 literary magazines), 19% children's poetry, 2% news content (from 4 popular news websites), and the remaining 2% comprising other categories including popular science works, cultural magazines, social media content, and amateur poetry and prose. Detailed corpus breakdown information is also provided.

提供机构：

OpenDataLab

创建时间：

2022-05-24

搜集汇总

数据集介绍