siimh/estonian_corpus_2021
收藏Hugging Face2024-10-25 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/siimh/estonian_corpus_2021
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含两个版本的Estonian National Corpus 2021,分别是有形态学标注的文本(corpus_et.jsonl)和清理后的纯文本(corpus_et_clean.jsonl)。数据集总大小约为43GB,包含约1.96亿个句子、24亿个单词、1170万篇文档和6450万个段落。这些数据可以用于形态学分析、自然语言理解、语言模型微调等多种自然语言处理任务。
This dataset contains two versions of the Estonian National Corpus 2021, one with morphologically tagged text (corpus_et.jsonl) and the other with cleaned plain text (corpus_et_clean.jsonl). The total size of the dataset is approximately 43GB, containing about 196 million sentences, 2.4 billion words, 11.7 million documents, and 64.5 million paragraphs. These data can be used for morphological analysis, natural language understanding, language model fine-tuning, and various other natural language processing tasks.
提供机构:
siimh



