Russian language modeling dataset
收藏arXiv2020-05-06 更新2024-06-21 收录
下载链接:
https://github.com/zeinsh/lenta_short_sentences
下载链接
链接失效反馈官方服务:
资源简介:
本研究创建了一个名为‘Russian language modeling dataset’的俄语语言模型数据集,由ITMO大学和SPbPU合作开发。该数据集包含236,000条从Lenta新闻数据集中随机抽样的句子,经过预处理和质量筛选,确保数据的高质量和适用性。数据集的创建旨在为俄语自然语言生成研究提供标准化数据支持,特别是用于评估和训练现代神经网络架构如变分自编码器(VAE)和生成对抗网络(GAN)。该数据集的应用领域包括文本生成、语法正确性评估和词汇多样性分析,旨在解决俄语自然语言处理领域中高质量数据集稀缺的问题。
This study developed a Russian language modeling dataset named 'Russian language modeling dataset' in collaboration between ITMO University and SPbPU. This dataset contains 236,000 sentences randomly sampled from the Lenta news dataset, and underwent preprocessing and quality filtering to ensure high data quality and applicability. The dataset was created to provide standardized data support for Russian natural language generation research, particularly for evaluating and training modern neural network architectures such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN). Its application areas include text generation, grammatical correctness evaluation, and lexical diversity analysis, aiming to address the shortage of high-quality datasets in the field of Russian natural language processing.
提供机构:
软件工程与计算机系统学院 ITMO大学, 圣彼得堡, 俄罗斯
创建时间:
2020-05-06



