deepvk/cultura_ru_edu
收藏Hugging Face2025-01-27 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/deepvk/cultura_ru_edu
下载链接
链接失效反馈官方服务:
资源简介:
`Cultura-Ru-Edu`数据集是从`uonlp/CulturaX`数据集中筛选出的俄语教育网页,旨在训练俄语语言模型。数据集创建过程中使用了`Meta-Llama-3-70B-Instruct`模型进行标注,并训练了一个基于`USER-base`模型的分类器来筛选高质量样本。最终数据集包含约1.4亿个文档,每个文档仅包含一个`text`属性。数据集大小为500GB,建议使用流式加载或通过Git LFS下载。
The `Cultura-Ru-Edu` dataset consists of Russian educational web pages filtered from the `uonlp/CulturaX` dataset. Inspired by `HuggingFaceFW/fineweb-edu`, this dataset focuses on the Russian language. By filtering based on educational criteria, the `Cultura-Ru-Edu` dataset is both high-quality and large enough to train a Russian-focused language model for tasks requiring knowledge of the world. The dataset creation involved annotating a subset with the `Meta-Llama-3-70B-Instruct` model, training a classifier, and applying it to the entire dataset to keep only high-quality samples. The dataset contains approximately 140 million documents, which is about 17.5% of the original `uonlp/CulturaX` dataset. Each sample contains only one property, `text`, the original text document. The dataset is intended for tasks such as text generation and is primarily in Russian. The dataset size is approximately 500GB, suggesting the need for streaming or direct download via Git LFS.
提供机构:
deepvk



