cultura_ru_edu
收藏魔搭社区2025-12-05 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/deepvk/cultura_ru_edu
下载链接
链接失效反馈官方服务:
资源简介:
# Cultura-Ru-Edu
The `Cultura-Ru-Edu` dataset consists of Russian educational web pages filtered from the [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset.
The dataset creation was inspired by [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), but with a focus on the Russian language.
By filtering the dataset based on educational criteria, the `Cultura-Ru-Edu` dataset is both high-quality and large enough to train a Russian-focused language model for tasks requiring knowledge of the world.
## Dataset curation
To create this dataset, we annotated a subset with the `Meta-Llama-3-70B-Instruct` model, trained a classifier on it, and then applied it to the entire dataset, keeping only the high-quality samples.
### Annotation
Follow [`deepvk/cultura_ru_edu_llama3_annotations`](https://huggingface.co/datasets/deepvk/cultura_ru_edu_llama3_annotations) to see details about creating the annotation dataset.
### Training classifier
We trained a classifier based on the [`USER-base`](https://huggingface.co/deepvk/USER-base) model.
Unlike the original FineWeb-Edu pipeline, we used binary classification, where the positive class includes samples with a score of 3 and higher.
We found this approach more stable due to the high imbalance in the annotation dataset.
### Dataset scoring
We converted the classifier to ONNX format and applied it to the Russian part of the [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset.
The original dataset contained approximately 800 million documents, and after filtration, only 140 million documents remained (~17.5% of the original dataset).
## Dataset information
Each sample contains only one property — `text`, the original text document.
Some notes:
- This dataset is a filtered version of the larger, multilingual [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset. No other information was added or removed.
- Since the original dataset consists of parsed web pages, there may still be artifacts in the text header or footer. Future work may include detecting and removing such blocks.
## Usage
To use this dataset, one may simply use the `datasets` API.
```python
from datasets import load_dataset
cultura_ru_edu = load_dataset("deepvk/cultura_ru_edu", split="train", streaming=True)
```
Note that the dataset size is approximately 500GB, so it is better to use streaming or download it directly via Git LFS.
## Citations
```
@misc{deepvk2024cultura-ru-edu,
title={Cultura-Ru-Edu},
author={Spirin, Egor and Sokolov, Andrey},
url={https://huggingface.co/datasets/deepvk/cultura_ru_edu},
publisher={Hugging Face}
year={2024},
}
```
# Cultura-Ru-Edu
`Cultura-Ru-Edu` 数据集是从 [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) 数据集中筛选得到的俄语教育网页集合。
该数据集的构建灵感来源于 [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu),但专注于俄语语言方向。通过基于教育标准进行筛选,`Cultura-Ru-Edu` 数据集兼具高质量与大规模特性,可用于训练面向俄语、依赖全球知识的语言模型。
## 数据集整理流程
为构建该数据集,我们使用`Meta-Llama-3-70B-Instruct`模型对其子集进行标注,基于标注子集训练分类器,随后将分类器应用于全量数据集,仅保留高质量样本。
### 标注环节
有关标注数据集的构建细节,请参阅 [`deepvk/cultura_ru_edu_llama3_annotations`](https://huggingface.co/datasets/deepvk/cultura_ru_edu_llama3_annotations)。
### 分类器训练
我们基于`USER-base`模型训练了分类器。与原始的FineWeb-Edu流程不同,我们采用了二元分类方案,正类包含评分不低于3分的样本。鉴于标注数据集存在严重的类别不平衡问题,我们发现该方案更为稳定。
### 数据集打分
我们将分类器转换为ONNX格式,并将其应用于`uonlp/CulturaX`数据集的俄语部分。原始数据集包含约8亿份文档,经过筛选后仅剩余1.4亿份(约占原数据集的17.5%)。
## 数据集信息
每个样本仅包含一个属性——`text`,即原始文本文档。
### 注意事项
- 本数据集是大型多语言`uonlp/CulturaX`数据集的筛选子集,未对原始数据进行额外增删。
- 由于原始数据集由解析后的网页构成,文本中可能仍残留页眉或页脚类的解析伪影。未来的工作可考虑检测并移除此类内容。
## 使用方法
如需使用该数据集,可直接使用`datasets`库的API。
python
from datasets import load_dataset
cultura_ru_edu = load_dataset("deepvk/cultura_ru_edu", split="train", streaming=True)
请注意,该数据集总大小约为500GB,建议采用流式加载或通过Git LFS直接下载。
## 引用
@misc{deepvk2024cultura-ru-edu,
title={Cultura-Ru-Edu},
author={Spirin, Egor and Sokolov, Andrey},
url={https://huggingface.co/datasets/deepvk/cultura_ru_edu},
publisher={Hugging Face}
year={2024},
}
提供机构:
maas
创建时间:
2025-08-01



