cultura_ru_edu

Name: cultura_ru_edu
Creator: maas
Published: 2025-12-05 16:44:14
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-09 收录

下载链接：

https://modelscope.cn/datasets/deepvk/cultura_ru_edu

下载链接

链接失效反馈

官方服务：

资源简介：

# Cultura-Ru-Edu The `Cultura-Ru-Edu` dataset consists of Russian educational web pages filtered from the [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset. The dataset creation was inspired by [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), but with a focus on the Russian language. By filtering the dataset based on educational criteria, the `Cultura-Ru-Edu` dataset is both high-quality and large enough to train a Russian-focused language model for tasks requiring knowledge of the world. ## Dataset curation To create this dataset, we annotated a subset with the `Meta-Llama-3-70B-Instruct` model, trained a classifier on it, and then applied it to the entire dataset, keeping only the high-quality samples. ### Annotation Follow [`deepvk/cultura_ru_edu_llama3_annotations`](https://huggingface.co/datasets/deepvk/cultura_ru_edu_llama3_annotations) to see details about creating the annotation dataset. ### Training classifier We trained a classifier based on the [`USER-base`](https://huggingface.co/deepvk/USER-base) model. Unlike the original FineWeb-Edu pipeline, we used binary classification, where the positive class includes samples with a score of 3 and higher. We found this approach more stable due to the high imbalance in the annotation dataset. ### Dataset scoring We converted the classifier to ONNX format and applied it to the Russian part of the [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset. The original dataset contained approximately 800 million documents, and after filtration, only 140 million documents remained (~17.5% of the original dataset). ## Dataset information Each sample contains only one property — `text`, the original text document. Some notes: - This dataset is a filtered version of the larger, multilingual [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset. No other information was added or removed. - Since the original dataset consists of parsed web pages, there may still be artifacts in the text header or footer. Future work may include detecting and removing such blocks. ## Usage To use this dataset, one may simply use the `datasets` API. ```python from datasets import load_dataset cultura_ru_edu = load_dataset("deepvk/cultura_ru_edu", split="train", streaming=True) ``` Note that the dataset size is approximately 500GB, so it is better to use streaming or download it directly via Git LFS. ## Citations ``` @misc{deepvk2024cultura-ru-edu, title={Cultura-Ru-Edu}, author={Spirin, Egor and Sokolov, Andrey}, url={https://huggingface.co/datasets/deepvk/cultura_ru_edu}, publisher={Hugging Face} year={2024}, } ```

# Cultura-Ru-Edu `Cultura-Ru-Edu` 数据集是从 [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) 数据集中筛选得到的俄语教育网页集合。该数据集的构建灵感来源于 [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)，但专注于俄语语言方向。通过基于教育标准进行筛选，`Cultura-Ru-Edu` 数据集兼具高质量与大规模特性，可用于训练面向俄语、依赖全球知识的语言模型。 ## 数据集整理流程为构建该数据集，我们使用`Meta-Llama-3-70B-Instruct`模型对其子集进行标注，基于标注子集训练分类器，随后将分类器应用于全量数据集，仅保留高质量样本。 ### 标注环节有关标注数据集的构建细节，请参阅 [`deepvk/cultura_ru_edu_llama3_annotations`](https://huggingface.co/datasets/deepvk/cultura_ru_edu_llama3_annotations)。 ### 分类器训练我们基于`USER-base`模型训练了分类器。与原始的FineWeb-Edu流程不同，我们采用了二元分类方案，正类包含评分不低于3分的样本。鉴于标注数据集存在严重的类别不平衡问题，我们发现该方案更为稳定。 ### 数据集打分我们将分类器转换为ONNX格式，并将其应用于`uonlp/CulturaX`数据集的俄语部分。原始数据集包含约8亿份文档，经过筛选后仅剩余1.4亿份（约占原数据集的17.5%）。 ## 数据集信息每个样本仅包含一个属性——`text`，即原始文本文档。 ### 注意事项 - 本数据集是大型多语言`uonlp/CulturaX`数据集的筛选子集，未对原始数据进行额外增删。 - 由于原始数据集由解析后的网页构成，文本中可能仍残留页眉或页脚类的解析伪影。未来的工作可考虑检测并移除此类内容。 ## 使用方法如需使用该数据集，可直接使用`datasets`库的API。 python from datasets import load_dataset cultura_ru_edu = load_dataset("deepvk/cultura_ru_edu", split="train", streaming=True) 请注意，该数据集总大小约为500GB，建议采用流式加载或通过Git LFS直接下载。 ## 引用 @misc{deepvk2024cultura-ru-edu, title={Cultura-Ru-Edu}, author={Spirin, Egor and Sokolov, Andrey}, url={https://huggingface.co/datasets/deepvk/cultura_ru_edu}, publisher={Hugging Face} year={2024}, }

提供机构：

maas

创建时间：

2025-08-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集