five

cultura_ru_edu

收藏
魔搭社区2025-12-05 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/deepvk/cultura_ru_edu
下载链接
链接失效反馈
官方服务:
资源简介:
# Cultura-Ru-Edu The `Cultura-Ru-Edu` dataset consists of Russian educational web pages filtered from the [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset. The dataset creation was inspired by [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), but with a focus on the Russian language. By filtering the dataset based on educational criteria, the `Cultura-Ru-Edu` dataset is both high-quality and large enough to train a Russian-focused language model for tasks requiring knowledge of the world. ## Dataset curation To create this dataset, we annotated a subset with the `Meta-Llama-3-70B-Instruct` model, trained a classifier on it, and then applied it to the entire dataset, keeping only the high-quality samples. ### Annotation Follow [`deepvk/cultura_ru_edu_llama3_annotations`](https://huggingface.co/datasets/deepvk/cultura_ru_edu_llama3_annotations) to see details about creating the annotation dataset. ### Training classifier We trained a classifier based on the [`USER-base`](https://huggingface.co/deepvk/USER-base) model. Unlike the original FineWeb-Edu pipeline, we used binary classification, where the positive class includes samples with a score of 3 and higher. We found this approach more stable due to the high imbalance in the annotation dataset. ### Dataset scoring We converted the classifier to ONNX format and applied it to the Russian part of the [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset. The original dataset contained approximately 800 million documents, and after filtration, only 140 million documents remained (~17.5% of the original dataset). ## Dataset information Each sample contains only one property — `text`, the original text document. Some notes: - This dataset is a filtered version of the larger, multilingual [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) dataset. No other information was added or removed. - Since the original dataset consists of parsed web pages, there may still be artifacts in the text header or footer. Future work may include detecting and removing such blocks. ## Usage To use this dataset, one may simply use the `datasets` API. ```python from datasets import load_dataset cultura_ru_edu = load_dataset("deepvk/cultura_ru_edu", split="train", streaming=True) ``` Note that the dataset size is approximately 500GB, so it is better to use streaming or download it directly via Git LFS. ## Citations ``` @misc{deepvk2024cultura-ru-edu, title={Cultura-Ru-Edu}, author={Spirin, Egor and Sokolov, Andrey}, url={https://huggingface.co/datasets/deepvk/cultura_ru_edu}, publisher={Hugging Face} year={2024}, } ```

# Cultura-Ru-Edu `Cultura-Ru-Edu` 数据集是从 [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) 数据集中筛选得到的俄语教育网页集合。 该数据集的构建灵感来源于 [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu),但专注于俄语语言方向。通过基于教育标准进行筛选,`Cultura-Ru-Edu` 数据集兼具高质量与大规模特性,可用于训练面向俄语、依赖全球知识的语言模型。 ## 数据集整理流程 为构建该数据集,我们使用`Meta-Llama-3-70B-Instruct`模型对其子集进行标注,基于标注子集训练分类器,随后将分类器应用于全量数据集,仅保留高质量样本。 ### 标注环节 有关标注数据集的构建细节,请参阅 [`deepvk/cultura_ru_edu_llama3_annotations`](https://huggingface.co/datasets/deepvk/cultura_ru_edu_llama3_annotations)。 ### 分类器训练 我们基于`USER-base`模型训练了分类器。与原始的FineWeb-Edu流程不同,我们采用了二元分类方案,正类包含评分不低于3分的样本。鉴于标注数据集存在严重的类别不平衡问题,我们发现该方案更为稳定。 ### 数据集打分 我们将分类器转换为ONNX格式,并将其应用于`uonlp/CulturaX`数据集的俄语部分。原始数据集包含约8亿份文档,经过筛选后仅剩余1.4亿份(约占原数据集的17.5%)。 ## 数据集信息 每个样本仅包含一个属性——`text`,即原始文本文档。 ### 注意事项 - 本数据集是大型多语言`uonlp/CulturaX`数据集的筛选子集,未对原始数据进行额外增删。 - 由于原始数据集由解析后的网页构成,文本中可能仍残留页眉或页脚类的解析伪影。未来的工作可考虑检测并移除此类内容。 ## 使用方法 如需使用该数据集,可直接使用`datasets`库的API。 python from datasets import load_dataset cultura_ru_edu = load_dataset("deepvk/cultura_ru_edu", split="train", streaming=True) 请注意,该数据集总大小约为500GB,建议采用流式加载或通过Git LFS直接下载。 ## 引用 @misc{deepvk2024cultura-ru-edu, title={Cultura-Ru-Edu}, author={Spirin, Egor and Sokolov, Andrey}, url={https://huggingface.co/datasets/deepvk/cultura_ru_edu}, publisher={Hugging Face} year={2024}, }
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作