five

cultura_ru_edu_llama3_annotations

收藏
魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/deepvk/cultura_ru_edu_llama3_annotations
下载链接
链接失效反馈
官方服务:
资源简介:
# Cultura-Ru-Edu Annotations This dataset contains the annotations used for training an educational quality classifier for Russian. The pipeline mostly follows the original [📚 FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1). We used the same prompt and the same model but utilized the Russian part of [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) as the data source. Additionally, we added good samples from [`HuggingFaceFW/fineweb-edu-llama3-annotations`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations). ## Details Briefly, the annotation process was as follows: 1. We randomly sampled 500,000 documents from the Russian part of [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX). 2. We used the `Meta-Llama-3-70B-Instruct` model to annotate each sample, scoring their educational quality on a scale from 0 to 5. We used the same prompt as the original pipeline. 3. We split the samples into train/validation/test holdouts based on class distribution. Due to high imbalance, we additionally added samples with scores of 4 and 5 from [`HuggingFaceFW/fineweb-edu-llama3-annotations`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations). ### Dataset structure Each sample contains 4 fields: - `text`: The original document. - `score`: A number from 0 to 5 indicating document quality, where 5 stands for highly educational content. - `explanation`: The model's explanation for this score. - `source`: Either `cultura-ru` or `fineweb-edu`, based on the sample source. ### Dataset statistic | Split | Cultura-Ru | FineWeb-Edu | **Total** | |------------|------------|-------------|-----------| | Train | 490,000 | 8,079 | 498,079 | | Validation | 5,000 | 0 | 5,000 | | Test | 5,000 | 0 | 5,000 | ### Prompt ``` Below is an extract from a web page. Evaluate whether the page has a high educational value and could be useful in an educational setting for teaching from primary school to grade school levels using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion: - Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and promotional material. - Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style. - Award a third point if the extract is appropriate for educational use and introduces key concepts relevant to school curricula. It is coherent though it may not be comprehensive or could include some extraneous information. It may resemble an introductory section of a textbook or a basic tutorial that is suitable for learning but has notable limitations like treating concepts that are too complex for grade school students. - Grant a fourth point if the extract highly relevant and beneficial for educational purposes for a level not higher than grade school, exhibiting a clear and consistent writing style. It could be similar to a chapter from a textbook or a tutorial, offering substantial educational content, including exercises and solutions, with minimal irrelevant information, and the concepts aren't too advanced for grade school students. The content is coherent, focused, and valuable for structured learning. - Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for teaching either at primary school or grade school. It follows detailed reasoning, the writing style is easy to follow and offers profound and thorough insights into the subject matter, devoid of any non-educational or complex content. The extract: {} After examining the extract: - Briefly justify your total score, up to 100 words. - Conclude with the score using the format: \"Educational score: <total points>\". ```

# Cultura-Ru-Edu 标注集 本数据集包含用于训练俄语教育质量分类器的标注数据。其整体流程管线基本遵循原始的[📚 FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)方案。我们采用了相同的提示词与模型,并以[`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX)的俄语子集作为数据源。此外,我们还补充了来自[`HuggingFaceFW/fineweb-edu-llama3-annotations`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations)的优质样本。 ## 细节说明 简言之,标注流程如下: 1. 我们从[`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX)的俄语子集中随机采样了500,000份文档。 2. 我们使用`Meta-Llama-3-70B-Instruct`模型对每份样本进行标注,以0至5的评分量表衡量其教育质量。我们沿用了原始流水线的提示词。 3. 我们根据类别分布将样本划分为训练集、验证集与测试集(留出法)。 由于存在严重的类别不平衡问题,我们额外补充了来自[`HuggingFaceFW/fineweb-edu-llama3-annotations`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations)的评分为4和5的样本。 ### 数据集结构 每份样本包含4个字段: - `text`:原始文档内容。 - `score`:0至5的数值,表示文档的教育质量,其中5代表极高教育价值的内容。 - `explanation`:模型给出的该评分的解释依据。 - `source`:样本来源,取值为`cultura-ru`或`fineweb-edu`。 ### 数据集统计 | 划分集 | Cultura-Ru | FineWeb-Edu | **总计** | |----------|------------|-------------|-----------| | 训练集 | 490,000 | 8,079 | 498,079 | | 验证集 | 5,000 | 0 | 5,000 | | 测试集 | 5,000 | 0 | 5,000 | ### 标注提示词 以下是一段网页摘录文本。请依据下述加法式5分评分体系,评估该网页是否具备较高教育价值,是否可适用于小学至中学阶段的教育教学场景。评分依据各评判标准的满足情况累加: - 若摘录提供了与教育主题相关的基础信息,即便包含无关内容或非学术素材(如广告、推广材料),加1分。 - 若摘录涉及与教育相关的部分内容,但未完全契合教育标准:可能将教育内容与非教育材料混合,对潜在有用主题仅提供浅层概述,或以杂乱无章的方式呈现信息、行文逻辑混乱,则再加1分。 - 若摘录适用于教育场景,且引入了与学校课程相关的核心概念:行文连贯,但可能不够全面或包含少量无关信息,其形式类似教科书的引言章节或基础教程,适合学习,但存在显著局限(如讲解了超出中学阶段难度的概念),则授予第3分。 - 若摘录对不超过中学阶段的教育教学场景具备高度相关性与实用性,行文风格清晰且连贯:可类比教科书章节或教程,提供丰富的教育内容(含练习与解答),仅含极少无关信息,且概念难度适配中学阶段要求,内容连贯、聚焦且有助于结构化学习,则授予第4分。 - 若摘录的教育价值极为突出,完全适配小学或中学阶段的教学需求:推理逻辑严谨,行文通俗易懂,对主题提供深刻且全面的见解,且无任何非教育内容或超纲内容,则授予第5分。 摘录文本: {} 完成对摘录文本的评估后: - 简要说明你的评分依据,字数不超过100词。 - 以"Educational score: <总得分>"的格式结尾给出最终评分。
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作