five

deepvk/cultura_ru_edu_llama3_annotations

收藏
Hugging Face2024-11-25 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/deepvk/cultura_ru_edu_llama3_annotations
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: features: - name: text dtype: string - name: explanation dtype: string - name: score dtype: int64 - name: source dtype: string splits: - name: train num_bytes: 3410489939 num_examples: 498079 - name: validation num_bytes: 34106915 num_examples: 5000 - name: test num_bytes: 34168374 num_examples: 5000 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* task_categories: - text-classification language: - ru size_categories: - 100K<n<1M --- # Cultura-Ru-Edu Annotations This dataset contains the annotations used for training an educational quality classifier for Russian. The pipeline mostly follows the original [📚 FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1). We used the same prompt and the same model but utilized the Russian part of [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX) as the data source. Additionally, we added good samples from [`HuggingFaceFW/fineweb-edu-llama3-annotations`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations). ## Details Briefly, the annotation process was as follows: 1. We randomly sampled 500,000 documents from the Russian part of [`uonlp/CulturaX`](https://huggingface.co/datasets/uonlp/CulturaX). 2. We used the `Meta-Llama-3-70B-Instruct` model to annotate each sample, scoring their educational quality on a scale from 0 to 5. We used the same prompt as the original pipeline. 3. We split the samples into train/validation/test holdouts based on class distribution. Due to high imbalance, we additionally added samples with scores of 4 and 5 from [`HuggingFaceFW/fineweb-edu-llama3-annotations`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations). ### Dataset structure Each sample contains 4 fields: - `text`: The original document. - `score`: A number from 0 to 5 indicating document quality, where 5 stands for highly educational content. - `explanation`: The model's explanation for this score. - `source`: Either `cultura-ru` or `fineweb-edu`, based on the sample source. ### Dataset statistic | Split | Cultura-Ru | FineWeb-Edu | **Total** | |------------|------------|-------------|-----------| | Train | 490,000 | 8,079 | 498,079 | | Validation | 5,000 | 0 | 5,000 | | Test | 5,000 | 0 | 5,000 | ### Prompt ``` Below is an extract from a web page. Evaluate whether the page has a high educational value and could be useful in an educational setting for teaching from primary school to grade school levels using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion: - Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and promotional material. - Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style. - Award a third point if the extract is appropriate for educational use and introduces key concepts relevant to school curricula. It is coherent though it may not be comprehensive or could include some extraneous information. It may resemble an introductory section of a textbook or a basic tutorial that is suitable for learning but has notable limitations like treating concepts that are too complex for grade school students. - Grant a fourth point if the extract highly relevant and beneficial for educational purposes for a level not higher than grade school, exhibiting a clear and consistent writing style. It could be similar to a chapter from a textbook or a tutorial, offering substantial educational content, including exercises and solutions, with minimal irrelevant information, and the concepts aren't too advanced for grade school students. The content is coherent, focused, and valuable for structured learning. - Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for teaching either at primary school or grade school. It follows detailed reasoning, the writing style is easy to follow and offers profound and thorough insights into the subject matter, devoid of any non-educational or complex content. The extract: {} After examining the extract: - Briefly justify your total score, up to 100 words. - Conclude with the score using the format: \"Educational score: <total points>\". ```
提供机构:
deepvk
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作