data-is-better-together/aya_dutch_dpo_raw

Name: data-is-better-together/aya_dutch_dpo_raw
Creator: data-is-better-together
Published: 2024-05-02 20:15:13
License: 暂无描述

Hugging Face2024-05-02 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/data-is-better-together/aya_dutch_dpo_raw

下载链接

链接失效反馈

官方服务：

资源简介：

--- size_categories: n<1K dataset_info: features: - name: instruction dtype: string - name: targets dtype: string - name: language dtype: string - name: language_code dtype: string - name: annotation_type dtype: string - name: user_id dtype: string - name: generation_model dtype: string - name: generation dtype: string - name: predicted_generation_language dtype: string - name: predicted_generation_language_score dtype: float64 - name: generations sequence: string - name: generation_models sequence: string - name: model_name dtype: string - name: ratings sequence: int64 - name: rationales sequence: string splits: - name: train num_bytes: 3530439 num_examples: 1200 download_size: 1847668 dataset_size: 3530439 configs: - config_name: default data_files: - split: train path: data/train-* tags: - synthetic - distilabel - rlaif --- <p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/> </a> </p> # Dataset Card for aya_dutch_dpo_raw This dataset has been created with [distilabel](https://distilabel.argilla.io/). ## Dataset Summary This dataset contains a `pipeline.yaml` which can be used to reproduce the pipeline that generated it in distilabel using the `distilabel` CLI: ```console distilabel pipeline run --config "https://huggingface.co/datasets/DIBT/aya_dutch_dpo_raw/raw/main/pipeline.yaml" ``` or explore the configuration: ```console distilabel pipeline info --config "https://huggingface.co/datasets/DIBT/aya_dutch_dpo_raw/raw/main/pipeline.yaml" ``` ## Dataset structure The examples have the following structure per configuration: <details><summary> Configuration: default </summary><hr> ```json { "annotation_type": "re-annotations", "generation": "De constructie van het fort, dat werd opgetrokken in de haven van Lahaina met behulp van massive koraalblokken die uit de omringende riffen waren verwijderd, markeerde een keerpunt in de politieke en economische ontwikkeling van het eiland Maui in de eerste helft van de negentiende eeuw.", "generation_model": "meta-llama/Meta-Llama-3-70B-Instruct", "generation_models": [ "aya", "meta-llama/Meta-Llama-3-70B-Instruct" ], "generations": [ "Een meer complexe versie van de zin is natuurlijk: \"Het fort werd gebouwd in het kadegebied bij Lahaina Harbor van koraalblokken met muren bekroond met 47 kanonnen die van schepen zijn verkregen\".", "De constructie van het fort, dat werd opgetrokken in de haven van Lahaina met behulp van massive koraalblokken die uit de omringende riffen waren verwijderd, markeerde een keerpunt in de politieke en economische ontwikkeling van het eiland Maui in de eerste helft van de negentiende eeuw." ], "instruction": "Maak een complexere zin: Het fort werd gebouwd in de haven van Lahaina met koraalblokken.", "language": "Dutch", "language_code": "nld", "model_name": "meta-llama/Meta-Llama-3-70B-Instruct", "predicted_generation_language": "nld_Latn", "predicted_generation_language_score": 0.9995737671852112, "ratings": [ 4, 5 ], "rationales": [ "The text accurately provides a more complex sentence as requested, and the added details are correct (e.g., \"muren bekroond met 47 kanonnen\"). The sentence is well-structured and easy to follow. However, the model could be more confident in its language, and some parts feel slightly redundant (e.g., \"kadegebied bij Lahaina Harbor\").", "The text provides a sophisticated and accurate sentence that not only meets the request but also adds valuable context about the significance of the fort\u0027s construction. The language is confident and precise, and the sentence is well-structured and engaging. The model demonstrates a thorough understanding of the topic and effectively conveys its knowledge without any noticeable errors or hallucinations." ], "targets": "Een meer complexe versie van de zin is natuurlijk: \"Het fort werd gebouwd in het kadegebied bij Lahaina Harbor van koraalblokken met muren bekroond met 47 kanonnen die van schepen zijn verkregen\".", "user_id": "ca908e583236b208e473e89dae5c7b7d3daf3662e2bbf6606f0702c718bb5c06" } ``` This subset can be loaded as: ```python from datasets import load_dataset ds = load_dataset("DIBT/aya_dutch_dpo_raw", "default") ``` Or simply as it follows, since there's only one configuration and is named `default`: ```python from datasets import load_dataset ds = load_dataset("DIBT/aya_dutch_dpo_raw") ``` </details>

--- size_categories: 样本量级：n<1000 dataset_info: features: - name: instruction dtype: 字符串（string） - name: targets dtype: 字符串（string） - name: language dtype: 字符串（string） - name: language_code dtype: 字符串（string） - name: annotation_type dtype: 字符串（string） - name: user_id dtype: 字符串（string） - name: generation_model dtype: 字符串（string） - name: generation dtype: 字符串（string） - name: predicted_generation_language dtype: 字符串（string） - name: predicted_generation_language_score dtype: 64位浮点型（float64） - name: generations dtype: 字符串序列（sequence<string>） - name: generation_models dtype: 字符串序列（sequence<string>） - name: model_name dtype: 字符串（string） - name: ratings dtype: 64位整型序列（sequence<int64>） - name: rationales dtype: 字符串序列（sequence<string>） splits: - name: 训练集（train） num_bytes: 3530439 num_examples: 1200 download_size: 1847668 dataset_size: 3530439 configs: - config_name: default data_files: - split: train path: data/train-* tags: - 合成数据集（synthetic） - Distilabel - RLAIF（Reinforcement Learning from AI Feedback） --- <p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="基于Distilabel构建" width="200" height="32"/> </a> </p> # aya_dutch_dpo_raw 数据集卡片本数据集由[Distilabel](https://distilabel.argilla.io/)构建。 ## 数据集概述本数据集包含一个`pipeline.yaml`配置文件，可通过Distilabel命令行工具（CLI）复现生成该数据集的流程，具体命令如下： console distilabel pipeline run --config "https://huggingface.co/datasets/DIBT/aya_dutch_dpo_raw/raw/main/pipeline.yaml" 或查看该配置的详细信息： console distilabel pipeline info --config "https://huggingface.co/datasets/DIBT/aya_dutch_dpo_raw/raw/main/pipeline.yaml" ## 数据集结构各配置下的样本结构如下： <details><summary> 配置：default </summary><hr> json { "annotation_type": "重新标注", "generation": "De constructie van het fort, dat werd opgetrokken in de haven van Lahaina met behulp van massive koraalblokken die uit de omringende riffen waren verwijderd, markeerde een keerpunt in de politieke en economische ontwikkeling van het eiland Maui in de eerste helft van de negentiende eeuw.", "generation_model": "meta-llama/Meta-Llama-3-70B-Instruct", "generation_models": [ "aya", "meta-llama/Meta-Llama-3-70B-Instruct" ], "generations": [ "Een meer complexe versie van de zin is natuurlijk: "Het fort werd gebouwd in het kadegebied bij Lahaina Harbor van koraalblokken met muren bekroond met 47 kanonnen die van schepen zijn verkregen".", "De constructie van het fort, dat werd opgetrokken in de haven van Lahaina met behulp van massive koraalblokken die uit de omringende riffen waren verwijderd, markeerde een keerpunt in de politieke en economische ontwikkeling van het eiland Maui in de eerste helft van de negentiende eeuw." ], "instruction": "Maak een complexere zin: Het fort werd gebouwd in de haven van Lahaina met koraalblokken.", "language": "Dutch", "language_code": "nld", "model_name": "meta-llama/Meta-Llama-3-70B-Instruct", "predicted_generation_language": "nld_Latn", "predicted_generation_language_score": 0.9995737671852112, "ratings": [ 4, 5 ], "rationales": [ "该文本准确按照要求生成了更复杂的句子，新增细节正确（例如"muren bekroond met 47 kanonnen"）。句子结构清晰、易于理解，但模型在语言表达上可更自信，部分内容略显冗余（例如"kadegebied bij Lahaina Harbor"）。", "该文本生成了精妙且准确的句子，不仅满足了改写要求，还补充了堡垒修建意义的相关背景信息。语言自信精准，句子结构清晰且富有吸引力。模型对主题理解透彻，有效传递了相关知识，未出现明显错误或幻觉。" ], "targets": "Een meer complexe versie van de zin is natuurlijk: "Het fort werd gebouwd in het kadegebied bij Lahaina Harbor van koraalblokken met muren bekroond met 47 kanonnen die van schepen zijn verkregen".", "user_id": "ca908e583236b208e473e89dae5c7b7d3daf3662e2bbf6606f0702c718bb5c06" } 该子集可通过以下方式加载： python from datasets import load_dataset ds = load_dataset("DIBT/aya_dutch_dpo_raw", "default") 或直接通过以下方式加载，由于仅存在一个名为default的配置： python from datasets import load_dataset ds = load_dataset("DIBT/aya_dutch_dpo_raw")

提供机构：

data-is-better-together

5,000+

优质数据集

54 个

任务类型

进入经典数据集