five

DIBT/aya_dutch_dpo_raw

收藏
Hugging Face2024-05-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/DIBT/aya_dutch_dpo_raw
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: n<1K dataset_info: features: - name: instruction dtype: string - name: targets dtype: string - name: language dtype: string - name: language_code dtype: string - name: annotation_type dtype: string - name: user_id dtype: string - name: generation_model dtype: string - name: generation dtype: string - name: predicted_generation_language dtype: string - name: predicted_generation_language_score dtype: float64 - name: generations sequence: string - name: generation_models sequence: string - name: model_name dtype: string - name: ratings sequence: int64 - name: rationales sequence: string splits: - name: train num_bytes: 3530439 num_examples: 1200 download_size: 1847668 dataset_size: 3530439 configs: - config_name: default data_files: - split: train path: data/train-* tags: - synthetic - distilabel - rlaif --- <p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/> </a> </p> # Dataset Card for aya_dutch_dpo_raw This dataset has been created with [distilabel](https://distilabel.argilla.io/). ## Dataset Summary This dataset contains a `pipeline.yaml` which can be used to reproduce the pipeline that generated it in distilabel using the `distilabel` CLI: ```console distilabel pipeline run --config "https://huggingface.co/datasets/DIBT/aya_dutch_dpo_raw/raw/main/pipeline.yaml" ``` or explore the configuration: ```console distilabel pipeline info --config "https://huggingface.co/datasets/DIBT/aya_dutch_dpo_raw/raw/main/pipeline.yaml" ``` ## Dataset structure The examples have the following structure per configuration: <details><summary> Configuration: default </summary><hr> ```json { "annotation_type": "re-annotations", "generation": "De constructie van het fort, dat werd opgetrokken in de haven van Lahaina met behulp van massive koraalblokken die uit de omringende riffen waren verwijderd, markeerde een keerpunt in de politieke en economische ontwikkeling van het eiland Maui in de eerste helft van de negentiende eeuw.", "generation_model": "meta-llama/Meta-Llama-3-70B-Instruct", "generation_models": [ "aya", "meta-llama/Meta-Llama-3-70B-Instruct" ], "generations": [ "Een meer complexe versie van de zin is natuurlijk: \"Het fort werd gebouwd in het kadegebied bij Lahaina Harbor van koraalblokken met muren bekroond met 47 kanonnen die van schepen zijn verkregen\".", "De constructie van het fort, dat werd opgetrokken in de haven van Lahaina met behulp van massive koraalblokken die uit de omringende riffen waren verwijderd, markeerde een keerpunt in de politieke en economische ontwikkeling van het eiland Maui in de eerste helft van de negentiende eeuw." ], "instruction": "Maak een complexere zin: Het fort werd gebouwd in de haven van Lahaina met koraalblokken.", "language": "Dutch", "language_code": "nld", "model_name": "meta-llama/Meta-Llama-3-70B-Instruct", "predicted_generation_language": "nld_Latn", "predicted_generation_language_score": 0.9995737671852112, "ratings": [ 4, 5 ], "rationales": [ "The text accurately provides a more complex sentence as requested, and the added details are correct (e.g., \"muren bekroond met 47 kanonnen\"). The sentence is well-structured and easy to follow. However, the model could be more confident in its language, and some parts feel slightly redundant (e.g., \"kadegebied bij Lahaina Harbor\").", "The text provides a sophisticated and accurate sentence that not only meets the request but also adds valuable context about the significance of the fort\u0027s construction. The language is confident and precise, and the sentence is well-structured and engaging. The model demonstrates a thorough understanding of the topic and effectively conveys its knowledge without any noticeable errors or hallucinations." ], "targets": "Een meer complexe versie van de zin is natuurlijk: \"Het fort werd gebouwd in het kadegebied bij Lahaina Harbor van koraalblokken met muren bekroond met 47 kanonnen die van schepen zijn verkregen\".", "user_id": "ca908e583236b208e473e89dae5c7b7d3daf3662e2bbf6606f0702c718bb5c06" } ``` This subset can be loaded as: ```python from datasets import load_dataset ds = load_dataset("DIBT/aya_dutch_dpo_raw", "default") ``` Or simply as it follows, since there's only one configuration and is named `default`: ```python from datasets import load_dataset ds = load_dataset("DIBT/aya_dutch_dpo_raw") ``` </details>
提供机构:
DIBT
原始信息汇总

数据集概述

数据集基本信息

  • 数据集名称: aya_dutch_dpo_raw
  • 数据集大小:
    • 下载大小: 1847668字节
    • 数据集大小: 3530439字节
  • 样本数量: 1200
  • 数据集分类: 小于1K

数据集特征

  • 特征名称 (数据类型):
    • instruction (string)
    • targets (string)
    • language (string)
    • language_code (string)
    • annotation_type (string)
    • user_id (string)
    • generation_model (string)
    • generation (string)
    • predicted_generation_language (string)
    • predicted_generation_language_score (float64)
    • generations (sequence: string)
    • generation_models (sequence: string)
    • model_name (string)
    • ratings (sequence: int64)
    • rationales (sequence: string)

数据集结构

  • 配置名称: default
  • 数据文件:
    • 分割: train
    • 路径: data/train-*

数据集加载示例

python from datasets import load_dataset

ds = load_dataset("DIBT/aya_dutch_dpo_raw")

数据集示例结构

json { "annotation_type": "re-annotations", "generation": "...", "generation_model": "meta-llama/Meta-Llama-3-70B-Instruct", "generation_models": ["aya", "meta-llama/Meta-Llama-3-70B-Instruct"], "generations": ["...", "..."], "instruction": "...", "language": "Dutch", "language_code": "nld", "model_name": "meta-llama/Meta-Llama-3-70B-Instruct", "predicted_generation_language": "nld_Latn", "predicted_generation_language_score": 0.9995737671852112, "ratings": [4, 5], "rationales": ["...", "..."], "targets": "...", "user_id": "ca908e583236b208e473e89dae5c7b7d3daf3662e2bbf6606f0702c718bb5c06" }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作