ReBatch/ultrachat_400k_nl
收藏Hugging Face2024-06-07 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/ReBatch/ultrachat_400k_nl
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: prompt
dtype: string
- name: prompt_id
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: test_sft
num_bytes: 209649280.0
num_examples: 44530
- name: train_sft
num_bytes: 1853238192.0
num_examples: 400456
download_size: 1100146612
dataset_size: 2062887472.0
configs:
- config_name: default
data_files:
- split: test_sft
path: data/test_sft-*
- split: train_sft
path: data/train_sft-*
---
# Dataset Card for ultrachat_400k_nl
## Dataset Description
This dataset is a combination 2 datasets for the Dutch Language. The first is a tranlsation of [HuggingFaceH4/ultrachat_200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) using a MarianMT model.
It contains multi-turn chat conversations between a user and an assistant.
The second is [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch). This is a recreation of ultrachat_200K in Dutch with gpt-4.
## Dataset Structure
The dataset has two splits; Only the SFT splits of the original dataset were translated. There are roughly 200k samples training samples and 20k test samples from each translated dataset.
| train_sft | test_sft |
|:-------:|:-----------:|
| 400456 | 44530 |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("ReBatch/ultrachat_400k_nl")
```
## Translation
The first dataset was translated using [vvn/en-to-dutch-marianmt](https://huggingface.co/vvn/en-to-dutch-marianmt).
A fine-Tuned MarianMT translation model for translating text from English to Dutch.
The second dataset was recreated using `gpt-4-1106-preview` via Azure.
提供机构:
ReBatch
原始信息汇总
数据集概述
数据集名称
ultrachat_400k_nl
数据集描述
该数据集是两个荷兰语数据集的组合。第一个数据集是使用MarianMT模型翻译的HuggingFaceH4/ultrachat_200k。它包含用户和助手之间的多轮聊天对话。第二个数据集是BramVanroy/ultrachat_200k_dutch,这是在荷兰语中重现的ultrachat_200K,使用gpt-4。
数据集结构
数据集包含两个分割:仅原始数据集的SFT分割被翻译。每个翻译数据集大约有200k训练样本和20k测试样本。
| 分割名称 | 示例数量 |
|---|---|
| train_sft | 400456 |
| test_sft | 44530 |
特征信息
- prompt: 数据类型 - string
- prompt_id: 数据类型 - string
- messages: 列表,包含以下子特征:
- content: 数据类型 - string
- role: 数据类型 - string
数据集大小
- 下载大小: 1100146612字节
- 数据集大小: 2062887472.0字节
配置信息
- config_name: default
- data_files:
- split: test_sft, path: data/test_sft-*
- split: train_sft, path: data/train_sft-*



