Finnish-NLP/Capybara-fi-deepl-translated-sft
收藏Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Finnish-NLP/Capybara-fi-deepl-translated-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fi
license: apache-2.0
task_categories:
- text-generation
dataset_info:
features:
- name: instruction
dtype: string
- name: response
dtype: string
- name: instruction_orig
dtype: string
- name: response_orig
dtype: string
- name: response_orig_grade
dtype: string
- name: response_judgelm
dtype: string
splits:
- name: train
num_bytes: 3710846
num_examples: 1416
download_size: 2238697
dataset_size: 3710846
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
tags:
- SFT
---
# Dataset Card for Finnish-NLP/Capybara-deepl-translated-sft
## Creation process
- Load data from LDJnr/Capybara
- Filter only samples that contain one input/output pair
- Do zero shot classification with facebook/bart-large-mnli with the following prompt:
```python
preds = pipe(f'{row["input"]} is a question about:', candidate_labels=["USA related question", "Math related question", "General question", "Coding related question"])
```
- Filter out rows with too high scores in following categories ["USA related question", "Math related question","Coding related question"]
- Write rows to .txt file with *** on a newline separating instruction/response and then END on a newline separating samples
- Upload file to deepl.com for file translation --> parse samples back from translated files --> Maybe some additional cleaning/filtering based on fasttext langdetect / kenlm perplexity
提供机构:
Finnish-NLP
原始信息汇总
数据集卡片 for Finnish-NLP/Capybara-deepl-translated-sft
数据集信息
- 语言: 芬兰语
- 许可证: Apache 2.0
- 任务类别: 文本生成
特征
- instruction: 字符串类型
- response: 字符串类型
- instruction_orig: 字符串类型
- response_orig: 字符串类型
- response_orig_grade: 字符串类型
- response_judgelm: 字符串类型
数据分割
- 训练集:
- 字节数: 3710846
- 样本数: 1416
数据大小
- 下载大小: 2238697
- 数据集大小: 3710846
配置
- 默认配置:
- 数据文件:
- 分割: 训练集
- 路径: data/train-*
- 数据文件:
标签
- SFT



