Eloquent/HalluciGen-Translation
收藏Hugging Face2024-11-13 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Eloquent/HalluciGen-Translation
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
language:
- de
- en
- fr
configs:
- config_name: trial
data_files:
- split: trial_de_en
path: de-en/trial.de-en.jsonl
- split: trial_en_de
path: de-en/trial.en-de.jsonl
- split: trial_fr_en
path: fr-en/trial.fr-en.jsonl
- split: trial_en_fr
path: fr-en/trial.en-fr.jsonl
- config_name: test_detection
data_files:
- split: test_detection_de_en
path: de-en/test_detection.de-en.jsonl
- split: test_detection_en_de
path: de-en/test_detection.en-de.jsonl
- split: test_detection_fr_en
path: fr-en/test_detection.fr-en.jsonl
- split: test_detection_en_fr
path: fr-en/test_detection.en-fr.jsonl
- config_name: test_generation
data_files:
- split: test_generation_de_en
path: de-en/test_generation.de-en.jsonl
- split: test_generation_en_de
path: de-en/test_generation.en-de.jsonl
- split: test_generation_fr_en
path: fr-en/test_generation.fr-en.jsonl
- split: test_generation_en_fr
path: fr-en/test_generation.en-fr.jsonl
- config_name: cross_model_evaluation
sep: ','
data_files:
- split: cross_model_evaluation_de_en
path: de-en/cross_model_evaluation.de-en.jsonl
- split: cross_model_evaluation_en_de
path: de-en/cross_model_evaluation.en-de.jsonl
- split: cross_model_evaluation_fr_en
path: fr-en/cross_model_evaluation.fr-en.jsonl
- split: cross_model_evaluation_en_fr
path: fr-en/cross_model_evaluation.en-fr.jsonl
pretty_name: HalluciGen Translation
size_categories:
- n<1K
---
# Task 2: HalluciGen - Tranlsation
This dataset contains the trial and test splits per language pair for the Translation scenario of the [HalluciGen task](https://docs.google.com/document/d/1yeohpm3YJAXKj9BI2JDXJ3ap9Vi2dnHkA2OsDI94QZ4/edit#heading=h.jtyt8tmnayhb), which is part of the 2024 ELOQUENT lab.
NOTE: A gold-labeled version of the dataset will be released in a new repository.
#### Dataset schema
- *id*: unique identifier of the example
- *langpair*: the source and target language pair of the example
- *source*: original model input for translation
- *hyp1*: first alternative translation of the source
- *hyp2*: second alternative translation of the source
- *label*: *hyp1* or *hyp2*, based on which of those has been annotated as hallucination
- *type*: hallucination category assigned. Possible values: addition, named-entity, number, conversion, date, tense, negation, gender, pronoun, antonym, natural
#### Trial Data
This is a small list of examples, provided to help the participants get familiar with the task. Each example contains the following fields: *id*, *langpair*, *source*, *hyp1*, *hyp2*, *type*, *label*.
```python
from datasets import load_dataset
#load the trial data for all language pairs
trial_ds = load_dataset("Eloquent/HalluciGen-Translation", name="trial")
#load the trial data only for the German->English pair
trial_ds_de_en = load_dataset("Eloquent/HalluciGen-Translation", name="trial", split="trial_de_en")
```
#### Test data for the detection step
The files "test_detection.langpair.jsonl" contain the test splits for the detection step for the specific *langpair*. Each example contains the following fields: *id*, *langpair*, *source* *hyp1*, *hyp2*.
```python
from datasets import load_dataset
#load the test data for the detection step for all language pairs
data = load_dataset("Eloquent/HalluciGen-Translation", "test_detection")
```
#### Test data for the generation step
The files "test_generation.langpair.jsonl" contain the test splits for the detection step for the specific *langpair*. Each example contains the following fields: *id*, *langpair*, *source* .
```python
from datasets import load_dataset
#load the test data for the generation step for all language pairs
data = load_dataset("Eloquent/HalluciGen-Translation", "test_generation")
```
#### Test data for the cross-model evaluation of the generation step (released 3 May, 2024)
The file "cross_model_evaluation.langpair.jsonl" contains the test splits for the cross-model evaluation of the generation step for the specific *langpair*. Each example contains the following fields: *id*, *langpair*, *source*, *hyp1*, *hyp2*.
```python
from datasets import load_dataset
#load the test data for the cross-model evaluation of the generation step for all language pairs
data = load_dataset("Eloquent/HalluciGen-Translation", "cross_model_evaluation")
```
提供机构:
Eloquent
原始信息汇总
数据集概述
数据集名称
- 名称: HalluciGen Translation
数据集内容
- 任务: 翻译
- 语言对: 德语-英语, 英语-德语, 法语-英语, 英语-法语
- 数据文件配置:
- trial: 包含四个语言对的试验数据文件
- test_detection: 包含四个语言对的检测测试数据文件
- test_generation: 包含四个语言对的生成测试数据文件
- cross_model_evaluation: 包含四个语言对的跨模型评估测试数据文件
数据集结构
- 字段:
- id: 唯一标识符
- langpair: 源语言和目标语言对
- source: 原始翻译输入
- hyp1: 第一个替代翻译
- hyp2: 第二个替代翻译
- label: 标注为幻觉的翻译选项
- type: 幻觉类别
数据集大小
- 类别: n<1K
许可证
- 许可证: cc-by-nc-sa-4.0



