MULTITuDEv2
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10013754
下载链接
链接失效反馈官方服务:
资源简介:
MULTITuDEv2 is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in v2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Files
The v2 of the dataset consists of multiple files. 'multitude.csv' contains original v1 of the dataset (i.e., without the field 'generated'). The other files contains also the 'generated' field (as described below) and are compressed by GZIP. The file 'multitude_obfuscated_original.csv.gz' contains copies of the 'text' field in the 'generated' field to be compatible with files with the obfuscated texts (used as such in the experiments).
Fields
The dataset has the following fields:
'text' - an original (unobfuscated) text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / news medium of the given text,
'generated' - an obfuscated text sample (i.e., transformed from original text by the obfuscator indicated by the corresponding filename)
Note: some obfuscated text in the 'generated' field are the same as in the 'text' field, indicating failure of the obfuscator to modify the text. Human-written obfuscated texts are also included; however, labels of their originals might be no longer relevant for them (i.e., human-written text obfuscated by a machine could be considered as machine-generated as well); thus, consider this in your research.
Statistics (the number of samples)
Splits:
train - 44786
test - 29295
Binary labels:
0 - 7992
1 - 66089
Multiclass labels:
gpt-3.5-turbo - 8300
gpt-4 - 8300
text-davinci-003 - 8297
alpaca-lora-30b - 8290
vicuna-13b - 8287
opt-66b - 8229
llama-65b - 8229
opt-iml-max-1.3b - 8157
human - 7992
Languages:
English (en) - 29460 (train + test)
Spanish (es) - 11586 (train + test)
Russian (ru) - 11578 (train + test)
Dutch (nl) - 2695 (test)
Catalan (ca) - 2691 (test)
Czech (cs) - 2689 (test)
German (de) - 2685 (test)
Chinese (zh) - 2683 (test)
Portuguese (pt) - 2673 (test)
Arabic (ar) - 2673 (test)
Ukrainian (uk) - 2668 (test)
创建时间:
2024-10-04



