finbenchv2-opengpt-x_truthfulqax-fi-mt
收藏魔搭社区2025-08-08 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/TurkuNLP/finbenchv2-opengpt-x_truthfulqax-fi-mt
下载链接
链接失效反馈官方服务:
资源简介:
This is an archived version of [LumiOpen/opengpt-x_truthfulqax](https://huggingface.co/datasets/LumiOpen/opengpt-x_truthfulqax) used in Finbench version 2.
### Citation Information
If you find benchmarks useful in your research, please consider citing the test and also the [TruthfulQA](https://aclanthology.org/2022.acl-long.229) dataset it draws from:
```
@misc{thellmann2024crosslingual,
title={Towards Cross-Lingual LLM Evaluation for European Languages},
author={Klaudia Thellmann and Bernhard Stadler and Michael Fromm and Jasper Schulze Buschhoff and Alex Jude and Fabio Barth and Johannes Leveling and Nicolas Flores-Herr and Joachim Köhler and René Jäkel and Mehdi Ali},
year={2024},
eprint={2410.08928},
archivePrefix={arXiv},
primaryClass={cs.CL}
# TruthfulQA
@inproceedings{lin-etal-2022-truthfulqa,
title = "{T}ruthful{QA}: Measuring How Models Mimic Human Falsehoods",
author = "Lin, Stephanie and
Hilton, Jacob and
Evans, Owain",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.229",
doi = "10.18653/v1/2022.acl-long.229",
pages = "3214--3252",
abstract = "We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58{\%} of questions, while human performance was 94{\%}. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.",
}
}
```
本资源为Finbench版本2中所使用的[LumiOpen/opengpt-x_truthfulqax](https://huggingface.co/datasets/LumiOpen/opengpt-x_truthfulqax)存档副本。
### 引用信息
若您的研究中用到了该评测基准,请一并引用本测试集及其所依托的真实问答(TruthfulQA)数据集:
@misc{thellmann2024crosslingual,
title={面向欧洲语言的跨语言大语言模型(LLM)评测},
author={Klaudia Thellmann and Bernhard Stadler and Michael Fromm and Jasper Schulze Buschhoff and Alex Jude and Fabio Barth and Johannes Leveling and Nicolas Flores-Herr and Joachim Köhler and René Jäkel and Mehdi Ali},
year={2024},
eprint={2410.08928},
archivePrefix={arXiv},
primaryClass={计算机科学 - 计算与语言(cs.CL)}
}
# 真实问答数据集
@inproceedings{lin-etal-2022-truthfulqa,
title={TruthfulQA:评测模型如何模仿人类的虚假言论},
author={Lin, Stephanie and Hilton, Jacob and Evans, Owain},
editor={Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline},
booktitle={第60届国际计算语言学协会年会论文集(第1卷:长论文)},
month={5月},
year={2022},
address={爱尔兰都柏林},
publisher={国际计算语言学协会},
url={https://aclanthology.org/2022.acl-long.229},
doi={10.18653/v1/2022.acl-long.229},
pages={3214--3252},
abstract={我们提出了一款评测基准,用于衡量语言模型在回答问题时生成真实内容的能力。该评测基准包含817道问题,涵盖38个类别,包括健康、法律、金融与政治。我们设计了部分人类会因错误信念或误解而给出错误回答的问题。若想在该基准上取得良好表现,模型必须避免生成因模仿人类文本而习得的虚假回答。我们测试了GPT-3、GPT-Neo/J、GPT-2以及基于T5的模型。表现最优的模型在58%的问题上能够生成真实回答,而人类的表现为94%。模型生成了大量模仿主流误解的虚假回答,且有可能欺骗人类。规模最大的模型通常真实性表现最差,这与其他自然语言处理(NLP)任务中模型性能随规模提升而改善的现象形成鲜明对比。但若虚假回答是从训练分布中习得的,那么该结果则符合预期。我们认为,仅通过缩放模型规模来提升真实性,其效果不如采用除模仿网页文本之外的训练目标进行微调。}
}
提供机构:
maas
创建时间:
2025-08-08
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是opengpt-x_truthfulqax的芬兰语机器翻译版本,专门用于Finbench v2基准测试,以评估芬兰大型语言模型的性能。它基于TruthfulQA基准,旨在测量模型在生成答案时避免模仿人类错误信念的真实性。
以上内容由遇见数据集搜集并总结生成



