felm
收藏魔搭社区2025-12-05 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/felm
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for FELM
## Table of Contents
- [Dataset Card for FELM](#dataset-card-for-FELM)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Dataset Creation](#dataset-creation)
- [Source Data](#source-data)
- [Initial Data Collection and Clean](#initial-data-collection-and-clean)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Needs More Information]
- **Repository:** https://github.com/hkust-nlp/felm
- **Paper:** [Needs More Information]
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
[[Paper]](https://arxiv.org/abs/2310.00741) [[Github Repo]](https://github.com/hkust-nlp/felm)
FELM is a meta-benchmark to evaluate factuality evaluation for large language models.<br>
The benchmark comprises 847 questions that span five distinct domains: world knowledge, science/technology, writing/recommendation, reasoning, and math. We gather prompts corresponding to each domain by various sources including standard datasets like truthfulQA, online platforms like Github repositories, ChatGPT generation or drafted by authors.<br>
We then obtain responses from ChatGPT for these prompts. For each response, we employ fine-grained annotation at the segment level, which includes reference links, identified error types, and the reasons behind these errors as provided by our annotators.
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
The text in the dataset is in English.
## Dataset Structure
### Data Instances
An example looks as follows:
```python
{"index": "0",
"source": "quora",
"prompt": "Which country or city has the maximum number of nuclear power plants?",
"response": "The United States has the highest number of nuclear power plants in the world, with 94 operating reactors. Other countries with a significant number of nuclear power plants include France, China, Russia, and South Korea.",
"segmented_response": ["The United States has the highest number of nuclear power plants in the world, with 94 operating reactors.", "Other countries with a significant number of nuclear power plants include France, China, Russia, and South Korea."],
"labels": [false, true],
"comment": ["As of December 2022, there were 92 operable nuclear power reactors in the United States.", ""],
"type": ["knowledge_error", null],
"ref": ["https://www.eia.gov/tools/faqs/faq.php?id=207&t=3"]}
```
### Data Fields
| Field Name | Field Value | Description |
| ----------- | ----------- | ------------------------------------------- |
| index | Integer | the order number of the data point |
| source | string | the prompt source |
| prompt | string | the prompt for generating response |
| response | string | the response of ChatGPT for prompt |
| segmented_response | list | segments of reponse |
| labels | list | factuality labels for segmented_response |
| comment | list | error reasons for segments with factual error |
| type | list | error types for segments with factual error |
| ref | list | reference links |
## Dataset Creation
### Source Data
#### Initial Data Collection and Clean
We gather prompts corresponding to each domain by various sources including standard datasets like truthfulQA, online platforms like Github repositories, ChatGPT generation or drafted by authors.
The data is cleaned by authors.
### Annotations
#### Annotation process
We have developed an annotation tool and established annotation guidelines. All annotations undergo a double-check process, which involves review by both other annotators and an expert reviewer.
#### Who are the annotators?
The authors of the paper; Yuzhen Huang, Yikai Zhang, Tangjun Su.
## Additional Information
### Licensing Information
This dataset is licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/)).
### Citation Information
```bibtex
@inproceedings{
chen2023felm,
title={FELM: Benchmarking Factuality Evaluation of Large Language Models},
author={Chen, Shiqi and Zhao, Yiran and Zhang, Jinghan and Chern, I-Chun and Gao, Siyang and Liu, Pengfei and He, Junxian},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={http://arxiv.org/abs/2310.00741}
}
```
### Contributions
[Needs More Information]
# FELM 数据集卡片
## 目录
- [FELM 数据集卡片](#dataset-card-for-FELM)
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
- [数据集构建](#dataset-creation)
- [源数据](#source-data)
- [初始数据收集与清洗](#initial-data-collection-and-clean)
- [源文本的创作者是谁?](#who-are-the-source-language-producers)
- [标注](#annotations)
- [标注流程](#annotation-process)
- [附加信息](#additional-information)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献说明](#contributions)
## 数据集描述
- **主页:** [待补充更多信息]
- **代码仓库:** https://github.com/hkust-nlp/felm
- **论文:** [待补充更多信息]
- **排行榜:** [待补充更多信息]
- **联系方式:** [待补充更多信息]
### 数据集概述
[[论文]](https://arxiv.org/abs/2310.00741) [[代码仓库]](https://github.com/hkust-nlp/felm)
FELM是一款用于评估大语言模型(Large Language Model,LLM)事实性的元基准测试集。
该基准测试集包含847个问题,覆盖五大领域:世界知识、科学/技术、写作/推荐、推理以及数学。我们通过多种来源收集各领域对应的提示词,包括truthfulQA等标准数据集、Github仓库等在线平台、ChatGPT生成内容,或是由作者手动撰写。
随后我们获取ChatGPT针对这些提示词生成的回复。针对每一条回复,我们采用细粒度的分段标注,内容涵盖标注人员提供的参考链接、识别出的错误类型以及错误产生的原因。
### 支持任务与排行榜
[待补充更多信息]
### 语言
本数据集的文本语言为英语。
## 数据集结构
### 数据样例
数据样例如下所示:
python
{"index": "0",
"source": "quora",
"prompt": "Which country or city has the maximum number of nuclear power plants?",
"response": "The United States has the highest number of nuclear power plants in the world, with 94 operating reactors. Other countries with a significant number of nuclear power plants include France, China, Russia, and South Korea.",
"segmented_response": ["The United States has the highest number of nuclear power plants in the world, with 94 operating reactors.", "Other countries with a significant number of nuclear power plants include France, China, Russia, and South Korea."],
"labels": [false, true],
"comment": ["As of December 2022, there were 92 operable nuclear power reactors in the United States.", ""],
"type": ["knowledge_error", null],
"ref": ["https://www.eia.gov/tools/faqs/faq.php?id=207&t=3"]}
### 数据字段
| 字段名 | 字段值 | 字段说明 |
| ----------- | ----------- | ------------------------------------------- |
| index | 整数 | 数据点的序号 |
| source | 字符串 | 提示词的来源 |
| prompt | 字符串 | 用于生成回复的提示词 |
| response | 字符串 | ChatGPT针对该提示词生成的回复 |
| segmented_response | 列表 | 回复内容的分段结果 |
| labels | 列表 | 分段回复对应的事实性标签 |
| comment | 列表 | 存在事实错误的分段对应的错误原因 |
| type | 列表 | 存在事实错误的分段对应的错误类型 |
| ref | 列表 | 参考链接 |
## 数据集构建
### 源数据
#### 初始数据收集与清洗
我们通过多种来源收集各领域对应的提示词,包括truthfulQA等标准数据集、Github仓库等在线平台、ChatGPT生成内容,或是由作者手动撰写。
所有数据均由作者进行清洗处理。
#### 源文本的创作者是谁?
本论文的作者:黄玉珍、张一楷、苏唐俊。
### 标注
#### 标注流程
我们开发了一款标注工具并制定了标注指南。所有标注均经过双重校验流程,即由其他标注人员以及专家审稿人分别进行审核。
#### 标注人员是谁?
本论文的作者:黄玉珍、张一楷、苏唐俊。
## 附加信息
### 授权信息
本数据集采用[知识共享署名-非商业性使用-相同方式共享4.0国际许可协议](http://creativecommons.org/licenses/by-nc-sa/4.0/)进行授权。
### 引用信息
bibtex
@inproceedings{
chen2023felm,
title={FELM: Benchmarking Factuality Evaluation of Large Language Models},
author={Chen, Shiqi and Zhao, Yiran and Zhang, Jinghan and Chern, I-Chun and Gao, Siyang and Liu, Pengfei and He, Junxian},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={http://arxiv.org/abs/2310.00741}
}
### 贡献说明
[待补充更多信息]
提供机构:
maas
创建时间:
2025-02-17



