wentingzhao/WildHallucinations
收藏Hugging Face2024-06-13 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/wentingzhao/WildHallucinations
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: entity
dtype: string
- name: perplexity
dtype: float64
- name: info
list:
- name: status_code
dtype: int64
- name: text
dtype: string
- name: url
dtype: string
- name: category
dtype: string
- name: wiki
dtype: int64
splits:
- name: train
num_bytes: 1944535165
num_examples: 7917
download_size: 1406426092
dataset_size: 1944535165
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
task_categories:
- text-generation
language:
- en
size_categories:
- 1K<n<10K
---
WildHallucinations is designed for evaluating the factuality of LLMs.
Its core idea is to prompt LLMs to generate and fact-check information about a diverse set of entities.
WildHallucinations consists of 7917 entities extracted from WildChat and a knowledge source.
These entities come from English conversations that are marked as non-toxic.
As described in the main paper, we apply extensive filtering for quality control,
especially for removing entities with more than one meaning.
The knowledge source is constructed from Google search API. We scrape the top 10 web pages for each entity.
Additional cleaning process can be found in the paper.
To use the dataset:
```
from datasets import load_dataset
ds = load_dataset("wentingzhao/WildHallucinations", split="train")
```
Dataset Columns:
* entity (string): the entity name
* perplexity (float): the perplexity of the entity measured by the Llama-3-8B model
* info (string): the web information about the entity scraped from Google search results
* category (string): the category of the entity annotated by either an author or GPT-4o
* wiki (Boolean): whether any information about the entity comes from wikipedia.org
提供机构:
wentingzhao
原始信息汇总
数据集概述
数据集特征
- entity: 数据类型为字符串 (string)
- perplexity: 数据类型为浮点数 (float64)
- info: 包含以下子特征
- status_code: 数据类型为整数 (int64)
- text: 数据类型为字符串 (string)
- url: 数据类型为字符串 (string)
- category: 数据类型为字符串 (string)
- wiki: 数据类型为整数 (int64)
数据集划分
- train: 包含7917个示例,数据大小为1944535165字节
数据集大小
- 下载大小: 1406426092字节
- 数据集大小: 1944535165字节
配置信息
- config_name: default
- data_files:
- split: train
- path: data/train-*
许可信息
- 许可证: MIT
任务类别
- text-generation
语言
- en
大小类别
- 1K<n<10K
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



