wentingzhao/WildHallucinations

Name: wentingzhao/WildHallucinations
Creator: wentingzhao
Published: 2024-06-13 04:06:52
License: 暂无描述

Hugging Face2024-06-13 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/wentingzhao/WildHallucinations

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: entity dtype: string - name: perplexity dtype: float64 - name: info list: - name: status_code dtype: int64 - name: text dtype: string - name: url dtype: string - name: category dtype: string - name: wiki dtype: int64 splits: - name: train num_bytes: 1944535165 num_examples: 7917 download_size: 1406426092 dataset_size: 1944535165 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-generation language: - en size_categories: - 1K<n<10K --- WildHallucinations is designed for evaluating the factuality of LLMs. Its core idea is to prompt LLMs to generate and fact-check information about a diverse set of entities. WildHallucinations consists of 7917 entities extracted from WildChat and a knowledge source. These entities come from English conversations that are marked as non-toxic. As described in the main paper, we apply extensive filtering for quality control, especially for removing entities with more than one meaning. The knowledge source is constructed from Google search API. We scrape the top 10 web pages for each entity. Additional cleaning process can be found in the paper. To use the dataset: ``` from datasets import load_dataset ds = load_dataset("wentingzhao/WildHallucinations", split="train") ``` Dataset Columns: * entity (string): the entity name * perplexity (float): the perplexity of the entity measured by the Llama-3-8B model * info (string): the web information about the entity scraped from Google search results * category (string): the category of the entity annotated by either an author or GPT-4o * wiki (Boolean): whether any information about the entity comes from wikipedia.org

提供机构：

wentingzhao

原始信息汇总

数据集概述

数据集特征

entity: 数据类型为字符串 (string)
perplexity: 数据类型为浮点数 (float64)
info: 包含以下子特征
- status_code: 数据类型为整数 (int64)
- text: 数据类型为字符串 (string)
- url: 数据类型为字符串 (string)
category: 数据类型为字符串 (string)
wiki: 数据类型为整数 (int64)