five

wentingzhao/WildHallucinations

收藏
Hugging Face2024-06-13 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/wentingzhao/WildHallucinations
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: entity dtype: string - name: perplexity dtype: float64 - name: info list: - name: status_code dtype: int64 - name: text dtype: string - name: url dtype: string - name: category dtype: string - name: wiki dtype: int64 splits: - name: train num_bytes: 1944535165 num_examples: 7917 download_size: 1406426092 dataset_size: 1944535165 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-generation language: - en size_categories: - 1K<n<10K --- WildHallucinations is designed for evaluating the factuality of LLMs. Its core idea is to prompt LLMs to generate and fact-check information about a diverse set of entities. WildHallucinations consists of 7917 entities extracted from WildChat and a knowledge source. These entities come from English conversations that are marked as non-toxic. As described in the main paper, we apply extensive filtering for quality control, especially for removing entities with more than one meaning. The knowledge source is constructed from Google search API. We scrape the top 10 web pages for each entity. Additional cleaning process can be found in the paper. To use the dataset: ``` from datasets import load_dataset ds = load_dataset("wentingzhao/WildHallucinations", split="train") ``` Dataset Columns: * entity (string): the entity name * perplexity (float): the perplexity of the entity measured by the Llama-3-8B model * info (string): the web information about the entity scraped from Google search results * category (string): the category of the entity annotated by either an author or GPT-4o * wiki (Boolean): whether any information about the entity comes from wikipedia.org
提供机构:
wentingzhao
原始信息汇总

数据集概述

数据集特征

  • entity: 数据类型为字符串 (string)
  • perplexity: 数据类型为浮点数 (float64)
  • info: 包含以下子特征
    • status_code: 数据类型为整数 (int64)
    • text: 数据类型为字符串 (string)
    • url: 数据类型为字符串 (string)
  • category: 数据类型为字符串 (string)
  • wiki: 数据类型为整数 (int64)

数据集划分

  • train: 包含7917个示例,数据大小为1944535165字节

数据集大小

  • 下载大小: 1406426092字节
  • 数据集大小: 1944535165字节

配置信息

  • config_name: default
  • data_files:
    • split: train
    • path: data/train-*

许可信息

  • 许可证: MIT

任务类别

  • text-generation

语言

  • en

大小类别

  • 1K<n<10K
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作