five

eve-esa/hallucination

收藏
Hugging Face2026-04-16 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/eve-esa/hallucination
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: Question dtype: string - name: Answer dtype: string - name: Topic dtype: string - name: Justification dtype: string - name: Hallucinated Spans dtype: string splits: - name: train num_bytes: 3069466 num_examples: 1830 download_size: 1580254 dataset_size: 3069466 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-4.0 task_categories: - text-classification - token-classification - question-answering language: - en tags: - EVE - EO - Earth - hallucination-detection - span-detection pretty_name: EVE-Hallucination size_categories: - 1K<n<10K --- # Dataset Summary EVE-Hallucination is a specialized dataset designed to evaluate language models' tendency to hallucinate (generate factually incorrect or unsupported information) in the Earth Observation (EO) domain. Unlike typical QA datasets that focus on correctness, this dataset contains deliberately hallucinated answers with detailed annotations marking which portions of the text are hallucinated. This dataset is crucial for developing and evaluating hallucination detection systems, training models to identify unreliable content, and measuring the reliability of language models in critical EO applications where factual accuracy is paramount. # Dataset Structure Each example in the dataset contains: - **Question**: A question related to Earth Observation - **Answer**: A model-generated or synthetic answer that contains hallucinated information - **Topic**: The EO category - **Justification**: The reason why the specific text span is hallucinated - **Hallucinated Spans**: A list of text spans each containing: - **start_char**: Starting character index of the hallucinated span - **end_char**: Ending character index of the hallucinated span - **text**: The actual text content of the hallucinated span **Note on Span Format**: All spans are expressed as character indices (not word or token indices). For example, if the answer is "Sentinel-2 has a resolution of 5 meters" and "5 meters" is hallucinated, the span would be `[34, 42]` representing character positions in the string. ## Example ```python { "Question": "What is the spatial resolution of Sentinel-2's visible bands?", "Answer": "Sentinel-2's visible bands have a spatial resolution of 5 meters, making it the highest resolution freely available satellite.", "Topic": "Satellite Observation" "Justification": "Sentinel-2 does not have 5m resolution" "Hallucinated Spans": [ { "start_char": 52, "end_end": 60, "text": "5 meters" }, { "start_char": 73, "end_char": 127, "text": "the highest resolution freely available satellite" } ], } ``` ## Using the Dataset ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("eve-esa/hallucination", split="train") # Access an example example = dataset[0] print(f"Question: {example['Question']}") print(f"Answer: {example['Answer']}") print(f"\n Hallucinated Spans:") for label in example['Hallucinated Spans']: print(f" - Text: '{label['text']}'") print(f" - Span: [{label['start_char']}, {label['end_char']}]") ``` # Evaluation Tasks This dataset supports multiple hallucination detection tasks with increasing granularity: ## 1. Binary Hallucination Detection **Task**: Determine whether the answer contains any hallucinated information (yes/no). **Metrics**: - **Precision**: Of all answers flagged as hallucinated, how many actually contain hallucinations - **Recall**: Of all answers with hallucinations, how many were correctly identified - **F1 Score**: Harmonic mean of precision and recall ```python # Example: Binary detection def has_hallucination(answer, hard_labels): return len(hard_labels) > 0 # If your model predicts hallucination exists prediction = model_detect_hallucination(question, answer) # Returns True/False ground_truth = has_hallucination(answer, example["Hard labels"]) ``` ## 2. Hard Span Detection **Task**: Identify the exact character spans that are hallucinated (binary classification at span level). **Metrics**: - **Precision**: Of all predicted hallucinated spans, how many match ground truth - **Recall**: Of all ground truth hallucinated spans, how many were identified - **F1 Score**: Harmonic mean of precision and recall ```python # Example: Hard span detection from datasets import load_dataset dataset = load_dataset("eve-esa/hallucination", split="train") example = dataset[0] # Your model should predict character spans that are hallucinated predicted_spans = model_detect_spans(example["Question"], example["Answer"]) # e.g., [[52, 60], [73, 127]] ground_truth_spans = example["Hard labels"] # Compute metrics def compute_span_metrics(pred_spans, true_spans): pred_set = set(map(tuple, pred_spans)) true_set = set(map(tuple, true_spans)) if len(pred_set) == 0 and len(true_set) == 0: return {"precision": 1.0, "recall": 1.0, "f1": 1.0} true_positives = len(pred_set & true_set) precision = true_positives / len(pred_set) if len(pred_set) > 0 else 0.0 recall = true_positives / len(true_set) if len(true_set) > 0 else 0.0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0 return {"precision": precision, "recall": recall, "f1": f1} ``` # Use Cases This dataset is particularly valuable for: - **Hallucination Detection Systems**: Training and evaluating models that identify unreliable AI-generated content - **Model Reliability Assessment**: Measuring how often and where models hallucinate in the EO domain - **RAG System Evaluation**: Testing whether retrieval-augmented systems introduce hallucinations - **Confidence Calibration**: Training models to accurately estimate their uncertainty - **Safety-Critical Applications**: Ensuring AI systems don't generate misleading information in domains like climate monitoring, disaster response, and environmental analysis # Benchmark Results We provide baseline results for hallucination **character-level span detection** using our [EVE-Instruct](https://huggingface.co/eve-esa/EVE-Instruct) model. For the benchmarks for binary hallucination detection, you can find them here [EVE-Instruct](https://huggingface.co/eve-esa/EVE-Instruct). ## Task **Hard Span Detection (Character-Level)** Given a `(Question, Answer)` pair, the model predicts character spans in the answer that contain hallucinated content. ## Metric For each example, we calculate the **Intersection-over-Union (IoU)** at the character level. The final score is the **mean IoU across all samples**. IoU is bounded between **0 and 1**, where: - **1.0** indicates perfect span overlap. - **0.0** indicates no overlap. ## Results | Model | IoU | |-------|---------| | EVE-Instruct | **0.0913** | # Implementation Notes When working with this dataset: 1. **Character-level spans**: All span indices are based on Python string indexing (0-indexed, end-exclusive) 2. **Overlapping spans**: Some spans may overlap if different parts of the same phrase have different hallucination probabilities 3. **Empty labels**: Some answers may have empty label lists if they contain no hallucinations (though this dataset focuses on hallucinated content) # Citation If you use this project in academic or research settings, please cite: ``` @misc{atrio2026evedomainspecificllmframework,       title={{EVE}: A Domain-Specific {LLM} Framework for Earth Intelligence},        author={Àlex R. Atrio and Antonio Lopez and Jino Rohit and Yassine El Ouahidi and Marcello Politi and Vijayasri Iyer and Umar Jamil and Sébastien Bratières and Nicolas Longépé},       year={2026},       eprint={2604.13071},       archivePrefix={arXiv},       primaryClass={cs.CL},       url={https://arxiv.org/abs/2604.13071},  } ```
提供机构:
eve-esa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作