eve-esa/open-ended-w-context

Name: eve-esa/open-ended-w-context
Creator: eve-esa
Published: 2026-04-16 07:56:13
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/eve-esa/open-ended-w-context

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Question dtype: string - name: Answer dtype: string - name: Doc 1 dtype: string - name: Doc 2 dtype: string - name: Doc 3 dtype: string - name: Source 1 dtype: string - name: Source 2 dtype: string - name: Source 3 dtype: string - name: file_path_1 dtype: string - name: file_path_2 dtype: string - name: file_path_3 dtype: string splits: - name: train num_bytes: 29521159 num_examples: 418 download_size: 12525331 dataset_size: 29521159 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-4.0 task_categories: - question-answering language: - en tags: - EVE - EO - Earth - RAG - context-grounded pretty_name: EVE-Open-Ended-w-Context size_categories: - n<1K --- # Dataset Summary EVE-open-ended-w-context is a collection of open-ended question-answer pairs focused on Earth Observation (EO) with accompanying context documents. Unlike the standard open-ended dataset, this version provides up to 3 relevant documents for each question that models can use to ground their responses. This makes it ideal for evaluating Retrieval-Augmented Generation (RAG) systems and testing models' ability to leverage provided context when answering questions. The dataset covers a wide range of EO topics, including satellite imagery analysis, remote sensing techniques, environmental monitoring, LiDAR, and more. The context documents provide relevant background information, technical specifications, and domain knowledge that can help models generate more accurate and grounded responses. **Note**: Not all samples contain all three documents. Some questions may have 1, 2, or 3 context documents depending on the availability of relevant information. Models should be designed to handle variable numbers of context documents gracefully. # Dataset Structure Each example in the dataset contains: - **Question**: A question related to Earth Observation - **Answer**: A reference answer to the question - **Doc 1**: First context document containing relevant information (may be empty) - **Doc 2**: Second context document containing relevant information (may be empty) - **Doc 3**: Third context document containing relevant information (may be empty) The context documents are selected to be relevant to the question and may contain information that helps answer the question, though models should also leverage their general knowledge where appropriate. **Important**: Not all samples include all three documents. Some may have only Doc 1, some may have Doc 1 and Doc 2, and others may have all three. Empty or missing documents should be handled appropriately in your implementation. ## Examples ### Example 1: Sample with all three documents ```python { "Question": "What is the spatial resolution of Sentinel-2's visible bands?", "Answer": "Sentinel-2's visible bands (B2, B3, B4) have a spatial resolution of 10 meters.", "Doc 1": "The Sentinel-2 mission comprises a constellation of two satellites...", "Doc 2": "Sentinel-2 carries the Multi-Spectral Instrument (MSI) with 13 spectral bands...", "Doc 3": "The visible bands of Sentinel-2 provide high-resolution imagery suitable for..." } ``` ### Example 2: Sample with only one document ```python { "Question": "What are the main applications of SAR imagery?", "Answer": "SAR imagery is primarily used for terrain mapping, disaster monitoring, and ice surveillance.", "Doc 1": "Synthetic Aperture Radar (SAR) provides all-weather imaging capabilities...", "Doc 2": "", "Doc 3": "" } ``` # Metrics The metrics suggested to evaluate model performance on EVE-open-ended-w-context are the same as the standard open-ended dataset: - **BLEU (Bilingual Evaluation Understudy)**: Measures the overlap between the generated answer and the reference answer based on n-grams. - **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Focuses on recall by measuring the overlap of n-grams, word sequences, and word pairs between the generated answer and the reference answer. - **Cosine Similarity**: Measures the similarity in the embedding space between the reference answers and the candidate one. - **BERTScore**: Utilizes pre-trained BERT models to compute similarity scores between the generated and reference answers based on contextual embeddings. - **LLM-as-Judge**: Uses a large language model to evaluate the quality of the generated answers based on criteria such as relevance, coherence, and informativeness. ## LLM-as-Judge To use LLM-as-Judge for evaluating open-ended responses with context, you can follow these steps: 1. **Select a Pre-trained LLM**: Choose a large language model that is capable of understanding and evaluating text, such as GPT-4, Claude, or any other suitable model. 2. **Define Evaluation Criteria**: Establish clear criteria for evaluation, such as relevance to the question, coherence of the response, informativeness, and overall quality. 3. **Prompt Engineering**: Create prompts that instruct the LLM to evaluate the generated answers based on the defined criteria. 4. **Run Evaluations**: Input the generated answers and reference answers into the LLM using the designed prompts to obtain evaluation scores or qualitative feedback. Here is the prompt used to evaluate the open-ended responses: ``` You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope. **Evaluation Rules:** 1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth. 2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question. 3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference. 4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters"). --- **Task**: Question: "{question}" Provided Answer: "{output}" Reference Answer: "{reference}" Using the rules above, assign a score of 0 if any failure condition is met. {format_instructions} ``` # Usage Examples ## Loading the Dataset ```python from datasets import load_dataset # Load the dataset from Hugging Face dataset = load_dataset("eve-esa/open-ended-w-context") # Access the data for sample in dataset['train']: question = sample['Question'] answer = sample['Answer'] doc1 = sample['Doc 1'] doc2 = sample['Doc 2'] doc3 = sample['Doc 3'] print(f"Q: {question}") # Check and display available context documents available_docs = [] if doc1 and doc1.strip(): available_docs.append(1) print(f"Context 1: {doc1[:100]}...") if doc2 and doc2.strip(): available_docs.append(2) print(f"Context 2: {doc2[:100]}...") if doc3 and doc3.strip(): available_docs.append(3) print(f"Context 3: {doc3[:100]}...") print(f"Available documents: {available_docs}") print(f"A: {answer}\n") ``` ## Generating Answers with Context ```python from datasets import load_dataset # Load dataset dataset = load_dataset("eve-esa/open-ended-w-context") def format_prompt_with_context(question, doc1, doc2, doc3): """ Format a prompt that includes the context documents. Handles cases where some documents may be empty or None. Args: question: The question to answer doc1, doc2, doc3: Context documents (may be empty/None) Returns: Formatted prompt string """ prompt = "Answer the following question using the provided context documents.\n\n" # Add only non-empty documents doc_count = 1 for doc in [doc1, doc2, doc3]: if doc and doc.strip(): # Check if document exists and is not empty prompt += f"Context Document {doc_count}:\n{doc}\n\n" doc_count += 1 prompt += f"Question: {question}\n\nAnswer:" return prompt # Generate answers for each example for sample in dataset['train']: prompt = format_prompt_with_context( sample['Question'], sample['Doc 1'], sample['Doc 2'], sample['Doc 3'] ) # Use your model to generate an answer generated_answer = your_model.generate(prompt) print(f"Generated: {generated_answer}") print(f"Reference: {sample['Answer']}\n") ``` ## Evaluation with LLM-as-Judge ```python from datasets import load_dataset import anthropic from pydantic import BaseModel import json # Load dataset dataset = load_dataset("eve-esa/open-ended-w-context") class EvaluationResult(BaseModel): score: int # 0 or 1 reasoning: str def evaluate_with_llm_judge(question, generated_answer, reference_answer, client): """ Evaluate a single answer using LLM-as-Judge. Args: question: The original question generated_answer: Model-generated answer reference_answer: Ground truth answer client: Anthropic or OpenAI client Returns: EvaluationResult: Score and reasoning """ prompt = f"""You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope. **Evaluation Rules:** 1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth. 2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question. 3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference. 4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters"). --- **Task**: Question: "{question}" Provided Answer: "{generated_answer}" Reference Answer: "{reference_answer}" Using the rules above, assign a score of 0 if any failure condition is met. Respond in JSON format with: - "score": 0 or 1 - "reasoning": brief explanation of your decision """ # Using Anthropic Claude message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) # Parse response response_text = message.content[0].text try: result_dict = json.loads(response_text) return EvaluationResult(**result_dict) except: return EvaluationResult(score=0, reasoning="Error parsing response") def evaluate_dataset_with_llm(dataset, generated_answers, api_key): """ Evaluate entire dataset using LLM-as-Judge. Args: dataset: HuggingFace dataset generated_answers: List of model-generated answers api_key: API key for the LLM service Returns: dict: Evaluation results """ client = anthropic.Anthropic(api_key=api_key) results = [] pass_count = 0 for i, sample in enumerate(dataset['train']): question = sample['Question'] reference = sample['Answer'] generated = generated_answers[i] eval_result = evaluate_with_llm_judge( question, generated, reference, client ) results.append({ 'question': question, 'score': eval_result.score, 'reasoning': eval_result.reasoning }) pass_count += eval_result.score print(f"Sample {i+1}: {'PASS' if eval_result.score == 1 else 'FAIL'}") accuracy = pass_count / len(results) return { 'accuracy': accuracy, 'total_samples': len(results), 'passed': pass_count, 'failed': len(results) - pass_count, 'detailed_results': results } # Example usage questions = [sample['Question'] for sample in dataset['train']] reference_answers = [sample['Answer'] for sample in dataset['train']] # Get your model's predictions (with context) generated_answers = [] for sample in dataset['train']: prompt = format_prompt_with_context( sample['Question'], sample['Doc 1'], sample['Doc 2'], sample['Doc 3'] ) answer = your_model.generate(prompt) generated_answers.append(answer) # Evaluate api_key = "your-api-key-here" # or use os.environ.get("ANTHROPIC_API_KEY") results = evaluate_dataset_with_llm(dataset, generated_answers, api_key) print(f"\nLLM-as-Judge Results:") print(f"Accuracy: {results['accuracy']:.2%}") print(f"Passed: {results['passed']}/{results['total_samples']}") print(f"Failed: {results['failed']}/{results['total_samples']}") ``` # Use Cases This dataset is particularly valuable for: - **RAG System Evaluation**: Testing how well models utilize provided context to answer questions - **Context Grounding**: Evaluating whether models can distinguish between information in the context vs. their pre-trained knowledge - **Document Comprehension**: Assessing models' ability to extract and synthesize information from multiple sources - **Factual Accuracy**: Measuring how well models avoid hallucination when relevant context is provided - **EO Domain Expertise**: Testing models' understanding of Earth Observation concepts when given domain-specific documentation # Comparison with Standard Open-Ended Dataset The key difference between this dataset and the standard EVE-open-ended dataset is: | Feature | Open-ended | Open-ended-w-context | |---------|-----------|----------------------| | Context Documents | None | 1-3 documents per question | | Use Case | General QA | RAG evaluation | | Expected Behavior | Use pre-trained knowledge | Leverage provided context | | Evaluation Focus | Knowledge recall | Context utilization + knowledge | # Implementation Notes When working with this dataset: 1. **Variable Document Availability**: Not all samples contain all three documents. Always check if a document exists and is not empty before using it in your prompts. 2. **Handling Empty Documents**: Empty documents may be represented as empty strings (`""`) or None. Use appropriate checks like `if doc and doc.strip()` to filter out empty documents. 3. **Context Formatting**: When constructing prompts, only include available (non-empty) documents to avoid confusing the model with empty context sections. 4. **Evaluation Consistency**: Use the same evaluation metrics (LLM-as-Judge, BERTScore, etc.) regardless of the number of documents available, as the reference answer quality should not depend on document count. 5. **RAG System Testing**: This variable document structure makes the dataset realistic for RAG evaluation, as real-world retrieval systems may return varying numbers of relevant documents. # Installation Requirements ```bash # For basic dataset loading pip install datasets # For evaluation metrics pip install sentence-transformers scikit-learn # Cosine Similarity pip install bert-score # BERTScore # For LLM-as-Judge (choose one) pip install anthropic # For Claude pip install openai # For OpenAI models ``` # Citation If you use this dataset in your research, please cite: ``` @misc{atrio2026evedomainspecificllmframework, title={{EVE}: A Domain-Specific {LLM} Framework for Earth Intelligence}, author={Àlex R. Atrio and Antonio Lopez and Jino Rohit and Yassine El Ouahidi and Marcello Politi and Vijayasri Iyer and Umar Jamil and Sébastien Bratières and Nicolas Longépé}, year={2026}, eprint={2604.13071}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.13071}, } ```

提供机构：

eve-esa

5,000+

优质数据集

54 个

任务类型

进入经典数据集