eve-esa/open-ended-w-context
收藏Hugging Face2026-04-16 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/eve-esa/open-ended-w-context
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Question
dtype: string
- name: Answer
dtype: string
- name: Doc 1
dtype: string
- name: Doc 2
dtype: string
- name: Doc 3
dtype: string
- name: Source 1
dtype: string
- name: Source 2
dtype: string
- name: Source 3
dtype: string
- name: file_path_1
dtype: string
- name: file_path_2
dtype: string
- name: file_path_3
dtype: string
splits:
- name: train
num_bytes: 29521159
num_examples: 418
download_size: 12525331
dataset_size: 29521159
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-4.0
task_categories:
- question-answering
language:
- en
tags:
- EVE
- EO
- Earth
- RAG
- context-grounded
pretty_name: EVE-Open-Ended-w-Context
size_categories:
- n<1K
---
# Dataset Summary
EVE-open-ended-w-context is a collection of open-ended question-answer pairs focused on Earth Observation (EO) with accompanying context documents. Unlike the standard open-ended dataset, this version provides up to 3 relevant documents for each question that models can use to ground their responses. This makes it ideal for evaluating Retrieval-Augmented Generation (RAG) systems and testing models' ability to leverage provided context when answering questions.
The dataset covers a wide range of EO topics, including satellite imagery analysis, remote sensing techniques, environmental monitoring, LiDAR, and more. The context documents provide relevant background information, technical specifications, and domain knowledge that can help models generate more accurate and grounded responses.
**Note**: Not all samples contain all three documents. Some questions may have 1, 2, or 3 context documents depending on the availability of relevant information. Models should be designed to handle variable numbers of context documents gracefully.
# Dataset Structure
Each example in the dataset contains:
- **Question**: A question related to Earth Observation
- **Answer**: A reference answer to the question
- **Doc 1**: First context document containing relevant information (may be empty)
- **Doc 2**: Second context document containing relevant information (may be empty)
- **Doc 3**: Third context document containing relevant information (may be empty)
The context documents are selected to be relevant to the question and may contain information that helps answer the question, though models should also leverage their general knowledge where appropriate.
**Important**: Not all samples include all three documents. Some may have only Doc 1, some may have Doc 1 and Doc 2, and others may have all three. Empty or missing documents should be handled appropriately in your implementation.
## Examples
### Example 1: Sample with all three documents
```python
{
"Question": "What is the spatial resolution of Sentinel-2's visible bands?",
"Answer": "Sentinel-2's visible bands (B2, B3, B4) have a spatial resolution of 10 meters.",
"Doc 1": "The Sentinel-2 mission comprises a constellation of two satellites...",
"Doc 2": "Sentinel-2 carries the Multi-Spectral Instrument (MSI) with 13 spectral bands...",
"Doc 3": "The visible bands of Sentinel-2 provide high-resolution imagery suitable for..."
}
```
### Example 2: Sample with only one document
```python
{
"Question": "What are the main applications of SAR imagery?",
"Answer": "SAR imagery is primarily used for terrain mapping, disaster monitoring, and ice surveillance.",
"Doc 1": "Synthetic Aperture Radar (SAR) provides all-weather imaging capabilities...",
"Doc 2": "",
"Doc 3": ""
}
```
# Metrics
The metrics suggested to evaluate model performance on EVE-open-ended-w-context are the same as the standard open-ended dataset:
- **BLEU (Bilingual Evaluation Understudy)**: Measures the overlap between the generated answer and the reference answer based on n-grams.
- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Focuses on recall by measuring the overlap of n-grams, word sequences, and word pairs between the generated answer and the reference answer.
- **Cosine Similarity**: Measures the similarity in the embedding space between the reference answers and the candidate one.
- **BERTScore**: Utilizes pre-trained BERT models to compute similarity scores between the generated and reference answers based on contextual embeddings.
- **LLM-as-Judge**: Uses a large language model to evaluate the quality of the generated answers based on criteria such as relevance, coherence, and informativeness.
## LLM-as-Judge
To use LLM-as-Judge for evaluating open-ended responses with context, you can follow these steps:
1. **Select a Pre-trained LLM**: Choose a large language model that is capable of understanding and evaluating text, such as GPT-4, Claude, or any other suitable model.
2. **Define Evaluation Criteria**: Establish clear criteria for evaluation, such as relevance to the question, coherence of the response, informativeness, and overall quality.
3. **Prompt Engineering**: Create prompts that instruct the LLM to evaluate the generated answers based on the defined criteria.
4. **Run Evaluations**: Input the generated answers and reference answers into the LLM using the designed prompts to obtain evaluation scores or qualitative feedback.
Here is the prompt used to evaluate the open-ended responses:
```
You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope.
**Evaluation Rules:**
1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth.
2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question.
3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference.
4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters").
---
**Task**:
Question: "{question}"
Provided Answer: "{output}"
Reference Answer: "{reference}"
Using the rules above, assign a score of 0 if any failure condition is met.
{format_instructions}
```
# Usage Examples
## Loading the Dataset
```python
from datasets import load_dataset
# Load the dataset from Hugging Face
dataset = load_dataset("eve-esa/open-ended-w-context")
# Access the data
for sample in dataset['train']:
question = sample['Question']
answer = sample['Answer']
doc1 = sample['Doc 1']
doc2 = sample['Doc 2']
doc3 = sample['Doc 3']
print(f"Q: {question}")
# Check and display available context documents
available_docs = []
if doc1 and doc1.strip():
available_docs.append(1)
print(f"Context 1: {doc1[:100]}...")
if doc2 and doc2.strip():
available_docs.append(2)
print(f"Context 2: {doc2[:100]}...")
if doc3 and doc3.strip():
available_docs.append(3)
print(f"Context 3: {doc3[:100]}...")
print(f"Available documents: {available_docs}")
print(f"A: {answer}\n")
```
## Generating Answers with Context
```python
from datasets import load_dataset
# Load dataset
dataset = load_dataset("eve-esa/open-ended-w-context")
def format_prompt_with_context(question, doc1, doc2, doc3):
"""
Format a prompt that includes the context documents.
Handles cases where some documents may be empty or None.
Args:
question: The question to answer
doc1, doc2, doc3: Context documents (may be empty/None)
Returns:
Formatted prompt string
"""
prompt = "Answer the following question using the provided context documents.\n\n"
# Add only non-empty documents
doc_count = 1
for doc in [doc1, doc2, doc3]:
if doc and doc.strip(): # Check if document exists and is not empty
prompt += f"Context Document {doc_count}:\n{doc}\n\n"
doc_count += 1
prompt += f"Question: {question}\n\nAnswer:"
return prompt
# Generate answers for each example
for sample in dataset['train']:
prompt = format_prompt_with_context(
sample['Question'],
sample['Doc 1'],
sample['Doc 2'],
sample['Doc 3']
)
# Use your model to generate an answer
generated_answer = your_model.generate(prompt)
print(f"Generated: {generated_answer}")
print(f"Reference: {sample['Answer']}\n")
```
## Evaluation with LLM-as-Judge
```python
from datasets import load_dataset
import anthropic
from pydantic import BaseModel
import json
# Load dataset
dataset = load_dataset("eve-esa/open-ended-w-context")
class EvaluationResult(BaseModel):
score: int # 0 or 1
reasoning: str
def evaluate_with_llm_judge(question, generated_answer, reference_answer, client):
"""
Evaluate a single answer using LLM-as-Judge.
Args:
question: The original question
generated_answer: Model-generated answer
reference_answer: Ground truth answer
client: Anthropic or OpenAI client
Returns:
EvaluationResult: Score and reasoning
"""
prompt = f"""You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope.
**Evaluation Rules:**
1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth.
2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question.
3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference.
4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters").
---
**Task**:
Question: "{question}"
Provided Answer: "{generated_answer}"
Reference Answer: "{reference_answer}"
Using the rules above, assign a score of 0 if any failure condition is met.
Respond in JSON format with:
- "score": 0 or 1
- "reasoning": brief explanation of your decision
"""
# Using Anthropic Claude
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
# Parse response
response_text = message.content[0].text
try:
result_dict = json.loads(response_text)
return EvaluationResult(**result_dict)
except:
return EvaluationResult(score=0, reasoning="Error parsing response")
def evaluate_dataset_with_llm(dataset, generated_answers, api_key):
"""
Evaluate entire dataset using LLM-as-Judge.
Args:
dataset: HuggingFace dataset
generated_answers: List of model-generated answers
api_key: API key for the LLM service
Returns:
dict: Evaluation results
"""
client = anthropic.Anthropic(api_key=api_key)
results = []
pass_count = 0
for i, sample in enumerate(dataset['train']):
question = sample['Question']
reference = sample['Answer']
generated = generated_answers[i]
eval_result = evaluate_with_llm_judge(
question, generated, reference, client
)
results.append({
'question': question,
'score': eval_result.score,
'reasoning': eval_result.reasoning
})
pass_count += eval_result.score
print(f"Sample {i+1}: {'PASS' if eval_result.score == 1 else 'FAIL'}")
accuracy = pass_count / len(results)
return {
'accuracy': accuracy,
'total_samples': len(results),
'passed': pass_count,
'failed': len(results) - pass_count,
'detailed_results': results
}
# Example usage
questions = [sample['Question'] for sample in dataset['train']]
reference_answers = [sample['Answer'] for sample in dataset['train']]
# Get your model's predictions (with context)
generated_answers = []
for sample in dataset['train']:
prompt = format_prompt_with_context(
sample['Question'],
sample['Doc 1'],
sample['Doc 2'],
sample['Doc 3']
)
answer = your_model.generate(prompt)
generated_answers.append(answer)
# Evaluate
api_key = "your-api-key-here" # or use os.environ.get("ANTHROPIC_API_KEY")
results = evaluate_dataset_with_llm(dataset, generated_answers, api_key)
print(f"\nLLM-as-Judge Results:")
print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Passed: {results['passed']}/{results['total_samples']}")
print(f"Failed: {results['failed']}/{results['total_samples']}")
```
# Use Cases
This dataset is particularly valuable for:
- **RAG System Evaluation**: Testing how well models utilize provided context to answer questions
- **Context Grounding**: Evaluating whether models can distinguish between information in the context vs. their pre-trained knowledge
- **Document Comprehension**: Assessing models' ability to extract and synthesize information from multiple sources
- **Factual Accuracy**: Measuring how well models avoid hallucination when relevant context is provided
- **EO Domain Expertise**: Testing models' understanding of Earth Observation concepts when given domain-specific documentation
# Comparison with Standard Open-Ended Dataset
The key difference between this dataset and the standard EVE-open-ended dataset is:
| Feature | Open-ended | Open-ended-w-context |
|---------|-----------|----------------------|
| Context Documents | None | 1-3 documents per question |
| Use Case | General QA | RAG evaluation |
| Expected Behavior | Use pre-trained knowledge | Leverage provided context |
| Evaluation Focus | Knowledge recall | Context utilization + knowledge |
# Implementation Notes
When working with this dataset:
1. **Variable Document Availability**: Not all samples contain all three documents. Always check if a document exists and is not empty before using it in your prompts.
2. **Handling Empty Documents**: Empty documents may be represented as empty strings (`""`) or None. Use appropriate checks like `if doc and doc.strip()` to filter out empty documents.
3. **Context Formatting**: When constructing prompts, only include available (non-empty) documents to avoid confusing the model with empty context sections.
4. **Evaluation Consistency**: Use the same evaluation metrics (LLM-as-Judge, BERTScore, etc.) regardless of the number of documents available, as the reference answer quality should not depend on document count.
5. **RAG System Testing**: This variable document structure makes the dataset realistic for RAG evaluation, as real-world retrieval systems may return varying numbers of relevant documents.
# Installation Requirements
```bash
# For basic dataset loading
pip install datasets
# For evaluation metrics
pip install sentence-transformers scikit-learn # Cosine Similarity
pip install bert-score # BERTScore
# For LLM-as-Judge (choose one)
pip install anthropic # For Claude
pip install openai # For OpenAI models
```
# Citation
If you use this dataset in your research, please cite:
```
@misc{atrio2026evedomainspecificllmframework,
title={{EVE}: A Domain-Specific {LLM} Framework for Earth Intelligence},
author={Àlex R. Atrio and Antonio Lopez and Jino Rohit and Yassine El Ouahidi and Marcello Politi and Vijayasri Iyer and Umar Jamil and Sébastien Bratières and Nicolas Longépé},
year={2026},
eprint={2604.13071},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.13071},
}
```
提供机构:
eve-esa



