eve-esa/open-ended
收藏Hugging Face2026-04-16 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/eve-esa/open-ended
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Question
dtype: string
- name: Answer
dtype: string
splits:
- name: train
num_bytes: 1000237
num_examples: 1257
download_size: 553925
dataset_size: 1000237
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-4.0
task_categories:
- question-answering
language:
- en
tags:
- EVE
- Earth-Virtual-Expert
- Earth
- Observation
- EO
pretty_name: EVE-open-ended
size_categories:
- n<1K
---
# Dataset Summary
EVE-open-ended is a collection of open-ended question-answer pairs focused on Earth Observation (EO). The datasets cover a wide range of EO topics, including, but not limited to satellite imagery analysis, remote sensing techniques, environmental monitoring, LiDAR, etc.
The datasets are designed to facilitate the development and evaluation of large language models (LLMs) in understanding and generating responses related to Earth Observation.
# Metrics
The metrics suggested to evaluate model performance on the EVE-open-ended:
- **BLEU (Bilingual Evaluation Understudy)**: Measures the overlap between the generated answer and the reference answer based on n-grams.
- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Focuses on recall by measuring the overlap of n-grams, word sequences, and word pairs between the generated answer and the reference answer.
- **Cosine Similarity**: Measures the similarity in the embedding space between the reference answers and the candidate one.
- **BERTScore**: Utilizes pre-trained BERT models to compute similarity scores between the generated and reference answers based on contextual embeddings.
- **LLM-as-Judge**: Uses a large language model to evaluate the quality of the generated answers based on criteria such as relevance, coherence, and informativeness.
## LLM-as-Judge
To use LLM-as-Judge for evaluating open-ended responses, you can follow these steps:
1. **Select a Pre-trained LLM**: Choose a large language model that is capable of understanding and evaluating text, such as GPT-4, PaLM, or any other suitable model.
2. **Define Evaluation Criteria**: Establish clear criteria for evaluation, such as relevance to the question, coherence of the response, informativeness, and overall quality.
3. **Prompt Engineering**: Create prompts that instruct the LLM to evaluate the generated answers based on the defined criteria.
4. **Run Evaluations**: Input the generated answers and reference answers into the LLM using the designed prompts to obtain evaluation scores or qualitative feedback.
Here is the prompt we used to evaluate the open-ended responses:
```
You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope.
**Evaluation Rules:**
1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth.
2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question.
3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference.
4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters").
---
**Task**:
Question: "{question}"
Provided Answer: "{output}"
Reference Answer: "{reference}"
Using the rules above, assign a score of 0 if any failure condition is met.
{format_instructions}
```
# Usage Examples
## Loading the Dataset
```python
from datasets import load_dataset
# Load the dataset from Hugging Face
dataset = load_dataset("eve-esa/open-ended")
# Access the data
for sample in dataset['train']:
question = sample['question']
reference_answer = sample['answer']
print(f"Q: {question}")
print(f"A: {reference_answer}\n")
```
## Evaluation Methods
### 1. Cosine Similarity
```python
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Load dataset
dataset = load_dataset("eve-esa/open-ended")
# Load a sentence transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
def evaluate_cosine_similarity(generated_answers, reference_answers):
"""
Evaluate generated answers using cosine similarity in embedding space.
Args:
generated_answers: List of model-generated answers
reference_answers: List of reference answers
Returns:
dict: Average cosine similarity and per-sample scores
"""
# Generate embeddings
generated_embeddings = model.encode(generated_answers)
reference_embeddings = model.encode(reference_answers)
# Calculate cosine similarities
similarities = []
for gen_emb, ref_emb in zip(generated_embeddings, reference_embeddings):
sim = cosine_similarity([gen_emb], [ref_emb])[0][0]
similarities.append(sim)
return {
'average_cosine_similarity': np.mean(similarities),
'scores': similarities
}
# Example usage
questions = [sample['question'] for sample in dataset['train']]
reference_answers = [sample['answer'] for sample in dataset['train']]
# Get your model's predictions
generated_answers = []
for question in questions:
# Replace with your actual model inference
answer = your_model.generate(question)
generated_answers.append(answer)
# Evaluate
results = evaluate_cosine_similarity(generated_answers, reference_answers)
print(f"Average Cosine Similarity: {results['average_cosine_similarity']:.4f}")
```
### 2. BERTScore
```python
from datasets import load_dataset
from bert_score import score
# Load dataset
dataset = load_dataset("eve-esa/open-ended")
def evaluate_bertscore(generated_answers, reference_answers):
"""
Evaluate generated answers using BERTScore.
Args:
generated_answers: List of model-generated answers
reference_answers: List of reference answers
Returns:
dict: Precision, Recall, and F1 scores
"""
# Calculate BERTScore
P, R, F1 = score(
generated_answers,
reference_answers,
lang='en',
model_type='microsoft/deberta-xlarge-mnli', # High-quality model
verbose=True
)
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item(),
'precision_scores': P.tolist(),
'recall_scores': R.tolist(),
'f1_scores': F1.tolist()
}
# Example usage
questions = [sample['question'] for sample in dataset['train']]
reference_answers = [sample['answer'] for sample in dataset['train']]
# Get your model's predictions
generated_answers = []
for question in questions:
# Replace with your actual model inference
answer = your_model.generate(question)
generated_answers.append(answer)
# Evaluate
results = evaluate_bertscore(generated_answers, reference_answers)
print(f"BERTScore Precision: {results['precision']:.4f}")
print(f"BERTScore Recall: {results['recall']:.4f}")
print(f"BERTScore F1: {results['f1']:.4f}")
```
### 3. LLM-as-Judge
```python
from datasets import load_dataset
import anthropic # or openai, depending on your choice
from pydantic import BaseModel
import json
# Load dataset
dataset = load_dataset("eve-esa/open-ended")
class EvaluationResult(BaseModel):
score: int # 0 or 1
reasoning: str
def evaluate_with_llm_judge(question, generated_answer, reference_answer, client):
"""
Evaluate a single answer using LLM-as-Judge.
Args:
question: The original question
generated_answer: Model-generated answer
reference_answer: Ground truth answer
client: Anthropic or OpenAI client
Returns:
EvaluationResult: Score and reasoning
"""
prompt = f"""You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope.
**Evaluation Rules:**
1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth.
2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question.
3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference.
4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters").
---
**Task**:
Question: "{question}"
Provided Answer: "{generated_answer}"
Reference Answer: "{reference_answer}"
Using the rules above, assign a score of 0 if any failure condition is met.
Respond in JSON format with:
- "score": 0 or 1
- "reasoning": brief explanation of your decision
"""
# Using Anthropic Claude
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
# Parse response
response_text = message.content[0].text
# Extract JSON from response
try:
result_dict = json.loads(response_text)
return EvaluationResult(**result_dict)
except:
# Fallback parsing if needed
return EvaluationResult(score=0, reasoning="Error parsing response")
def evaluate_dataset_with_llm(dataset, generated_answers, api_key):
"""
Evaluate entire dataset using LLM-as-Judge.
Args:
dataset: HuggingFace dataset
generated_answers: List of model-generated answers
api_key: API key for the LLM service
Returns:
dict: Evaluation results
"""
client = anthropic.Anthropic(api_key=api_key)
results = []
pass_count = 0
for i, sample in enumerate(dataset['train']):
question = sample['question']
reference = sample['answer']
generated = generated_answers[i]
eval_result = evaluate_with_llm_judge(
question, generated, reference, client
)
results.append({
'question': question,
'score': eval_result.score,
'reasoning': eval_result.reasoning
})
pass_count += eval_result.score
print(f"Sample {i+1}: {'PASS' if eval_result.score == 1 else 'FAIL'}")
accuracy = pass_count / len(results)
return {
'accuracy': accuracy,
'total_samples': len(results),
'passed': pass_count,
'failed': len(results) - pass_count,
'detailed_results': results
}
# Example usage
questions = [sample['question'] for sample in dataset['train']]
reference_answers = [sample['answer'] for sample in dataset['train']]
# Get your model's predictions
generated_answers = []
for question in questions:
# Replace with your actual model inference
answer = your_model.generate(question)
generated_answers.append(answer)
# Evaluate
api_key = "your-api-key-here" # or use os.environ.get("ANTHROPIC_API_KEY")
results = evaluate_dataset_with_llm(dataset, generated_answers, api_key)
print(f"\nLLM-as-Judge Results:")
print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Passed: {results['passed']}/{results['total_samples']}")
print(f"Failed: {results['failed']}/{results['total_samples']}")
```
## Installation Requirements
```bash
# For Cosine Similarity
pip install sentence-transformers scikit-learn
# For BERTScore
pip install bert-score
# For LLM-as-Judge (choose one)
pip install anthropic # For Claude
pip install openai # For OpenAI models
# Load dataset
pip install datasets
```
# Citation
If you use this project in academic or research settings, please cite:
```
@misc{atrio2026evedomainspecificllmframework,
title={{EVE}: A Domain-Specific {LLM} Framework for Earth Intelligence},
author={Àlex R. Atrio and Antonio Lopez and Jino Rohit and Yassine El Ouahidi and Marcello Politi and Vijayasri Iyer and Umar Jamil and Sébastien Bratières and Nicolas Longépé},
year={2026},
eprint={2604.13071},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.13071},
}
```
提供机构:
eve-esa



