eve-esa/open-ended

Name: eve-esa/open-ended
Creator: eve-esa
Published: 2026-04-16 07:59:15
License: 暂无描述

Hugging Face2026-04-16 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/eve-esa/open-ended

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Question dtype: string - name: Answer dtype: string splits: - name: train num_bytes: 1000237 num_examples: 1257 download_size: 553925 dataset_size: 1000237 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-4.0 task_categories: - question-answering language: - en tags: - EVE - Earth-Virtual-Expert - Earth - Observation - EO pretty_name: EVE-open-ended size_categories: - n<1K --- # Dataset Summary EVE-open-ended is a collection of open-ended question-answer pairs focused on Earth Observation (EO). The datasets cover a wide range of EO topics, including, but not limited to satellite imagery analysis, remote sensing techniques, environmental monitoring, LiDAR, etc. The datasets are designed to facilitate the development and evaluation of large language models (LLMs) in understanding and generating responses related to Earth Observation. # Metrics The metrics suggested to evaluate model performance on the EVE-open-ended: - **BLEU (Bilingual Evaluation Understudy)**: Measures the overlap between the generated answer and the reference answer based on n-grams. - **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Focuses on recall by measuring the overlap of n-grams, word sequences, and word pairs between the generated answer and the reference answer. - **Cosine Similarity**: Measures the similarity in the embedding space between the reference answers and the candidate one. - **BERTScore**: Utilizes pre-trained BERT models to compute similarity scores between the generated and reference answers based on contextual embeddings. - **LLM-as-Judge**: Uses a large language model to evaluate the quality of the generated answers based on criteria such as relevance, coherence, and informativeness. ## LLM-as-Judge To use LLM-as-Judge for evaluating open-ended responses, you can follow these steps: 1. **Select a Pre-trained LLM**: Choose a large language model that is capable of understanding and evaluating text, such as GPT-4, PaLM, or any other suitable model. 2. **Define Evaluation Criteria**: Establish clear criteria for evaluation, such as relevance to the question, coherence of the response, informativeness, and overall quality. 3. **Prompt Engineering**: Create prompts that instruct the LLM to evaluate the generated answers based on the defined criteria. 4. **Run Evaluations**: Input the generated answers and reference answers into the LLM using the designed prompts to obtain evaluation scores or qualitative feedback. Here is the prompt we used to evaluate the open-ended responses: ``` You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope. **Evaluation Rules:** 1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth. 2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question. 3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference. 4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters"). --- **Task**: Question: "{question}" Provided Answer: "{output}" Reference Answer: "{reference}" Using the rules above, assign a score of 0 if any failure condition is met. {format_instructions} ``` # Usage Examples ## Loading the Dataset ```python from datasets import load_dataset # Load the dataset from Hugging Face dataset = load_dataset("eve-esa/open-ended") # Access the data for sample in dataset['train']: question = sample['question'] reference_answer = sample['answer'] print(f"Q: {question}") print(f"A: {reference_answer}\n") ``` ## Evaluation Methods ### 1. Cosine Similarity ```python from datasets import load_dataset from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity import numpy as np # Load dataset dataset = load_dataset("eve-esa/open-ended") # Load a sentence transformer model model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') def evaluate_cosine_similarity(generated_answers, reference_answers): """ Evaluate generated answers using cosine similarity in embedding space. Args: generated_answers: List of model-generated answers reference_answers: List of reference answers Returns: dict: Average cosine similarity and per-sample scores """ # Generate embeddings generated_embeddings = model.encode(generated_answers) reference_embeddings = model.encode(reference_answers) # Calculate cosine similarities similarities = [] for gen_emb, ref_emb in zip(generated_embeddings, reference_embeddings): sim = cosine_similarity([gen_emb], [ref_emb])[0][0] similarities.append(sim) return { 'average_cosine_similarity': np.mean(similarities), 'scores': similarities } # Example usage questions = [sample['question'] for sample in dataset['train']] reference_answers = [sample['answer'] for sample in dataset['train']] # Get your model's predictions generated_answers = [] for question in questions: # Replace with your actual model inference answer = your_model.generate(question) generated_answers.append(answer) # Evaluate results = evaluate_cosine_similarity(generated_answers, reference_answers) print(f"Average Cosine Similarity: {results['average_cosine_similarity']:.4f}") ``` ### 2. BERTScore ```python from datasets import load_dataset from bert_score import score # Load dataset dataset = load_dataset("eve-esa/open-ended") def evaluate_bertscore(generated_answers, reference_answers): """ Evaluate generated answers using BERTScore. Args: generated_answers: List of model-generated answers reference_answers: List of reference answers Returns: dict: Precision, Recall, and F1 scores """ # Calculate BERTScore P, R, F1 = score( generated_answers, reference_answers, lang='en', model_type='microsoft/deberta-xlarge-mnli', # High-quality model verbose=True ) return { 'precision': P.mean().item(), 'recall': R.mean().item(), 'f1': F1.mean().item(), 'precision_scores': P.tolist(), 'recall_scores': R.tolist(), 'f1_scores': F1.tolist() } # Example usage questions = [sample['question'] for sample in dataset['train']] reference_answers = [sample['answer'] for sample in dataset['train']] # Get your model's predictions generated_answers = [] for question in questions: # Replace with your actual model inference answer = your_model.generate(question) generated_answers.append(answer) # Evaluate results = evaluate_bertscore(generated_answers, reference_answers) print(f"BERTScore Precision: {results['precision']:.4f}") print(f"BERTScore Recall: {results['recall']:.4f}") print(f"BERTScore F1: {results['f1']:.4f}") ``` ### 3. LLM-as-Judge ```python from datasets import load_dataset import anthropic # or openai, depending on your choice from pydantic import BaseModel import json # Load dataset dataset = load_dataset("eve-esa/open-ended") class EvaluationResult(BaseModel): score: int # 0 or 1 reasoning: str def evaluate_with_llm_judge(question, generated_answer, reference_answer, client): """ Evaluate a single answer using LLM-as-Judge. Args: question: The original question generated_answer: Model-generated answer reference_answer: Ground truth answer client: Anthropic or OpenAI client Returns: EvaluationResult: Score and reasoning """ prompt = f"""You are a strict fact-checker for Earth Observation (EO). Your task is to give a score of 0 (FAIL) or 1 (PASS) by comparing a "Provided Answer" to a "Reference Answer," using the "Question" to define the scope. **Evaluation Rules:** 1. **Contradiction Check (FAIL condition)**: The Provided Answer scores 0 if it contains **ANY** fact (name, number, concept) that contradicts the Reference. The Reference is the absolute source of truth. 2. **Relevance Check (FAIL condition)**: The Provided Answer scores 0 if it omits key technical facts from the Reference that are **ESSENTIAL** to correctly answering the Question. 3. **Additive Information (PASS condition)**: The Provided Answer may include additional, correct information not found in the Reference. This is acceptable and should **NOT** be penalized, as long as it does not contradict the Reference. 4. **Focus on Substance, Not Style**: Ignore the answer's length, verbosity, and tone. Tolerate minor phrasing differences (e.g., "10m" vs "10 meters"). --- **Task**: Question: "{question}" Provided Answer: "{generated_answer}" Reference Answer: "{reference_answer}" Using the rules above, assign a score of 0 if any failure condition is met. Respond in JSON format with: - "score": 0 or 1 - "reasoning": brief explanation of your decision """ # Using Anthropic Claude message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) # Parse response response_text = message.content[0].text # Extract JSON from response try: result_dict = json.loads(response_text) return EvaluationResult(**result_dict) except: # Fallback parsing if needed return EvaluationResult(score=0, reasoning="Error parsing response") def evaluate_dataset_with_llm(dataset, generated_answers, api_key): """ Evaluate entire dataset using LLM-as-Judge. Args: dataset: HuggingFace dataset generated_answers: List of model-generated answers api_key: API key for the LLM service Returns: dict: Evaluation results """ client = anthropic.Anthropic(api_key=api_key) results = [] pass_count = 0 for i, sample in enumerate(dataset['train']): question = sample['question'] reference = sample['answer'] generated = generated_answers[i] eval_result = evaluate_with_llm_judge( question, generated, reference, client ) results.append({ 'question': question, 'score': eval_result.score, 'reasoning': eval_result.reasoning }) pass_count += eval_result.score print(f"Sample {i+1}: {'PASS' if eval_result.score == 1 else 'FAIL'}") accuracy = pass_count / len(results) return { 'accuracy': accuracy, 'total_samples': len(results), 'passed': pass_count, 'failed': len(results) - pass_count, 'detailed_results': results } # Example usage questions = [sample['question'] for sample in dataset['train']] reference_answers = [sample['answer'] for sample in dataset['train']] # Get your model's predictions generated_answers = [] for question in questions: # Replace with your actual model inference answer = your_model.generate(question) generated_answers.append(answer) # Evaluate api_key = "your-api-key-here" # or use os.environ.get("ANTHROPIC_API_KEY") results = evaluate_dataset_with_llm(dataset, generated_answers, api_key) print(f"\nLLM-as-Judge Results:") print(f"Accuracy: {results['accuracy']:.2%}") print(f"Passed: {results['passed']}/{results['total_samples']}") print(f"Failed: {results['failed']}/{results['total_samples']}") ``` ## Installation Requirements ```bash # For Cosine Similarity pip install sentence-transformers scikit-learn # For BERTScore pip install bert-score # For LLM-as-Judge (choose one) pip install anthropic # For Claude pip install openai # For OpenAI models # Load dataset pip install datasets ``` # Citation If you use this project in academic or research settings, please cite: ``` @misc{atrio2026evedomainspecificllmframework, title={{EVE}: A Domain-Specific {LLM} Framework for Earth Intelligence}, author={Àlex R. Atrio and Antonio Lopez and Jino Rohit and Yassine El Ouahidi and Marcello Politi and Vijayasri Iyer and Umar Jamil and Sébastien Bratières and Nicolas Longépé}, year={2026}, eprint={2604.13071}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.13071}, } ```

提供机构：

eve-esa

5,000+

优质数据集

54 个

任务类型

进入经典数据集