Name: sdeakin/GoEmotions-Projected-BIO-Emotions
Creator: sdeakin
Published: 2025-12-09 03:07:12
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/sdeakin/GoEmotions-Projected-BIO-Emotions

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: cc-by-4.0 pretty_name: GoEmotions Projected BIO + Span Tags (LLM-Generated) tags: - goemotions - bio-tagging - span-extraction - emotion-classification - llm-generated - synthetic dataset_info: features: - name: src_id dtype: string - name: model dtype: string - name: provider dtype: string - name: prompt dtype: string - name: level dtype: string - name: original_llm_predictions sequence: string - name: text dtype: string - name: data struct: - name: tokens sequence: string - name: labels sequence: string - name: spans sequence: struct: - name: type dtype: string - name: subtype dtype: string - name: start dtype: int32 - name: end dtype: int32 - name: text dtype: string - name: attrs struct: {} paperswithcode_id: go-emotions task_categories: - text-classification - token-classification - feature-extraction size_categories: - 100K<n<1M --- # Dataset Card for **GoEmotions-Projected-BIO-Emotions** ## Dataset Summary **GoEmotions-Projected-BIO-Emotions** contains **196,853 high-quality span annotations** generated by projecting the *ground-truth GoEmotions emotion labels* onto **BIO-tagged emotional spans** using `llama3:instruct`. Unlike typical LLM-based annotation pipelines (where the model *predicts* emotions), this dataset feeds the **true GoEmotions label(s)** into the prompt and asks the LLM to: * tokenize the text * generate BIO tags (`B-EMO`, `I-EMO`, `O`) * identify span boundaries * produce structured span objects * attach rich emotion attributes (valence, intensity, certainty, temporality, source, emotion_group) * optionally include target entity + relation metadata This produces a highly consistent, projected labeling dataset that aligns the GoEmotions taxonomy with explicit emotional spans. --- ## Dataset Structure ### Example Record ```json { "src_id": "l2_345", "model": "llama3:instruct", "provider": "ollama-local", "prompt": "level_2_projected", "level": "level2", "original_llm_predictions": ["gratitude"], "text": "Thanks for staying late to help me finish.", "data": { "tokens": ["Thanks", "for", "staying", "late", "to", "help", "me", "finish", "."], "labels": ["B-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "O"], "spans": [ { "type": "EMO", "subtype": "Gratitude", "start": 0, "end": 7, "text": "Thanks for staying late to help me finish", "attrs": { "valence": "pos", "intensity": "med", "certainty": "asserted", "temporality": "present", "source": "self", "emotion_group": "positive_affect", "sentence_index": 0, "clause_index": 0, "confidence": 0.97, "target_text": "you", "target_relation": "benefactor" } } ] } } ``` --- ## Data Fields ### Top-Level Fields | Field | Type | Description | | -------------------------- | ------------ | --------------------------------------------------------- | | `src_id` | string | Unique row ID (`l2_<index>`). | | `model` | string | LLM used (`llama3:instruct`). | | `provider` | string | Backend (`ollama-local`). | | `prompt` | string | Prompt name used. | | `level` | string | Annotation level (`level2`). | | `original_llm_predictions` | list[string] | **Ground-truth GoEmotions labels provided to the model.** | | `text` | string | Original input sentence. | | `data.tokens` | list[string] | Whitespace tokenization. | | `data.labels` | list[string] | BIO labels. | | `data.spans` | list[object] | Spans with attributes. | --- ## Generation Process ### 1. Ground-Truth Emotion Projection The LLM is not tasked with labeling emotions. Instead, GoEmotions labels are inserted into the prompt, and the model *projects* them onto: * token-level BIO tags * explicit spans * fine-grained emotional attributes ### 2. Prompt Template The Level-2 Projected Prompt (`prompts/level_2.txt`) instructs the LLM to: * echo the input text * tokenize * produce token-aligned BIO tagging * output span objects with attributes ### 3. Cleaning & Validation The cleaned dataset applies strict filtering: | Step | Description | | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | | **Schema validation** | Ensures presence of required fields (`tokens`, `labels`, `spans`). | | **Token/label alignment** | Verifies BIO label count equals token count. | | **Span consistency** | Confirms `start` / `end` match the token slice & reconstruct span text. | | **Attribute normalization** | Maps attribute values to controlled vocabularies. | | **Emotion label validation** | Confirms span `subtype` matches the official GoEmotions taxonomy (28 emotions + neutral). Rejects hallucinated or invalid emotion names. | | **Confidence bounds** | Ensures `confidence` ∈ `[0, 1]`. | | **Rejected sample logging** | Any failed entry is saved to an `_incorrect.jsonl` audit file. | Final cleaned dataset size: **196,853 entries** Rejected during cleaning: **3,385 entries** --- ## Intended Uses ### Span-Based Emotion Taggers Train token-level or span-level models for emotion extraction. ### Tri-Tower / Contrastive Architectures Use BIO spans + attributes for: * span tower * definition tower alignment * context tower supervision ### Targeted Emotion Extraction Many spans include target entities and relations. ### Attribute Prediction Multitask learning for valence, intensity, certainty, etc. --- ## Limitations * Emotional attributes (e.g., intensity) are subjective and may be noisy. * BIO span boundaries reflect LLM judgments. * Dataset inherits biases from Llama-3 and Reddit-based GoEmotions data. * English-only, informal tone. --- ## Citation ```bibtex @misc{goemotions_projected_bio_emotions, title = {GoEmotions Projected BIO + Span Tags (LLM-Generated)}, author = {Sheryl D. and contributors}, year = {2025}, note = {LLM-projected span annotations using llama3:instruct.} } ```

应用场景：