sdeakin/GoEmotions-Projected-BIO-Emotions
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sdeakin/GoEmotions-Projected-BIO-Emotions
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: cc-by-4.0
pretty_name: GoEmotions Projected BIO + Span Tags (LLM-Generated)
tags:
- goemotions
- bio-tagging
- span-extraction
- emotion-classification
- llm-generated
- synthetic
dataset_info:
features:
- name: src_id
dtype: string
- name: model
dtype: string
- name: provider
dtype: string
- name: prompt
dtype: string
- name: level
dtype: string
- name: original_llm_predictions
sequence: string
- name: text
dtype: string
- name: data
struct:
- name: tokens
sequence: string
- name: labels
sequence: string
- name: spans
sequence:
struct:
- name: type
dtype: string
- name: subtype
dtype: string
- name: start
dtype: int32
- name: end
dtype: int32
- name: text
dtype: string
- name: attrs
struct: {}
paperswithcode_id: go-emotions
task_categories:
- text-classification
- token-classification
- feature-extraction
size_categories:
- 100K<n<1M
---
# Dataset Card for **GoEmotions-Projected-BIO-Emotions**
## Dataset Summary
**GoEmotions-Projected-BIO-Emotions** contains **196,853 high-quality span annotations** generated by projecting the *ground-truth GoEmotions emotion labels* onto **BIO-tagged emotional spans** using `llama3:instruct`.
Unlike typical LLM-based annotation pipelines (where the model *predicts* emotions), this dataset feeds the **true GoEmotions label(s)** into the prompt and asks the LLM to:
* tokenize the text
* generate BIO tags (`B-EMO`, `I-EMO`, `O`)
* identify span boundaries
* produce structured span objects
* attach rich emotion attributes (valence, intensity, certainty, temporality, source, emotion_group)
* optionally include target entity + relation metadata
This produces a highly consistent, projected labeling dataset that aligns the GoEmotions taxonomy with explicit emotional spans.
---
## Dataset Structure
### Example Record
```json
{
"src_id": "l2_345",
"model": "llama3:instruct",
"provider": "ollama-local",
"prompt": "level_2_projected",
"level": "level2",
"original_llm_predictions": ["gratitude"],
"text": "Thanks for staying late to help me finish.",
"data": {
"tokens": ["Thanks", "for", "staying", "late", "to", "help", "me", "finish", "."],
"labels": ["B-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "O"],
"spans": [
{
"type": "EMO",
"subtype": "Gratitude",
"start": 0,
"end": 7,
"text": "Thanks for staying late to help me finish",
"attrs": {
"valence": "pos",
"intensity": "med",
"certainty": "asserted",
"temporality": "present",
"source": "self",
"emotion_group": "positive_affect",
"sentence_index": 0,
"clause_index": 0,
"confidence": 0.97,
"target_text": "you",
"target_relation": "benefactor"
}
}
]
}
}
```
---
## Data Fields
### Top-Level Fields
| Field | Type | Description |
| -------------------------- | ------------ | --------------------------------------------------------- |
| `src_id` | string | Unique row ID (`l2_<index>`). |
| `model` | string | LLM used (`llama3:instruct`). |
| `provider` | string | Backend (`ollama-local`). |
| `prompt` | string | Prompt name used. |
| `level` | string | Annotation level (`level2`). |
| `original_llm_predictions` | list[string] | **Ground-truth GoEmotions labels provided to the model.** |
| `text` | string | Original input sentence. |
| `data.tokens` | list[string] | Whitespace tokenization. |
| `data.labels` | list[string] | BIO labels. |
| `data.spans` | list[object] | Spans with attributes. |
---
## Generation Process
### 1. Ground-Truth Emotion Projection
The LLM is not tasked with labeling emotions.
Instead, GoEmotions labels are inserted into the prompt, and the model *projects* them onto:
* token-level BIO tags
* explicit spans
* fine-grained emotional attributes
### 2. Prompt Template
The Level-2 Projected Prompt (`prompts/level_2.txt`) instructs the LLM to:
* echo the input text
* tokenize
* produce token-aligned BIO tagging
* output span objects with attributes
### 3. Cleaning & Validation
The cleaned dataset applies strict filtering:
| Step | Description |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| **Schema validation** | Ensures presence of required fields (`tokens`, `labels`, `spans`). |
| **Token/label alignment** | Verifies BIO label count equals token count. |
| **Span consistency** | Confirms `start` / `end` match the token slice & reconstruct span text. |
| **Attribute normalization** | Maps attribute values to controlled vocabularies. |
| **Emotion label validation** | Confirms span `subtype` matches the official GoEmotions taxonomy (28 emotions + neutral). Rejects hallucinated or invalid emotion names. |
| **Confidence bounds** | Ensures `confidence` ∈ `[0, 1]`. |
| **Rejected sample logging** | Any failed entry is saved to an `_incorrect.jsonl` audit file. |
Final cleaned dataset size: **196,853 entries**
Rejected during cleaning: **3,385 entries**
---
## Intended Uses
### Span-Based Emotion Taggers
Train token-level or span-level models for emotion extraction.
### Tri-Tower / Contrastive Architectures
Use BIO spans + attributes for:
* span tower
* definition tower alignment
* context tower supervision
### Targeted Emotion Extraction
Many spans include target entities and relations.
### Attribute Prediction
Multitask learning for valence, intensity, certainty, etc.
---
## Limitations
* Emotional attributes (e.g., intensity) are subjective and may be noisy.
* BIO span boundaries reflect LLM judgments.
* Dataset inherits biases from Llama-3 and Reddit-based GoEmotions data.
* English-only, informal tone.
---
## Citation
```bibtex
@misc{goemotions_projected_bio_emotions,
title = {GoEmotions Projected BIO + Span Tags (LLM-Generated)},
author = {Sheryl D. and contributors},
year = {2025},
note = {LLM-projected span annotations using llama3:instruct.}
}
```
提供机构:
sdeakin



