sdeakin/LLM-BIO-Emotions
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sdeakin/LLM-BIO-Emotions
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: cc-by-4.0
pretty_name: LLM-Generated Emotion Labels and BIO-Tagged Spans (No Projection)
tags:
- goemotions
- llm-generated
- bio-tagging
- span-extraction
- emotion-classification
- synthetic
dataset_info:
features:
- name: src_id
dtype: string
- name: model
dtype: string
- name: provider
dtype: string
- name: prompt
dtype: string
- name: level
dtype: string
- name: predictions
sequence: string
- name: text
dtype: string
- name: data
struct:
- name: tokens
sequence: string
- name: labels
sequence: string
- name: spans
sequence:
struct:
- name: type
dtype: string
- name: subtype
dtype: string
- name: start
dtype: int32
- name: end
dtype: int32
- name: text
dtype: string
- name: attrs
struct: {}
paperswithcode_id: go-emotions
task_categories:
- text-classification
- token-classification
- feature-extraction
size_categories:
- 100K<n<1M
---
# Dataset Card for **LLM-BIO-Emotions**
## Dataset Summary
**LLM-Generated Emotion Labels and BIO-Tagged Spans (No Projection)**
**LLM-BIO-Emotions** is a fully LLM-generated emotion labeling and BIO tagging dataset created using `llama3:instruct` with a Level-2-style prompt.
Unlike projection-based datasets (GoEmotions-Projected-BIO, LLM-Projected-BIO), this dataset:
* **does not receive any ground-truth or precomputed labels**
* the LLM **predicts emotion labels entirely on its own**
* the LLM **generates BIO spans and emotional attributes entirely autonomously**
This dataset provides a **pure LLM baseline** for emotion-span extraction and serves as a comparison point for:
* Human-grounded projections
* LLM-Tagged GoEmotions → BIO projections
* Hybrid or contrastive span-tower training
All data is stored in:
**`LLM-BIO-Emotions.jsonl`**
---
## Dataset Structure
### Example Record
```json
{
"src_id": "l2_11023",
"model": "llama3:instruct",
"provider": "ollama-local",
"prompt": "level_2",
"level": "level2",
"predictions": ["annoyance"],
"text": "Stop asking me the same question.",
"data": {
"tokens": ["Stop", "asking", "me", "the", "same", "question", "."],
"labels": ["B-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "O"],
"spans": [
{
"type": "EMO",
"subtype": "Annoyance",
"start": 0,
"end": 5,
"text": "Stop asking me the same question",
"attrs": {
"valence": "neg",
"intensity": "med",
"certainty": "asserted",
"temporality": "present",
"source": "self",
"emotion_group": "negative_affect",
"sentence_index": 0,
"clause_index": 0,
"confidence": 0.91,
"target_text": "you",
"target_relation": "cause"
}
}
]
}
}
```
---
## Data Fields
### Top-Level Fields
| Field | Type | Description |
| ------------- | ------------ | ----------------------------------------- |
| `src_id` | string | Unique row identifier. |
| `model` | string | LLM used (`llama3:instruct`). |
| `provider` | string | Backend provider (`ollama-local`). |
| `prompt` | string | Prompt used (Level-2 autonomous tagging). |
| `level` | string | Always `level2`. |
| `predictions` | list[string] | Emotion labels predicted by the LLM. |
| `text` | string | Input sentence. |
| `data.tokens` | list[string] | Tokenized text. |
| `data.labels` | list[string] | BIO tags aligned to tokens. |
| `data.spans` | list[object] | Spans describing emotional segments. |
### Span Fields
| Field | Type | Description |
| --------- | ------ | --------------------------------------------------------------- |
| `type` | string | Usually `"EMO"`. |
| `subtype` | string | LLM-predicted emotion name. |
| `start` | int | Token start index. |
| `end` | int | Token end index. |
| `text` | string | Extracted span text. |
| `attrs` | dict | valence, intensity, certainty, temporality, emotion_group, etc. |
---
## Generation Process
### 1. Autonomous LLM Emotion Detection
The LLM receives **only the raw text** and determines:
* which emotions are present
* where the emotional trigger spans lie
* which attributes apply
This represents the pure LLM reasoning process without constraints.
### 2. Level-2 Prompt
The Level-2 prompt instructs the LLM to output:
* tokens
* BIO labels
* spans with indices
* emotional attributes
* optional target entity + relation
### 3. Cleaning & Validation
| Step | Description |
| ---------------------------- | ----------------------------------------------------------------------- |
| **Schema validation** | Checks that all required fields exist. |
| **Token/label alignment** | Ensures `labels` length matches `tokens` length. |
| **Span consistency** | Confirms span indices match token slices and span text reconstruction. |
| **Attribute normalization** | Converts attribute values to controlled vocabularies. |
| **Emotion label validation** | Ensures emotion names match allowed taxonomy (LLM-Simple + GoEmotions). |
| **Confidence checks** | Ensures `confidence ∈ [0,1]`. |
| **Rejected sample logging** | Invalid samples are saved for auditing. |
---
## Intended Uses
### Benchmark autonomous LLM reasoning
Study how an LLM behaves with **no supervision or projection**, including:
* over/under-prediction of emotions
* span misalignment behavior
* consistency relative to LLM-Simple and GoEmotions projections
### Train fully synthetic span taggers
BIO-tagged emotional spans can be used to train:
* sequence taggers
* span extractors
* emotion classification models
### Build contrastive or Tri-Tower models
Use spans + attributes for span-tower or attribute-tower contrastive objectives.
### Compare supervisory sources
This dataset provides the “LLM-autonomous baseline” to compare with:
* human-grounded projections (GoEmotions-Projected-BIO)
* LLM-grounded projections (LLM-Projected-BIO)
* label-only datasets (LLM-Simple)
---
## Limitations
* Entirely synthetic → includes LLM-specific biases.
* Spans may be inconsistent or subjective.
* Emotion attributes (intensity, certainty, source, etc.) vary in reliability.
* Reddit-based text → inherits domain-specific language patterns.
---
## Usage
### Load with 🤗 Datasets
```python
from datasets import load_dataset
ds = load_dataset(
"json",
data_files="LLM-BIO-Emotions.jsonl",
split="train"
)
```
### Direct JSONL Reading
```python
import json
with open("LLM-BIO-Emotions.jsonl", "r") as f:
for line in f:
record = json.loads(line)
print(record["predictions"], record["data"]["spans"])
```
---
## Citation
```bibtex
@article{demszky2020goemotions,
title = {GoEmotions: A Dataset of Fine-Grained Emotions},
author = {Demszky, Dorottya and et al.},
journal = {ACL},
year = {2020}
}
@dataset{llm_bio_emotions,
title = {LLM-Generated Emotion Labels and BIO-Tagged Spans (No Projection)},
author = {Sheryl D. and contributors},
year = {2025},
```
提供机构:
sdeakin



