five

sdeakin/LLM-BIO-Emotions

收藏
Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sdeakin/LLM-BIO-Emotions
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: en license: cc-by-4.0 pretty_name: LLM-Generated Emotion Labels and BIO-Tagged Spans (No Projection) tags: - goemotions - llm-generated - bio-tagging - span-extraction - emotion-classification - synthetic dataset_info: features: - name: src_id dtype: string - name: model dtype: string - name: provider dtype: string - name: prompt dtype: string - name: level dtype: string - name: predictions sequence: string - name: text dtype: string - name: data struct: - name: tokens sequence: string - name: labels sequence: string - name: spans sequence: struct: - name: type dtype: string - name: subtype dtype: string - name: start dtype: int32 - name: end dtype: int32 - name: text dtype: string - name: attrs struct: {} paperswithcode_id: go-emotions task_categories: - text-classification - token-classification - feature-extraction size_categories: - 100K<n<1M --- # Dataset Card for **LLM-BIO-Emotions** ## Dataset Summary **LLM-Generated Emotion Labels and BIO-Tagged Spans (No Projection)** **LLM-BIO-Emotions** is a fully LLM-generated emotion labeling and BIO tagging dataset created using `llama3:instruct` with a Level-2-style prompt. Unlike projection-based datasets (GoEmotions-Projected-BIO, LLM-Projected-BIO), this dataset: * **does not receive any ground-truth or precomputed labels** * the LLM **predicts emotion labels entirely on its own** * the LLM **generates BIO spans and emotional attributes entirely autonomously** This dataset provides a **pure LLM baseline** for emotion-span extraction and serves as a comparison point for: * Human-grounded projections * LLM-Tagged GoEmotions → BIO projections * Hybrid or contrastive span-tower training All data is stored in: **`LLM-BIO-Emotions.jsonl`** --- ## Dataset Structure ### Example Record ```json { "src_id": "l2_11023", "model": "llama3:instruct", "provider": "ollama-local", "prompt": "level_2", "level": "level2", "predictions": ["annoyance"], "text": "Stop asking me the same question.", "data": { "tokens": ["Stop", "asking", "me", "the", "same", "question", "."], "labels": ["B-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "I-EMO", "O"], "spans": [ { "type": "EMO", "subtype": "Annoyance", "start": 0, "end": 5, "text": "Stop asking me the same question", "attrs": { "valence": "neg", "intensity": "med", "certainty": "asserted", "temporality": "present", "source": "self", "emotion_group": "negative_affect", "sentence_index": 0, "clause_index": 0, "confidence": 0.91, "target_text": "you", "target_relation": "cause" } } ] } } ``` --- ## Data Fields ### Top-Level Fields | Field | Type | Description | | ------------- | ------------ | ----------------------------------------- | | `src_id` | string | Unique row identifier. | | `model` | string | LLM used (`llama3:instruct`). | | `provider` | string | Backend provider (`ollama-local`). | | `prompt` | string | Prompt used (Level-2 autonomous tagging). | | `level` | string | Always `level2`. | | `predictions` | list[string] | Emotion labels predicted by the LLM. | | `text` | string | Input sentence. | | `data.tokens` | list[string] | Tokenized text. | | `data.labels` | list[string] | BIO tags aligned to tokens. | | `data.spans` | list[object] | Spans describing emotional segments. | ### Span Fields | Field | Type | Description | | --------- | ------ | --------------------------------------------------------------- | | `type` | string | Usually `"EMO"`. | | `subtype` | string | LLM-predicted emotion name. | | `start` | int | Token start index. | | `end` | int | Token end index. | | `text` | string | Extracted span text. | | `attrs` | dict | valence, intensity, certainty, temporality, emotion_group, etc. | --- ## Generation Process ### 1. Autonomous LLM Emotion Detection The LLM receives **only the raw text** and determines: * which emotions are present * where the emotional trigger spans lie * which attributes apply This represents the pure LLM reasoning process without constraints. ### 2. Level-2 Prompt The Level-2 prompt instructs the LLM to output: * tokens * BIO labels * spans with indices * emotional attributes * optional target entity + relation ### 3. Cleaning & Validation | Step | Description | | ---------------------------- | ----------------------------------------------------------------------- | | **Schema validation** | Checks that all required fields exist. | | **Token/label alignment** | Ensures `labels` length matches `tokens` length. | | **Span consistency** | Confirms span indices match token slices and span text reconstruction. | | **Attribute normalization** | Converts attribute values to controlled vocabularies. | | **Emotion label validation** | Ensures emotion names match allowed taxonomy (LLM-Simple + GoEmotions). | | **Confidence checks** | Ensures `confidence ∈ [0,1]`. | | **Rejected sample logging** | Invalid samples are saved for auditing. | --- ## Intended Uses ### Benchmark autonomous LLM reasoning Study how an LLM behaves with **no supervision or projection**, including: * over/under-prediction of emotions * span misalignment behavior * consistency relative to LLM-Simple and GoEmotions projections ### Train fully synthetic span taggers BIO-tagged emotional spans can be used to train: * sequence taggers * span extractors * emotion classification models ### Build contrastive or Tri-Tower models Use spans + attributes for span-tower or attribute-tower contrastive objectives. ### Compare supervisory sources This dataset provides the “LLM-autonomous baseline” to compare with: * human-grounded projections (GoEmotions-Projected-BIO) * LLM-grounded projections (LLM-Projected-BIO) * label-only datasets (LLM-Simple) --- ## Limitations * Entirely synthetic → includes LLM-specific biases. * Spans may be inconsistent or subjective. * Emotion attributes (intensity, certainty, source, etc.) vary in reliability. * Reddit-based text → inherits domain-specific language patterns. --- ## Usage ### Load with 🤗 Datasets ```python from datasets import load_dataset ds = load_dataset( "json", data_files="LLM-BIO-Emotions.jsonl", split="train" ) ``` ### Direct JSONL Reading ```python import json with open("LLM-BIO-Emotions.jsonl", "r") as f: for line in f: record = json.loads(line) print(record["predictions"], record["data"]["spans"]) ``` --- ## Citation ```bibtex @article{demszky2020goemotions, title = {GoEmotions: A Dataset of Fine-Grained Emotions}, author = {Demszky, Dorottya and et al.}, journal = {ACL}, year = {2020} } @dataset{llm_bio_emotions, title = {LLM-Generated Emotion Labels and BIO-Tagged Spans (No Projection)}, author = {Sheryl D. and contributors}, year = {2025}, ```
提供机构:
sdeakin
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作