five

LiXinran1/prc-emo-demonstration-retrieval-repository

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/LiXinran1/prc-emo-demonstration-retrieval-repository
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 --- # 📚 PRC-Emo: A Demonstration Retrieval Repository for Emotion Recognition in Conversation This repository introduces the first dedicated **demonstration retrieval repository** for Emotion Recognition in Conversation (ERC), as presented in the paper: > "Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning" (AAAI 2026) It provides high-quality, utterance-level emotional exemplars that support **retrieval-augmented prompting (RAP)** and significantly improve LLM-based emotional reasoning. --- ## 📊 Dataset Summary The PRC-Emo repository consists of **36,712** utterance-level samples aggregated from four sources: | Source | Count | | :--- | :--- | | GPT-4o Augmented Emotional Dialogues (Human Verified) | 14,009 | | IEMOCAP | 5,163 | | MELD | 9,989 | | EmoryNLP | 7,551 | | **Total** | **36,712** | ### Data Fields Each sample is a complete entry designed for retrieval and training, including the following fields: * **Utterance text** * **Emotion label** * **Source dataset** * **Dialogue ID** * **Utterance position** * **SBERT embedding vector** (768-dim, for retrieval) The repository is highly suitable for **retrieval-augmented prompting (RAP)**, **few-shot prompting**, and **ERC model training**. --- ## 🛠️ How the Repository Was Built Due to significant emotion imbalance in existing ERC datasets, we constructed this repository using a multi-stage approach: ### 1. High-Quality Generation with GPT-4o High-quality emotional dialogues were generated using a two-stage prompting strategy across **six real-world scenarios**: *healthcare, workplace, education, family, social, and entertainment.* * **Stage 1: Topic Generation** * GPT-4o generated 30 diverse subtopics for each scenario. * **Stage 2: Dialogue Generation** * GPT-4o produced two-speaker emotional dialogues with sentence-level emotion labels. **Low-frequency emotions were intentionally reinforced** to ensure balance. ### 2. Label Masking and Rigorous Human Verification A strict annotation pipeline was used to ensure maximum quality: 1. GPT-4o produces labeled dialogues. 2. Emotion labels are automatically masked/removed. 3. Two human annotators independently **re-label every utterance**. 4. A sample is kept **only if both annotators exactly match** the original GPT-4o label. This process was repeated for three rounds to ensure the resulting dataset is diverse, balanced, and of the highest quality. --- ## 💻 Data Format ### Example JSON Entry ```json { "text": "I’ve been feeling great lately! I just had a checkup, and the doctor said everything looks good!", "label": "happy", "dataset": "my_data", "conversation_id": null, "utterance_id": 0, "vector": [0.1532, -0.9183, ...] } { "text": "How you doing?", "label": "neutral", "dataset": "iemocap", "conversation_id": "Ses03F_impro06", "utterance_id": 0, "vector": [0.1532, -0.9183, ...] } ``` --- ## 🎯 Recommended Use Cases * **Retrieval-Augmented Prompting (RAP)** * Retrieve the top-$k$ most similar emotional utterances (using SBERT cosine similarity) and insert them into your LLM prompts as contextual examples. * **Few-Shot Prompting** * Utilize the repository as a high-quality demonstration bank for LLM-based ERC tasks. --- ## 📝 Additional Notes * **Embeddings:** The provided embeddings were generated using **SBERT** (768-dim). Users may recompute and replace the `embedding` field if a different encoder is preferred. * **Contact:** 963707605@mail.dlut.edu.cn * **License:** The dataset is released under **CC BY 4.0**.
提供机构:
LiXinran1
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作