What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews - Replication Package
收藏DataCite Commons2025-06-04 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/What_About_Emotions_Guiding_Fine-Grained_Emotion_Extraction_from_Mobile_App_Reviews_-_Replication_Package/28548638/5
下载链接
链接失效反馈官方服务:
资源简介:
Emotion analysis from app reviews - Replication packageFull paper accepted at the 33rd IEEE International Requirements Engineering 2025 conference (Research Track).📚 Summary of artifactThis artifact supports the replication of the study presented in the paper "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews", accepted at the 33rd IEEE International Requirements Engineering 2025 conference. It provides a comprehensive framework for conducting fine-grained emotion analysis from mobile app reviews using both human and large language model (LLM)-based annotations.The artifact includes:<b>Input</b>: A dataset of user reviews, emotion annotation guidelines, and ground truth annotations from human annotators.<b>Process</b>: Scripts for generating emotion annotations via LLMs (GPT-4o, Mistral Large 2, and Gemini 2.0 Flash), splitting annotations into iterations, computing agreement metrics (e.g., Cohen’s Kappa), and evaluating correctness and cost-efficiency.<b>Output</b>: Annotated datasets (human and LLM-generated), agreement analyses, emotion statistics, and evaluation metrics including accuracy, precision, recall, and F1 score.The artifact was developed to ensure transparency, reproducibility, and extensibility of the experimental pipeline. It enables researchers to replicate, validate, or extend the emotion annotation process across different LLMs and configurations, contributing to the broader goal of integrating emotional insights into requirements engineering practices.🔎 Artifact LocationThe artifact is available at https://doi.org/10.6084/m9.figshare.28548638.Find how to cite this replication package and author information at the end of this README file.📂 Description of Artifact<b>Literature review</b>: results from the literature review on opinion mining and emotion analysis within the context of software-based reviews.<b>Data</b>: data used in the study, including user reviews (input), human annotations (ground truth), and LLM-based annotations (generated by the assistants).<b>Code</b>: code used in the study, including the generative annotation, data processing, and evaluation.📖 Literature reviewStudy selection and results are available in the <code>literature_review/study-selection.xlsx</code> file. This file contains the following sheets:<code>iteration_1_IC_analysis</code>: results from the first iteration of the inclusion criteria analysis.<code>iteration_1_feature_extraction</code>: results from the first iteration of the feature extraction analysis.<code>iteration_2_IC_analysis</code>: results from the second iteration of the inclusion criteria analysis.<code>iteration_2_feature_extraction</code>: results from the second iteration of the feature extraction analysis.<code>iteration_3_IC_analysis</code>: results from the third iteration of the inclusion criteria analysis.<code>iteration_3_feature_extraction</code>: results from the third iteration of the feature extraction analysis.<code>emotions</code>: statistical analysis of emotions covered by emotion taxonomies in the selected studies.🗃️ DataThe <code>data</code> root folder contains the following files:<code>reviews.json</code> contains the reviews used in the study.<code>guidelines.txt</code> contains a .txt version of the annotation guidelines.<code>ground-truth.xlsx</code> contains the ground truth (human agreement) annotations for the reviews.In addition, the <code>data</code> root folder contains the following subfolders:<code>assistants</code> contains the IDs of the assistants used for the generative annotation (see LLM-based annotation).<code>annotations</code> contains the results of the human and LLM-based annotation: -- <code>iterations</code> contains both human and LLM-based annotations for each iteration. -- <code>llm-annotations</code> contains the LLM-based annotations for each assistance, including results for various temperature values: low (0), medium (0.5), and high (1) (see LLM-based annotation).<code>agreements</code> contains the results of the agreement analysis between the human and LLM-based annotations (see Data Processing).<code>evaluation</code> contains the results of the evaluation of the LLM-based annotations (see Evaluation), including statistics, Cohen's Kappa, correctness, and cost-efficiency analysis, which includes token usage and human annotation reported times.⚙️ System RequirementsAll artifacts in this replication package are runnable in any operating system with the following requirements:<b>OS</b>: Linux Based OS // Mac-OS // Windows With Unix Like Shells For Example Git Bash CLI<b>Python 3.10</b>Additionally, you will also need at least one API key for OpenAI, Mistral or Gemini. See Step 1 in Usage Instructions & Steps to reproduce.💻 Installation Instructions<b>⚙️ Install requirements</b>Create a virtual environment:<code>python -m venv venv</code>Activate the virtual environment. For Linux Based OS Or Mac-OS.<code>source venv/bin/activate</code>For Windows With Unix Like Shells (for example Git Bash CLI):<code>source venv/Scripts/activate</code>Install Python dependency requirements running the following command.<pre><pre>pip install -r requirements.txt<br></pre></pre>Now you're ready to start the annotation process!💻 Usage Instructions & Steps to reproduceWe structure the code available in this replication package based on the stages involved in the LLM-based annotation process.🤖 LLM-based annotationThe <code>llm_annotation</code> folder contains the code used to generate the LLM-based annotations.There are two main scripts:<code>create_assistant.py</code> is used to create a new assistant with a particular provider and model. This class includes the definition of a common system prompt across all agents, using the <code>data/guidelines.txt</code> file as the basis.<code>annotate_emotions.py</code> is used to annotate a set of emotions using a previously created assistant. This script includes the assessment of the output format, as well as some common metrics for cost-efficiency analysis and output file generation.Our research includes an LLM-based annotation experimentation with 3 LLMs: GPT-4o, Mistral Large 2, and Gemini 2.0 Flash. To illustrate the usage of the code, in this README we refer to the code execution for generating annotations using GPT-4o. However, full code is provided for all LLMs.<b>🔑 Step 1: Add your API key</b>If you haven't done this already, add your API key to the <code>.env</code> file in the root folder. For instance, for OpenAI, you can add the following:<pre><pre>OPENAI_API_KEY=sk-proj-...<br></pre></pre><b>🛠️ Step 2: Create an assistant</b>Create an assistant using the <code>create_assistant.py</code> script. For instance, for GPT-4o, you can run the following command:<code>python ./code/llm_annotation/create_assistant_openai.py --guidelines ./data/guidelines.txt --model gpt-4o</code>This will create an assistant loading the <code>data/guidelines.txt</code> file and using the GPT-4o model.<b>📝 Step 3: Annotate emotions</b>Annotate emotions using the <code>annotate_emotions.py</code> script. For instance, for GPT-4o, you can run the following command using a small subset of 100 reviews from the ground truth as an example:<code>python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth-small.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10</code>For annotating the whole dataset, run the following command (<b>IMPORTANT</b>: this will take more than 60 minutes due to OpenAI, Mistral and Gemini consumption times!):<code>python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10</code>Parameters include:<code>input</code>: path to the input file containing the set of reviews to annotate (e.g., <code>data/ground-truth.xlsx</code>).<code>output</code>: path to the output folder where annotations will be saved (e.g., <code>data/annotations/llm/temperature-00/</code>).<code>batch_size</code>: number of reviews to annotate for each user request (e.g., 10).<code>model</code>: model to use for the annotation (e.g., <code>gpt-4o</code>).<code>temperature</code>: temperature for the model responses (e.g., 0).<code>sleep_time</code>: time to wait between batches, in seconds (e.g., 10).This will annotate the emotions using the assistant created in the previous step, creating a new file with the same format as in the <code>data/ground-truth.xlsx</code> file.🔄 Data processingIn this stage, we refactor all files into iterations and we consolidate the agreement between multiple annotators or LLM runs. These logic serves both for human and LLM annotations. Parameters can be updated to include more annotators or LLM runs.<b>✂️ Step 4: Split annotations into iterations</b>We split the annotations into iterations based on the number of annotators or LLM runs. For instance, for GPT-4o (run 0), we can run the following command:<code>python code/data_processing/split_annotations.py --input_file data/annotations/llm/temperature-00/gpt-4o-0-annotations.xlsx --output_dir data/annotations/iterations/</code>This facilitates the Kappa analysis and agreement in alignment with each human iteration.<b>🤝 Step 5: Analyse agreement</b>We consolidate the agreement between multiple annotators or LLM runs. For instance, for GPT-4o, we can run the following command to use the run from Step 3 (run 0) and three additional annotations (run 1, 2, and 3) already available in the replication package (<b>NOTE</b>: we simplify the process to speed up the analysis and avoid delays in annotation):<code>python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-0 gpt-4o-1 gpt-4o-2 gpt-4o-3</code>For replicating our original study, run the following:<code>python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-1 gpt-4o-2 gpt-4o-3</code>📊 EvaluationAfter consolidating agreements, we can evaluate both the Cohen's Kappa agreement and correctness between the human and LLM-based annotations. Our code allows any combination of annotators and LLM runs.<b>📈 Step 6: Emotion statistics</b>We evaluate the statistics of the emotions in the annotations, including emotion frequency, distribution, and correlation between emotions. For instance, for GPT-4o and the example in this README file, we can run the following command:<code>python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o-0123</code>For replicating our original study, run the following:<code>python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o</code><b>⚖️ Step 7: Cohen's Kappa pairwise agreement</b>We measure the average pairwise Cohen's Kappa agreement between annotators or LLM runs. For instance, for GPT-4o and the example in this README file, we can run the following command:<code>python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-0,gpt-4o-1,gpt-4o-2,gpt-4o-3</code>For replicating our original study, run the following:<code>python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-1,gpt-4o-2,gpt-4o-3 --exclude 0,1,2</code>In our analysis, we exclude iterations 0, 1 and 2 as they were used for guidelines refinement.<b>✅ Step 8: LLM-based annotation correctness</b>We measure the correctness (accuracy, precision, recall, and F1 score) between a set of annotated reviews and a given ground truth. For instance, for GPT-4o agreement and the example in this README file, we can run the following command:<code>python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o</code>For replicating our original study, run the following:<code>python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o</code><b>📝 Step 8: Check results</b>After completing these steps, you will be able to check all generated artefacts, including:<b>LLM annotations</b>: available at <code>data\annotations\llm\</code><b>Agreement between LLM annotations and humans</b>: available at <code>data\evaluation\kappa</code><b>Correctness of LLM annotations with respect to Human agreement</b>: available at <code>data\evaluation\correctness</code>📜 LicenseThis repository is licensed under the GPL-3.0 License. See the LICENSE file for details.👥 Authors informationFull authors list:Quim Motger, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, joaquim.motger@upc.eduMarc Oriol, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, marc.oriol@upc.eduMax Tiessler, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, max.tiessler@upc.eduXavier Franch, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, xavier.franch@upc.eduJordi Marco, Dept. of Computer Science (CS), Universitat Politècnica de Catalunya, Barcelona, Spain, jordi.marco@upc.eduTo cite this replication package, please use the following citation format:<pre><pre>Q. Motger, M. Oriol, M. Tiessler, X. Franch, and J. Marco, "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews - Replication Package". figshare. Dataset. https://doi.org/10.6084/m9.figshare.28548638<br></pre></pre>To cite the full paper describing the research that produced these artifacts, please use the following citation format (DOI to be generated upon publication):<pre><pre>Q. Motger, M. Oriol, M. Tiessler, X. Franch, and J. Marco, "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews," in Proc. IEEE Int. Requirements Eng. Conf. (RE), 2025.<br></pre></pre>
提供机构:
figshare
创建时间:
2025-05-30



