emoneil/reflections-in-peer-counseling

Name: emoneil/reflections-in-peer-counseling
Creator: emoneil
Published: 2022-10-14 03:59:04
License: 暂无描述

Hugging Face2022-10-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/emoneil/reflections-in-peer-counseling

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: [] language_creators: [] license: [] pretty_name: Reflections in Peer Counseling size_categories: - 1K<n<10K source_datasets: [] tags: - gpt3 - natural language processing - natural language generation - peer counseling task_categories: - summarization - text-generation - conversational task_ids: - dialogue-generation --- # Dataset Card for Reflections in Peer Counseling ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper: Automatic Reflection Generation for Peer-to-Peer Counseling** - **Point of Contact: emoneil@sas.upenn.edu** ### Dataset Summary The dataset derives from conversations between clients and counselors on a large peer-to-peer online counseling service. There are a total of 1061 observations across training and testing datasets, with 50 additional randomly sampled examples used in defining the few-shot learning prompt or for validation purposes in tuning hyperparameters, thus totaling 1111 observations across these sets. These observations were sourced from a larger dataset consisting of annotations of several different clinical counseling skills. We thus focus on the annotations of counselor reflections. The counselor reflections were annotated at utterance level with counselor verbal behaviors using the Motivational Interviewing Treatment Integrity 4.2 (MITI) and the Motivational Interviewing Skill Code 2.5 (MISC) manuals. Thus, the entire dataset consists of conversational context-counselor reflection pairs. ### Supported Tasks and Leaderboards The dataset was used for conditioning and tuning generative models for generating reflection statements in the domain of peer-to-peer counseling. ### Languages The language in the dataset is English. ## Dataset Structure ### Data Instances Each instance consists of the chat room id of the conversation in which the dialogue occurred, the prompt which is the conversational context that immediately precedes the counselor reflection (including previous utterances from either the client or counselor up until and including the most recent prior client message that immediately followed a counselor’s message), and the completion which is the counselor reflection. ``` { 'chat_id': "1234567", 'prompt': "Client: I'm 19, he's 25. He's not very considerate of how I feel but says he cares about me and loves me.\nCounselor:", 'completion': " The words are easy, actions are needed. Guys who are 25 just desire to have different experiences.\n\n", } ``` ### Data Fields * `chat_id`: an integer defining the chat id of the conversation * `prompt`: a string corresponding to the conversational context preceding the counselor reflection with the messages separated by new line characters and each utterance prepended by 'Client:' or 'Counselor:'. The string ends with 'Counselor:' to indicate that it is followed by the counselor completion described below. * `completion`: a string corresponding to the counselor reflection ### Data Splits The dataset is split into training, testing, and a small set of 50 examples used either for designing the few-shot learning prompt or tuning hyperparameters. 911 examples were used for training. 350 of these examples also constitute a reduced training set used in comparative experiments. 150 examples were used for testing. 50 of these testing examples (randomly selected) were used in the human evaluation. We ensured that the chat identifiers for messages in the test set uniquely differed from those included in the training set. ## Dataset Creation ### Curation Rationale Reflective listening is a critical skill in peer-to-peer counseling that is only effective when tailored to the context. Thus, we wanted to home in on this particular skill and explore the potential of state-of-the-art language models for text generation in this domain. ### Source Data #### Initial Data Collection and Normalization The dataset was created by filtering the larger dataset of utterances annotated for many different counseling skills to only those counselor messages annotated as reflections. Then, the prompt instances were created by identifying the preceding messages for each of these counselor reflection instances. After the prompts were initially created, prompts with less than or equal to five words were removed. The author created reference reflections for each of the 350 training example prompts in the reduced training set and each of the 150 testing example prompts. In creating a reference reflection given each conversational context, the author intended to simulate responding to the client in roughly the same time a counselor would as if this turn was embedded in a conversation the client was having with the author. This gauging of time is based on the author’s experience in volunteering as a counselor at crisis hotlines. It is possible that the reference reflections were created in roughly even less time than an average counselor response given that there were hundreds of conversational contexts for which reflections needed to be created. #### Who are the source language producers? The 'client' messages are utterances of those seeking mental health support on a large online counseling service platform. The 'counselor' messages are utterances of minimally-trained peer counselors of this large online counseling service. For each of the 350 training example prompts in the reduced training set and each of the 150 testing example prompts, a reference reflection was also created by the author. ### Annotations #### Annotation process The human evaluation examined text of generative models fine-tuned on the full training set, a reduced training set, and reference reflections; a few-shot learning model; the actual counselor; and the reference reflection. We administered a survey through Amazon Mechanical Turk Developer Sandbox. 50 of the testing prompts were provided along with the corresponding six response sources. Provided with the conversational context, the annotators evaluated responses based on three criteria: fluency, resemblance of reflection, and overall preference. Thus, for each context, evaluators measured the fluency, reflection resemblance, and overall preference for all six candidate responses. We used a variation of Efficient Annotation of Scalar Labels (EASL), a hybrid approach between direct assessment and online pairwise ranking aggregation and rank-based magnitude estimation. Evaluators saw all six responses at once (without knowledge of each response’s origin) and used a sliding scale from 1 to 5 to rate the responses based on each of the three dimensions. The order of the model responses for each conversational context was randomized. We provided examples of response ratings for ratings of 1 and 5 on the overall fluency and reflection resemblance dimensions. However, we did not include an example for overall preference, noting its subjectivity. The order of the model responses for each conversational context was randomized. We provided examples of response ratings for ratings of 1 and 5 on the overall fluency and reflection resemblance dimensions. However, we did not include an example for overall preference, noting its subjectivity. Fluency refers to the response's overall fluency and human-likeness. In the instructions, we noted non-capitalized words and colloquial language are acceptable and not to be considered fluency errors. Reflection resemblance refers to whether the response captures and returns to the client something the client has said. Overall preference refers to the extent to which the evaluator likes the response. Using Krippendorff’s alpha, we measured inter-annotator agreement, obtaining alpha values of -0.0369, 0.557, and 0.358 for overall fluency, reflection resemblance, and overall preference, respectively. Although these agreement values are low, the 0.557 inter-annotator agreement we obtained for reflection resemblance is notably higher than the inter-annotator agreement obtained for reflection likeness in the most relevant prior work. #### Who are the annotators? The three annotators recruited for the human evaluation were familiar with counseling reflections. All three annotators have worked with this large online counseling service dataset with IRB approval. They are quite familiar with motivational interviewing codes, annotating messages and using large language models for mass labeling. ### Personal and Sensitive Information Due to the sensitive nature of this dataset and privacy concerns, we are unable to publicly share the data. ## Considerations for Using the Data ### Social Impact of Dataset This dataset of reflections in peer-to-peer counseling can be used as a reference point in understanding and evaluating counselor clinical skills and furthering the potential of language technology to be applied in this space. Given the sensitive nature of the mental health care context and the minimal training of these counselors, the use of such data requires care in understanding the limitations of technology defined based on this language. ### Discussion of Biases Much of the language of conversations on this online counseling service platform is very informal and some client and counselor utterances may also contain pejorative language. As for the generated text assessed in the human evaluation of this work, it is important to note that GPT-3 was trained on over 45 terabytes of data from the internet and books, and large volumes of data collected from online sources will inevitably contain biases that may be captured. There may thus be inadvertent discrimination against subclasses of particular protected groups. Using generated responses as a source of guidance rather than using generative systems as the counselors themselves may be able to balance the benefits and risks of using artificial intelligence in delicate mental health settings. It is imperative that such systems are not misused by companies seeking to maximize efficiency and minimize cost. The reference reflections in this work were created by the author, whose experience with counseling and motivational interviewing derives from over one hundred hours of training at a teen-to-teen crisis hotline and textline service and experience through a research fellowship developing and user testing a platform for nurses to practice and grow their motivational interviewing skills. Therefore, the reference reflections may not be as clinically precise as are possible from a medical professional, and the diversity of reflections is inherently limited. ### Other Known Limitations ## Additional Information ### Dataset Curators Developed by Emma O'Neil, João Sedoc, Diyi Yang, Haiyi Zhu, Lyle Ungar. ### Licensing Information ### Citation Information ### Contributions Thanks to [@emoneil](https://github.com/emoneil) for adding this dataset.

提供机构：

emoneil

原始信息汇总

数据集概述

数据集描述

数据集摘要

该数据集源自一个大型在线同伴咨询服务中客户与咨询师之间的对话。总共有1061个观察样本分布在训练和测试数据集中，另外有50个随机抽样的例子用于定义少样本学习提示或用于调整超参数的验证目的，因此总共有1111个观察样本。这些观察样本来自一个更大的数据集，该数据集包含多种临床咨询技能的注释。我们专注于咨询师反思的注释。咨询师的反思在话语层面使用动机访谈治疗完整性4.2（MITI）和动机访谈技能代码2.5（MISC）手册进行注释。因此，整个数据集由对话上下文-咨询师反思对组成。

支持的任务和排行榜

该数据集用于调整生成模型，以生成同伴咨询领域的反思陈述。

语言

数据集中的语言是英语。

数据集结构

数据实例

每个实例包含对话发生的聊天室ID、提示（即紧接咨询师反思之前的对话上下文，包括来自客户或咨询师之前的 utterances，直到并包括最近的客户消息，该消息紧随咨询师的消息）和完成（即咨询师的反思）。

json { chat_id: "1234567", prompt: "Client: Im 19, hes 25. Hes not very considerate of how I feel but says he cares about me and loves me. Counselor:", completion: " The words are easy, actions are needed. Guys who are 25 just desire to have different experiences.

", }

数据字段

chat_id: 定义对话的聊天ID的整数
prompt: 对应于咨询师反思之前对话上下文的字符串，消息之间用换行符分隔，每个 utterance 以 Client: 或 Counselor: 开头。字符串以 Counselor: 结尾，表示接下来是咨询师的完成。
completion: 对应于咨询师反思的字符串

数据分割

数据集分为训练、测试和一小部分50个例子，用于设计少样本学习提示或调整超参数。911个例子用于训练。其中350个例子也构成用于比较实验的简化训练集。150个例子用于测试。其中50个测试例子（随机选择）用于人工评估。我们确保测试集中的消息的聊天标识符与训练集中的标识符唯一不同。

数据集创建

策划理由

反思性倾听是同伴咨询中的一项关键技能，只有在适应上下文时才有效。因此，我们希望专注于这一特定技能，并探索最先进的语言模型在文本生成方面的潜力。

源数据

初始数据收集和规范化

数据集是通过过滤更大的注释了多种咨询技能的 utterances 数据集，仅保留那些被注释为反思的咨询师消息来创建的。然后，通过识别这些咨询师反思实例之前的消息来创建提示实例。初始创建提示后，删除了少于或等于五个词的提示。

作者为简化训练集中的350个训练示例提示和150个测试示例提示创建了参考反思。在创建每个对话上下文的参考反思时，作者旨在模拟与客户对话中嵌入的这一轮次大致相同的时间响应。这种时间估算是基于作者在危机热线担任咨询师的经验。参考反思的创建时间可能比平均咨询师响应时间更短，因为有数百个对话上下文需要创建反思。

源语言生产者是谁？

client 消息是寻求大型在线咨询服务平台上心理健康支持的人的 utterances。counselor 消息是这个大型在线咨询服务平台上经过最少培训的同伴咨询师的 utterances。

对于简化训练集中的350个训练示例提示和150个测试示例提示，作者也创建了参考反思。

注释

注释过程

人工评估检查了在完整训练集、简化训练集和参考反思上微调的生成模型的文本；少样本学习模型；实际咨询师；和参考反思。

我们通过 Amazon Mechanical Turk Developer Sandbox 进行了一项调查。提供了50个测试提示及其对应的六个响应源。在提供对话上下文的情况下，注释者根据三个标准评估响应：流畅性、反思相似性和总体偏好。因此，对于每个上下文，评估者测量了所有六个候选响应的流畅性、反思相似性和总体偏好。

我们使用了 Efficient Annotation of Scalar Labels (EASL) 的变体，这是一种介于直接评估和在线成对排名聚合与基于排名的量级估计之间的混合方法。评估者一次看到所有六个响应（不知道每个响应的来源），并使用1到5的滑动标尺根据三个维度对响应进行评分。每个对话上下文的模型响应顺序是随机的。我们提供了1和5评分的响应示例，用于总体流畅性和反思相似性维度。然而，我们没有提供总体偏好的示例，指出其主观性。

流畅性指的是响应的整体流畅性和人类相似性。在说明中，我们指出非大写单词和口语是可以接受的，不应被视为流畅性错误。反思相似性指的是响应是否捕捉并返回客户所说的内容。总体偏好指的是评估者对响应的喜欢程度。

使用 Krippendorff’s alpha，我们测量了注释者间的一致性，获得了总体流畅性、反思相似性和总体偏好的 alpha 值分别为 -0.0369、0.557 和 0.358。尽管这些一致性值较低，但我们获得的反思相似性的注释者间一致性 0.557 明显高于相关先前工作中获得的反思相似性的一致性。

注释者是谁？

进行人工评估的三名注释者熟悉咨询反思。所有三名注释者都与这个大型在线咨询服务数据集合作，并获得 IRB 批准。他们对动机访谈代码、消息注释和使用大型语言模型进行大规模标记非常熟悉。

个人和敏感信息

由于该数据集的敏感性质和隐私问题，我们无法公开分享数据。

使用数据的注意事项

数据集的社会影响

这个同伴咨询反思数据集可以作为理解和评估咨询师临床技能的参考点，并进一步推动语言技术在这一领域的应用潜力。鉴于心理健康护理环境的敏感性和这些咨询师的最低培训，使用此类数据需要谨慎理解基于这种语言定义的技术的局限性。

偏见的讨论

这个在线咨询服务平台上的对话语言非常非正式，一些客户和咨询师的 utterances 可能也包含贬义语言。

对于这项工作中的人工评估中评估的生成文本，重要的是要注意 GPT-3 是在超过45TB的互联网和书籍数据上训练的，从在线来源收集的大量数据不可避免地会包含可能被捕获的偏见。因此，可能会无意中对特定受保护群体的子类进行歧视。使用生成响应作为指导来源，而不是将生成系统本身用作咨询师，可能能够在脆弱的心理健康环境中平衡使用人工智能的好处和风险。必须确保此类系统不会被寻求最大化效率和最小化成本的公司滥用。

这项工作中的参考反思是由作者创建的，其咨询和动机访谈的经验来自在青少年危机热线和短信服务上超过一百小时的培训和通过开发和用户测试护士练习和提高动机访谈技能平台的研究奖学金经验。因此，参考反思可能不像医学专业人士那样临床精确，并且反思的多样性本质上是有限的。