cjerzak/HumanDisentangledText

Name: cjerzak/HumanDisentangledText
Creator: cjerzak
Published: 2024-05-28 20:47:16
License: 暂无描述

Hugging Face2024-05-28 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/cjerzak/HumanDisentangledText

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: creativeml-openrail-m --- *Paper title:* Can Large Language Models (or Humans) Disentangle Text? *Abstract:* We investigate the potential of large language models (LLMs) to disentangle text variables—to remove the textual traces of an undesired forbidden variable in a task sometimes known as text distillation and closely related to the fairness in AI and causal inference literature. We employ a range of various LLM approaches in an attempt to disentangle text by identifying and removing information about a target variable while preserving other relevant signals. We show that in the strong test of removing sentiment, the statistical association between the processed text and sentiment is still detectable to machine learning classifiers post-LLM-disentanglement. Furthermore, we find that human annotators also struggle to disentangle sentiment while preserving other semantic content. This suggests there may be limited separability between concept variables in some text contexts, highlighting limitations of methods relying on text-level transformations and also raising questions about the robustness of disentanglement methods that achieve statistical independence in representation space if this is difficult for human coders operating on raw text to attain. *Repository details:* This repository contains data from human-coded and processed reviews from the main paper results. *Paper link:* https://arxiv.org/abs/2403.16584

提供机构：

cjerzak

原始信息汇总

数据集概述

数据集名称

无具体名称提供。

数据集内容

该数据集包含从主要论文结果中提取的人工编码和处理的评论数据。

数据集用途

用于研究大型语言模型（LLMs）在文本变量分离方面的潜力，特别是在移除文本中不希望出现的禁止变量方面的应用，这与AI公平性和因果推断文献密切相关。

数据集相关研究

论文标题：Can Large Language Models (or Humans) Disentangle Text?
研究内容：探讨了LLMs在移除文本中特定信息（如情感）的同时保留其他相关信号的能力，并比较了人类标注者在相同任务上的表现。
研究结果：即使在LLMs处理后，机器学习分类器仍能检测到处理文本与情感之间的统计关联。同时，人类标注者在分离情感并保留其他语义内容方面也面临挑战。

数据集许可证

数据集遵循creativeml-openrail-m许可证。

论文链接

5,000+

优质数据集

54 个

任务类型

进入经典数据集