tomaarsen/setfit-absa-semeval-laptops

Name: tomaarsen/setfit-absa-semeval-laptops
Creator: tomaarsen
Published: 2023-11-16 10:38:19
License: 暂无描述

Hugging Face2023-11-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tomaarsen/setfit-absa-semeval-laptops

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: span dtype: string - name: label dtype: string - name: ordinal dtype: int64 splits: - name: train num_bytes: 335243 num_examples: 2358 - name: test num_bytes: 76698 num_examples: 654 download_size: 146971 dataset_size: 411941 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Dataset Card for "tomaarsen/setfit-absa-semeval-laptops" ### Dataset Summary This dataset contains the manually annotated laptop reviews from SemEval-2014 Task 4, in the format as understood by [SetFit](https://github.com/huggingface/setfit) ABSA. For more details, see https://aclanthology.org/S14-2004/ ### Data Instances An example of "train" looks as follows. ```json {"text": "I charge it at night and skip taking the cord with me because of the good battery life.", "span": "cord", "label": "neutral", "ordinal": 0} {"text": "I charge it at night and skip taking the cord with me because of the good battery life.", "span": "battery life", "label": "positive", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the \"sales\" team, which is the retail shop which I bought my netbook from.", "span": "service center", "label": "negative", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the \"sales\" team, which is the retail shop which I bought my netbook from.", "span": "\"sales\" team", "label": "negative", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the \"sales\" team, which is the retail shop which I bought my netbook from.", "span": "tech guy", "label": "neutral", "ordinal": 0} ``` ### Data Fields The data fields are the same among all splits. - `text`: a `string` feature. - `span`: a `string` feature showing the aspect span from the text. - `label`: a `string` feature showing the polarity of the aspect span. - `ordinal`: an `int64` feature showing the n-th occurrence of the span in the text. This is useful for if the span occurs within the same text multiple times. ### Data Splits | name |train|test| |---------|----:|---:| |tomaarsen/setfit-absa-semeval-laptops|2358|654| ### Training ABSA models using SetFit ABSA To train using this dataset, first install the SetFit library: ```bash pip install setfit ``` And then you can use the following script as a guideline of how to train an ABSA model on this dataset: ```python from setfit import AbsaModel, AbsaTrainer, TrainingArguments from datasets import load_dataset from transformers import EarlyStoppingCallback # You can initialize a AbsaModel using one or two SentenceTransformer models, or two ABSA models model = AbsaModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") # The training/eval dataset must have `text`, `span`, `polarity`, and `ordinal` columns dataset = load_dataset("tomaarsen/setfit-absa-semeval-laptops") train_dataset = dataset["train"] eval_dataset = dataset["test"] args = TrainingArguments( output_dir="models", use_amp=True, batch_size=256, eval_steps=50, save_steps=50, load_best_model_at_end=True, ) trainer = AbsaTrainer( model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, callbacks=[EarlyStoppingCallback(early_stopping_patience=5)], ) trainer.train() metrics = trainer.evaluate(eval_dataset) print(metrics) trainer.push_to_hub("tomaarsen/setfit-absa-laptops") ``` You can then run inference like so: ```python from setfit import AbsaModel # Download from Hub and run inference model = AbsaModel.from_pretrained( "tomaarsen/setfit-absa-laptops-aspect", "tomaarsen/setfit-absa-laptops-polarity", ) # Run inference preds = model([ "Boots up fast and runs great!", "The screen shows great colors.", ]) ``` ### Citation Information ```bibtex @inproceedings{pontiki-etal-2014-semeval, title = "{S}em{E}val-2014 Task 4: Aspect Based Sentiment Analysis", author = "Pontiki, Maria and Galanis, Dimitris and Pavlopoulos, John and Papageorgiou, Harris and Androutsopoulos, Ion and Manandhar, Suresh", editor = "Nakov, Preslav and Zesch, Torsten", booktitle = "Proceedings of the 8th International Workshop on Semantic Evaluation ({S}em{E}val 2014)", month = aug, year = "2014", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S14-2004", doi = "10.3115/v1/S14-2004", pages = "27--35", } ```

数据集信息: 特征: - 名称: text 数据类型: 字符串(string) - 名称: span 数据类型: 字符串(string) - 名称: label 数据类型: 字符串(string) - 名称: ordinal 数据类型: 64位整数(int64) 划分集: - 名称: train（训练集）字节数: 335243 样本量: 2358 - 名称: test（测试集）字节数: 76698 样本量: 654 下载大小: 146971 总数据集大小: 411941 配置项: - 配置名称: default（默认配置）数据文件: - 划分集: train（训练集）路径: data/train-* - 划分集: test（测试集）路径: data/test-* # 数据集卡片：`tomaarsen/setfit-absa-semeval-laptops` ### 数据集概述本数据集包含SemEval-2014任务4中的人工标注笔记本电脑评论，格式适配[SetFit](https://github.com/huggingface/setfit)的基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)任务。更多细节可参考https://aclanthology.org/S14-2004/ ### 数据样例训练集的一个示例如下： json {"text": "I charge it at night and skip taking the cord with me because of the good battery life.", "span": "cord", "label": "neutral", "ordinal": 0} {"text": "I charge it at night and skip taking the cord with me because of the good battery life.", "span": "battery life", "label": "positive", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.", "span": "service center", "label": "negative", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.", "span": ""sales" team", "label": "negative", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.", "span": "tech guy", "label": "neutral", "ordinal": 0} ### 数据字段所有划分集的数据字段均保持统一： - `text`：字符串类型特征，代表原始评论文本 - `span`：字符串类型特征，用于标注文本中的**方面跨度(Aspect Span)**，即待分析情感的目标片段 - `label`：字符串类型特征，代表该方面跨度的情感极性(Polarity) - `ordinal`：64位整数特征，用于标记该跨度在当前文本中出现的序号，当同一文本中存在多个相同跨度时可用于区分不同实例 ### 数据划分 | 数据集名称 | 训练集样本量 | 测试集样本量 | |---------|----:|---:| |`tomaarsen/setfit-absa-semeval-laptops`|2358|654| ### 使用SetFit ABSA训练ABSA模型若需基于本数据集训练模型，请先安装SetFit库： bash pip install setfit 随后可参考以下脚本完成该数据集上的ABSA模型训练： python from setfit import AbsaModel, AbsaTrainer, TrainingArguments from datasets import load_dataset from transformers import EarlyStoppingCallback # You can initialize a AbsaModel using one or two SentenceTransformer models, or two ABSA models model = AbsaModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") # The training/eval dataset must have `text`, `span`, `polarity`, and `ordinal` columns dataset = load_dataset("tomaarsen/setfit-absa-semeval-laptops") train_dataset = dataset["train"] eval_dataset = dataset["test"] args = TrainingArguments( output_dir="models", use_amp=True, batch_size=256, eval_steps=50, save_steps=50, load_best_model_at_end=True, ) trainer = AbsaTrainer( model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, callbacks=[EarlyStoppingCallback(early_stopping_patience=5)], ) trainer.train() metrics = trainer.evaluate(eval_dataset) print(metrics) trainer.push_to_hub("tomaarsen/setfit-absa-laptops") 你可按如下方式执行推理： python from setfit import AbsaModel # Download from Hub and run inference model = AbsaModel.from_pretrained( "tomaarsen/setfit-absa-laptops-aspect", "tomaarsen/setfit-absa-laptops-polarity", ) # Run inference preds = model([ "Boots up fast and runs great!", "The screen shows great colors.", ]) ### 引用信息 bibtex @inproceedings{pontiki-etal-2014-semeval, title = "{S}em{E}val-2014 Task 4: Aspect Based Sentiment Analysis", author = "Pontiki, Maria and Galanis, Dimitris and Pavlopoulos, John and Papageorgiou, Harris and Androutsopoulos, Ion and Manandhar, Suresh", editor = "Nakov, Preslav and Zesch, Torsten", booktitle = "Proceedings of the 8th International Workshop on Semantic Evaluation ({S}em{E}val 2014)", month = aug, year = "2014", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S14-2004", doi = "10.3115/v1/S14-2004", pages = "27--35", }

提供机构：

tomaarsen

原始信息汇总

数据集卡片 for "tomaarsen/setfit-absa-semeval-laptops"

数据集概述

该数据集包含从SemEval-2014任务4手动注释的笔记本电脑评论，格式为SetFit ABSA所理解的形式。

数据实例

一个"train"示例如下：

json {"text": "I charge it at night and skip taking the cord with me because of the good battery life.", "span": "cord", "label": "neutral", "ordinal": 0} {"text": "I charge it at night and skip taking the cord with me because of the good battery life.", "span": "battery life", "label": "positive", "ordinal": 0}
{"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.", "span": "service center", "label": "negative", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.", "span": ""sales" team", "label": "negative", "ordinal": 0} {"text": "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.", "span": "tech guy", "label": "neutral", "ordinal": 0}

数据字段

所有拆分中的数据字段相同：

text: 一个string特征。
span: 一个string特征，显示文本中的方面范围。
label: 一个string特征，显示方面范围的极性。
ordinal: 一个int64特征，显示文本中范围的第n次出现。这对于同一文本中多次出现的范围很有用。

数据拆分

名称	训练集	测试集
tomaarsen/setfit-absa-semeval-laptops	2358	654

引用信息

bibtex @inproceedings{pontiki-etal-2014-semeval, title = "{S}em{E}val-2014 Task 4: Aspect Based Sentiment Analysis", author = "Pontiki, Maria and Galanis, Dimitris and Pavlopoulos, John and Papageorgiou, Harris and Androutsopoulos, Ion and Manandhar, Suresh", editor = "Nakov, Preslav and Zesch, Torsten", booktitle = "Proceedings of the 8th International Workshop on Semantic Evaluation ({S}em{E}val 2014)", month = aug, year = "2014", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S14-2004", doi = "10.3115/v1/S14-2004", pages = "27--35", }

搜集汇总

数据集介绍

构建方式

在情感分析研究领域，tomaarsen/setfit-absa-semeval-laptops数据集源自SemEval-2014 Task 4的笔记本电脑评论语料。该数据集通过人工标注方式构建，专家从原始评论文本中精准识别特定方面词（aspect span），并为其分配情感极性标签，包括正面、负面和中性。为确保数据的一致性与可追溯性，每条记录不仅包含评论文本、方面词及其情感标签，还引入了序数（ordinal）字段，以区分同一文本中多次出现的相同方面词，从而形成了结构清晰、标注细致的训练与测试样本集合。

特点

该数据集在方面级情感分析任务中展现出鲜明的专业特性。其核心特征在于每个样本均围绕评论文本中的具体方面词展开，并标注了精确的情感极性，这为模型学习方面与情感之间的细粒度关联提供了坚实基础。数据集中包含的序数字段进一步增强了数据的表达能力，能够有效处理同一上下文中重复出现的方面实体。此外，数据集严格遵循SemEval国际评测标准，确保了标注质量与学术权威性，为后续研究提供了可靠的高质量基准数据。

使用方法

利用该数据集进行方面级情感分析模型训练时，研究者可借助SetFit框架高效开展工作。首先通过安装SetFit库并加载数据集，使用AbsaModel初始化模型架构。训练过程中，需将数据整理为包含text、span、label和ordinal字段的标准格式，并配置TrainingArguments以控制训练参数。借助AbsaTrainer，结合训练集与验证集进行模型优化，并可集成早停等回调机制以防止过拟合。模型训练完成后，不仅能在测试集上评估性能，还可便捷地部署至Hugging Face Hub，或直接用于对新评论文本进行方面情感预测。

背景与挑战

背景概述

在自然语言处理领域，方面级情感分析旨在从文本中识别特定方面并判断其情感极性。tomaarsen/setfit-absa-semeval-laptops数据集源于SemEval-2014 Task 4，由Maria Pontiki等学者于2014年构建，专注于笔记本电脑评论的细粒度情感分析。该数据集通过手动标注，为研究方面提取与情感分类提供了高质量资源，推动了基于方面的情感分析模型的发展，对情感计算和意见挖掘领域产生了深远影响。

当前挑战

方面级情感分析面临的核心挑战在于准确识别文本中的方面词并精确判断其情感倾向，尤其在复杂语境中，方面词可能隐含或具有歧义。数据构建过程中，标注者需处理语言表达的多样性和主观性，确保标注一致性与准确性，这涉及大量人工努力与质量控制。此外，数据集的规模相对有限，可能影响模型在更广泛场景下的泛化能力，对后续研究提出了数据扩充与模型鲁棒性的要求。

常用场景

经典使用场景

在情感计算领域，该数据集作为细粒度情感分析的基准资源，广泛应用于笔记本电脑产品评论的方面级情感极性识别。研究者利用其标注的文本片段与情感标签，训练模型精准定位评论中的特定方面（如“电池寿命”或“服务团队”），并判断其情感倾向（正面、负面或中性），从而深化对用户反馈的结构化理解。

衍生相关工作

该数据集衍生了一系列经典研究，如基于注意力机制的LSTM模型、图神经网络在方面情感关联建模中的应用，以及结合预训练语言模型（如BERT）的迁移学习框架。这些工作不仅提升了方面情感分析的性能，还拓展了跨领域适应、少样本学习等方向，为后续SemEval竞赛及相关工业解决方案奠定了理论基础。

数据集最近研究