ilos-vigil/steam-review-aspect-dataset

Name: ilos-vigil/steam-review-aspect-dataset
Creator: ilos-vigil
Published: 2024-06-11 17:10:40
License: 暂无描述

Hugging Face2024-06-11 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/ilos-vigil/steam-review-aspect-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification task_ids: - multi-label-classification language: - en pretty_name: Steam Aspect review dataset size_categories: - n<1K configs: - config_name: default data_files: - split: train path: "data/train/data-00000-of-00001.arrow" - split: test path: "data/test/data-00000-of-00001.arrow" source_datasets: - Steam --- # Steam Aspect review dataset > Content on this dataset card initially appeared on [SRec blog](https://srec.ai/blog/steam-review-aspect-dataset). Steam Review aspect dataset is a dataset of Steam Review with 8 review aspects. This dataset contains 1100 English Steam reviews, split into 900 train and 200 test. This dataset was initially created to identify which aspects are mentioned in English reviews as part of [Analysis of 43 million English Steam Reviews](https://srec.ai/blog/analysis-43m-english-steam-reviews). # Data collection and annotation The source of the reviews comes from a snapshot of the SRec database, which was taken on 21 February 2024. SRec obtain all reviews for all games and mods using [API provided by Steam](https://partner.steamgames.com/doc/store/getreviews). To reduce bias when selecting reviews to be annotated, I chose reviews primarily based on these criteria, * Character length. * Helpfulness score. * Popularity of the reviewed game. * Genre or category of the reviewed game. There are 8 aspects to define review in this dataset. I am the only annotator for this dataset. A review is deemed to contain a certain aspect, even if it's mentioned implicitly (e.g., "but it'd be great if there's good looking characters...") or only mention lack of the aspect (e.g., "... essentially has no story ..."). The below table shows 8 aspects of this dataset, along with a short description and example. > Table 1. A description and example for 8 aspects in this dataset | Aspect | Short description | Review text | | ----------- | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | | Recommended | Whether the reviewer recommends the game or not. This aspect comes from the one who wrote the review. | ... In conclusion, good game | | Story | Story, character, lore, world-building and other storytelling elements. | Excellent game, but has an awful-abrupt ending that comes out of nowhere and doesnt make sense ... | | Gameplay | Controls, mechanics, interactivity, difficulty and other gameplay setups. | Gone are the days of mindless building and fun. Power grids? Taxes? Intense accounting and counter-intuitive path building ... | | Visual | Aesthetic, art style, animation, visual effects and other visual elements. | Gorgeous graphics + 80s/90s anime artstyle + Spooky + Atmospheric ... | | Audio | Sound design, music, voice acting and other auditory elements. | ... catchy music, wonderful narrator saying very kind words ... | | Technical | The technical aspects of the game such as bug, performance, OS support, controller support and overall functionality. | bad doesnt fit a 1080p monitor u bastard ... | | Price | Price of the game or its additional content. | Devs are on meth pricing this game at $44 | | Suggestion | Suggestions for the state of the game, including external factors such as game\'s price or publisher partnership. | ... but needs a bit of personal effort to optimize the controls for PC, otherwise ... | Take note that few reviews contain language and content that some people may find offensive, discriminatory, or inappropriate. I **DO NOT** endorse, condone or promote any of such language and content. # Model benchmark Model benchmark on Steam Review aspect dataset split into 3 categories, * Base: Non-attention based language model. * Embedding: Inspired by MTEB, obtained embedding trained on Logistic Regressor for up to 100 epochs. * Fine-tune. Source code for running these models is available on [GitHub](https://github.com/ilos-vigil/steam-review-aspect-dataset/tree/main/model_benchmark). But take note it may not follow best practice as it was written with the goal of using it only once. I ran those on Linux, RTX 3060 and 32GB RAM. > Base | Model | Macro precision | Macro recall | Macro F1 | Note | | ------------------ | --------------- | ------------ | -------- | ---------------------------------------------------------------------------- | | Spacy Bag of Words | 0.6203 | 0.5391 | 0.5494 | | | FastText | 0.6284 | 0.5713 | 0.5871 | Minimum text preprocessing, use pretrained vector | | FastText | 0.6933 | 0.5821 | 0.6027 | Minimum text preprocessing, choose hyperparameter based on K-5 fold autotune | | Spacy Ensemble | 0.6043 | 0.6773 | 0.6299 | Choose hyperparameter based on simple grid search | > Embedding | Model | Param | Max tokens | Macro precision | Macro recall | Macro F1 | Note | | --------------------------------------------------------- | ----- | ---------- | --------------- | ------------ | -------- | ------------------------------------ | | sentence-transformers/all-mpnet-base-v2 | 110M | 514 | 0.7074 | 0.5431 | 0.5853 | | | jinaai/jina-embeddings-v2-small-en | 137M | 8192 | 0.7068 | 0.6075 | 0.6437 | | | jinaai/jina-embeddings-v2-base-en | 137M | 8192 | 0.6813 | 0.6501 | 0.6618 | | | Alibaba-NLP/gte-large-en-v1.5 | 434M | 8192 | 0.7001 | 0.6501 | 0.6729 | | | nomic-ai/nomic-embed-text-v1.5 | 137M | 8192 | 0.7075 | 0.6498 | 0.6756 | | | McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised | 7111M | 32768 | 0.7238 | 0.6697 | 0.6928 | NF4 double quantization, instruction | | WhereIsAI/UAE-Large-V1 | 335M | 512 | 0.7245 | 0.6718 | 0.6946 | | | mixedbread-ai/mxbai-embed-large-v1 | 335M | 512 | 0.7215 | 0.6817 | 0.6989 | | | intfloat/e5-mistral-7b-instruct | 7111M | 32768 | 0.7345 | 0.7000 | 0.7137 | NF4 double quantization, instruction | > Fine-tune | Model | Param | Max tokens | Macro precision | Macro recall | Macro F1 | Note | | --------------------------------- | ----- | ---------- | --------------- | ------------ | -------- | ----------------------------------------------- | | jinaai/jina-embeddings-v2-base-en | 137M | 8192 | 0.7485 | 0.7257 | 0.7354 | Choose hyperparameter from Ray Tune (30 trials) | | Alibaba-NLP/gte-large-en-v1.5 | 434M | 8192 | 0.8403 | 0.8152 | 0.8231 | Choose hyperparameter from Ray Tune (16 trials) | # Download You can download Steam review aspect dataset from here (HuggingFace) or one of these sources, * [GitHub](https://github.com/ilos-vigil/steam-review-aspect-dataset) * [Kaggle](https://www.kaggle.com/datasets/ilosvigil/steam-review-aspect-dataset) # Citation If you wish to use this dataset in your research or project, please cite this blog post: [Steam review aspect dataset](https://srec.ai/blog/steam-review-aspect-dataset) ``` Sandy Khosasi. "Steam review aspect dataset". (2024). ``` For those who need it, a BibTeX citation format also has been prepared. ```bibtex @misc{srec:steam-review-aspect-dataset, title = {Steam review aspect dataset}, author = {Sandy Khosasi}, year = {2024}, month = {may}, day = {28}, url = {https://srec.ai/blog/steam-review-aspect-dataset}, urldate = {2024-05-28} } ``` # License Steam Review aspect dataset is licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0).

提供机构：

ilos-vigil

原始信息汇总

Steam Aspect Review Dataset

概述

Steam Aspect Review Dataset 是一个包含8个评论方面的Steam评论数据集。该数据集包含1100条英文Steam评论，分为900条训练集和200条测试集。

数据集详情

语言: 英语
任务类别: 文本分类
任务ID: 多标签分类
数据大小: 小于1K
配置:
- 默认配置:
  - 训练集: data/train/data-00000-of-00001.arrow
  - 测试集: data/test/data-00000-of-00001.arrow
来源数据集: Steam

数据收集和标注

数据来源: SRec数据库的快照，采集日期为2024年2月21日。
采集方式: 使用Steam提供的API获取所有游戏和模组的评论。
选择标准:
- 字符长度
- 有用性评分
- 被评论游戏的人气
- 被评论游戏的类型或类别
标注方面: 该数据集定义了8个评论方面，由单一标注者进行标注。

评论方面描述

方面	简短描述	评论文本示例
Recommended	评论者是否推荐该游戏	... In conclusion, good game
Story	故事、角色、背景、世界构建和其他叙事元素	Excellent game, but has an awful-abrupt ending that comes out of nowhere and doesnt make sense ...
Gameplay	控制、机制、互动性、难度和其他游戏设置	Gone are the days of mindless building and fun. Power grids? Taxes? Intense accounting and counter-intuitive path building ...
Visual	美学、艺术风格、动画、视觉效果和其他视觉元素	Gorgeous graphics + 80s/90s anime artstyle + Spooky + Atmospheric ...
Audio	声音设计、音乐、配音和其他听觉元素	... catchy music, wonderful narrator saying very kind words ...
Technical	游戏的技术方面，如bug、性能、操作系统支持、控制器支持和整体功能	bad doesnt fit a 1080p monitor u bastard ...
Price	游戏或其附加内容的价格	Devs are on meth pricing this game at $44
Suggestion	对游戏状态的建议，包括外部因素如游戏价格或发行商合作	... but needs a bit of personal effort to optimize the controls for PC, otherwise ...

模型基准

基准模型:
- Base: 非注意力基础语言模型
- Embedding: 受MTEB启发，使用Logistic回归训练嵌入，最多100个epoch
- Fine-tune: 微调模型

下载

下载链接:
- GitHub
- Kaggle

引用

引用格式:

Sandy Khosasi. "Steam review aspect dataset". (2024).
BibTeX格式: bibtex @misc{srec:steam-review-aspect-dataset, title = {Steam review aspect dataset}, author = {Sandy Khosasi}, year = {2024}, month = {may}, day = {28}, url = {https://srec.ai/blog/steam-review-aspect-dataset}, urldate = {2024-05-28} }

许可证

Steam Review aspect dataset 使用 Creative Commons Attribution 4.0 International 许可证。

搜集汇总

数据集介绍

构建方式

在数字游戏评论分析领域，Steam评论方面数据集通过精心设计的构建流程，为多标签文本分类任务提供了高质量资源。数据源自SRec数据库于2024年2月21日的快照，利用Steam官方API获取全部游戏与模组的评论。为减少选择偏差，构建者依据字符长度、有用性评分、游戏流行度及游戏类别等多维度标准筛选评论，并由单一标注者基于八个预定义方面进行人工标注，即使评论中仅隐含提及或表达缺失也予以标记，确保了标注的一致性与深度。

特点

该数据集在游戏评论分析领域展现出独特价值，其核心在于定义了八个涵盖游戏评价多维度的方面，包括推荐度、故事性、游戏性、视觉表现、音频效果、技术性能、价格因素及改进建议。数据集规模适中，包含1100条英文Steam评论，划分为900条训练样本与200条测试样本，适用于小规模多标签分类模型的训练与评估。数据集中部分评论可能包含冒犯性或不当内容，这反映了真实用户反馈的复杂性，为模型鲁棒性测试提供了现实场景。

使用方法

研究者可利用该数据集进行多标签文本分类任务的模型开发与性能评估。数据集已预分割为训练集与测试集，支持直接加载至机器学习框架。用户可基于提供的基准模型结果，对比基础模型、嵌入模型及微调模型在不同方面的分类表现，例如宏精确率、召回率与F1分数。数据集附有开源代码链接，便于复现基准实验，但需注意代码以实现功能为导向，可能未遵循最佳实践。使用时应遵循CC BY 4.0许可协议，并在研究中引用指定文献。

背景与挑战

背景概述

在数字游戏产业蓬勃发展的背景下，用户生成内容如游戏评论成为评估产品体验与市场反馈的关键资源。Steam Review Aspect Dataset由独立研究者Sandy Khosasi于2024年创建，依托SRec数据库与Steam官方API，旨在解析英语游戏评论中的多维语义信息。该数据集聚焦于多标签文本分类任务，定义了推荐、故事、玩法、视觉、音频、技术、价格与建议八大评论维度，通过精心筛选的1100条标注样本，为游戏评论的细粒度情感与主题分析提供了结构化基础。其诞生源于对4300万条Steam评论的宏观分析需求，推动了自然语言处理在娱乐产业应用中的深度探索。

当前挑战

该数据集致力于解决游戏评论多维度自动分类的复杂问题，其核心挑战在于评论文本的语义模糊性与隐含表达，例如用户可能通过间接陈述或否定句式提及某个方面，这要求模型具备深层语境理解能力。在构建过程中，研究者面临标注一致性的难题，由于仅由单一标注者完成，虽通过明确准则力求标准统一，但可能引入个人主观偏差。此外，数据选择需平衡评论长度、有用性评分、游戏流行度与类型等多重因素，以降低采样偏差，确保数据集的代表性与泛化潜力，这些因素共同构成了数据集构建与应用中的显著挑战。

常用场景

经典使用场景

在数字娱乐产业中，用户评论分析是理解产品反馈的核心环节。Steam Review Aspect Dataset作为一款专注于游戏评论的多标签分类数据集，其经典使用场景在于为自然语言处理模型提供细粒度的方面级情感分析训练基础。研究者利用该数据集构建分类器，能够自动识别评论中提及的八个关键方面，如故事性、游戏性、视觉表现等，从而系统化地解析海量用户评论，揭示游戏在不同维度的优劣表现。

解决学术问题

该数据集有效解决了多标签文本分类领域中的方面抽取难题，为学术研究提供了标准化的评估基准。通过精心标注的八个评论方面，它助力研究者探索隐式方面提及与否定表达的识别方法，推动了细粒度情感分析模型的发展。其意义在于填补了游戏领域方面级数据集空白，为跨领域迁移学习、少样本学习等前沿课题提供了宝贵资源，促进了自然语言处理技术在垂直领域的深化应用。

衍生相关工作

围绕该数据集，已衍生出多项经典研究工作，主要集中在多标签分类模型的性能优化与领域适配。例如，基于嵌入向量的方法如Jina Embeddings和E5-Mistral的微调实验，展示了预训练模型在方面识别任务上的显著提升。这些工作不仅验证了数据集的质量，还推动了注意力机制、集成学习等技术在游戏评论分析中的创新应用，为后续的跨语言、跨平台评论分析研究奠定了方法论基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集