arize-ai/movie_reviews_with_context_drift

Name: arize-ai/movie_reviews_with_context_drift
Creator: arize-ai
Published: 2022-07-01 17:26:12
License: 暂无描述

Hugging Face2022-07-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/arize-ai/movie_reviews_with_context_drift

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是为我们的教程创建的，包含一个大型电影评论数据集与一些酒店评论数据集的混合数据。训练/验证集完全来自电影评论数据集，而生产集是混合的。还添加了一些其他特征（如年龄、性别、上下文）以及一个虚构的时间戳`prediction_ts`，表示推理发生的时间。数据集主要用于文本分类任务，特别是情感分类，给定文本预测情感（正面或负面）。文本主要是英文。

annotations_creators: - 专家生成 language_creators: - 专家生成 language: - 英语 license: - MIT协议 multilinguality: - 单语言 pretty_name: 带分布漂移的情感分类评论数据集（sentiment-classification-reviews-with-drift） size_categories: - 10K<n<100K source_datasets: - 扩展|IMDb（Internet Movie Database） task_categories: - 文本分类（text-classification） task_ids: - 情感分类（sentiment-classification） # `reviews_with_drift` 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [注释说明](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献声明](#contributions) ## 数据集描述 ### 数据集概述本数据集专为配套教程打造（教程链接待完善）。其核心为大型电影评论数据集，并混入了部分酒店评论数据集的样本。训练集与验证集仅源自电影评论数据集，而生产集则为混合了两类评论的数据。此外，数据集新增了`age`（年龄）、`gender`（性别）、`context`（场景）三类特征，以及用于标记推理执行时刻的自定义时间戳`prediction_ts`。 ### 支持任务与基准榜单 `文本分类（text-classification）`、`情感分类（sentiment-classification）`：本数据集主要应用于文本分类任务，即基于给定文本预测其情感倾向（正面或负面）。 ### 使用语言本数据集的文本均以英语撰写。 ## 数据集结构 ### 数据实例 #### 默认配置训练集的样本示例如下： json { 'prediction_ts': 1650092416.0, 'age': 44, 'gender': 'female', 'context': 'movies', 'text': "An interesting premise, and Billy Drago is always good as a dangerous nut-bag (side note: I'd love to see Drago, Stephen McHattie and Lance Hendrikson in a flick together; talk about raging cheekbones!). The soundtrack wasn't terrible, either.<br /><br />But the acting--even that of such professionals as Drago and Debbie Rochon--was terrible, the directing worse (perhaps contributory to the former), the dialog chimp-like, and the camera work, barely tolerable. Still, it was the SETS that got a big 10 on my oy-vey scale. I don't know where this was filmed, but were I to hazard a guess, it would be either an open-air museum, or one of those re-enactment villages, where everything is just a bit too well-kept to do more than suggest the real Old West. Okay, so it was shot on a college kid's budget. That said, I could have forgiven one or two of the aforementioned faults. But taken all together, and being generous, I could not see giving it more than three stars.", 'label': 0 } ### 数据字段 #### 默认配置所有数据划分的字段结构均保持一致，训练集的字段说明如下： - `prediction_ts`：浮点型（float）特征，代表推理执行时间戳 - `age`：整型（int）特征，代表评论者年龄 - `gender`：字符串型（string）特征，代表评论者性别 - `context`：字符串型（string）特征，代表评论所属场景 - `text`：字符串型（string）特征，代表原始评论文本 - `label`：分类标签（ClassLabel）特征，可选值包括负面（0）与正面（1） ### 数据划分 | 划分名称 | 训练集 | 验证集 | 生产集 | |----------|---------:|---------:|---------:| | 默认配置 | 9916 | 2479 | 40079 | ## 数据集构建 ### 构建依据【需补充更多信息】 ### 源数据【本部分待补充】 ### 注释说明【本部分待补充】 ### 个人与敏感信息【本部分待补充】 ## 数据集使用注意事项 ### 数据集的社会影响【本部分待补充】 ### 偏差分析【本部分待补充】 ### 其他已知局限【本部分待补充】 ## 附加信息 ### 数据集维护者【本部分待补充】 ### 许可信息【本部分待补充】 ### 引用信息【本部分待补充】 ### 贡献声明感谢[@fjcasti1](https://github.com/fjcasti1) 为本数据集添加支持。

提供机构：

arize-ai

原始信息汇总

数据集概述

数据集名称

名称：sentiment-classification-reviews-with-drift
别名：reviews_with_drift

数据集属性

语言：英语（en）
许可证：MIT
多语言性：单语种
大小：10K<n<100K
来源数据集：扩展自IMDB
任务类别：文本分类
任务ID：情感分类

数据集内容

数据集摘要：该数据集用于情感分类，结合了电影评论和酒店评论数据，增加了age, gender, context等特征，并设定了prediction_ts时间戳。
支持的任务和排行榜：主要用于文本分类和情感分类，预测文本的情感倾向（正面或负面）。
数据结构：
- 数据实例：包含prediction_ts, age, gender, context, text, label等字段。
- 数据字段：
  - prediction_ts：浮点型特征。
  - age：整型特征。
  - gender：字符串特征。
  - context：字符串特征。
  - text：字符串特征。
  - label：分类标签特征，可能的值为负(0)和正(1)。
- 数据分割：训练集9916条，验证集2479条，生产集40079条。

数据集创建

贡献者：@fjcasti1

5,000+

优质数据集

54 个

任务类型

进入经典数据集