NatLee/sentiment-classification-dataset-bundle
收藏Hugging Face2023-05-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NatLee/sentiment-classification-dataset-bundle
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-classification
language:
- en
size_categories:
- 100K<n<1M
---
# NLP: Sentiment Classification Dataset
This is a bundle dataset for a NLP task of sentiment classification in English.
There is a sample project is using this dataset [GURA-gru-unit-for-recognizing-affect](https://github.com/NatLee/GURA-gru-unit-for-recognizing-affect).
## Content
- `myanimelist-sts`: This dataset is derived from MyAnimeList, a social networking and cataloging service for anime and manga fans. The dataset typically includes user reviews with ratings. We used [skip-thoughts](https://pypi.org/project/skip-thoughts/) to summarize them. You can find the original source of the dataset [myanimelist-comment-dataset](https://www.kaggle.com/datasets/natlee/myanimelist-comment-dataset) and the version is `2023-05-11`.
- `aclImdb`: The ACL IMDB dataset is a large movie review dataset collected for sentiment analysis tasks. It contains 50,000 highly polar movie reviews, divided evenly into 25,000 training and 25,000 test sets. Each set includes an equal number of positive and negative reviews. The source is from [sentiment](https://ai.stanford.edu/~amaas/data/sentiment/)
- `MR`: Movie Review Data (MR) is a dataset that contains 5,331 positive and 5,331 negative processed sentences/lines. This dataset is suitable for binary sentiment classification tasks, and it's a good starting point for text classification models. You can find the source [movie-review-data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) and the section is `Sentiment scale datasets`.
- `MPQA`: The Multi-Perspective Question Answering (MPQA) dataset is a resource for opinion detection and sentiment analysis research. It consists of news articles from a wide variety of sources annotated for opinions and other private states. You can get the source from [MPQA](https://mpqa.cs.pitt.edu/)
- `SST2`: The Stanford Sentiment Treebank version 2 (SST2) is a popular benchmark for sentence-level sentiment analysis. It includes movie review sentences with corresponding sentiment labels (positive or negative). You can obtain the dataset from [SST2](https://huggingface.co/datasets/sst2)
- `SUBJ`: The Subjectivity dataset is used for sentiment analysis research. It consists of 5000 subjective and 5000 objective processed sentences, which can help a model to distinguish between subjective and objective (factual) statements. You can find the source [movie-review-data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) and the section is `Subjectivity datasets`.
# Tokenizer
```python
from pathlib import Path
import pickle
from tensorflow.keras.preprocessing.text import Tokenizer
def check_data_path(file_path:str) -> bool:
if Path(file_path).exists():
print(f'[Path][OK] {file_path}')
return True
print(f'[Path][FAILED] {file_path}')
return False
sentences = []
# =====================
# Anime Reviews
# =====================
dataset = './myanimelist-sts.pkl'
if check_data_path(dataset):
with open(dataset, 'rb') as p:
X, Y = pickle.load(p)
sentences.extend(X)
sentences.extend(Y)
# =====================
# MPQA
# =====================
dataset = './MPQA.pkl'
if check_data_path(dataset):
with open(dataset, 'rb') as p:
mpqa = pickle.load(p)
sentences.extend(list(mpqa.sentence))
# =====================
# IMDB
# =====================
dataset = './aclImdb.pkl'
if check_data_path(dataset):
with open(dataset, 'rb') as p:
x_test, y_test, x_train, y_train = pickle.load(p)
sentences.extend(x_train)
sentences.extend(y_train)
# =====================
# MR
# =====================
dataset = './MR.pkl'
if check_data_path(dataset):
with open(dataset, 'rb') as p:
mr = pickle.load(p)
sentences.extend(list(mr.sentence))
# =====================
# SST2
# =====================
dataset = './SST2.pkl'
if check_data_path(dataset):
with open(dataset, 'rb') as p:
sst2 = pickle.load(p)
sentences.extend(list(sst2.sentence))
# =====================
# SUBJ
# =====================
dataset = './SUBJ.pkl'
if check_data_path(dataset):
with open(dataset, 'rb') as p:
subj = pickle.load(p)
sentences.extend(list(subj.sentence))
sentences = map(str, sentences)
#Tokenize the sentences
myTokenizer = Tokenizer(
num_words = 100,
oov_token="{OOV}"
)
myTokenizer.fit_on_texts(sentences)
print(myTokenizer.word_index)
with open('./big-tokenizer.pkl', 'wb') as p:
pickle.dump(myTokenizer, p)
```
提供机构:
NatLee
原始信息汇总
数据集概述
基本信息
- 任务类别:文本分类
- 语言:英语
- 数据集大小:100K<n<1M
数据集内容
- myanimelist-sts:源自MyAnimeList,包含用户评论及评分,使用skip-thoughts进行摘要。
- aclImdb:包含50,000条高度极化的电影评论,分为25,000条训练集和25,000条测试集,每集包含相等数量的正面和负面评论。
- MR:包含5,331条正面和5,331条负面处理过的电影评论,适用于二元情感分类任务。
- MPQA:用于观点检测和情感分析研究,包含来自多种来源的新闻文章,已标注观点和其他私人状态。
- SST2:斯坦福情感树库版本2,用于句子级情感分析,包含电影评论句子及其相应的情感标签(正面或负面)。
- SUBJ:用于情感分析研究,包含5000条主观和5000条客观处理过的句子,有助于模型区分主观和客观陈述。
数据集用途
- 用于自然语言处理任务中的情感分类。
数据集处理
- 使用Tokenizer对所有数据集的句子进行文本处理,生成词汇索引,并将处理后的Tokenizer保存为pickle文件。



