sst
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/stanfordnlp/sst
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for sst
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://nlp.stanford.edu/sentiment/index.html
- **Repository:** [Needs More Information]
- **Paper:** [Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank](https://www.aclweb.org/anthology/D13-1170/)
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
The Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language.
### Supported Tasks and Leaderboards
- `sentiment-scoring`: Each complete sentence is annotated with a `float` label that indicates its level of positive sentiment from 0.0 to 1.0. One can decide to use only complete sentences or to include the contributions of the sub-sentences (aka phrases). The labels for each phrase are included in the `dictionary` configuration. To obtain all the phrases in a sentence we need to visit the parse tree included with each example. In contrast, the `ptb` configuration explicitly provides all the labelled parse trees in Penn Treebank format. Here the labels are binned in 5 bins from 0 to 4.
- `sentiment-classification`: We can transform the above into a binary sentiment classification task by rounding each label to 0 or 1.
### Languages
The text in the dataset is in English
## Dataset Structure
### Data Instances
For the `default` configuration:
```
{'label': 0.7222200036048889,
'sentence': 'Yet the act is still charming here .',
'tokens': 'Yet|the|act|is|still|charming|here|.',
'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'}
```
For the `dictionary` configuration:
```
{'label': 0.7361099720001221,
'phrase': 'still charming'}
```
For the `ptb` configuration:
```
{'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'}
```
### Data Fields
- `sentence`: a complete sentence expressing an opinion about a film
- `label`: the degree of "positivity" of the opinion, on a scale between 0.0 and 1.0
- `tokens`: a sequence of tokens that form a sentence
- `tree`: a sentence parse tree formatted as a parent pointer tree
- `phrase`: a sub-sentence of a complete sentence
- `ptb_tree`: a sentence parse tree formatted in Penn Treebank-style, where each component's degree of positive sentiment is labelled on a scale from 0 to 4
### Data Splits
The set of complete sentences (both `default` and `ptb` configurations) is split into a training, validation and test set. The `dictionary` configuration has only one split as it is used for reference rather than for learning.
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
Rotten Tomatoes reviewers.
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
[Needs More Information]
### Citation Information
```
@inproceedings{socher-etal-2013-recursive,
title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank",
author = "Socher, Richard and
Perelygin, Alex and
Wu, Jean and
Chuang, Jason and
Manning, Christopher D. and
Ng, Andrew and
Potts, Christopher",
booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
month = oct,
year = "2013",
address = "Seattle, Washington, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D13-1170",
pages = "1631--1642",
}
```
### Contributions
Thanks to [@patpizio](https://github.com/patpizio) for adding this dataset.
# sst数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **"主页"**:https://nlp.stanford.edu/sentiment/index.html
- **"代码仓库"**:[需补充更多信息]
- **"论文"**:《用于情感树库语义组合性的递归深度模型》(Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank),链接:https://www.aclweb.org/anthology/D13-1170/
- **"排行榜"**:[需补充更多信息]
- **"联系方式"**:[需补充更多信息]
### 数据集概述
斯坦福情感树库(Stanford Sentiment Treebank)是首个带有全标注句法树的语料库,可完整分析语言中情感的组合效应。
### 支持任务与排行榜
- `情感评分(sentiment-scoring)`:每个完整句子会被标注一个浮点型(float)标签,取值范围为0.0至1.0,用于表征其正面情感强度。研究者可选择仅使用完整句子,或纳入子句(即短语)的情感贡献。每个短语的标签已包含在`dictionary`配置中。若要获取句子中的所有短语,需解析每个样本附带的句法树。与之相对,`ptb`配置会以Penn Treebank(Penn Treebank)格式显式提供所有带标注的句法树,此时标签被划分为0至4共5个区间。
- `情感分类(sentiment-classification)`:可将上述任务转化为二分类情感分类任务,即把每个标签四舍五入为0或1。
### 语言
本数据集的文本语言为英语。
## 数据集结构
### 数据实例
针对`default`配置:
{'label': 0.7222200036048889,
'sentence': 'Yet the act is still charming here .',
'tokens': 'Yet|the|act|is|still|charming|here|.',
'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'}
针对`dictionary`配置:
{'label': 0.7361099720001221,
'phrase': 'still charming'}
针对`ptb`配置:
{'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'}
### 数据字段
- `sentence`:表达对某部电影观点的完整句子
- `label`:观点的正面程度,取值范围为0.0至1.0的连续区间
- `tokens`:构成句子的Token(Token)序列
- `tree`:以父指针树格式表示的句子句法树
- `phrase`:完整句子的子句片段
- `ptb_tree`:采用Penn Treebank格式的句子句法树,其中每个组成部分的正面情感强度以0至4的区间标注
### 数据划分
完整句子的数据集(包括`default`和`ptb`配置)被划分为训练集、验证集与测试集。而`dictionary`配置仅包含一个划分,因其主要用于参考而非模型训练。
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源文本生产者是谁?
烂番茄(Rotten Tomatoes)的影评人。
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 授权信息
[需补充更多信息]
### 引用信息
@inproceedings{socher-etal-2013-recursive,
title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank",
author = "Socher, Richard and
Perelygin, Alex and
Wu, Jean and
Chuang, Jason and
Manning, Christopher D. and
Ng, Andrew and
Potts, Christopher",
booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
month = oct,
year = "2013",
address = "Seattle, Washington, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D13-1170",
pages = "1631--1642",
}
其中论文标题可译为《用于情感树库语义组合性的递归深度模型》,会议名称可译为《2013年自然语言处理经验方法会议论文集》,出版方可译为国际计算语言学协会(Association for Computational Linguistics)。
### 贡献者
感谢[@patpizio](https://github.com/patpizio)贡献本数据集。
提供机构:
maas
创建时间:
2025-10-15



