five

sst

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/stanfordnlp/sst
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for sst ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://nlp.stanford.edu/sentiment/index.html - **Repository:** [Needs More Information] - **Paper:** [Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank](https://www.aclweb.org/anthology/D13-1170/) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] ### Dataset Summary The Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. ### Supported Tasks and Leaderboards - `sentiment-scoring`: Each complete sentence is annotated with a `float` label that indicates its level of positive sentiment from 0.0 to 1.0. One can decide to use only complete sentences or to include the contributions of the sub-sentences (aka phrases). The labels for each phrase are included in the `dictionary` configuration. To obtain all the phrases in a sentence we need to visit the parse tree included with each example. In contrast, the `ptb` configuration explicitly provides all the labelled parse trees in Penn Treebank format. Here the labels are binned in 5 bins from 0 to 4. - `sentiment-classification`: We can transform the above into a binary sentiment classification task by rounding each label to 0 or 1. ### Languages The text in the dataset is in English ## Dataset Structure ### Data Instances For the `default` configuration: ``` {'label': 0.7222200036048889, 'sentence': 'Yet the act is still charming here .', 'tokens': 'Yet|the|act|is|still|charming|here|.', 'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'} ``` For the `dictionary` configuration: ``` {'label': 0.7361099720001221, 'phrase': 'still charming'} ``` For the `ptb` configuration: ``` {'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'} ``` ### Data Fields - `sentence`: a complete sentence expressing an opinion about a film - `label`: the degree of "positivity" of the opinion, on a scale between 0.0 and 1.0 - `tokens`: a sequence of tokens that form a sentence - `tree`: a sentence parse tree formatted as a parent pointer tree - `phrase`: a sub-sentence of a complete sentence - `ptb_tree`: a sentence parse tree formatted in Penn Treebank-style, where each component's degree of positive sentiment is labelled on a scale from 0 to 4 ### Data Splits The set of complete sentences (both `default` and `ptb` configurations) is split into a training, validation and test set. The `dictionary` configuration has only one split as it is used for reference rather than for learning. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? Rotten Tomatoes reviewers. ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information ``` @inproceedings{socher-etal-2013-recursive, title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", author = "Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher", booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", month = oct, year = "2013", address = "Seattle, Washington, USA", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D13-1170", pages = "1631--1642", } ``` ### Contributions Thanks to [@patpizio](https://github.com/patpizio) for adding this dataset.

# sst数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **"主页"**:https://nlp.stanford.edu/sentiment/index.html - **"代码仓库"**:[需补充更多信息] - **"论文"**:《用于情感树库语义组合性的递归深度模型》(Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank),链接:https://www.aclweb.org/anthology/D13-1170/ - **"排行榜"**:[需补充更多信息] - **"联系方式"**:[需补充更多信息] ### 数据集概述 斯坦福情感树库(Stanford Sentiment Treebank)是首个带有全标注句法树的语料库,可完整分析语言中情感的组合效应。 ### 支持任务与排行榜 - `情感评分(sentiment-scoring)`:每个完整句子会被标注一个浮点型(float)标签,取值范围为0.0至1.0,用于表征其正面情感强度。研究者可选择仅使用完整句子,或纳入子句(即短语)的情感贡献。每个短语的标签已包含在`dictionary`配置中。若要获取句子中的所有短语,需解析每个样本附带的句法树。与之相对,`ptb`配置会以Penn Treebank(Penn Treebank)格式显式提供所有带标注的句法树,此时标签被划分为0至4共5个区间。 - `情感分类(sentiment-classification)`:可将上述任务转化为二分类情感分类任务,即把每个标签四舍五入为0或1。 ### 语言 本数据集的文本语言为英语。 ## 数据集结构 ### 数据实例 针对`default`配置: {'label': 0.7222200036048889, 'sentence': 'Yet the act is still charming here .', 'tokens': 'Yet|the|act|is|still|charming|here|.', 'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'} 针对`dictionary`配置: {'label': 0.7361099720001221, 'phrase': 'still charming'} 针对`ptb`配置: {'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'} ### 数据字段 - `sentence`:表达对某部电影观点的完整句子 - `label`:观点的正面程度,取值范围为0.0至1.0的连续区间 - `tokens`:构成句子的Token(Token)序列 - `tree`:以父指针树格式表示的句子句法树 - `phrase`:完整句子的子句片段 - `ptb_tree`:采用Penn Treebank格式的句子句法树,其中每个组成部分的正面情感强度以0至4的区间标注 ### 数据划分 完整句子的数据集(包括`default`和`ptb`配置)被划分为训练集、验证集与测试集。而`dictionary`配置仅包含一个划分,因其主要用于参考而非模型训练。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源文本生产者是谁? 烂番茄(Rotten Tomatoes)的影评人。 ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 授权信息 [需补充更多信息] ### 引用信息 @inproceedings{socher-etal-2013-recursive, title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", author = "Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher", booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", month = oct, year = "2013", address = "Seattle, Washington, USA", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D13-1170", pages = "1631--1642", } 其中论文标题可译为《用于情感树库语义组合性的递归深度模型》,会议名称可译为《2013年自然语言处理经验方法会议论文集》,出版方可译为国际计算语言学协会(Association for Computational Linguistics)。 ### 贡献者 感谢[@patpizio](https://github.com/patpizio)贡献本数据集。
提供机构:
maas
创建时间:
2025-10-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作