sst

Name: sst
Creator: maas
Published: 2025-12-05 16:54:49
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/stanfordnlp/sst

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for sst ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://nlp.stanford.edu/sentiment/index.html - **Repository:** [Needs More Information] - **Paper:** [Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank](https://www.aclweb.org/anthology/D13-1170/) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] ### Dataset Summary The Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. ### Supported Tasks and Leaderboards - `sentiment-scoring`: Each complete sentence is annotated with a `float` label that indicates its level of positive sentiment from 0.0 to 1.0. One can decide to use only complete sentences or to include the contributions of the sub-sentences (aka phrases). The labels for each phrase are included in the `dictionary` configuration. To obtain all the phrases in a sentence we need to visit the parse tree included with each example. In contrast, the `ptb` configuration explicitly provides all the labelled parse trees in Penn Treebank format. Here the labels are binned in 5 bins from 0 to 4. - `sentiment-classification`: We can transform the above into a binary sentiment classification task by rounding each label to 0 or 1. ### Languages The text in the dataset is in English ## Dataset Structure ### Data Instances For the `default` configuration: ``` {'label': 0.7222200036048889, 'sentence': 'Yet the act is still charming here .', 'tokens': 'Yet|the|act|is|still|charming|here|.', 'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'} ``` For the `dictionary` configuration: ``` {'label': 0.7361099720001221, 'phrase': 'still charming'} ``` For the `ptb` configuration: ``` {'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'} ``` ### Data Fields - `sentence`: a complete sentence expressing an opinion about a film - `label`: the degree of "positivity" of the opinion, on a scale between 0.0 and 1.0 - `tokens`: a sequence of tokens that form a sentence - `tree`: a sentence parse tree formatted as a parent pointer tree - `phrase`: a sub-sentence of a complete sentence - `ptb_tree`: a sentence parse tree formatted in Penn Treebank-style, where each component's degree of positive sentiment is labelled on a scale from 0 to 4 ### Data Splits The set of complete sentences (both `default` and `ptb` configurations) is split into a training, validation and test set. The `dictionary` configuration has only one split as it is used for reference rather than for learning. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? Rotten Tomatoes reviewers. ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information ``` @inproceedings{socher-etal-2013-recursive, title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", author = "Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher", booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", month = oct, year = "2013", address = "Seattle, Washington, USA", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D13-1170", pages = "1631--1642", } ``` ### Contributions Thanks to [@patpizio](https://github.com/patpizio) for adding this dataset.

# sst数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **"主页"**：https://nlp.stanford.edu/sentiment/index.html - **"代码仓库"**：[需补充更多信息] - **"论文"**：《用于情感树库语义组合性的递归深度模型》(Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank)，链接：https://www.aclweb.org/anthology/D13-1170/ - **"排行榜"**：[需补充更多信息] - **"联系方式"**：[需补充更多信息] ### 数据集概述斯坦福情感树库（Stanford Sentiment Treebank）是首个带有全标注句法树的语料库，可完整分析语言中情感的组合效应。 ### 支持任务与排行榜 - `情感评分（sentiment-scoring）`：每个完整句子会被标注一个浮点型（float）标签，取值范围为0.0至1.0，用于表征其正面情感强度。研究者可选择仅使用完整句子，或纳入子句（即短语）的情感贡献。每个短语的标签已包含在`dictionary`配置中。若要获取句子中的所有短语，需解析每个样本附带的句法树。与之相对，`ptb`配置会以Penn Treebank（Penn Treebank）格式显式提供所有带标注的句法树，此时标签被划分为0至4共5个区间。 - `情感分类（sentiment-classification）`：可将上述任务转化为二分类情感分类任务，即把每个标签四舍五入为0或1。 ### 语言本数据集的文本语言为英语。 ## 数据集结构 ### 数据实例针对`default`配置： {'label': 0.7222200036048889, 'sentence': 'Yet the act is still charming here .', 'tokens': 'Yet|the|act|is|still|charming|here|.', 'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'} 针对`dictionary`配置： {'label': 0.7361099720001221, 'phrase': 'still charming'} 针对`ptb`配置： {'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'} ### 数据字段 - `sentence`：表达对某部电影观点的完整句子 - `label`：观点的正面程度，取值范围为0.0至1.0的连续区间 - `tokens`：构成句子的Token（Token）序列 - `tree`：以父指针树格式表示的句子句法树 - `phrase`：完整句子的子句片段 - `ptb_tree`：采用Penn Treebank格式的句子句法树，其中每个组成部分的正面情感强度以0至4的区间标注 ### 数据划分完整句子的数据集（包括`default`和`ptb`配置）被划分为训练集、验证集与测试集。而`dictionary`配置仅包含一个划分，因其主要用于参考而非模型训练。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源文本生产者是谁？烂番茄（Rotten Tomatoes）的影评人。 ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 授权信息 [需补充更多信息] ### 引用信息 @inproceedings{socher-etal-2013-recursive, title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", author = "Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher", booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", month = oct, year = "2013", address = "Seattle, Washington, USA", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D13-1170", pages = "1631--1642", } 其中论文标题可译为《用于情感树库语义组合性的递归深度模型》，会议名称可译为《2013年自然语言处理经验方法会议论文集》，出版方可译为国际计算语言学协会（Association for Computational Linguistics）。 ### 贡献者感谢[@patpizio](https://github.com/patpizio)贡献本数据集。

提供机构：

maas

创建时间：

2025-10-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集