Ishaank18/screenplay-features-linguistic
收藏Hugging Face2026-01-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ishaank18/screenplay-features-linguistic
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
language:
- en
size_categories:
- 1K<n<10K
---
# Screenplay Features - Linguistic Categories
This dataset reorganizes the features from [screenplay-features](https://huggingface.co/datasets/Ishaank18/screenplay-features) into theoretically-motivated linguistic categories.
## Dataset Structure
The dataset contains 837 features organized into 10 linguistic categories:
### 1. SURPRISAL (57 features)
Language model predictability features measuring cognitive processing difficulty.
- bert_surprisal (15)
- surprisal (5) - GPT-2 surprisal
- gpt2_char_surprisal (6)
- ngram_surprisal (5)
- ngram_char_surprisal (24)
- psychformers (2)
### 2. MORPHOSYNTACTIC (587 features)
Grammatical structure and tense features.
- gc_syntax (574) - dependency relations, parse trees
- gc_pos (8) - part-of-speech patterns
- gc_temporal (5) - tense marking
### 3. LEXICAL (44 features)
Word-level surface features.
- gc_basic (6) - sentence length, token count
- gc_readability (7) - readability formulas
- gc_char_diversity (7) - lexical diversity (TTR, MAAS)
- ngram (9) - n-gram diversity
- gc_narrative (9) - action verbs, sensory words
- gc_punctuation (6) - general punctuation patterns
### 4. SEMANTIC (29 features)
Word meaning features.
- gc_academic (8) - academic register markers
- gc_concreteness (21) - word imageability/concreteness
### 5. DISCOURSE-PRAGMATIC (26 features)
Discourse structure and illocutionary features.
- gc_discourse (7) - discourse connectives
- rst (13) - rhetorical relations
- textrank_centrality (1) - discourse centrality
- gc_punctuation (5) - question/exclamation marks
### 6. DIALOGIC (20 features)
Dialogue and person reference features.
- gc_dialogue (7) - speech verbs, dialogue markers
- gc_pronouns (13) - person deixis
### 7. EMOTIONAL (26 features)
Affective and sentiment features.
- emotional (5) - affective dynamics
- gc_polarity (21) - sentiment polarity scores
### 8. NARRATIVE-STRUCTURAL (18 features)
Narrative and plot structure features.
- structure (5) - screenplay acts, callbacks
- plot_shifts (3) - plot dynamics
- character_arcs (7) - character introduction/turnover
- position (3) - temporal position in narrative
### 9. SAXENA-KELLER (3 features)
Prior salience model predictions.
- saxena_keller (3)
### 10. GENRE (27 features)
Film genre features.
- genre (27)
## Splits
- Train: 5,247 scenes from 84 movies
- Validation: 1,312 scenes from 21 movies
- Test: 1,321 scenes from 21 movies
## Usage
```python
from datasets import load_dataset
# Load all categories
dataset = load_dataset("Ishaank18/screenplay-features-linguistic")
# Load specific category
surprisal_features = load_dataset(
"Ishaank18/screenplay-features-linguistic",
data_files={"train": "train/surprisal.parquet"}
)
```
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{screenplay_features_linguistic,
title={Screenplay Features - Linguistic Categories},
author={Your Name},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Ishaank18/screenplay-features-linguistic}
}
```
## License
MIT License
提供机构:
Ishaank18



