five

Ishaank18/screenplay-features-linguistic

收藏
Hugging Face2026-01-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ishaank18/screenplay-features-linguistic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en size_categories: - 1K<n<10K --- # Screenplay Features - Linguistic Categories This dataset reorganizes the features from [screenplay-features](https://huggingface.co/datasets/Ishaank18/screenplay-features) into theoretically-motivated linguistic categories. ## Dataset Structure The dataset contains 837 features organized into 10 linguistic categories: ### 1. SURPRISAL (57 features) Language model predictability features measuring cognitive processing difficulty. - bert_surprisal (15) - surprisal (5) - GPT-2 surprisal - gpt2_char_surprisal (6) - ngram_surprisal (5) - ngram_char_surprisal (24) - psychformers (2) ### 2. MORPHOSYNTACTIC (587 features) Grammatical structure and tense features. - gc_syntax (574) - dependency relations, parse trees - gc_pos (8) - part-of-speech patterns - gc_temporal (5) - tense marking ### 3. LEXICAL (44 features) Word-level surface features. - gc_basic (6) - sentence length, token count - gc_readability (7) - readability formulas - gc_char_diversity (7) - lexical diversity (TTR, MAAS) - ngram (9) - n-gram diversity - gc_narrative (9) - action verbs, sensory words - gc_punctuation (6) - general punctuation patterns ### 4. SEMANTIC (29 features) Word meaning features. - gc_academic (8) - academic register markers - gc_concreteness (21) - word imageability/concreteness ### 5. DISCOURSE-PRAGMATIC (26 features) Discourse structure and illocutionary features. - gc_discourse (7) - discourse connectives - rst (13) - rhetorical relations - textrank_centrality (1) - discourse centrality - gc_punctuation (5) - question/exclamation marks ### 6. DIALOGIC (20 features) Dialogue and person reference features. - gc_dialogue (7) - speech verbs, dialogue markers - gc_pronouns (13) - person deixis ### 7. EMOTIONAL (26 features) Affective and sentiment features. - emotional (5) - affective dynamics - gc_polarity (21) - sentiment polarity scores ### 8. NARRATIVE-STRUCTURAL (18 features) Narrative and plot structure features. - structure (5) - screenplay acts, callbacks - plot_shifts (3) - plot dynamics - character_arcs (7) - character introduction/turnover - position (3) - temporal position in narrative ### 9. SAXENA-KELLER (3 features) Prior salience model predictions. - saxena_keller (3) ### 10. GENRE (27 features) Film genre features. - genre (27) ## Splits - Train: 5,247 scenes from 84 movies - Validation: 1,312 scenes from 21 movies - Test: 1,321 scenes from 21 movies ## Usage ```python from datasets import load_dataset # Load all categories dataset = load_dataset("Ishaank18/screenplay-features-linguistic") # Load specific category surprisal_features = load_dataset( "Ishaank18/screenplay-features-linguistic", data_files={"train": "train/surprisal.parquet"} ) ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{screenplay_features_linguistic, title={Screenplay Features - Linguistic Categories}, author={Your Name}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Ishaank18/screenplay-features-linguistic} } ``` ## License MIT License
提供机构:
Ishaank18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作