five

anony156/anon123yu

收藏
Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anony156/anon123yu
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-classification tags: - screenplay - narrative - salience - linguistics language: - en size_categories: - 100K<n<1M --- # Screenplay Scene Salience Features Pre-extracted linguistic and narrative features for screenplay scene salience detection from the MENSA dataset. ## Dataset Description This dataset contains **913 linguistic features** extracted from movie screenplays in the MENSA dataset. Features are organized into **24 feature groups** covering various aspects of linguistic, narrative, and discourse analysis. ### Dataset Statistics | Split | Samples | Size | |-------|---------|------| | Train | 117,503 | 172.9 MB | | Validation | 8,052 | 16.1 MB | | Test | 8,156 | 16.1 MB | | **Total** | **133,711** | **140.1 MB** | ### Feature Groups (24 groups) - `base` - `bert_surprisal` - `character_arcs` - `emotional` - `gc_academic` - `gc_basic` - `gc_char_diversity` - `gc_concreteness` - `gc_dialogue` - `gc_discourse` - `gc_narrative` - `gc_polarity` - `gc_pos` - `gc_pronouns` - `gc_punctuation` - `gc_readability` - `gc_syntax` - `gc_temporal` - `ngram` - `ngram_surprisal` - `plot_shifts` - `rst` - `structure` - `surprisal` ## Usage ### Option 1: Load with Hugging Face datasets (Recommended) ```python from datasets import load_dataset # Load a single feature group ds = load_dataset("anony156/anon123yu", data_files="train/base.parquet") df = ds['train'].to_pandas() # Load multiple groups for training ds = load_dataset("anony156/anon123yu", data_files={ "train": ["train/base.parquet", "train/gc_polarity.parquet", "train/emotional.parquet"] }) df = ds['train'].to_pandas() # Load all splits for evaluation ds = load_dataset("anony156/anon123yu", data_files={ "train": "train/gc_polarity.parquet", "validation": "validation/gc_polarity.parquet", "test": "test/gc_polarity.parquet" }) ``` ### Option 2: Load with pandas directly ```python import pandas as pd # From HuggingFace URL df = pd.read_parquet("hf://datasets/anony156/anon123yu/train/base.parquet") # Or if you have the repo cloned locally df = pd.read_parquet("train/base.parquet") ``` ### Option 3: Use custom loader (Easiest) ```python from feature_cache.load_hf import load_groups # Load features and labels X, y = load_groups( groups=["base", "gc_polarity", "emotional", "rst"], split="train", hf_repo="anony156/anon123yu" ) # Load features only (no labels) X = load_groups( groups=["base", "gc_polarity"], split="test", include_label=False, hf_repo="anony156/anon123yu" ) ``` ## Data Structure Each parquet file contains: - **`movie_id`** (string): Unique movie identifier - **`scene_index`** (int): Scene index within the movie (0-indexed) - **`label`** (int): Salience label - `0` = Non-salient scene - `1` = Salient scene - **Feature columns**: Various linguistic/narrative features (float/int) ### Example row structure: | movie_id | scene_index | label | feature_1 | feature_2 | ... | |----------|-------------|-------|-----------|-----------|-----| | tt0111161 | 42 | 1 | 0.85 | 12.3 | ... | ## Feature Categories The features are organized into the following categories: ### Base Features - Basic linguistic statistics (token count, sentence count, etc.) - Structural position features (act, scene positions) ### GenreClassifier (GC) Features - **gc_basic**: Basic linguistic metrics - **gc_char_diversity**: Character diversity metrics - **gc_concreteness**: Concreteness scores - **gc_dialogue**: Dialogue-specific features - **gc_discourse**: Discourse markers and connectives - **gc_narrative**: Narrative structure features - **gc_polarity**: Sentiment polarity scores - **gc_pos**: Part-of-speech distributions - **gc_pronouns**: Pronoun usage patterns - **gc_punctuation**: Punctuation statistics - **gc_readability**: Readability metrics - **gc_syntax**: Syntactic complexity features - **gc_temporal**: Temporal expressions ### Narrative Features - **character_arcs**: Character development metrics - **plot_shifts**: Plot progression indicators - **structure**: Narrative structure features - **emotional**: Emotional arc features ### Linguistic Features - **ngram**: N-gram diversity metrics - **rst**: Rhetorical Structure Theory features - **bert_surprisal**: BERT-based surprisal scores - **ngram_surprisal**: N-gram-based surprisal
提供机构:
anony156
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作