five

vladvlasov256/opensubs-collocations

收藏
Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vladvlasov256/opensubs-collocations
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - nl - sr license: cc-by-4.0 task_categories: - feature-extraction tags: - nlp - collocations - bigrams - npmi - multilingual - language-learning - opensubtitles pretty_name: OpenSubtitles Collocations size_categories: - 10K<n<100K configs: - config_name: en data_files: data/en.jsonl - config_name: nl data_files: data/nl.jsonl - config_name: sr data_files: data/sr.jsonl - config_name: all default: true data_files: data/*.jsonl --- # OpenSubtitles Collocations NPMI-scored bigram collocations extracted from the OpenSubtitles parallel corpus. Three languages, three relation types, ~43K bigrams total. ## Languages & Corpus Size | Language | Code | Corpus lines | Bigrams | |----------|------|-------------|---------| | English | en | ~100M | 15,000 | | Dutch | nl | ~105M | 15,000 | | Serbian | sr | ~50M | 13,586 | ## Relation Types - **ADJ+NOUN** — adjective-noun pairs: "slim contract", "kreditan kartica" - **VERB+ADP** — phrasal verbs / verb-preposition: "come on", "houden van" - **VERB+NOUN** — verb-object pairs: "earn money", "verdienen geld" ## Fields | Field | Type | Description | |-------|------|-------------| | `lang` | str | Language code (en/nl/sr) | | `type` | str | Relation type (ADJ+NOUN, VERB+ADP, VERB+NOUN) | | `bigram` | str | Lemmatized bigram | | `count` | int | Co-occurrence count in corpus | | `pmi` | float | Pointwise mutual information | | `npmi` | float | Normalized PMI (0–1 scale) | | `score` | float | Composite ranking score (PMI + log frequency) | | `variants` | int | Number of surface form variants (Serbian only) | ## Usage ```python from datasets import load_dataset # Load one language ds = load_dataset("vladvlasov256/opensubs-collocations", "en", split="train") # Load all languages ds = load_dataset("vladvlasov256/opensubs-collocations", "all", split="train") # Filter ds.filter(lambda x: x["type"] == "VERB+ADP" and x["npmi"] > 0.5) ``` ## Extraction Method - **Source:** OPUS OpenSubtitles v2018 (Lison & Tiedemann, 2016) - **NLP:** Stanza (tokenize, POS, lemma, depparse) - **Patterns:** ADJ+NOUN, VERB+NOUN, VERB+ADP extracted via dependency relations - **Filtering:** MIN_COUNT=3, TOP_N=5000 per pattern per language - **Scoring:** PMI and NPMI, ranked by composite score (PMI + log frequency) ## Demo See these collocations used in a live vocabulary extraction pipeline: [vocab-nlp Space](https://huggingface.co/spaces/vladvlasov256/vocab-nlp) ## Use Cases - Collocation whitelists for NLP pipelines - Language learning applications (phrase extraction, vocabulary selection) - Linguistic research on multi-word expressions ## Citation If you use this data, please cite the underlying corpus: ```bibtex @inproceedings{lison2016opensubtitles, title={OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles}, author={Lison, Pierre and Tiedemann, J{\"o}rg}, booktitle={Proceedings of the 10th LREC}, year={2016} } ``` ## License CC-BY 4.0 (following OpenSubtitles licensing).
提供机构:
vladvlasov256
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作