vladvlasov256/opensubs-collocations
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vladvlasov256/opensubs-collocations
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- nl
- sr
license: cc-by-4.0
task_categories:
- feature-extraction
tags:
- nlp
- collocations
- bigrams
- npmi
- multilingual
- language-learning
- opensubtitles
pretty_name: OpenSubtitles Collocations
size_categories:
- 10K<n<100K
configs:
- config_name: en
data_files: data/en.jsonl
- config_name: nl
data_files: data/nl.jsonl
- config_name: sr
data_files: data/sr.jsonl
- config_name: all
default: true
data_files: data/*.jsonl
---
# OpenSubtitles Collocations
NPMI-scored bigram collocations extracted from the OpenSubtitles parallel corpus. Three languages, three relation types, ~43K bigrams total.
## Languages & Corpus Size
| Language | Code | Corpus lines | Bigrams |
|----------|------|-------------|---------|
| English | en | ~100M | 15,000 |
| Dutch | nl | ~105M | 15,000 |
| Serbian | sr | ~50M | 13,586 |
## Relation Types
- **ADJ+NOUN** — adjective-noun pairs: "slim contract", "kreditan kartica"
- **VERB+ADP** — phrasal verbs / verb-preposition: "come on", "houden van"
- **VERB+NOUN** — verb-object pairs: "earn money", "verdienen geld"
## Fields
| Field | Type | Description |
|-------|------|-------------|
| `lang` | str | Language code (en/nl/sr) |
| `type` | str | Relation type (ADJ+NOUN, VERB+ADP, VERB+NOUN) |
| `bigram` | str | Lemmatized bigram |
| `count` | int | Co-occurrence count in corpus |
| `pmi` | float | Pointwise mutual information |
| `npmi` | float | Normalized PMI (0–1 scale) |
| `score` | float | Composite ranking score (PMI + log frequency) |
| `variants` | int | Number of surface form variants (Serbian only) |
## Usage
```python
from datasets import load_dataset
# Load one language
ds = load_dataset("vladvlasov256/opensubs-collocations", "en", split="train")
# Load all languages
ds = load_dataset("vladvlasov256/opensubs-collocations", "all", split="train")
# Filter
ds.filter(lambda x: x["type"] == "VERB+ADP" and x["npmi"] > 0.5)
```
## Extraction Method
- **Source:** OPUS OpenSubtitles v2018 (Lison & Tiedemann, 2016)
- **NLP:** Stanza (tokenize, POS, lemma, depparse)
- **Patterns:** ADJ+NOUN, VERB+NOUN, VERB+ADP extracted via dependency relations
- **Filtering:** MIN_COUNT=3, TOP_N=5000 per pattern per language
- **Scoring:** PMI and NPMI, ranked by composite score (PMI + log frequency)
## Demo
See these collocations used in a live vocabulary extraction pipeline: [vocab-nlp Space](https://huggingface.co/spaces/vladvlasov256/vocab-nlp)
## Use Cases
- Collocation whitelists for NLP pipelines
- Language learning applications (phrase extraction, vocabulary selection)
- Linguistic research on multi-word expressions
## Citation
If you use this data, please cite the underlying corpus:
```bibtex
@inproceedings{lison2016opensubtitles,
title={OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles},
author={Lison, Pierre and Tiedemann, J{\"o}rg},
booktitle={Proceedings of the 10th LREC},
year={2016}
}
```
## License
CC-BY 4.0 (following OpenSubtitles licensing).
提供机构:
vladvlasov256



