five

Nikola-92/sr-morpho-base

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Nikola-92/sr-morpho-base
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - sr license: cc-by-nc-sa-4.0 pretty_name: "Serbian Morphological Segmentation Knowledge Base" tags: - morphology - nlp - serbian - segmentation - linguistics - slavic task_categories: - token-classification - text-generation size_categories: - 10B<n<100B --- # Serbian Morphological Segmentation Knowledge Base (`sr-morpho-base`) A comprehensive morphological knowledge base for the Serbian language, produced by a multi-stage NLP pipeline that mines productive morphemes from dictionary data and generalizes them across a large text corpus. ## Dataset Description This repository contains two classes of artifacts: 1. **Morphological knowledge base** — the complete output of the segmentation pipeline: segmented word forms, frequency files for all morpheme types, and the supporting knowledge structures used during segmentation. 2. **Raw corpus** — `corpus.txt`, a 60+ GB unified Serbian text corpus in both Latin and Cyrillic scripts, assembled from multiple open sources. ## What is in this repository | File | Description | |------|-------------| | `corpus.txt` | Unified Serbian corpus (~60 GB, Latin + Cyrillic, one document per line) | | `knowledge_base_final.tsv` | **Main output**: merged segmentations for all known and discovered words | | `all_forms_segmented.tsv` | Dual-script, dual-root segmentation from dictionary only | | `unknowns_segmented.tsv` | Segmentation results for corpus-discovered unknown words | | `lemma_profiles.json` | Complete morphological profile for every dictionary lemma | | `root_map.json` | Map from each root surface form to its associated lemmas | | `canonical_root_map.json` | Data-driven mapping from non-canonical to canonical root variants | | `candidate_prefixes.txt` / `candidate_suffixes.txt` | Discovered productive affixes | | `stable_lemmas.txt` | All dictionary lemmas (used as phonological guard) | | `known_forms_frequencies.tsv` | Corpus frequency counts for known word forms | | `known_roots_frequencies.tsv` | Corpus frequency counts for known roots (surface) | | `known_roots_abstract_frequencies.tsv` | Corpus frequency counts for abstract (canonical) roots | | `known_prefixes_frequencies.tsv` | Corpus frequency counts for known prefixes | | `known_suffixes_frequencies.tsv` | Corpus frequency counts for known suffixes | | `unknowns_forms_frequencies.tsv` | Same, for corpus-discovered unknown words | | `unknowns_roots_frequencies.tsv` | | | `unknowns_roots_abstract_frequencies.tsv` | | | `unknowns_prefixes_frequencies.tsv` | | | `unknowns_suffixes_frequencies.tsv` | | ## Corpus Sources `corpus.txt` was assembled from the following open-source Serbian datasets: | Source | Description | |--------|-------------| | [procesaur/kisobran](https://huggingface.co/datasets/procesaur/kisobran) | MaCoCu, PDRS, SrpKorNews, CC100, CLASSLA, HPLT, mC4, OSCAR, srWaC | | [procesaur/Vikipedija](https://huggingface.co/datasets/procesaur/Vikipedija) | Serbian Wikipedia | | [procesaur/znanje](https://huggingface.co/datasets/procesaur/znanje) | enauka_sr, nardus_sr (scientific papers and theses) | | [oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | OSCAR unshuffled deduplicated Serbian | All sources were sentence-tag-stripped, NFC-normalized, and deduplicated before merging. ## Pipeline Architecture The segmentation pipeline operates in 9 stages: 1. **Dictionary mining** (`updated_srpmd_pipeline.py`) — parses DELA-style Serbian dictionaries, mines phonological alternations, discovers abstract roots, and segments all known forms with dual-script support. 2. **Affix discovery** (`discover_affixes_from_known.py`) — identifies productive prefixes and suffixes by measuring stem variety. 3. **Corpus parsing** (`parser.py`) — scans `corpus.txt` in parallel, counting known morpheme frequencies and collecting unknown words. 4. **Frequency merging** (`merge.py`) — consolidates frequency shards. 5. **Unknown consolidation** (`consolidate_unknowns.py`) + **Canonical map** (`build_canonical_map.py`). 6. **Unknown segmentation** (`segment_unknowns_phonologically.py`) — applies phonological rules using all knowledge bases. 7. **Unknown frequency counting** (`count_unknown_frequencies.py`). 8. **Unknown frequency merging** (`merge_unknowns.py`). 9. **Knowledge base merge** (`merge_knowledge.py`). ### Phonological Engine The pipeline implements a cascade of Serbian-specific phonological reversal rules: - Consonant deletion (`Gubljenje suglasnika`) - Fleeting vowel removal (`Nepostojano A`) - L-vocalization (`Prelazak L u O`) - Place assimilation (`Jednačenje po mestu tvorbe`) - Voicing assimilation (`Jednačenje po zvučnosti`) - Sibilarization and Palatalization - Iotation (`Jotovanje`) - Data-driven canonical root correction ## Related Repositories | Repository | Description | |------------|-------------| | [`Nikola-92/sr-morpho-vocab`](https://huggingface.co/datasets/Nikola-92/sr-morpho-vocab) | Optimized 10K Serbian morpheme vocabulary for LLM injection | | [`Nikola-92/Qwen2.5-7B-serbian-morpho`](https://huggingface.co/Nikola-92/Qwen2.5-7B-serbian-morpho) | Qwen 2.5 7B with the 10K vocabulary injected | ## Usage ```python import pandas as pd # Load the main knowledge base kb = pd.read_csv("knowledge_base_final.tsv", sep="\t") print(kb.head()) # Load root frequencies roots = pd.read_csv("known_roots_frequencies.tsv", sep="\t", header=None, names=["root", "frequency"]) print(roots.nlargest(20, "frequency")) ``` ## Citation If you use this dataset in your research, please cite: ``` @dataset{jankovic2025srmorpho, author = {Jankovic, Nikola}, title = {Serbian Morphological Segmentation Knowledge Base}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/Nikola-92/sr-morpho-base} } ``` ## License This dataset is licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). You are free to share and adapt this material for non-commercial purposes, provided you give appropriate credit and distribute your contributions under the same license.
提供机构:
Nikola-92
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作