hyper-efficient-system-llc/bosnian-corpus-v1

Name: hyper-efficient-system-llc/bosnian-corpus-v1
Creator: hyper-efficient-system-llc
Published: 2026-01-18 23:43:40
License: 暂无描述

Hugging Face2026-01-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hyper-efficient-system-llc/bosnian-corpus-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 language: - bs tags: - bosnian - corpus - nlp - information-theory - entropy - language-modeling pretty_name: Bosnian Corpus v1.0 (cleaned) size_categories: - 1B<n<10B --- # Bosnian Corpus v1.0 (cleaned) This dataset provides a cleaned and genre-annotated corpus of contemporary Bosnian, designed for quantitative analysis of language entropy, “language energy”, and modern NLP tasks. The canonical release of this corpus is published on Zenodo: **DOI: 10.5281/zenodo.17757098** --- ## Corpus composition The corpus is built from three publicly available resources released via the CLARIN.SI repository: 1. **Sarajevo Corpus of SMS Messages in Bosnian 1.1** 2. **Bosnian Web Corpus bsWaC 1.1** 3. **Bosnian Web Corpus CLASSLA-web.bs 1.0** All sources were converted to plain text, cleaned, normalized, partially deduplicated, and merged into a single consistent dataset. --- ## Size and statistics - **Total size:** ~6.18 GB (≈ 6,182,905,888 bytes) - **Lines:** ~46,258,935 - **Tokens:** ~942,515,845 --- ## Genre structure The web portion of the corpus is organized into the following “super-genres”: - News - Opinion - Forum / Chat - Info / HowTo - Legal / Administrative - Literature - Ads / Promo - Mix / Other For each super-genre, a separate text file is provided, along with one **global file** concatenating all genres. The global file is intended for entropy estimation and language-model training. --- ## Cleaning and normalization Cleaning focuses on removing technical noise that would bias frequency distributions and entropy estimates, while preserving the linguistic signal: - Unicode normalization (UTF-8, NFC) - Correction of common mojibake artifacts - Removal of URLs, e-mail addresses, file names, boilerplate, CMS/navigation lines - Filtering of lines with a high proportion of non-letter characters - Optional digit normalization and lowercasing - Language filtering to keep primarily Bosnian text --- ## Files - `bosnian_corpus_all.txt` — full corpus (all genres combined) - Per-genre text files: - news - opinion - forum_chat - info_howto - legal_admin - literature - ads_promo - mix_other - `README.txt` — dataset description Associated research papers (Bosnian and English) are published separately on Zenodo. - Download: `data/bosnian-corpus-1.0.zip` --- ## Code availability Preprocessing, cleaning, and entropy-calculation scripts are available on GitHub: https://github.com/H4sK0/bosnian-corpus-pipeline --- ## License This dataset is released under the **Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)** license. Users must: - credit this Zenodo record and the original source corpora - distribute derivative datasets under the same or a compatible license --- ## Citation ```bibtex @dataset{kahrimanovic2025bosniancorpus, title = {Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research}, author = {Kahrimanović, Hasan}, year = {2025}, publisher = {Zenodo}, doi = {10.5281/zenodo.17757098} }

提供机构：

hyper-efficient-system-llc

5,000+

优质数据集

54 个

任务类型

进入经典数据集