five

lumees/bulgarian-corpus-33b

收藏
Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/lumees/bulgarian-corpus-33b
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - bg license: apache-2.0 task_categories: - text-generation - question-answering - translation pretty_name: Bulgarian Corpus 33B size_categories: - 10B<n<100B tags: - bulgarian - llm - foundation-model - pretraining - sft - fineweb - science configs: - config_name: pretrain data_files: "pretrain/*.parquet" - config_name: sft data_files: "sft/*.parquet" --- # Lumees Bulgarian Corpus (BG-Corpus-33B) ## Dataset Summary The **Bulgarian Corpus 33B** is a massive-scale, deduplicated, and cleaned dataset designed for training Foundation Models in Bulgarian. Comprising approximately **33.4 Billion tokens** (measured with Qwen 2.5/Llama-3 tokenizer), it represents one of the largest open-source resources for Bulgarian LLM pretraining. The dataset is engineered for a modern two-stage training pipeline: 1. **Pretrain Subset (~29.3B Tokens):** A diverse mix of high-quality web data, encyclopedic knowledge, and scientific abstracts. 2. **SFT Subset (~4.1B Tokens):** A curated collection of instruction-following, chat, and multitask data, strictly filtered to remove alignment artifacts. **Training Recommendation:** With ~33B unique high-quality tokens, we recommend training for **3 Epochs** over the pretrain subset to achieve optimal convergence for models in the 7B-8B parameter range (effectively ~90B training tokens). --- ## Dataset Statistics *Estimates based on Qwen 2.5 / Llama-3 Tokenization.* | Subset | Format | File Type | Documents | Token Count | | :--- | :--- | :--- | :--- | :--- | | **Pretrain** | Universal Schema | Parquet (Snappy) | 26,278,393 | **~29.31 Billion** | | **SFT** | ChatML | Parquet (Snappy) | 8,663,195 | **~4.11 Billion** | | **Total** | - | - | **34,941,588** | **~33.42 Billion** | --- ## Data Structure ### 1. Pretraining Subset (`pretrain`) Optimized for high-throughput streaming with libraries like `datatrove`, `nanotron`, or `torchtune`. | Column | Type | Description | | :--- | :--- | :--- | | `id` | `string` | Unique identifier (vital for tracking). | | `text` | `string` | The cleaned, deduplicated content. | | `source` | `string` | Origin dataset (e.g., `fineweb-2`, `bpos_science`). | | `language` | `string` | ISO Code (`bg`). | | `meta` | `string` | Original metadata (URL, date, title, DOI) serialized as a JSON string. | ### 2. SFT Subset (`sft`) Optimized for "Instruction Pretraining" or Fine-Tuning (Axolotl/LLaMA-Factory compatible). | Column | Type | Description | | :--- | :--- | :--- | | `messages` | `list` | Standard OpenAI/ChatML format: `[{"role": "user", ...}, {"role": "assistant", ...}]` | | `source` | `string` | Origin task (e.g., `aya_collection`, `xp3x`). | --- ## Data Composition This corpus was built using a **Quality-First** strategy, blending massive web scale with high-density scientific and encyclopedic data. | Source | Type | Usage Phase | Description | | :--- | :--- | :--- | :--- | | **[FineWeb-2 (Bulgarian)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)** | Web Crawl | Pretrain | The backbone of the corpus (cleaned web text). | | **[FineWiki BG](https://huggingface.co/datasets/HuggingFaceFW/finewiki)** | Knowledge | Pretrain | Full Bulgarian Wikipedia dump with rich metadata. | | **[BPOS (Open Science)](https://bpos.bg)** | Scientific | Pretrain | **4,700+** Titles and Abstracts from the Bulgarian Portal for Open Science (High density domain knowledge). | | **[Aya Collection](https://huggingface.co/datasets/CohereLabs/aya_collection)** | Instruction | SFT | High-quality multilingual instruction following. | | **[xP3x](https://huggingface.co/datasets/CohereLabs/xP3x)** | NLP Tasks | SFT | Massive multitask dataset (Filtered for quality). | | **[Alpaca Dictionary BG](https://huggingface.co/datasets/vislupus/alpaca-bulgarian-dictionary)** | Linguistic | SFT | Definitions, synonyms, and linguistic tasks. | --- ## Processing Pipeline This dataset was engineered for **Foundation Model** training standards: 1. **Normalization:** Multiple raw data sources were mapped to a single unified schema. 2. **PII Sanitization:** * **Regex Cleaning:** Automated removal of Email addresses, IPv4 addresses, and **Bulgarian phone numbers** (e.g., `+359...`, `088...`). 3. **DB-Assisted Deduplication:** * Exact deduplication (MD5 hashing) was performed across the entire collection. * **Priority Strategy:** High-quality sources (Wiki/Science) were processed first to claim ownership of duplicate text, ensuring the highest quality version is kept. 4. **Quality Filtering (SFT):** * The SFT subset was scrubbed of "poison" rows (e.g., where the assistant replies "None", "null", or refuses to answer due to alignment errors). 5. **Sharding:** Data is split into `~200k row` Parquet shards for optimal download and streaming speeds. ## Limitations * **Web Bias:** A significant portion of the data (FineWeb) comes from the open internet and may reflect societal biases found in Bulgarian web content. * **Translation Artifacts:** Some SFT data is machine-translated or aligned; while we filtered obvious errors, some translation artifacts may remain. ----- ## Citation & Attribution If you use this dataset in your research or product, please cite: ```bibtex @misc{bulgariancorpus33b, author = {Hasan KURŞUN, Kerem Berkay YANIK}, publisher = {Lumees AI}, title = {Bulgarian Corpus 33B}, year = {2025}, publisher = {HuggingFace Community}, howpublished = {\url{[https://lumees.io](https://lumees.io)}}, email = {hello@lumees.io} } ```
提供机构:
lumees
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作