five

krutrim-ai-labs/BhashaKritika

收藏
Hugging Face2025-11-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/krutrim-ai-labs/BhashaKritika
下载链接
链接失效反馈
官方服务:
资源简介:
--- license_name: krutrim-community-license-agreement-version-1.0 license_link: LICENSE dataset_info: - config_name: bengali features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 36819676622 num_examples: 3110430 download_size: 13413550672 dataset_size: 36819676622 - config_name: gujarati features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 13359430388 num_examples: 1234300 download_size: 3754057820 dataset_size: 13359430388 - config_name: hindi features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 55983119941 num_examples: 4566283 download_size: 19417613854 dataset_size: 55983119941 - config_name: malayalam features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 1312405460 num_examples: 104279 download_size: 362979120 dataset_size: 1312405460 - config_name: marathi features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 16543531776 num_examples: 1939708 download_size: 5184472157 dataset_size: 16543531776 - config_name: punjabi features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 15235757034 num_examples: 1465532 download_size: 4448746540 dataset_size: 15235757034 - config_name: tamil features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 32911989480 num_examples: 2829721 download_size: 11200823373 dataset_size: 32911989480 - config_name: telugu features: - name: master_id dtype: string - name: generation_technique dtype: string - name: source dtype: string - name: style dtype: string - name: language dtype: string - name: topic dtype: string - name: prompt dtype: string - name: response dtype: string - name: flags dtype: string - name: language_detection dtype: string - name: word_count dtype: string - name: word_n_gram_repetition struct: - name: 6_gram_words_repetition_score dtype: float64 - name: perplexity struct: - name: perplexity_score dtype: float64 - name: quality_classification dtype: string - name: nsfw_words struct: - name: nsfw_words_ratio dtype: float64 - name: stop_words struct: - name: stop_words_ratio dtype: float64 - name: non_li_words struct: - name: non_li_words_ratio dtype: float64 splits: - name: train num_bytes: 12548600792 num_examples: 974363 download_size: 4178093850 dataset_size: 12548600792 configs: - config_name: bengali data_files: - split: train path: bengali/train-* - config_name: gujarati data_files: - split: train path: gujarati/train-* - config_name: hindi data_files: - split: train path: hindi/train-* - config_name: malayalam data_files: - split: train path: malayalam/train-* - config_name: marathi data_files: - split: train path: marathi/train-* - config_name: punjabi data_files: - split: train path: punjabi/train-* - config_name: tamil data_files: - split: train path: tamil/train-* - config_name: telugu data_files: - split: train path: telugu/train-* --- # BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages - You can find the paper on BhashaKritika here : [**Paper**](https://arxiv.org/pdf/2511.10338) ## 1. Introduction **BhashaKritika** is a large-scale synthetic pretraining corpus for 10 Indic languages. It is built using **five generation strategies**, including document-grounded, persona-based, topic-guided, and translation-based approaches. The dataset is part of a systematic study on how grounding, instruction language, and native vs. translated generation affect data quality in multilingual settings. To ensure consistency at scale, we develop a **modular quality evaluation pipeline** with script and language detection, metadata checks, n-gram repetition analysis, and KenLM-based perplexity filtering. BhashaKritika aims to provide a reliable, diverse, and linguistically rich synthetic corpus for pretraining high-quality Indic LLMs. --- ## 2. Dataset Details Each entry captures both the generated text and detailed quality evaluation metadata. Each sample includes: ### Core Metadata - **`master_id`** — Unique identifier for each generated instance. - **`generation_technique`** — Method used for text generation (e.g., document_grounded, persona_based, topic_based, math_and_reasoning_based, translation_based) - **`source`** — Origin of the context used for generation (indic_cc, fineweb2, etc.) - **`style`** — Output style or format of the generation - **`language`** — Target Indic language - **`topic`** — Topical domain of the sample - **`prompt`** — Input instruction provided to the model - **`response`** — Generated output text ### Quality & Safety Metadata - **`flags`** — Indicators for automatically detected quality issues - **`language_detection`** — Language identified - **`word_count`** — Word-count of the generated text - **`word_n_gram_repetition`** — 6-gram repetition scores - **`perplexity`** — KenLM-based fluency/naturalness score - **`quality_classification`** — Quality label and score assigned by the Fasttext Quality Classifier - **`nsfw_words`** — Detected sensitive or inappropriate words ratio - **`stop_words`** — Stopword occurrences ratio based on language-specific lists - **`non_li_words`** — Words outside latin and indic alphabets This structure provides a rich combination of **generation details**, **linguistic analysis**, and **quality signals**, enabling fine-grained filtering and large-scale pretraining for multilingual Indic LLMs. --- ## 3. How to Use and Run You can load the dataset using the `datasets` library: ```python from datasets import load_dataset ds = load_dataset( "krutrim-ai-labs/BhashaKritika", name="bengali", split="train" ) ``` --- ## 4. License This repository is licensed under the [Krutrim Community License.](LICENSE) ## 5. Citation ``` @misc{manoj2025bhashakritikabuildingsyntheticpretraining, title={BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages}, author={Guduru Manoj and Neel Prabhanjan Rachamalla and Ashish Kulkarni and Gautam Rajeev and Jay Piplodiya and Arul Menezes and Shaharukh Khan and Souvik Rana and Manya Sah and Chandra Khatri and Shubham Agarwal}, year={2025}, eprint={2511.10338}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.10338}, } ```
提供机构:
krutrim-ai-labs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作