krutrim-ai-labs/BhashaKritika
收藏Hugging Face2025-11-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/krutrim-ai-labs/BhashaKritika
下载链接
链接失效反馈官方服务:
资源简介:
---
license_name: krutrim-community-license-agreement-version-1.0
license_link: LICENSE
dataset_info:
- config_name: bengali
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 36819676622
num_examples: 3110430
download_size: 13413550672
dataset_size: 36819676622
- config_name: gujarati
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 13359430388
num_examples: 1234300
download_size: 3754057820
dataset_size: 13359430388
- config_name: hindi
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 55983119941
num_examples: 4566283
download_size: 19417613854
dataset_size: 55983119941
- config_name: malayalam
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 1312405460
num_examples: 104279
download_size: 362979120
dataset_size: 1312405460
- config_name: marathi
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 16543531776
num_examples: 1939708
download_size: 5184472157
dataset_size: 16543531776
- config_name: punjabi
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 15235757034
num_examples: 1465532
download_size: 4448746540
dataset_size: 15235757034
- config_name: tamil
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 32911989480
num_examples: 2829721
download_size: 11200823373
dataset_size: 32911989480
- config_name: telugu
features:
- name: master_id
dtype: string
- name: generation_technique
dtype: string
- name: source
dtype: string
- name: style
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: flags
dtype: string
- name: language_detection
dtype: string
- name: word_count
dtype: string
- name: word_n_gram_repetition
struct:
- name: 6_gram_words_repetition_score
dtype: float64
- name: perplexity
struct:
- name: perplexity_score
dtype: float64
- name: quality_classification
dtype: string
- name: nsfw_words
struct:
- name: nsfw_words_ratio
dtype: float64
- name: stop_words
struct:
- name: stop_words_ratio
dtype: float64
- name: non_li_words
struct:
- name: non_li_words_ratio
dtype: float64
splits:
- name: train
num_bytes: 12548600792
num_examples: 974363
download_size: 4178093850
dataset_size: 12548600792
configs:
- config_name: bengali
data_files:
- split: train
path: bengali/train-*
- config_name: gujarati
data_files:
- split: train
path: gujarati/train-*
- config_name: hindi
data_files:
- split: train
path: hindi/train-*
- config_name: malayalam
data_files:
- split: train
path: malayalam/train-*
- config_name: marathi
data_files:
- split: train
path: marathi/train-*
- config_name: punjabi
data_files:
- split: train
path: punjabi/train-*
- config_name: tamil
data_files:
- split: train
path: tamil/train-*
- config_name: telugu
data_files:
- split: train
path: telugu/train-*
---
# BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
- You can find the paper on BhashaKritika here : [**Paper**](https://arxiv.org/pdf/2511.10338)
## 1. Introduction
**BhashaKritika** is a large-scale synthetic pretraining corpus for 10 Indic languages. It is built using **five generation strategies**, including document-grounded, persona-based, topic-guided, and translation-based approaches.
The dataset is part of a systematic study on how grounding, instruction language, and native vs. translated generation affect data quality in multilingual settings. To ensure consistency at scale, we develop a **modular quality evaluation pipeline** with script and language detection, metadata checks, n-gram repetition analysis, and KenLM-based perplexity filtering.
BhashaKritika aims to provide a reliable, diverse, and linguistically rich synthetic corpus for pretraining high-quality Indic LLMs.
---
## 2. Dataset Details
Each entry captures both the generated text and detailed quality evaluation metadata.
Each sample includes:
### Core Metadata
- **`master_id`** — Unique identifier for each generated instance.
- **`generation_technique`** — Method used for text generation (e.g., document_grounded, persona_based, topic_based, math_and_reasoning_based, translation_based)
- **`source`** — Origin of the context used for generation (indic_cc, fineweb2, etc.)
- **`style`** — Output style or format of the generation
- **`language`** — Target Indic language
- **`topic`** — Topical domain of the sample
- **`prompt`** — Input instruction provided to the model
- **`response`** — Generated output text
### Quality & Safety Metadata
- **`flags`** — Indicators for automatically detected quality issues
- **`language_detection`** — Language identified
- **`word_count`** — Word-count of the generated text
- **`word_n_gram_repetition`** — 6-gram repetition scores
- **`perplexity`** — KenLM-based fluency/naturalness score
- **`quality_classification`** — Quality label and score assigned by the Fasttext Quality Classifier
- **`nsfw_words`** — Detected sensitive or inappropriate words ratio
- **`stop_words`** — Stopword occurrences ratio based on language-specific lists
- **`non_li_words`** — Words outside latin and indic alphabets
This structure provides a rich combination of **generation details**, **linguistic analysis**, and **quality signals**, enabling fine-grained filtering and large-scale pretraining for multilingual Indic LLMs.
---
## 3. How to Use and Run
You can load the dataset using the `datasets` library:
```python
from datasets import load_dataset
ds = load_dataset(
"krutrim-ai-labs/BhashaKritika",
name="bengali",
split="train"
)
```
---
## 4. License
This repository is licensed under the [Krutrim Community License.](LICENSE)
## 5. Citation
```
@misc{manoj2025bhashakritikabuildingsyntheticpretraining,
title={BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages},
author={Guduru Manoj and Neel Prabhanjan Rachamalla and Ashish Kulkarni and Gautam Rajeev and Jay Piplodiya and Arul Menezes and Shaharukh Khan and Souvik Rana and Manya Sah and Chandra Khatri and Shubham Agarwal},
year={2025},
eprint={2511.10338},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.10338},
}
```
提供机构:
krutrim-ai-labs



