five

Polygl0t/gigalekh-v1

收藏
Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Polygl0t/gigalekh-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: default features: - name: text dtype: string - name: id dtype: string - name: source dtype: string - name: subset dtype: string - name: token_count dtype: int64 - name: toxic_score dtype: float64 - name: toxic_int_score dtype: int64 - name: edu_score dtype: float64 - name: edu_int_score dtype: int64 splits: - name: train num_bytes: 260681556054 num_examples: 83081507 download_size: 260681556054 dataset_size: 260681556054 - config_name: excluded features: - name: text dtype: string - name: id dtype: string - name: source dtype: string - name: subset dtype: string - name: token_count dtype: int64 - name: toxic_score dtype: float64 - name: toxic_int_score dtype: int64 - name: edu_score dtype: float64 - name: edu_int_score dtype: int64 splits: - name: train num_bytes: 1763011227 num_examples: 498892 download_size: 1763011227 dataset_size: 1763011227 configs: - config_name: default default: true data_files: - split: train path: default/train-* - config_name: excluded data_files: - split: train path: excluded/train-* license: other task_categories: - text-generation language: - hi tags: - hindi pretty_name: GigaLekh-v1 size_categories: - 10M<n<100M --- # GigaLekh: A Large Hindi Text Corpus with Educational and Toxicity Annotations <img src="./logo.png" width="400" height="400"> ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Subsets and Splits](#subsets-and-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Additional Information](#additional-information) - [Dataset Maintainers](#dataset-maintainers) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Acknowledgments](#acknowledgments) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/Polygl0t/hindi-corpus - **Repository:** https://huggingface.co/datasets/Polygl0t/hindi-corpus - **Point of Contact:** [Shiza Fatimah](mailto:shizafatimah15@gmail.com) ### Dataset Summary This repository contains a large corpus of Hindi text, which has been filtered and annotated using classifiers for educational content and toxicity. The dataset is intended for training language models and other NLP applications in Hindi. ### Supported Tasks and Leaderboards This dataset can be utilized for tasks involving language modeling. ### Languages Hindi. ## Dataset Structure ### Data Instances The dataset consists of the following features: - **text:** a string of text in Hindi. - **source:** the source where that string originated. - **subset:** a short string indicating the name of the subset (referring to the original dataset or crawl). - **id:** a unique identifier for each sample (md5 hash). - **token_count:** number of tokens in the text ([Polygl0t/LilMoo](https://huggingface.co/shiza-fatimah/tok-hin-en-code-49152) tokenizer). - **edu_score:** a float representing the educational quality score predicted by the [Polygl0t/hindi-roberta-edu-classifier](https://huggingface.co/Polygl0t/hindi-roberta-edu-classifier) (1 = non-educational, 5 = highly educational). - **edu_int_score:** an integer-rounded version of the `edu_score`. - **toxic_score:** a float representing the toxicity score predicted by the [Polygl0t/hindi-roberta-toxicity-classifier](https://huggingface.co/Polygl0t/hindi-roberta-toxicity-classifier) (1 = non-toxic, 5 = highly toxic). - **toxic_int_score:** an integer-rounded version of the `toxic_score`. ### Data Fields ```json { "text": "चाणक्य की गिनती भारत के श्रेष्ठ विद्वानों में की जाती है|चाणक्य को राजनीति शास्त्र, समाज शास्त्र, कूटनीति शास्त्र, सैन्य शास्त्र के साथ अर्थशास्त्र का भी गहरा ज्ञान था|", "source": "https://huggingface.co/datasets/HuggingFaceFW/fineweb-2", "subset": "finweb_2_hi", "id": "dc0ec0a28936637fd9ec5be2b3994eb8", "token_count": 410, "edu_score": 3.87568, "edu_int_score": 4, "toxic_score": 2.047186, "toxic_int_score": 2 } ``` ### Subsets and Splits After applying the complete filtering pipeline—including text extraction, language identification, heuristic quality filtering, deduplication, and learned filtering using our trained classifiers, we further (1) removed documents shorter than 50 tokens, and (2) removed documents with a toxicity score > 3. Given that toxicity is a particularly challenging issue in Hindi web data, we opted for preserving the filtered out documents as a separate subset of our corpus, which can be useful for future research on toxicity detection and mitigation in Hindi NLP. ```python from datasets import load_dataset # Load the main dataset ds = load_dataset("Polygl0t/hindi-corpus", "default", split="train") # Load the excluded subset excluded_ds = load_dataset("Polygl0t/hindi-corpus", "excluded", split="train") # If you don't want to download the entire dataset, set streaming to `True` ds = load_dataset("Polygl0t/hindi-corpus", "default", split="train", streaming=True) ``` #### Statistics | Subset | Files | Rows | Size (GB) | Tokens | | ---------- | ----- | ----------- | --------- | ---------------| | `default` | 35 | 83,081,507 | 260.68 | 90,705,245,239 | | `excluded` | 11 | 498,892 | 1.76 | 1,545,290,236 | | **Total** | 46 | 83,580,399 | 262.44 | 92,250,535,475 | ## Dataset Creation ### Curation Rationale To curate this dataset, we developed an implementaton of the FineWeb2 filtering pipeline, while also adding a learned-filter (LLM-as-a-Judge) approach based on works like FineWeb-Edu, the original Phi paper, and other works that document a similar approach. #### Text Extraction and Language Identification For web-crawled data sourced from CC WARC files, we begin by extracting text content using the Trafilatura library, configured to favor precision over recall. For datasets sourced from Hugging Face, which often include pre-cleaned text, this extraction step is bypassed, allowing direct input into the subsequent filtering stages. To ensure relevance and a minimal level of filtering at this early stage, we apply an initial URL Filter to remove documents from undesirable sources, leveraging blocklists ([Maravento's blackweb](https://github.com/maravento/blackweb)) to exclude low-quality or inappropriate websites. This is followed by a language identification step using the FastText FT176 model. To enhance language identification accuracy, we employ a second round of language identification using GlotLID as backend. #### Quality Filtering and Formatting For the purpose of building a robust Hindi corpus, we define quality based on several heuristic criteria that reflect linguistic coherence and structural integrity. In short, we import the filters developed during the creation of **FineWeb-2** and **MassiveText**, tunning them to be more sensitive to Hindi's linguistic characteristics. Post-filtering, we apply formatting steps to corrects encoding issues, remove personally identifiable information, and eliminate/replace undesirable patterns (e.g., excessive symbols). For all of this, we used the standard implementation from the [Datatrove](https://github.com/huggingface/datatrove) library to implement these filters. #### Deduplication Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. Hence, to address redundancy, we implement a deduplication pipeline using the MinHash algorithm. Following FineWeb-2 implementation, we use 14 buckets, 8 hashes per bucket, and 5-grams, employing the xxHash function for hashing. #### Learned Filtering with LLM-as-a-Judge Learned filters, often based on large language models, can evaluate text quality more holistically, considering context, coherence, and other nuanced factors that heuristic methods might overlook. Hence, we implemented an LLM-as-a-Judge filtering approach to further enrich our dataset with annotations. Specifically, we used [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) to evaluate the quality of documents that passed through the initial heuristic filters. This annotation process involved prompting the model to assess documents based on two criteria: educational quality and toxicity. The result was the following datasets: - [Polygl0t/hindi-edu-qwen-annotations](https://huggingface.co/datasets/Polygl0t/hindi-edu-qwen-annotations) - [Polygl0t/hindi-toxicity-qwen-annotations](https://huggingface.co/datasets/Polygl0t/hindi-toxicity-qwen-annotations) With these annotations, we trained two separate classification models to automate the filtering process for our corpus curation: - [Polygl0t/hindi-roberta-edu-classifier](https://huggingface.co/Polygl0t/hindi-roberta-edu-classifier) (1 = non-educational, 5 = highly educational) - [Polygl0t/hindi-roberta-toxicity-classifier](https://huggingface.co/Polygl0t/hindi-roberta-toxicity-classifier) (1 = non-toxic, 5 = highly toxic) The columns `edu_score` and `toxic_score` in this dataset correspond to the predicted scores from these classifiers, while `edu_int_score` and `toxic_int_score` are the integer-rounded versions of these scores. ### Source Data We sourced data from a variety of existing datasets available on Hugging Face. These datasets offer a mix of curated and semi-curated content, providing a solid foundation for our corpus to start (see the table below). Unlike web crawls, these datasets often come with some level of quality control, allowing us to bypass some of the initial cleaning steps (e.g., cleaning of HTML tags). However, we also recognized the need to supplement these datasets with more extensive web-crawled data to ensure sufficient coverage and diversity, especially when it comes to adding timely data from recent snapshots of CommonCrawl. - **Cutoff Date:** December 2025 | Source Type | Dataset / Crawl | License(s) | |--------------|----------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------| | Common Crawl | | [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/), [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use) | | | [CC-MAIN-2025-30](https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-30/index.html) | | | | [CC-MAIN-2025-26](https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-26/index.html) | | | | [CC-MAIN-2025-05](https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-05/index.html) | | | | [CC-MAIN-2024-51](https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-51/index.html) | | | | [CC-MAIN-2023-50](https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/index.html) | | | | [CC-MAIN-2022-49](https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-49/index.html) | | | | [CC-MAIN-2021-49](https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-49/index.html) | | | | [CC-MAIN-2020-50](https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/index.html) | | | Hugging Face | | | | |[HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) | [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/), [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use) | | |[HPLT/HPLT2.0_cleaned](https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned) | [cc0-1.0](https://choosealicense.com/licenses/cc0-1.0/) | | |[wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | [cc-by-sa-3.0](https://spdx.org/licenses/CC-BY-SA-3.0), [gfdl](https://www.gnu.org/licenses/fdl-1.3.html) | | |[allenai/c4](https://huggingface.co/datasets/allenai/c4) | [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/), [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use) | | |[statmt/cc100](https://huggingface.co/datasets/statmt/cc100) | [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/), [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use) | | |[bigscience-data/roots_indic-hi_indic_nlp_corpus](https://huggingface.co/datasets/bigscience-data/roots_indic-hi_indic_nlp_corpus)| [cc-by-nc-4.0](https://spdx.org/licenses/CC-BY-NC-4.0/) | | |[bigscience-data/roots_indic-hi_wikipedia](https://huggingface.co/datasets/bigscience-data/roots_indic-hi_wikipedia) | [cc-by-sa-3.0](https://spdx.org/licenses/CC-BY-SA-3.0) | | |[soketlabs/bhasha-wiki](https://huggingface.co/datasets/soketlabs/bhasha-wiki) | [cc-by-sa-3.0](https://spdx.org/licenses/CC-BY-SA-3.0) | | |[csebuetnlp/xlsum](https://huggingface.co/datasets/csebuetnlp/xlsum) | [cc-by-nc-sa-4.0](https://spdx.org/licenses/CC-BY-NC-SA-4.0) | | |[zicsx/OSCAR-2301-Hindi-Cleaned](https://huggingface.co/datasets/zicsx/OSCAR-2301-Hindi-Cleaned) | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | | |[djstrong/oscar-small](https://huggingface.co/datasets/djstrong/oscar-small) | [cc0-1.0](https://choosealicense.com/licenses/cc0-1.0/) | | |[ganeshjcs/hindi-article-generation](https://huggingface.co/datasets/ganeshjcs/hindi-headline-article-generation) | [cc-by-nc-sa-4.0](https://spdx.org/licenses/CC-BY-NC-SA-4.0) | | |[Davlan/sib200](https://huggingface.co/datasets/Davlan/sib200) | [cc-by-sa-4.0](https://spdx.org/licenses/CC-BY-SA-4.0) | | |[Tensoic/Bhandara](https://huggingface.co/datasets/Tensoic/Bhandara) | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | | |[OdiaGenAI/health_hindi_200](https://huggingface.co/datasets/OdiaGenAI/health_hindi_200) | [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/), [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use) | | |[MBZUAI/Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X) | [cc-by-nc-4.0](https://spdx.org/licenses/CC-BY-NC-4.0/) | | |[dnyanesh/HindiMathQuest](https://huggingface.co/datasets/dnyanesh/HindiMathQuest) | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | | |[KathirKs/fineweb-edu-hindi](https://huggingface.co/datasets/KathirKs/fineweb-edu-hindi) | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) | | |[HuggingFaceTB/finemath](https://huggingface.co/datasets/HuggingFaceTB/finemath) (Translated 11606 rows to hindi) | [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/), [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use) | | |[HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) (Translated 8382 rows to hindi) | [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/), [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use) | #### Who are the source language producers? All text samples are native to Hindi or translated from other languages to Hindi (slight contamination of different languages should also be expected). ### Annotations #### Annotation process Annotations were created using a learned-filtering approach based on large language models (LLM-as-a-Judge). Specifically, we used [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) to evaluate the quality of documents that passed through the initial heuristic filters. This annotation process involved prompting the model to assess documents based on two criteria: educational quality and toxicity. With these annotations, we trained two separate classification models to automate the filtering process for our corpus curation. ### Personal and Sensitive Information This dataset was filtered to remove personally identifiable information (PII) using standard PII detection methods as part of the formatting step in the filtering pipeline. Moreover, users should be aware that the `excluded` subset contains documents that were filtered out due to high toxicity scores, which may include offensive or sensitive content. ## Considerations for Using the Data ### Social Impact of Dataset The creation of a large Hindi text corpus has the potential to significantly advance NLP research and applications for Hindi speakers, who represent a substantial portion of the global population. By providing high-quality training data, this dataset can facilitate the development of more accurate and effective language models, which can be used in various applications such as machine translation, sentiment analysis, and information retrieval. ## Additional Information ### Dataset Maintainers - [Shiza Fatimah](mailto:shizafatimah15@gmail.com). - [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de). - [Aniket Sen](mailto:sen@hiskp.uni-bonn.de). ### Licensing Information Please refer to the individual licenses of the source datasets used to create this corpus, as listed in the "Source Data" section above. The combined dataset does not have a single unified license, and users should ensure compliance with the terms of each source dataset when utilizing this corpus. ### Citation Information ```latex @misc{shiza2026lilmoo, title={{Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi}}, author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a}, year={2026}, eprint={2603.03508}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03508}, } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ### Contributions If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!
提供机构:
Polygl0t
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作