Arjun-G-Ravi/malayalam-sangraha

Name: Arjun-G-Ravi/malayalam-sangraha
Creator: Arjun-G-Ravi
Published: 2025-12-07 05:51:49
License: 暂无描述

Hugging Face2025-12-07 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Arjun-G-Ravi/malayalam-sangraha

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation language: - ml pretty_name: human-verified-sangraha-malayalam-dataset size_categories: - 1M<n<10M --- This is a cleaned version of the malayalam subset of sangraha dataset. This only contains the human verified part of the dataset(which is high quality data obtained from Indic language PDFs, transcribed data from various Indic language videos, podcasts, movies, courses, etc.) The csv dataset has around 6.3M rows, accounting to 32.8 GB. I've also removed the doc_id provided in the dataset, making this ideal for pretraining malayalam LLM. For pretraining, I recommend using this dataset along with [Ultimate-malayalam-dataset](https://huggingface.co/datasets/Arjun-G-Ravi/Ultimate-Malayalam-Dataset) for more diverse data. # Credits @article{khan2024indicllmsuite, title = {IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages}, author = {Mohammed Safi Ur Rahman Khan and Priyam Mehta and Ananth Sankar and Umashankar Kumaravelan and Sumanth Doddapaneni and Suriyaprasaad G and Varun Balan G and Sparsh Jain and Anoop Kunchukuttan and Pratyush Kumar and Raj Dabre and Mitesh M. Khapra}, year = {2024}, journal = {arXiv preprint arXiv: 2403.06350} }

提供机构：

Arjun-G-Ravi

5,000+

优质数据集

54 个

任务类型

进入经典数据集