five

Arjun-G-Ravi/malayalam-sangraha

收藏
Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Arjun-G-Ravi/malayalam-sangraha
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation language: - ml pretty_name: human-verified-sangraha-malayalam-dataset size_categories: - 1M<n<10M --- This is a cleaned version of the malayalam subset of sangraha dataset. This only contains the human verified part of the dataset(which is high quality data obtained from Indic language PDFs, transcribed data from various Indic language videos, podcasts, movies, courses, etc.) The csv dataset has around 6.3M rows, accounting to 32.8 GB. I've also removed the doc_id provided in the dataset, making this ideal for pretraining malayalam LLM. For pretraining, I recommend using this dataset along with [Ultimate-malayalam-dataset](https://huggingface.co/datasets/Arjun-G-Ravi/Ultimate-Malayalam-Dataset) for more diverse data. # Credits @article{khan2024indicllmsuite, title = {IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages}, author = {Mohammed Safi Ur Rahman Khan and Priyam Mehta and Ananth Sankar and Umashankar Kumaravelan and Sumanth Doddapaneni and Suriyaprasaad G and Varun Balan G and Sparsh Jain and Anoop Kunchukuttan and Pratyush Kumar and Raj Dabre and Mitesh M. Khapra}, year = {2024}, journal = {arXiv preprint arXiv: 2403.06350} }
提供机构:
Arjun-G-Ravi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作