Arjun-G-Ravi/Ultimate-Malayalam-Dataset
收藏Hugging Face2024-04-18 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/Arjun-G-Ravi/Ultimate-Malayalam-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- summarization
language:
- ml
pretty_name: Malayalam dataset
size_categories:
- 1M<n<10M
---
# About
This is a large dataset that contains a lot of (more than 6 million) lines of malayalam text. The dataset is created by combining many other malayalam datasets, and cleaning them.
The dataset is ideal to train or fine tune a Large Language Model on Malayalam language.
# Credits
- https://www.kaggle.com/datasets/disisbig/malyalam-news-dataset
- https://www.kaggle.com/datasets/parvmodi/english-to-malayalam-machine-translation-dataset?select=train.ml
- https://www.kaggle.com/datasets/akhisreelibra/malayalam-news
- https://www.kaggle.com/datasets/vigneshvit/malayalam-news-dataset
- https://huggingface.co/datasets/rajeshradhakrishnan/malayalam_wiki
- https://huggingface.co/datasets/Sakshamrzt/IndicNLP-Malayalam?row=17
- https://www.kaggle.com/datasets/disisbig/malayalam-wikipedia-articles
提供机构:
Arjun-G-Ravi
原始信息汇总
数据集概述
基本信息
- 许可证: MIT
- 任务类别:
- 文本生成
- 摘要生成
- 语言: 马拉雅拉姆语(ml)
- 规模: 1M<n<10M
- 美观名称: 马拉雅拉姆语数据集
数据集描述
- 内容: 包含超过600万行马拉雅拉姆语文本。
- 来源: 由多个马拉雅拉姆语数据集合并并清洗而成。
- 用途: 适合用于训练或微调马拉雅拉姆语的大型语言模型。



