malaysia-ai/dedup-text-dataset
收藏Hugging Face2024-06-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/malaysia-ai/dedup-text-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ms
- en
- zh
- ta
---
## Dataset Introduction
This dataset is a collection of malaysian texts in the Malay, English, Chinese, and Tamil languages, gathered by Malaysia AI volunteers through web crawling of malaysian websites.
The dataset amounts to approximately 250 GB of text data, and has undergone deduplication process.
## Project Link
To learn more about the ongoing project and updates related to this dataset, visit the project board on GitHub:
https://github.com/users/huseinzol05/projects/1/views/1
## Github Repo
Our data preprocessing, and deduplication processes are transparent and open for review.
You can find the code and documentation related to these processes in https://github.com/malaysia-ai/text-dataset-dedup
## Data Format
All the dataset is standardized in JSONL (JSON Lines) format, with each line containing a text snippet.
提供机构:
malaysia-ai
原始信息汇总
数据集介绍
该数据集是马来西亚文本的集合,包含马来语、英语、中文和泰米尔语,由马来西亚AI志愿者通过网络爬虫从马来西亚网站收集。数据集总量约为250 GB的文本数据,并已进行去重处理。
数据格式
所有数据集均采用JSONL(JSON Lines)格式标准化,每行包含一个文本片段。



