five

mesolitica/fineweb-filter-malaysian-context

收藏
Hugging Face2024-08-13 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mesolitica/fineweb-filter-malaysian-context
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是从原始的FineWeb数据集中筛选出与马来西亚相关的内容,使用了关键词{malay, malaysia, melayu, bursa, ringgit}进行筛选。数据集的总token数为174B,旨在为预训练、继续预训练或生成合成数据集提供一个专门的语料库。

We filter the original FineWeb dataset, which consists of more than 15T tokens, using simple Malaysian keywords. The filtered dataset totals 174102784199 tokens, i.e., 174B tokens. The filtering process used the keywords {malay, malaysia, melayu, bursa, ringgit} and ran on an r5.16xlarge EC2 instance for 7 days. The total token count was calculated using the `tiktoken.encoding_for_model("gpt2")` method, which ran on a c7a.24xlarge EC2 instance for 1 hour. The purpose of the filtering is to allow anyone to use this filtered corpus for pretraining, continued pretraining, or generating synthetic datasets.
提供机构:
mesolitica
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作