mesolitica/fineweb-filter-malaysian-context
收藏Hugging Face2024-08-13 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mesolitica/fineweb-filter-malaysian-context
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从原始的FineWeb数据集中筛选出与马来西亚相关的内容,使用了关键词{malay, malaysia, melayu, bursa, ringgit}进行筛选。数据集的总token数为174B,旨在为预训练、继续预训练或生成合成数据集提供一个专门的语料库。
We filter the original FineWeb dataset, which consists of more than 15T tokens, using simple Malaysian keywords. The filtered dataset totals 174102784199 tokens, i.e., 174B tokens. The filtering process used the keywords {malay, malaysia, melayu, bursa, ringgit} and ran on an r5.16xlarge EC2 instance for 7 days. The total token count was calculated using the `tiktoken.encoding_for_model("gpt2")` method, which ran on a c7a.24xlarge EC2 instance for 1 hour. The purpose of the filtering is to allow anyone to use this filtered corpus for pretraining, continued pretraining, or generating synthetic datasets.
提供机构:
mesolitica



