fineweb-filter-malaysian-context

Name: fineweb-filter-malaysian-context
Creator: maas
Published: 2025-12-05 11:40:50
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/mesolitica/fineweb-filter-malaysian-context

下载链接

链接失效反馈

官方服务：

资源简介：

# HuggingFaceFW/fineweb filter Malaysian context ## What is it? We filter the original 🍷 FineWeb dataset that consists more than **15T tokens** on simple Malaysian keywords. Total tokens for the filtered dataset is 174102784199 tokens, **174B tokens**. ## How we do it? 1. We filter rows using `{'malay', 'malaysia', 'melayu', 'bursa', 'ringgit'}` keywords on r5.16xlarge EC2 instance for 7 days. 2. We calculate total tokens using `tiktoken.encoding_for_model("gpt2")` on c7a.24xlarge EC2 instance for 1 hour. source code at https://github.com/mesolitica/malaysian-dataset/tree/master/corpus/fineweb ## Why we do it? So anybody can use this filtered corpus to pretrain, continue pretraining or generate synthetic dataset for their own use cases.

# HuggingFaceFW/fineweb 马来西亚语境过滤数据集 ## 数据集概述我们针对包含超**15万亿Token**的原始🍷FineWeb数据集，基于马来西亚相关关键词完成过滤。经过滤后的数据集总Token数为174102784199，即**1740亿Token**。 ## 构建与统计流程 1. 我们在r5.16xlarge规格的EC2实例上，通过关键词集合`{'malay', 'malaysia', 'melayu', 'bursa', 'ringgit'}`对数据行进行过滤，总耗时7天。 2. 我们使用`tiktoken.encoding_for_model("gpt2")`工具，在c7a.24xlarge规格的EC2实例上完成总Token数统计，耗时1小时。源码地址：https://github.com/mesolitica/malaysian-dataset/tree/master/corpus/fineweb ## 构建目的旨在方便开发者使用该过滤后的语料库开展大语言模型的预训练、续预训练，或是针对自身业务场景生成定制化合成数据集。

提供机构：

maas

创建时间：

2025-10-04

搜集汇总

数据集介绍