five

malaysia-ai/mosaic-extra

收藏
Hugging Face2023-11-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/malaysia-ai/mosaic-extra
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ms --- # Mosaic format for extra dataset to train Malaysian LLM This repository is to store dataset shards using mosaic format. 1. prepared at https://github.com/malaysia-ai/dedup-text-dataset/blob/main/pretrain-llm/combine-extra.ipynb 2. using tokenizer https://huggingface.co/malaysia-ai/bpe-tokenizer 3. 4096 context length. ## how-to 1. git clone, ```bash git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-extra ``` 2. load it, ```python from streaming import LocalDataset import numpy as np from streaming.base.format.mds.encodings import Encoding, _encodings class UInt16(Encoding): def encode(self, obj) -> bytes: return obj.tobytes() def decode(self, data: bytes): return np.frombuffer(data, np.uint16) _encodings['uint16'] = UInt16 dataset = LocalDataset('mosaic-extra') len(dataset) ```
提供机构:
malaysia-ai
原始信息汇总

数据集概述

数据集描述

该数据集用于训练马来西亚的大型语言模型(LLM),采用马赛克格式存储数据分片。

数据集准备

数据集使用方法

  1. 克隆数据集: bash git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-extra

  2. 加载数据集: python from streaming import LocalDataset import numpy as np from streaming.base.format.mds.encodings import Encoding, _encodings

    class UInt16(Encoding): def encode(self, obj) -> bytes: return obj.tobytes()

     def decode(self, data: bytes):
         return np.frombuffer(data, np.uint16)
    

    _encodings[uint16] = UInt16

    dataset = LocalDataset(mosaic-extra) len(dataset)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作