malaysia-ai/mosaic-dedup-text-dataset-filtered
收藏Hugging Face2023-12-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/malaysia-ai/mosaic-dedup-text-dataset-filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ms
---
# Mosaic format for filtered dedup text dataset to train Malaysian LLM
This repository is to store dataset shards using mosaic format.
1. prepared at https://github.com/malaysia-ai/dedup-text-dataset/blob/main/pretrain-llm/combine-dedup-text-dataset-filtered-4096.ipynb
2. using tokenizer https://huggingface.co/malaysia-ai/bpe-tokenizer
3. 4096 context length.
## how-to
1. git clone,
```bash
git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-dedup-text-dataset
```
2. load it,
```python
from streaming import LocalDataset
import numpy as np
from streaming.base.format.mds.encodings import Encoding, _encodings
class UInt16(Encoding):
def encode(self, obj) -> bytes:
return obj.tobytes()
def decode(self, data: bytes):
return np.frombuffer(data, np.uint16)
_encodings['uint16'] = UInt16
dataset = LocalDataset('mosaic-dedup-text-dataset-filtered')
len(dataset)
```
提供机构:
malaysia-ai
原始信息汇总
数据集概述
数据集名称
Mosaic format for filtered dedup text dataset to train Malaysian LLM
数据集描述
该数据集用于存储使用mosaic格式的数据分片,旨在训练马来西亚的大型语言模型。
数据集准备
- 数据集准备脚本位于:combine-dedup-text-dataset-filtered-4096.ipynb
- 使用tokenizer:malaysia-ai/bpe-tokenizer
- 上下文长度为4096。
数据集使用方法
-
克隆数据集: bash git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-dedup-text-dataset
-
加载数据集: python from streaming import LocalDataset import numpy as np from streaming.base.format.mds.encodings import Encoding, _encodings
class UInt16(Encoding): def encode(self, obj) -> bytes: return obj.tobytes()
def decode(self, data: bytes): return np.frombuffer(data, np.uint16)_encodings[uint16] = UInt16
dataset = LocalDataset(mosaic-dedup-text-dataset-filtered) len(dataset)



