malaysia-ai/mosaic-embedding-pairs
收藏Hugging Face2023-12-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/malaysia-ai/mosaic-embedding-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ms
---
# Mosaic format for embedding task text pair dataset
This repository is to store dataset shards using mosaic format.
1. prepared at https://github.com/mesolitica/llama2-embedding/blob/main/notebooks/combine-embedding.ipynb
## how-to
1. git clone,
```bash
git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-embedding-pairs
```
2. load it,
```python
from streaming import LocalDataset
from streaming.base.format.mds.encodings import Encoding, _encodings
import json
class ListStr(Encoding):
def encode(self, obj):
return json.dumps(obj).encode()
def decode(self, data):
return json.loads(data)
_encodings['liststr'] = ListStr
dataset = LocalDataset('mosaic-embedding-pairs')
len(dataset)
```
提供机构:
malaysia-ai
原始信息汇总
Mosaic format for embedding task text pair dataset
数据集描述
该数据集用于存储使用镶嵌格式的文本对嵌入任务数据分片。
数据集准备
数据集准备过程详见:combine-embedding.ipynb
使用方法
-
克隆数据集: bash git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-embedding-pairs
-
加载数据集: python from streaming import LocalDataset from streaming.base.format.mds.encodings import Encoding, _encodings import json
class ListStr(Encoding): def encode(self, obj): return json.dumps(obj).encode()
def decode(self, data): return json.loads(data)_encodings[liststr] = ListStr
dataset = LocalDataset(mosaic-embedding-pairs) len(dataset)



