minishlab/tokenlearn-c4-multilingual-bge-m3
收藏Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/minishlab/tokenlearn-c4-multilingual-bge-m3
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: 10m
features:
- name: text
dtype: string
- name: embedding
sequence: float32
splits:
- name: train
num_bytes: 0
num_examples: 9999954
- config_name: 2m
features:
- name: text
dtype: string
- name: embedding
sequence: float32
splits:
- name: train
num_bytes: 11700400216
num_examples: 1999949
download_size: 11384180946
dataset_size: 11700400216
configs:
- config_name: 10m
data_files:
- split: train
path: 10m/train-*
default: true
- config_name: 2m
data_files:
- split: train
path: 2m/train-*
---
# minishlab/tokenlearn-c4-multilingual-bge-m3 Dataset Card
This dataset was created with [Tokenlearn](https://github.com/MinishLab/tokenlearn) for training [Model2Vec](https://github.com/MinishLab/model2vec) models. It contains mean token embeddings produced by a sentence transformer, used as training targets for static embedding distillation.
## Dataset Details
| Field | Value |
|---|---|
| **Source dataset** | [allenai/c4](https://huggingface.co/datasets/allenai/c4) |
| **Source split** | `train` |
| **Embedding model** | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) |
| **Embedding dimension** | 1024 |
| **Rows** | 9999954 |
## Dataset Structure
| Column | Type | Description |
|---|---|---|
| `text` | `string` | Truncated input text |
| `embedding` | `list[float32]` | Mean token embedding from `BAAI/bge-m3`, excluding BOS/EOS tokens |
## Usage
Load with the `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("minishlab/tokenlearn-c4-multilingual-bge-m3")
```
Train a Model2Vec model on this dataset using Tokenlearn:
```bash
python -m tokenlearn.train \
--model-name "BAAI/bge-m3" \
--data-path "minishlab/tokenlearn-c4-multilingual-bge-m3" \
--save-path "<path-to-save-model>"
```
## Creation
Both the 10M and 2M datasets were generated with [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) on [multilingual C4](https://huggingface.co/datasets/allenai/c4) using Tokenlearn, across all C4 language subsets with temperature-smoothed sampling.
## Library Authors
Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
## Citation
```
@software{minishlab2024model2vec,
author = {Stephan Tulkens and {van Dongen}, Thomas},
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.17270888},
url = {https://github.com/MinishLab/model2vec},
license = {MIT}
}
```
提供机构:
minishlab



