minishlab/tokenlearn-c4-en-bge-base-en-v1.5
收藏Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/minishlab/tokenlearn-c4-en-bge-base-en-v1.5
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
tags:
- tokenlearn
- embeddings
- model2vec
configs:
- config_name: 10m
data_files:
- split: train
path: 10m/train-*
default: true
---
# minishlab/tokenlearn-c4-en-bge-base-v1.5 Dataset Card
This dataset was created with [Tokenlearn](https://github.com/MinishLab/tokenlearn) for training [Model2Vec](https://github.com/MinishLab/model2vec) models. It contains mean token embeddings produced by a sentence transformer, used as training targets for static embedding distillation.
## Dataset Details
| Field | Value |
|---|---|
| **Source dataset** | [allenai/c4](https://huggingface.co/datasets/allenai/c4) |
| **Source split** | `train` |
| **Embedding model** | [baai/bge-base-en-v1.5](https://huggingface.co/baai/bge-base-en-v1.5) |
| **Embedding dimension** | 768 |
| **Rows** | 10000000 |
## Dataset Structure
| Column | Type | Description |
|---|---|---|
| `text` | `string` | Truncated input text |
| `embedding` | `list[float32]` | Mean token embedding from `baai/bge-base-en-v1.5`, excluding BOS/EOS tokens |
## Usage
Load with the `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("minishlab/tokenlearn-c4-en-bge-base-v1.5")
```
Train a Model2Vec model on this dataset using Tokenlearn:
```bash
python -m tokenlearn.train \
--model-name "baai/bge-base-en-v1.5" \
--data-path "minishlab/tokenlearn-c4-en-bge-base-v1.5" \
--save-path "<path-to-save-model>"
```
## Creation
This dataset was created using the `tokenlearn-featurize` CLI:
```bash
python -m tokenlearn.featurize \
--model-name "baai/bge-base-en-v1.5" \
--dataset-path "allenai/c4" \
--dataset-name "en" \
--dataset-split "train" \
--output-dir "<output-dir>"
```
## Library Authors
Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
## Citation
```
@software{minishlab2024model2vec,
author = {Stephan Tulkens and {van Dongen}, Thomas},
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.17270888},
url = {https://github.com/MinishLab/model2vec},
license = {MIT}
}
```
提供机构:
minishlab



