five

minishlab/tokenlearn-c4-multilingual-bge-m3

收藏
Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/minishlab/tokenlearn-c4-multilingual-bge-m3
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: 10m features: - name: text dtype: string - name: embedding sequence: float32 splits: - name: train num_bytes: 0 num_examples: 9999954 - config_name: 2m features: - name: text dtype: string - name: embedding sequence: float32 splits: - name: train num_bytes: 11700400216 num_examples: 1999949 download_size: 11384180946 dataset_size: 11700400216 configs: - config_name: 10m data_files: - split: train path: 10m/train-* default: true - config_name: 2m data_files: - split: train path: 2m/train-* --- # minishlab/tokenlearn-c4-multilingual-bge-m3 Dataset Card This dataset was created with [Tokenlearn](https://github.com/MinishLab/tokenlearn) for training [Model2Vec](https://github.com/MinishLab/model2vec) models. It contains mean token embeddings produced by a sentence transformer, used as training targets for static embedding distillation. ## Dataset Details | Field | Value | |---|---| | **Source dataset** | [allenai/c4](https://huggingface.co/datasets/allenai/c4) | | **Source split** | `train` | | **Embedding model** | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | | **Embedding dimension** | 1024 | | **Rows** | 9999954 | ## Dataset Structure | Column | Type | Description | |---|---|---| | `text` | `string` | Truncated input text | | `embedding` | `list[float32]` | Mean token embedding from `BAAI/bge-m3`, excluding BOS/EOS tokens | ## Usage Load with the `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("minishlab/tokenlearn-c4-multilingual-bge-m3") ``` Train a Model2Vec model on this dataset using Tokenlearn: ```bash python -m tokenlearn.train \ --model-name "BAAI/bge-m3" \ --data-path "minishlab/tokenlearn-c4-multilingual-bge-m3" \ --save-path "<path-to-save-model>" ``` ## Creation Both the 10M and 2M datasets were generated with [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) on [multilingual C4](https://huggingface.co/datasets/allenai/c4) using Tokenlearn, across all C4 language subsets with temperature-smoothed sampling. ## Library Authors Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled). ## Citation ``` @software{minishlab2024model2vec, author = {Stephan Tulkens and {van Dongen}, Thomas}, title = {Model2Vec: Fast State-of-the-Art Static Embeddings}, year = {2024}, publisher = {Zenodo}, doi = {10.5281/zenodo.17270888}, url = {https://github.com/MinishLab/model2vec}, license = {MIT} } ```
提供机构:
minishlab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作