five

minishlab/tokenlearn-c4-en-bge-base-en-v1.5

收藏
Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/minishlab/tokenlearn-c4-en-bge-base-en-v1.5
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: en tags: - tokenlearn - embeddings - model2vec configs: - config_name: 10m data_files: - split: train path: 10m/train-* default: true --- # minishlab/tokenlearn-c4-en-bge-base-v1.5 Dataset Card This dataset was created with [Tokenlearn](https://github.com/MinishLab/tokenlearn) for training [Model2Vec](https://github.com/MinishLab/model2vec) models. It contains mean token embeddings produced by a sentence transformer, used as training targets for static embedding distillation. ## Dataset Details | Field | Value | |---|---| | **Source dataset** | [allenai/c4](https://huggingface.co/datasets/allenai/c4) | | **Source split** | `train` | | **Embedding model** | [baai/bge-base-en-v1.5](https://huggingface.co/baai/bge-base-en-v1.5) | | **Embedding dimension** | 768 | | **Rows** | 10000000 | ## Dataset Structure | Column | Type | Description | |---|---|---| | `text` | `string` | Truncated input text | | `embedding` | `list[float32]` | Mean token embedding from `baai/bge-base-en-v1.5`, excluding BOS/EOS tokens | ## Usage Load with the `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("minishlab/tokenlearn-c4-en-bge-base-v1.5") ``` Train a Model2Vec model on this dataset using Tokenlearn: ```bash python -m tokenlearn.train \ --model-name "baai/bge-base-en-v1.5" \ --data-path "minishlab/tokenlearn-c4-en-bge-base-v1.5" \ --save-path "<path-to-save-model>" ``` ## Creation This dataset was created using the `tokenlearn-featurize` CLI: ```bash python -m tokenlearn.featurize \ --model-name "baai/bge-base-en-v1.5" \ --dataset-path "allenai/c4" \ --dataset-name "en" \ --dataset-split "train" \ --output-dir "<output-dir>" ``` ## Library Authors Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled). ## Citation ``` @software{minishlab2024model2vec, author = {Stephan Tulkens and {van Dongen}, Thomas}, title = {Model2Vec: Fast State-of-the-Art Static Embeddings}, year = {2024}, publisher = {Zenodo}, doi = {10.5281/zenodo.17270888}, url = {https://github.com/MinishLab/model2vec}, license = {MIT} } ```
提供机构:
minishlab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作