argilla-warehouse/personahub-fineweb-edu-4-embeddings
收藏Hugging Face2024-09-10 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/argilla-warehouse/personahub-fineweb-edu-4-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
该数据集通过使用Alibaba-NLP/gte-large-en-v1.5模型从句子转换器中获取了argilla-warehouse/personahub-fineweb-edu-4-dedup数据集的嵌入。数据集包含一个pipeline.yaml文件,可用于在distilabel中使用distilabel CLI重现生成数据集的管道。数据集的特征包括id、persona、model_name_embeddings和embedding。训练集包含21,071,228个示例,总大小为178,032,127,144字节。
This dataset is created using the distilabel tool, specifically for obtaining embeddings for the argilla-warehouse/personahub-fineweb-edu-4-dedup dataset. These embeddings are generated using the Alibaba-NLP/gte-large-en-v1.5 model from sentence transformers. The dataset includes a pipeline.yaml file that can be used to reproduce the pipeline that generated it in distilabel. The dataset features include id, persona, model_name_embeddings, and embedding. The dataset language is English, with tags including synthetic and distilabel. The dataset size is between 10M and 100M.
提供机构:
argilla-warehouse



