epfml/FineWeb2-embedded
收藏Hugging Face2025-02-19 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/epfml/FineWeb2-embedded
下载链接
链接失效反馈官方服务:
资源简介:
FineWeb2-embedded是一个包含20种语言文档级别XLM-RoBERTa嵌入的数据集,适用于文档聚类、过滤等多语言任务。它基于FineWeb2数据集,每个文档的嵌入是通过对于XLM-RoBERTa输出的512令牌块的均值池化获得的。
FineWeb2-embedded is a dataset with document-level XLM-RoBERTa embeddings for 20 languages, suitable for tasks such as document clustering and filtering. It is based on the FineWeb2 dataset, with each documents embeddings obtained by mean-pooling 512 token chunks of the XLM-RoBERTa output.
提供机构:
epfml



