JQL-AI/fw2_embeddings
收藏Hugging Face2025-08-21 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/JQL-AI/fw2_embeddings
下载链接
链接失效反馈官方服务:
资源简介:
FineWeb2-embeddings是一个扩展自FineWeb2的数据集,为36种语言提供了文档级别的Snowflake Arctic-embed-m-v2.0嵌入,可用于文档聚类、过滤和多语言研究。嵌入使用CLS标记嵌入每个文档。数据集包含不同语言和配置的子集,每个子集都有特定文件路径用于过滤和删除的数据。数据集源自2013年至2024年收集的网络内容,可能包含个人身份信息。建议查阅FineWeb2文档,了解社会影响、潜在偏差和已知限制。
FineWeb2-embeddings is an extension of the FineWeb2 dataset, annotated with document-level embeddings for 36 languages, which can be useful for tasks like document clustering, filtering, and multilingual research. The embeddings were computed using Snowflakes Arctic-embed-m-v2.0, which has a sequence length limit of 8192 tokens. The dataset includes subsets for different languages and configurations, each with specific file paths for filtered and removed data. The dataset is derived from web content collected from 2013 to 2024 and may contain PII. It is recommended to review the FineWeb2 documentation for social impact considerations and known limitations.
提供机构:
JQL-AI



