AI71ai/Arctic-Wiki-Arabic-1M

Name: AI71ai/Arctic-Wiki-Arabic-1M
Creator: AI71ai
Published: 2026-01-29 05:51:58
License: 暂无描述

Hugging Face2026-01-29 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/AI71ai/Arctic-Wiki-Arabic-1M

下载链接

链接失效反馈

官方服务：

资源简介：

Arctic-Wiki-Arabic-1M是一个与VDBBench兼容的向量基准测试案例，发布为Hugging Face的数据集仓库。数据集包含阿拉伯语维基百科文章的向量表示，使用Snowflake Arctic Embed L v2.0模型（维度1024）生成。训练集包含1,000,000个向量，测试集包含1,000个查询向量。数据集通过字符长度过滤文章，以确保文章既不会过短（低信号）也不会过长（可能导致截断或嵌入效果不佳）。测试集的查询来源于维基百科标题，且训练集可选地提供了打乱顺序的版本以评估向量数据库对输入顺序的敏感性。数据集还包括用于精确评估的最近邻训练ID的余弦相似度计算结果。此外，提供了可选的ID映射辅助文件，用于追踪向量ID回原始维基百科/源ID。数据集设计用于评估向量数据库，特别是压缩表示（包括二值化/量化向量）。

Arctic-Wiki-Arabic-1M is a VDBBench-compatible vector benchmark case published as a Hugging Face dataset repository. It contains vector representations of Arabic Wikipedia articles, generated using the Snowflake Arctic Embed L v2.0 model (dimension 1024). The training set consists of 1,000,000 vectors, and the test set includes 1,000 query vectors. The dataset filters articles by character length to ensure they are neither too short (low-signal) nor too long (risking truncation or poor embedding behavior). Test set queries are derived from Wikipedia titles of documents in the training stream, and an optional shuffled version of the training set is provided to evaluate vector DB sensitivity to ingestion order. The dataset includes ground truth with exact top-400 nearest train IDs for each test query, computed using cosine similarity. Optional ID mapping sidecars are included to trace vector IDs back to original Wikipedia/source IDs. The dataset is designed for evaluating vector databases, particularly compressed representations (including binarized/quantized vectors).

提供机构：

AI71ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集