GreenNode/nano-nq-vn
收藏Hugging Face2025-12-30 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/GreenNode/nano-nq-vn
下载链接
链接失效反馈官方服务:
资源简介:
NanoNQ-VN是一个越南语的翻译数据集,源自NFCorpus(一个用于医学信息检索的全文本学习排名数据集)。该数据集通过自动化系统创建,包括使用大型语言模型(如Coherences Aya模型)进行翻译,应用高级嵌入模型过滤翻译,以及使用LLM-as-a-judge评分样本质量。数据集属于MTEB(Massive Text Embedding Benchmark)的一部分,主要用于文本嵌入模型的评估。数据集包含三个配置:corpus(语料库)、qrels(查询-文档相关性)和queries(查询),每个配置都有详细的特征和分割信息。任务类别包括文本检索、多项选择QA和问答。
NanoNQ-VN is a Vietnamese translated dataset derived from NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherences Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria. The dataset is part of the MTEB (Massive Text Embedding Benchmark) and is primarily used for evaluating text embedding models. It includes three configurations: corpus, qrels, and queries, each with detailed features and splits. The task categories include text retrieval, multiple-choice QA, and question-answering.
提供机构:
GreenNode



