Hemimoon/JaTextRel
收藏Hugging Face2024-07-17 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/Hemimoon/JaTextRel
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含日语句子和段落及其相关分数,用于促进相关性检索任务。数据集主要用于训练和验证日语文本嵌入模型,特别是用于检索增强生成(Retrieval Augmented Generation, RAG)系统。数据集包含约76.6万条训练数据和约4.2万条验证及测试数据,这些数据来源于多个任务的数据集,包括语义文本相似性(STS)、自然语言推理(NLI)、问答(QA)、多项选择问答(MCQA)和文本摘要(TS)。数据集中的每个数据条目包含元数据、句子、文本和相关分数,相关分数用于表示句子与文本之间的相关性。数据集的转换方法包括将不同任务的标签统一转换为相关分数,并使用特定的数学函数进行处理。最后,数据集被划分为训练集、验证集和测试集。
This dataset contains Japanese sentences and passages with relevance scores to facilitate relevance retrieval tasks. It is constructed based on multiple task datasets, including Semantic Textual Similarity (STS), Natural Language Inference (NLI), Question Answering (QA), Multiple-Choice Question Answering (MCQA), and Text Summarization (TS). The labels from these datasets are converted into a unified relevance score, represented by a float64 number between 0 and 1, where 0 indicates no relevance and 1 indicates high relevance. The dataset is divided into a training set with 766085 samples, a validation set with 42560 samples, and a test set with 42561 samples.
提供机构:
Hemimoon



