SNLI Indo : Dataset Recognizing Textual Entailment (RTE)
收藏Mendeley Data2024-03-27 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/k4tjhzs2gd
下载链接
链接失效反馈官方服务:
资源简介:
SNLI Indo is derived from the SNLI corpus, where each premise (P) and hypothesis (H) sentence is translated directly from English to Indonesian using the Google Cloud Translation API. This translation process is applied to both premise and hypothesis sentences, starting from index 0 to the last index for each data file (train, val, and test). This ensures that the same number of sentences as the original SNLI dataset, 570k sentence pairs, is obtained. A filtering process is carried out to remove incomplete sentence pairs and those with a gold label '-'. As a result, 569,030 sentence pairs are obtained.
SNLI Indo 衍生自SNLI语料库(SNLI corpus),其中每个前提句(Premise,简称P)与假设句(Hypothesis,简称H)均通过谷歌云翻译API(Google Cloud Translation API)从英语直接翻译为印尼语。该翻译流程同时覆盖前提句与假设句,针对训练集(train)、验证集(val)与测试集(test)的每个数据文件,均从索引0遍历至最后一条数据,以此确保最终得到的句对数量与原始SNLI数据集一致,共计57万条句对。随后执行过滤流程,剔除不完整的句对以及标注为"-"的黄金标签(gold label)句对,最终得到569,030条句对。
创建时间:
2024-01-23



