five

ASOIF和Harry Potter数据集

收藏
arXiv2019-03-04 更新2024-06-21 收录
下载链接:
https://github.com/cicling2018-dhdata/dh-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
本研究创建了针对数字人文领域的高质量数据集,用于评估语言模型,特别是词嵌入模型。数据集包括两个奇幻小说系列(《冰与火之歌》和《哈利波特》)的单字和n-gram数据集,共计31362个测试单元,分为80个任务部分。数据集创建过程涉及手动收集、过滤和扩展,由领域专家进行精细调整。该数据集主要用于解决小规模语料库中细粒度关系和知识的提取与验证问题,支持多种词嵌入模型的训练和评估,如word2vec、GloVe、fastText和LexVec。

This study presents a high-quality dataset tailored for the digital humanities field, intended for evaluating language models, particularly word embedding models. The dataset consists of unigram and n-gram datasets derived from two fantasy novel series, *A Song of Ice and Fire* and *Harry Potter*, with a total of 31,362 test units split across 80 task segments. The dataset was constructed via manual collection, filtering, and expansion, followed by fine-grained tuning conducted by domain experts. This dataset is primarily designed to address the extraction and validation of fine-grained relationships and knowledge within small-scale corpora, and supports the training and evaluation of multiple word embedding models including word2vec, GloVe, fastText, and LexVec.
提供机构:
圣彼得堡国立信息技术机械与光学大学国际信息科学实验室和语义技术
创建时间:
2019-03-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作