KelvinJiang/freebase_qa
收藏数据集概述
数据集名称
- 名称: FreebaseQA
- 别名: FreebaseQA
数据集属性
- 语言: 英语
- 许可证: 未知
- 多语言性: 单语
- 大小类别: 10K<n<100K
- 任务类别: 问答
- 任务ID: open-domain-qa
- 论文代码ID: freebaseqa
数据集结构
- 特征:
- Question-ID: 字符串类型,表示每个问题的ID。
- RawQuestion: 字符串类型,表示从数据源收集的原始问题。
- ProcessedQuestion: 字符串类型,表示经过某些操作处理后的问题,如去除尾随问号和首字母大写。
- Parses: 字典类型,表示问题的语义解析,包含以下子特征:
- Parse-Id: 字符串类型,表示每个语义解析的ID。
- PotentialTopicEntityMention: 字符串类型,表示问题中潜在的主题实体提及。
- TopicEntityName: 字符串类型,表示问题中主题实体的名称或别名。
- TopicEntityMid: 字符串类型,表示问题中主题实体的Freebase MID。
- InferentialChain: 字符串类型,表示从主题实体节点到答案节点的路径,标记为谓词。
- Answers: 字典类型,表示从该解析中找到的答案,包含以下子特征:
- AnswersMid: 字符串类型,表示答案的Freebase MID。
- AnswersName: 列表类型,包含字符串类型,表示原始问题-答案对中的答案字符串。
数据集分割
- 训练集: 20,358个例子,10235375字节
- 测试集: 3,996个例子,1987874字节
- 验证集: 3,994个例子,1974114字节
数据集创建
- 来源数据:
- 初始数据收集和规范化: 数据集通过匹配琐事类型的问题-答案对与Freebase中的主谓宾三元组生成。对于每个收集的问题-答案对,首先标记问题中的所有实体,并在Freebase中搜索连接标记实体与答案的相关谓词。最后,使用人工标注来移除这些匹配三元组中的假阳性。
引用信息
@inproceedings{jiang-etal-2019-freebaseqa, title = "{F}reebase{QA}: A New Factoid {QA} Data Set Matching Trivia-Style Question-Answer Pairs with {F}reebase", author = "Jiang, Kelvin and Wu, Dekun and Jiang, Hui", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N19-1028", doi = "10.18653/v1/N19-1028", pages = "318--323", abstract = "In this paper, we present a new data set, named FreebaseQA, for open-domain factoid question answering (QA) tasks over structured knowledge bases, like Freebase. The data set is generated by matching trivia-type question-answer pairs with subject-predicate-object triples in Freebase. For each collected question-answer pair, we first tag all entities in each question and search for relevant predicates that bridge a tagged entity with the answer in Freebase. Finally, human annotation is used to remove any false positive in these matched triples. Using this method, we are able to efficiently generate over 54K matches from about 28K unique questions with minimal cost. Our analysis shows that this data set is suitable for model training in factoid QA tasks beyond simpler questions since FreebaseQA provides more linguistically sophisticated questions than other existing data sets.", }




