MOIED: Magi Open Information Extraction Dataset
收藏Mendeley Data2024-03-27 更新2024-06-27 收录
下载链接:
https://zenodo.org/record/3666039
下载链接
链接失效反馈官方服务:
资源简介:
Description Magi Open Information Extraction Dataset (MOIED) is a Chinese Open IE dataset containing 7,618,181 records extracted from plain text across 3,319,763 webpages in various domains. Each record in the dataset consists of the (subject, predicate, object) tuple, the associated confidence score, and the context information. The dataset comprises 1,427,742 distinct facts of 272,522 entities and 117,731 predicates. A notable property of MOIED is that each distinct fact has multiple records with URLs referring to mentions in diverse contexts, which enables multiple-instance learning (MIL) and other correlative approaches. As a paragraph level Open IE dataset, at least 45.1% of the records in MOIED can only be extracted through synthesizing information from multiple sentences. Magi is an extraction engine that continuously learns from the Internet, which combines cross-referencing, timeline analysis, and other heuristics to mitigate the inevitable false positives in the extractions. All records in MOIED were randomly sampled from a database dump of magi.com in January 2020. To provide more reliable evaluation results, human annotators examined the dataset and selected 19,161 verified records for the dev and test sets. Disclaimers The dataset is expected to be used in weakly supervised scenarios since the records in the training set are not human-annotated and could be imprecise or erroneous. Records are not guaranteed to be universally correct. The correctness of extractions should be evaluated based on contexts (specified by the URLs). The extraction was made at a certain time Magi visits the URL, thus it is not guaranteed that the URL is still accessible, or the content is unmodified since the extraction was conducted. Due to legal and regulatory issues, the webpage URLs are mostly ones accessible from Mainland China, yet, the content of certain webpages, as well as the extraction results, could be in violation of law and regulation of certain countries or regions in certain ways.
Magi开放信息抽取数据集(Magi Open Information Extraction Dataset, MOIED)是一款中文开放信息抽取数据集,共包含7,618,181条抽取自多领域3,319,763个网页纯文本的记录。每条数据集记录由(主语、谓语、宾语)三元组、关联置信度分数及上下文信息构成。该数据集涵盖272,522个实体与117,731个谓词对应的1,427,742条唯一事实。MOIED的显著特性在于,每个唯一事实均对应多条携带不同上下文提及URL的记录,这一特性可支撑多实例学习(Multiple-Instance Learning, MIL)及其他关联式学习方法。作为篇章级开放信息抽取数据集,MOIED中至少45.1%的记录仅能通过整合多句信息完成抽取。
Magi是一款持续从互联网中学习的抽取引擎,它结合跨引用分析、时序分析及其他启发式方法,以缓解抽取过程中不可避免的假阳性问题。MOIED的所有记录均于2020年1月从magi.com的数据库备份中随机采样得到。为保障评估结果的可靠性,人工标注员对数据集进行了核验,并选取19,161条经过验证的记录用于开发集与测试集。
免责声明
本数据集适用于弱监督场景,因训练集记录未经过人工标注,可能存在不精确或错误之处。无法保证所有记录均普遍正确,抽取结果的正确性需结合URL指定的上下文进行评估。本次抽取是在Magi访问对应URL的特定时间点完成的,因此无法保证该URL当前仍可访问,或抽取后内容未发生修改。受法律法规限制,本次使用的网页URL大多可从中国大陆地区访问,但部分网页内容及抽取结果可能以某些方式违反部分国家或地区的法律法规。
创建时间:
2023-06-28



