five

MOIED: Magi Open Information Extraction Dataset

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/3666038
下载链接
链接失效反馈
官方服务:
资源简介:
Description Magi Open Information Extraction Dataset (MOIED) is a Chinese Open IE dataset containing 7,618,181 records extracted from plain text across 3,319,763 webpages in various domains. Each record in the dataset consists of the (subject, predicate, object) tuple, the associated confidence score, and the context information. The dataset comprises 1,427,742 distinct facts of 272,522 entities and 117,731 predicates. A notable property of MOIED is that each distinct fact has multiple records with URLs referring to mentions in diverse contexts, which enables multiple-instance learning (MIL) and other correlative approaches. As a paragraph level Open IE dataset, at least 45.1% of the records in MOIED can only be extracted through synthesizing information from multiple sentences. Magi is an extraction engine that continuously learns from the Internet, which combines cross-referencing, timeline analysis, and other heuristics to mitigate the inevitable false positives in the extractions. All records in MOIED were randomly sampled from a database dump of magi.com in January 2020. To provide more reliable evaluation results, human annotators examined the dataset and selected 19,161 verified records for the dev and test sets.   Disclaimers The dataset is expected to be used in weakly supervised scenarios since the records in the training set are not human-annotated and could be imprecise or erroneous. Records are not guaranteed to be universally correct. The correctness of extractions should be evaluated based on contexts (specified by the URLs). The extraction was made at a certain time Magi visits the URL, thus it is not guaranteed that the URL is still accessible, or the content is unmodified since the extraction was conducted. Due to legal and regulatory issues, the webpage URLs are mostly ones accessible from Mainland China, yet, the content of certain webpages, as well as the extraction results, could be in violation of law and regulation of certain countries or regions in certain ways.
创建时间:
2024-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作