Schema.org mark-up data for named entities
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7897608
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains two files: original_data.zip, and website_5folds.zip
original_data.zip will unpack into three .csv files, Place.csv, CreativeWork.csv, and LocalBusiness.csv. Each file contains one entity on each row, and this entity belongs to a subclass of the class indicated by the file name. There are 8 columns:
the first 2 columns are simply the index of the row
description_t: the long textual description of the entity
schemaorg_class: the schema.org class assigned to the entity
name_tpage_domain: always empty
name_t: the name of the entity
page_domain: the website where the entity mark-up data is found
label: an index for the schemaorg_class
description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t)
website_5folds.zip is a transformation of the original_data.zip. It unzips into three folders, Place, LocalBusiness, and CreativeWork. Inside each folder, there are five folders: 0, 1, 2, 3 and 4 indicating five folds. Inside each of the numbered sub-folder there is a train.csv and test.csv file. Then each csv file contains one entity on each row, with the following columns:
the first column is simply the index of the row
schemaorg_class: the schema.org class assigned to the entity
name_t: the name of the entity
description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t)
page_domain: the name of the entity plus the processed domain name. The process includes parsing the domain URL, extract the host name, applying word segmentation (tescobank -> tesco bank), and removing stopwords and TLDs (co, uk, com, fr)
As mentioned, website_5folds.zip is a transformation of the original_data.zip and in fact contains multiple replications of original_data.zip. It is created for 5 fold validation experiment while ensuring that there are no overlap in the page_domain of entities in training and test sets.
创建时间:
2023-05-05



