five

Schema.org mark-up data for named entities

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7897608
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains two files: original_data.zip, and website_5folds.zip original_data.zip will unpack into three .csv files, Place.csv, CreativeWork.csv, and LocalBusiness.csv. Each file contains one entity on each row, and this entity belongs to a subclass of the class indicated by the file name. There are 8 columns: the first 2 columns are simply the index of the row description_t: the long textual description of the entity schemaorg_class: the schema.org class assigned to the entity name_tpage_domain: always empty name_t: the name of the entity page_domain: the website where the entity mark-up data is found label: an index for the schemaorg_class description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t) website_5folds.zip is a transformation of the original_data.zip. It unzips into three folders, Place, LocalBusiness, and CreativeWork. Inside each folder, there are five folders: 0, 1, 2, 3 and 4 indicating five folds. Inside each of the numbered sub-folder there is a train.csv and test.csv file. Then each csv file contains one entity on each row, with the following columns: the first column is simply the index of the row schemaorg_class: the schema.org class assigned to the entity name_t: the name of the entity description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t) page_domain: the name of the entity plus the processed domain name. The process includes parsing the domain URL, extract the host name, applying word segmentation (tescobank -> tesco bank), and removing stopwords and TLDs (co, uk, com, fr) As mentioned, website_5folds.zip is a transformation of the original_data.zip and in fact contains multiple replications of original_data.zip. It is created for 5 fold validation experiment while ensuring that there are no overlap in the page_domain of entities in training and test sets.
创建时间:
2023-05-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作