imbWBI: Classification of Business Entities on Multilingual Web - The Main Results

Mendeley Data2018-02-26 更新2026-04-09 收录

下载链接：

https://data.mendeley.com/datasets/8x9n2mn7h4/2

下载链接

链接失效反馈

官方服务：

资源简介：

In this research, we proposed and developed, an open source business stakeholder classification system, capable of multi-class single-label hard classification of business entities, according to the products they fabricate. The output is single label result, pointing to the particular industry of the stakeholder. + Summary Spreadsheets with the most relevant findings and research sample data. + TF-IDF Evaluation Contains in total 16 configurations, evaluated in 10-fold cross validation schema, where the same 8 models were ran with page sorting (by text size, desc) at input (of content processing pipeline) and 8 without. Beside the traditional TF-IDF (2 experiments), another 6 modified versions were evaluated: without IDF, with DFC 1.1 and 2.0, and with and without HTML Tag Factors (TW). + Results with CSSRM Cosine SSRM is our customized method for semantic similarity computation. Reports in this folder are performed near and at optimum configuration of the system. + System evaluation Reports and summary spreadsheets on experiments performed for system 10-fold cross validation. + Unstable performance Experiments with different (several sites) sample set, where the system achieved up to F1=0.893 effectiveness, while being unstable because high-number of parallel threads. Morphosyntactic resource interpreter and content decomposition pipeline were producing different results at each run. The results are discarded as non reproducible with single run. ------------------------------------------------------------------- Sample set contains: 5 categories, each having 10 companies (web sites). Specific challenges addressed in this research: - multilingual web content - limited availability of domain-specific training data-sets - heterogeneous linguistic resources of variable quality - absence of production ready and publicly available general semantic lexicons, like WordNet Problems that are addressed by this research: - construction of semantic cloud (non-hierarchical lexicon of semantically related terms) from limited amount of web content - adaptation of similarity computation schema, based on Semantic Similarity Retrieval Model - development of efficient and effective Feature Vector Extraction mechanism, used to reduce number of dimensions in Feature Vector to the number of categories (5) - evaluation of wide range of classification algorithms and configuration parameters: kNN, NaiveBayes, Multiclass SVM and Neural Networks. (17 classifier models are evaluated in every experiment) All software tools (application and the libraries), developed during this research, are published under GNU GPL3 licence, thus available for other researchers and professionals. ---- Goran Grubić Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia goran.grubic@koplas.co.rs, +381 62 27 27 55

创建时间：

2018-02-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集