imbWBI: Classification of Business Entities on Multilingual Web: configuration optimization, auxiliary experimental reports and other resources

Mendeley Data2018-02-26 更新2026-04-09 收录

下载链接：

https://data.mendeley.com/datasets/mg98ypgc8s/2

下载链接

链接失效反馈

官方服务：

资源简介：

The DataSet contains auxiliary experimental reports, configuration files and other materiel, produced during research: "Classification of Business Entities on Multilingual Web using Natural Language Processing...". # Content of the folders: ## Configuration optimization process Experiments performed for system optimization ## General resources Graphic representations of Semantic Clouds, created by the system. Collection ACE Scripts, executed during research. Color renders are made with older cloud construction algorithm. ## Particular aspects of the system Results of the experiments performed to evaluate particular mechanisms of the system ## imbWBI_ITM_ProjectFiles Configuration files with sample specification and other resources required for results reproduction ------------------------------------------------------------------------------------------------------------------ In this research, we proposed and developed, an open source business stakeholder classification system, capable of multi-class single-label hard classification of business entities, according to the products they fabricate. The sole external data source is content retrieved from web site of the stakeholder, processed with array of Natural Language Processing, Web Data Mining and statistical techniques. The output is single label result, pointing to the particular industry of the stakeholder. Sample set contains: 5 categories, each having 10 manufacturing companies (web sites). Specific challenges addressed: - multilingual web content - limited availability of domain-specific training data-sets - heterogeneous linguistic resources of variable quality - absence of production ready and publicly available general semantic lexicons, like WordNet Problems solved in this research: - construction of semantic cloud (non-hierarchical lexicon of semantically related terms) from limited amount of web content - adaptation of similarity computation schema, based on Semantic Similarity Retrieval Model - development of efficient and effective Feature Vector Extraction mechanism, used to reduce number of dimensions in Feature Vector to the number of categories (5) - evaluation of wide range of classification algorithms and configuration parameters: kNN, NaiveBayes, Multiclass SVM and Neural Networks. (17 classifier models are evaluated in every experiment) All software tools (application and the libraries), developed during this research, are published under GNU GPL3 licence, thus available for other researchers and professionals. ---- Goran Grubić Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia goran.grubic@koplas.co.rs, +381 62 27 27 55

创建时间：

2018-02-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集