pborchert/CompanyWeb
收藏Hugging Face2024-02-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pborchert/CompanyWeb
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- fill-mask
- text-classification
language:
- en
tags:
- business
- company website
- industry classification
pretty_name: CompanyWeb
size_categories:
- 1M<n<10M
task_ids:
- masked-language-modeling
---
# Dataset Card for "CompanyWeb"
### Dataset Summary
The dataset contains textual content extracted from 1,788,413 company web pages of 393,542 companies. The companies included in the dataset are small, medium and large international enterprises including publicly listed companies. Additional company information is provided in form of the corresponding Standard Industry Classification (SIC) label `sic4`.
The text includes all textual information contained on the website with a timeline ranging from 2014 to 2021. The search includes all subsequent pages with links from the homepage containing the company domain name.
We filter the resulting textual data to only include English text utilizing the FastText language detection API [(Joulin et al., 2016)](https://aclanthology.org/E17-2068/).
### Languages
- en
## Dataset Structure
### Data Instances
- **#Instances:** 1789413
- **#Companies:** 393542
- **#Timeline:** 2014-2021
### Data Fields
- `id`: instance identifier `(string)`
- `cid`: company identifier `(string)`
- `text`: website text `(string)`
- `sic4`: 4-digit SIC `(string)`
### Citation Information
```bibtex
@article{BORCHERT2024,
title = {Industry-sensitive language modeling for business},
journal = {European Journal of Operational Research},
year = {2024},
issn = {0377-2217},
doi = {https://doi.org/10.1016/j.ejor.2024.01.023},
url = {https://www.sciencedirect.com/science/article/pii/S0377221724000444},
author = {Philipp Borchert and Kristof Coussement and Jochen {De Weerdt} and Arno {De Caigny}},
}
```
提供机构:
pborchert
原始信息汇总
数据集卡片 "CompanyWeb"
数据集概述
该数据集包含从1,788,413个公司网页中提取的文本内容,涉及393,542家公司。这些公司包括小型、中型和大型国际企业,包括上市公司。数据集还提供了相应的标准行业分类(SIC)标签sic4。文本内容包括网站上的所有文本信息,时间范围从2014年到2021年。搜索包括从主页链接的所有后续页面,这些页面包含公司域名。我们使用FastText语言检测API(Joulin et al., 2016)过滤结果文本数据,仅包括英语文本。
语言
- 英语(en)
数据集结构
数据实例
- 实例数量: 1,789,413
- 公司数量: 393,542
- 时间范围: 2014-2021
数据字段
id:实例标识符(字符串)cid:公司标识符(字符串)text:网站文本(字符串)sic4:4位SIC(字符串)
引用信息
bibtex @article{BORCHERT2024, title = {Industry-sensitive language modeling for business}, journal = {European Journal of Operational Research}, year = {2024}, issn = {0377-2217}, doi = {https://doi.org/10.1016/j.ejor.2024.01.023}, url = {https://www.sciencedirect.com/science/article/pii/S0377221724000444}, author = {Philipp Borchert and Kristof Coussement and Jochen {De Weerdt} and Arno {De Caigny}}, }



