five

pborchert/CompanyWeb

收藏
Hugging Face2024-02-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pborchert/CompanyWeb
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - fill-mask - text-classification language: - en tags: - business - company website - industry classification pretty_name: CompanyWeb size_categories: - 1M<n<10M task_ids: - masked-language-modeling --- # Dataset Card for "CompanyWeb" ### Dataset Summary The dataset contains textual content extracted from 1,788,413 company web pages of 393,542 companies. The companies included in the dataset are small, medium and large international enterprises including publicly listed companies. Additional company information is provided in form of the corresponding Standard Industry Classification (SIC) label `sic4`. The text includes all textual information contained on the website with a timeline ranging from 2014 to 2021. The search includes all subsequent pages with links from the homepage containing the company domain name. We filter the resulting textual data to only include English text utilizing the FastText language detection API [(Joulin et al., 2016)](https://aclanthology.org/E17-2068/). ### Languages - en ## Dataset Structure ### Data Instances - **#Instances:** 1789413 - **#Companies:** 393542 - **#Timeline:** 2014-2021 ### Data Fields - `id`: instance identifier `(string)` - `cid`: company identifier `(string)` - `text`: website text `(string)` - `sic4`: 4-digit SIC `(string)` ### Citation Information ```bibtex @article{BORCHERT2024, title = {Industry-sensitive language modeling for business}, journal = {European Journal of Operational Research}, year = {2024}, issn = {0377-2217}, doi = {https://doi.org/10.1016/j.ejor.2024.01.023}, url = {https://www.sciencedirect.com/science/article/pii/S0377221724000444}, author = {Philipp Borchert and Kristof Coussement and Jochen {De Weerdt} and Arno {De Caigny}}, } ```
提供机构:
pborchert
原始信息汇总

数据集卡片 "CompanyWeb"

数据集概述

该数据集包含从1,788,413个公司网页中提取的文本内容,涉及393,542家公司。这些公司包括小型、中型和大型国际企业,包括上市公司。数据集还提供了相应的标准行业分类(SIC)标签sic4。文本内容包括网站上的所有文本信息,时间范围从2014年到2021年。搜索包括从主页链接的所有后续页面,这些页面包含公司域名。我们使用FastText语言检测API(Joulin et al., 2016)过滤结果文本数据,仅包括英语文本。

语言

  • 英语(en)

数据集结构

数据实例

  • 实例数量: 1,789,413
  • 公司数量: 393,542
  • 时间范围: 2014-2021

数据字段

  • id:实例标识符(字符串)
  • cid:公司标识符(字符串)
  • text:网站文本(字符串)
  • sic4:4位SIC(字符串)

引用信息

bibtex @article{BORCHERT2024, title = {Industry-sensitive language modeling for business}, journal = {European Journal of Operational Research}, year = {2024}, issn = {0377-2217}, doi = {https://doi.org/10.1016/j.ejor.2024.01.023}, url = {https://www.sciencedirect.com/science/article/pii/S0377221724000444}, author = {Philipp Borchert and Kristof Coussement and Jochen {De Weerdt} and Arno {De Caigny}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作