nyuuzyou/wb-products
收藏Hugging Face2024-01-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nyuuzyou/wb-products
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language:
- ru
language_creators:
- crowdsourced
license:
- cc0-1.0
multilinguality:
- monolingual
pretty_name: Wildberries products
size_categories:
- 100M<n<1B
source_datasets:
- original
task_categories:
- text-generation
task_ids:
- language-modeling
---
# Dataset Card for Wildberries products
### Dataset Summary
This dataset was scraped from product pages on the Russian marketplace [Wildberries](https://www.wildberries.ru). It includes all information from the product card and metadata from the API, excluding image URLs. The dataset was collected by processing approximately 160 million products out of a potential 230 million, starting from the first product. Data collection had to be stopped due to serious rate limits that prevented further progress. The data is in zstd archives containing jsonl files. Each archive contains data from a specific Wildberries data server identified by a basket server number.
### Languages
The dataset is mostly in Russian, but there may be other languages present.
## Dataset Structure
### Data Fields
This dataset includes the following fields:
- `imt_id`: Identifier for the item (integer)
- `nm_id`: Numeric identifier associated with the item (integer)
- `imt_name`: Name of the product (string)
- `subj_name`: Subject name (string)
- `subj_root_name`: Root subject name (string)
- `nm_colors_names`: Colors names (string, may be empty)
- `vendor_code`: Vendor code (string)
- `description`: Description of the product (string, may be empty)
- `brand_name`: Name of the brand (string)
### Data Splits
All examples are in the train split, there is no validation split.
## Additional Information
### License
This dataset is dedicated to the public domain under the Creative Commons Zero (CC0) license. This means you can:
* Use it for any purpose, including commercial projects.
* Modify it however you like.
* Distribute it without asking permission.
No attribution is required, but it's always appreciated!
CC0 license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
To learn more about CC0, visit the Creative Commons website: https://creativecommons.org/publicdomain/zero/1.0/
### Dataset Curators
- [nyuuzyou](https://ducks.party)
提供机构:
nyuuzyou
原始信息汇总
数据集卡片 for Wildberries products
数据集概述
该数据集是从俄罗斯市场平台Wildberries的产品页面抓取的。它包括产品卡片的所有信息和来自API的元数据,不包括图片URL。数据集是通过处理大约1.6亿个产品(从第一个产品开始)从潜在的2.3亿个产品中收集的。由于严重的速率限制,数据收集不得不停止。数据以包含jsonl文件的zstd档案形式存在。每个档案包含来自特定Wildberries数据服务器的数据,由篮子服务器编号标识。
语言
数据集主要为俄语,但可能包含其他语言。
数据集结构
数据字段
该数据集包括以下字段:
imt_id: 商品标识符(整数)nm_id: 与商品关联的数字标识符(整数)imt_name: 商品名称(字符串)subj_name: 主题名称(字符串)subj_root_name: 根主题名称(字符串)nm_colors_names: 颜色名称(字符串,可能为空)vendor_code: 供应商代码(字符串)description: 商品描述(字符串,可能为空)brand_name: 品牌名称(字符串)
数据分割
所有示例都在训练分割中,没有验证分割。
附加信息
许可证
该数据集根据Creative Commons Zero (CC0) 许可证贡献给公共领域。这意味着你可以:
- 将其用于任何目的,包括商业项目。
- 随意修改。
- 无需请求许可即可分发。
不需要署名,但总是受到欢迎!
CC0许可证:https://creativecommons.org/publicdomain/zero/1.0/deed.en
了解更多关于CC0的信息,请访问Creative Commons网站:https://creativecommons.org/publicdomain/zero/1.0/



