smartcat/Amazon_Baby_Products_2023
收藏Hugging Face2024-10-31 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/smartcat/Amazon_Baby_Products_2023
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: main_category
dtype: string
- name: title
dtype: string
- name: average_rating
dtype: float64
- name: rating_number
dtype: int64
- name: features
dtype: string
- name: description
dtype: string
- name: price
dtype: float64
- name: images
list:
- name: thumb
dtype: string
- name: large
dtype: string
- name: variant
dtype: string
- name: hi_res
dtype: string
- name: videos
list:
- name: title
dtype: string
- name: url
dtype: string
- name: user_id
dtype: string
- name: store
dtype: string
- name: categories
sequence: string
- name: parent_asin
dtype: string
- name: item_weight
dtype: string
- name: brand
dtype: string
- name: item_model_number
dtype: string
- name: product_dimensions
dtype: string
- name: batteries_required
dtype: string
- name: color
dtype: string
- name: material
dtype: string
- name: material_type
dtype: string
- name: style
dtype: string
- name: number_of_items
dtype: string
- name: manufacturer
dtype: string
- name: package_dimensions
dtype: string
- name: date_first_available
dtype: int64
- name: best_sellers_rank
dtype: string
- name: age_range_(description)
dtype: string
splits:
- name: train
num_bytes: 74252952
num_examples: 22767
download_size: 33492627
dataset_size: 74252952
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card for Dataset Name
Original dataset can be found on: https://amazon-reviews-2023.github.io/
## Dataset Details
This dataset is downloaded from the link above, the category Baby Products meta dataset.
### Dataset Description
This dataset is a refined version of the Amazon Baby Products 2023 meta dataset, which originally contained baby product metadata for product that are sold on Amazon. The dataset includes detailed information about products such as their descriptions, ratings, prices, images, and features. The primary focus of this modification was to ensure the completeness of key fields while simplifying the dataset by removing irrelevant or empty columns.
The table below represents the original structure of the dataset.
<table border="1" cellpadding="5" cellspacing="0">
<tr>
<th>Field</th>
<th>Type</th>
<th>Explanation</th>
</tr>
<tr>
<td>main_category</td>
<td>str</td>
<td>Main category (i.e., domain) of the product.</td>
</tr>
<tr>
<td>title</td>
<td>str</td>
<td>Name of the product.</td>
</tr>
<tr>
<td>average_rating</td>
<td>float</td>
<td>Rating of the product shown on the product page.</td>
</tr>
<tr>
<td>rating_number</td>
<td>int</td>
<td>Number of ratings in the product.</td>
</tr>
<tr>
<td>features</td>
<td>list</td>
<td>Bullet-point format features of the product.</td>
</tr>
<tr>
<td>description</td>
<td>list</td>
<td>Description of the product.</td>
</tr>
<tr>
<td>price</td>
<td>float</td>
<td>Price in US dollars (at time of crawling).</td>
</tr>
<tr>
<td>images</td>
<td>list</td>
<td>Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image.</td>
</tr>
<tr>
<td>videos</td>
<td>list</td>
<td>Videos of the product including title and url.</td>
</tr>
<tr>
<td>store</td>
<td>str</td>
<td>Store name of the product.</td>
</tr>
<tr>
<td>categories</td>
<td>list</td>
<td>Hierarchical categories of the product.</td>
</tr>
<tr>
<td>details</td>
<td>dict</td>
<td>Product details, including materials, brand, sizes, etc.</td>
</tr>
<tr>
<td>parent_asin</td>
<td>str</td>
<td>Parent ID of the product.</td>
</tr>
<tr>
<td>bought_together</td>
<td>list</td>
<td>Recommended bundles from the websites.</td>
</tr>
</table>
### Modifications made
<ul>
<li>Products without a description, title, images or details were removed.</li>
<li>Lists in features and description are transformed into strings concatinated with a newline</li>
<li>For the details column, only the top 16 most frequent detail types were kept. The details column was then split into these new 16 columns based on the detail types kept.</li>
<li>Products with date first available before the year 2015 are dropped.</li>
<li>Products with is_discontinued_by_manufacturer set to 'true' or 'yes' are dropped. Then that column was dropped.</li>
<li>Column bought_together is dropped due to missing values.</li>
</ul>
### Dataset Size
<ul>
<li>Total entries: 22,767</li>
<li>Total columns: 27</li>
</ul>
### Final Structure
<table border="1" cellpadding="5" cellspacing="0">
<tr>
<th>Field</th>
<th>Type</th>
<th>Explanation</th>
</tr>
<tr>
<td>main_category</td>
<td>str</td>
<td>Main category</td>
</tr>
<tr>
<td>title</td>
<td>str</td>
<td>Name of the product</td>
</tr>
<tr>
<td>average_rating</td>
<td>float</td>
<td>Rating of the product shown on the product page.</td>
</tr>
<tr>
<td>rating_number</td>
<td>int</td>
<td>Number of ratings in the product.</td>
</tr>
<tr>
<td>features</td>
<td>list</td>
<td>Bullet-point format features of the product.</td>
</tr>
<tr>
<td>description</td>
<td>list</td>
<td>Description of the product.</td>
</tr>
<tr>
<td>price</td>
<td>float</td>
<td>Price in US dollars (at time of crawling).</td>
</tr>
<tr>
<td>images</td>
<td>list</td>
<td>Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image.</td>
</tr>
<tr>
<td>videos</td>
<td>list</td>
<td>Videos of the product including title and url.</td>
</tr>
<tr>
<td>store</td>
<td>str</td>
<td>Store name of the product.</td>
</tr>
<tr>
<td>details</td>
<td>dict</td>
<td>Product details, including materials, brand, sizes, etc.</td>
</tr>
<tr>
<td>parent_asin</td>
<td>str</td>
<td>Parent ID of the product.</td>
</tr>
<tr>
<td>item_weight</td>
<td>str</td>
<td>Weight of the item</td>
</tr>
<tr>
<td>brand</td>
<td>str</td>
<td>Brand name</td>
</tr>
<tr>
<td>item_model_number</td>
<td>str</td>
<td>Model number of the item</td>
</tr>
<tr>
<td>product_dimensions</td>
<td>str</td>
<td>Dimensions of the product</td>
</tr>
<tr>
<td>batteries_required</td>
<td>str</td>
<td>Baterries required</td>
</tr>
<tr>
<td>color</td>
<td>str</td>
<td>Color</td>
</tr>
<tr>
<td>material</td>
<td>str</td>
<td>Material</td>
</tr>
<tr>
<td>material_type</td>
<td>str</td>
<td>Material</td>
</tr>
<tr>
<td>style</td>
<td>str</td>
<td>Style</td>
</tr>
<tr>
<td>number_of_items</td>
<td>str</td>
<td>Number of items</td>
</tr>
<tr>
<td>manufacturer</td>
<td>str</td>
<td>Manufacturer</td>
</tr>
<tr>
<td>package_dimensions</td>
<td>str</td>
<td>Package dimensions</td>
</tr>
<tr>
<td>date_first_available</td>
<td>int64</td>
<td>Date product was first time available</td>
</tr>
<tr>
<td>best_sellers_rank</td>
<td>str</td>
<td>Best seller rank</td>
</tr>
<tr>
<td>age_range_(description)</td>
<td>str</td>
<td>Age range</td>
</tr>
</table>
---
dataset_info:
数据集信息:
特征:
- 名称:主类别(main_category),数据类型:string(字符串)
- 名称:商品标题(title),数据类型:string(字符串)
- 名称:平均评分(average_rating),数据类型:float64(64位双精度浮点数)
- 名称:评分数量(rating_number),数据类型:int64(64位整数)
- 名称:商品特性列表(features),数据类型:string(字符串)
- 名称:商品描述(description),数据类型:string(字符串)
- 名称:商品价格(price),数据类型:float64(64位双精度浮点数)
- 名称:商品图片组(images),列表类型,包含子字段:
- 名称:缩略图(thumb),数据类型:string(字符串)
- 名称:大图(large),数据类型:string(字符串)
- 名称:图片序号(variant),数据类型:string(字符串)
- 名称:高清图(hi_res),数据类型:string(字符串)
- 名称:商品视频组(videos),列表类型,包含子字段:
- 名称:视频标题(title),数据类型:string(字符串)
- 名称:视频链接(url),数据类型:string(字符串)
- 名称:上传用户ID(user_id),数据类型:string(字符串)
- 名称:店铺名称(store),数据类型:string(字符串)
- 名称:商品层级分类(categories),序列类型(字符串序列)
- 名称:父商品ASIN(parent_asin),数据类型:string(字符串)
- 名称:商品重量(item_weight),数据类型:string(字符串)
- 名称:品牌名称(brand),数据类型:string(字符串)
- 名称:商品型号(item_model_number),数据类型:string(字符串)
- 名称:商品尺寸(product_dimensions),数据类型:string(字符串)
- 名称:所需电池规格(batteries_required),数据类型:string(字符串)
- 名称:商品颜色(color),数据类型:string(字符串)
- 名称:商品材质(material),数据类型:string(字符串)
- 名称:材质类型(material_type),数据类型:string(字符串)
- 名称:商品款式(style),数据类型:string(字符串)
- 名称:套装商品数量(number_of_items),数据类型:string(字符串)
- 名称:制造商信息(manufacturer),数据类型:string(字符串)
- 名称:包装尺寸(package_dimensions),数据类型:string(字符串)
- 名称:首次上架日期(date_first_available),数据类型:int64(64位整数)
- 名称:畅销榜单排名(best_sellers_rank),数据类型:string(字符串)
- 名称:适用年龄范围(age_range_(description)),数据类型:string(字符串)
数据集划分:
- 名称:训练集(train),字节占用:74252952,样本数量:22767
下载总大小:33492627
数据集总存储大小:74252952
数据集配置:
- 配置名称:默认配置(default),数据文件路径:
- 划分集:train,路径:data/train-*
---
# 数据集名称的数据集卡片
原始数据集可访问于:https://amazon-reviews-2023.github.io/
## 数据集详情
本数据集源自上述链接,为婴儿用品元数据集。
### 数据集描述
本数据集是亚马逊2023年婴儿用品元数据集的优化版本,原始数据集收录了亚马逊在售婴儿用品的元数据信息。本数据集包含商品的详细信息,如商品描述、评分、价格、图片及特性等。本次优化的核心目标是确保核心字段的完整性,同时通过移除无关或空值列以简化数据集结构。
下表展示了本数据集的原始结构:
| 字段 | 数据类型 | 说明 |
| ---- | ---- | ---- |
| main_category | str | 商品主类别(即所属领域) |
| title | str | 商品名称 |
| average_rating | float | 商品页面展示的产品评分 |
| rating_number | int | 商品的总评分次数 |
| features | list | 以项目符号列表形式呈现的商品特性 |
| description | list | 商品描述 |
| price | float | 采集时的商品价格(单位:美元) |
| images | list | 商品图片。每张图片包含多种尺寸(缩略图、大图、高清图),“variant”字段表示图片的位置序号 |
| videos | list | 商品视频,包含视频标题与链接 |
| store | str | 商品所属店铺名称 |
| categories | list | 商品的层级分类体系 |
| details | dict | 商品详情,包含材质、品牌、尺寸等信息 |
| parent_asin | str | 商品的父ASIN ID |
| bought_together | list | 网站推荐的捆绑购买商品列表 |
### 本次优化修改内容
- 移除了缺失商品描述、标题、图片或详情的商品样本
- 将“features”与“description”字段中的列表转换为以换行符拼接的字符串
- 对于“details”字段,仅保留出现频率最高的前16种详情类型,并根据这些保留的详情类型将原“details”字段拆分为16个新字段
- 移除了首次上架日期早于2015年的商品样本
- 移除了制造商标注为“已停产”(即`is_discontinued_by_manufacturer`字段值为`'true'`或`'yes'`)的商品样本,随后删除该字段
- 由于存在大量缺失值,删除“bought_together”字段
### 数据集规模
- 总样本量:22767
- 总字段数:27
### 最终数据集结构
下表展示了本数据集的最终字段结构:
| 字段 | 数据类型 | 说明 |
| ---- | ---- | ---- |
| main_category | str | 商品主类别 |
| title | str | 商品名称 |
| average_rating | float | 商品页面展示的产品评分 |
| rating_number | int | 商品的总评分次数 |
| features | list | 以项目符号列表形式呈现的商品特性 |
| description | list | 商品描述 |
| price | float | 采集时的商品价格(单位:美元) |
| images | list | 商品图片。每张图片包含多种尺寸(缩略图、大图、高清图),“variant”字段表示图片的位置序号 |
| videos | list | 商品视频,包含视频标题与链接 |
| store | str | 商品所属店铺名称 |
| categories | list | 商品的层级分类体系 |
| parent_asin | str | 商品的父ASIN ID |
| item_weight | str | 商品重量 |
| brand | str | 品牌名称 |
| item_model_number | str | 商品型号 |
| product_dimensions | str | 商品尺寸 |
| batteries_required | str | 所需电池规格 |
| color | str | 商品颜色 |
| material | str | 商品材质 |
| material_type | str | 材质类型 |
| style | str | 商品款式 |
| number_of_items | str | 套装商品数量 |
| manufacturer | str | 制造商信息 |
| package_dimensions | str | 包装尺寸 |
| date_first_available | int64 | 商品首次上架日期 |
| best_sellers_rank | str | 畅销榜单排名 |
| age_range_(description) | str | 适用年龄范围(描述版) |
提供机构:
smartcat



