five

smartcat/Amazon_Baby_Products_2023

收藏
Hugging Face2024-10-31 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/smartcat/Amazon_Baby_Products_2023
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: main_category dtype: string - name: title dtype: string - name: average_rating dtype: float64 - name: rating_number dtype: int64 - name: features dtype: string - name: description dtype: string - name: price dtype: float64 - name: images list: - name: thumb dtype: string - name: large dtype: string - name: variant dtype: string - name: hi_res dtype: string - name: videos list: - name: title dtype: string - name: url dtype: string - name: user_id dtype: string - name: store dtype: string - name: categories sequence: string - name: parent_asin dtype: string - name: item_weight dtype: string - name: brand dtype: string - name: item_model_number dtype: string - name: product_dimensions dtype: string - name: batteries_required dtype: string - name: color dtype: string - name: material dtype: string - name: material_type dtype: string - name: style dtype: string - name: number_of_items dtype: string - name: manufacturer dtype: string - name: package_dimensions dtype: string - name: date_first_available dtype: int64 - name: best_sellers_rank dtype: string - name: age_range_(description) dtype: string splits: - name: train num_bytes: 74252952 num_examples: 22767 download_size: 33492627 dataset_size: 74252952 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for Dataset Name Original dataset can be found on: https://amazon-reviews-2023.github.io/ ## Dataset Details This dataset is downloaded from the link above, the category Baby Products meta dataset. ### Dataset Description This dataset is a refined version of the Amazon Baby Products 2023 meta dataset, which originally contained baby product metadata for product that are sold on Amazon. The dataset includes detailed information about products such as their descriptions, ratings, prices, images, and features. The primary focus of this modification was to ensure the completeness of key fields while simplifying the dataset by removing irrelevant or empty columns. The table below represents the original structure of the dataset. <table border="1" cellpadding="5" cellspacing="0"> <tr> <th>Field</th> <th>Type</th> <th>Explanation</th> </tr> <tr> <td>main_category</td> <td>str</td> <td>Main category (i.e., domain) of the product.</td> </tr> <tr> <td>title</td> <td>str</td> <td>Name of the product.</td> </tr> <tr> <td>average_rating</td> <td>float</td> <td>Rating of the product shown on the product page.</td> </tr> <tr> <td>rating_number</td> <td>int</td> <td>Number of ratings in the product.</td> </tr> <tr> <td>features</td> <td>list</td> <td>Bullet-point format features of the product.</td> </tr> <tr> <td>description</td> <td>list</td> <td>Description of the product.</td> </tr> <tr> <td>price</td> <td>float</td> <td>Price in US dollars (at time of crawling).</td> </tr> <tr> <td>images</td> <td>list</td> <td>Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image.</td> </tr> <tr> <td>videos</td> <td>list</td> <td>Videos of the product including title and url.</td> </tr> <tr> <td>store</td> <td>str</td> <td>Store name of the product.</td> </tr> <tr> <td>categories</td> <td>list</td> <td>Hierarchical categories of the product.</td> </tr> <tr> <td>details</td> <td>dict</td> <td>Product details, including materials, brand, sizes, etc.</td> </tr> <tr> <td>parent_asin</td> <td>str</td> <td>Parent ID of the product.</td> </tr> <tr> <td>bought_together</td> <td>list</td> <td>Recommended bundles from the websites.</td> </tr> </table> ### Modifications made <ul> <li>Products without a description, title, images or details were removed.</li> <li>Lists in features and description are transformed into strings concatinated with a newline</li> <li>For the details column, only the top 16 most frequent detail types were kept. The details column was then split into these new 16 columns based on the detail types kept.</li> <li>Products with date first available before the year 2015 are dropped.</li> <li>Products with is_discontinued_by_manufacturer set to 'true' or 'yes' are dropped. Then that column was dropped.</li> <li>Column bought_together is dropped due to missing values.</li> </ul> ### Dataset Size <ul> <li>Total entries: 22,767</li> <li>Total columns: 27</li> </ul> ### Final Structure <table border="1" cellpadding="5" cellspacing="0"> <tr> <th>Field</th> <th>Type</th> <th>Explanation</th> </tr> <tr> <td>main_category</td> <td>str</td> <td>Main category</td> </tr> <tr> <td>title</td> <td>str</td> <td>Name of the product</td> </tr> <tr> <td>average_rating</td> <td>float</td> <td>Rating of the product shown on the product page.</td> </tr> <tr> <td>rating_number</td> <td>int</td> <td>Number of ratings in the product.</td> </tr> <tr> <td>features</td> <td>list</td> <td>Bullet-point format features of the product.</td> </tr> <tr> <td>description</td> <td>list</td> <td>Description of the product.</td> </tr> <tr> <td>price</td> <td>float</td> <td>Price in US dollars (at time of crawling).</td> </tr> <tr> <td>images</td> <td>list</td> <td>Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image.</td> </tr> <tr> <td>videos</td> <td>list</td> <td>Videos of the product including title and url.</td> </tr> <tr> <td>store</td> <td>str</td> <td>Store name of the product.</td> </tr> <tr> <td>details</td> <td>dict</td> <td>Product details, including materials, brand, sizes, etc.</td> </tr> <tr> <td>parent_asin</td> <td>str</td> <td>Parent ID of the product.</td> </tr> <tr> <td>item_weight</td> <td>str</td> <td>Weight of the item</td> </tr> <tr> <td>brand</td> <td>str</td> <td>Brand name</td> </tr> <tr> <td>item_model_number</td> <td>str</td> <td>Model number of the item</td> </tr> <tr> <td>product_dimensions</td> <td>str</td> <td>Dimensions of the product</td> </tr> <tr> <td>batteries_required</td> <td>str</td> <td>Baterries required</td> </tr> <tr> <td>color</td> <td>str</td> <td>Color</td> </tr> <tr> <td>material</td> <td>str</td> <td>Material</td> </tr> <tr> <td>material_type</td> <td>str</td> <td>Material</td> </tr> <tr> <td>style</td> <td>str</td> <td>Style</td> </tr> <tr> <td>number_of_items</td> <td>str</td> <td>Number of items</td> </tr> <tr> <td>manufacturer</td> <td>str</td> <td>Manufacturer</td> </tr> <tr> <td>package_dimensions</td> <td>str</td> <td>Package dimensions</td> </tr> <tr> <td>date_first_available</td> <td>int64</td> <td>Date product was first time available</td> </tr> <tr> <td>best_sellers_rank</td> <td>str</td> <td>Best seller rank</td> </tr> <tr> <td>age_range_(description)</td> <td>str</td> <td>Age range</td> </tr> </table>

--- dataset_info: 数据集信息: 特征: - 名称:主类别(main_category),数据类型:string(字符串) - 名称:商品标题(title),数据类型:string(字符串) - 名称:平均评分(average_rating),数据类型:float64(64位双精度浮点数) - 名称:评分数量(rating_number),数据类型:int64(64位整数) - 名称:商品特性列表(features),数据类型:string(字符串) - 名称:商品描述(description),数据类型:string(字符串) - 名称:商品价格(price),数据类型:float64(64位双精度浮点数) - 名称:商品图片组(images),列表类型,包含子字段: - 名称:缩略图(thumb),数据类型:string(字符串) - 名称:大图(large),数据类型:string(字符串) - 名称:图片序号(variant),数据类型:string(字符串) - 名称:高清图(hi_res),数据类型:string(字符串) - 名称:商品视频组(videos),列表类型,包含子字段: - 名称:视频标题(title),数据类型:string(字符串) - 名称:视频链接(url),数据类型:string(字符串) - 名称:上传用户ID(user_id),数据类型:string(字符串) - 名称:店铺名称(store),数据类型:string(字符串) - 名称:商品层级分类(categories),序列类型(字符串序列) - 名称:父商品ASIN(parent_asin),数据类型:string(字符串) - 名称:商品重量(item_weight),数据类型:string(字符串) - 名称:品牌名称(brand),数据类型:string(字符串) - 名称:商品型号(item_model_number),数据类型:string(字符串) - 名称:商品尺寸(product_dimensions),数据类型:string(字符串) - 名称:所需电池规格(batteries_required),数据类型:string(字符串) - 名称:商品颜色(color),数据类型:string(字符串) - 名称:商品材质(material),数据类型:string(字符串) - 名称:材质类型(material_type),数据类型:string(字符串) - 名称:商品款式(style),数据类型:string(字符串) - 名称:套装商品数量(number_of_items),数据类型:string(字符串) - 名称:制造商信息(manufacturer),数据类型:string(字符串) - 名称:包装尺寸(package_dimensions),数据类型:string(字符串) - 名称:首次上架日期(date_first_available),数据类型:int64(64位整数) - 名称:畅销榜单排名(best_sellers_rank),数据类型:string(字符串) - 名称:适用年龄范围(age_range_(description)),数据类型:string(字符串) 数据集划分: - 名称:训练集(train),字节占用:74252952,样本数量:22767 下载总大小:33492627 数据集总存储大小:74252952 数据集配置: - 配置名称:默认配置(default),数据文件路径: - 划分集:train,路径:data/train-* --- # 数据集名称的数据集卡片 原始数据集可访问于:https://amazon-reviews-2023.github.io/ ## 数据集详情 本数据集源自上述链接,为婴儿用品元数据集。 ### 数据集描述 本数据集是亚马逊2023年婴儿用品元数据集的优化版本,原始数据集收录了亚马逊在售婴儿用品的元数据信息。本数据集包含商品的详细信息,如商品描述、评分、价格、图片及特性等。本次优化的核心目标是确保核心字段的完整性,同时通过移除无关或空值列以简化数据集结构。 下表展示了本数据集的原始结构: | 字段 | 数据类型 | 说明 | | ---- | ---- | ---- | | main_category | str | 商品主类别(即所属领域) | | title | str | 商品名称 | | average_rating | float | 商品页面展示的产品评分 | | rating_number | int | 商品的总评分次数 | | features | list | 以项目符号列表形式呈现的商品特性 | | description | list | 商品描述 | | price | float | 采集时的商品价格(单位:美元) | | images | list | 商品图片。每张图片包含多种尺寸(缩略图、大图、高清图),“variant”字段表示图片的位置序号 | | videos | list | 商品视频,包含视频标题与链接 | | store | str | 商品所属店铺名称 | | categories | list | 商品的层级分类体系 | | details | dict | 商品详情,包含材质、品牌、尺寸等信息 | | parent_asin | str | 商品的父ASIN ID | | bought_together | list | 网站推荐的捆绑购买商品列表 | ### 本次优化修改内容 - 移除了缺失商品描述、标题、图片或详情的商品样本 - 将“features”与“description”字段中的列表转换为以换行符拼接的字符串 - 对于“details”字段,仅保留出现频率最高的前16种详情类型,并根据这些保留的详情类型将原“details”字段拆分为16个新字段 - 移除了首次上架日期早于2015年的商品样本 - 移除了制造商标注为“已停产”(即`is_discontinued_by_manufacturer`字段值为`'true'`或`'yes'`)的商品样本,随后删除该字段 - 由于存在大量缺失值,删除“bought_together”字段 ### 数据集规模 - 总样本量:22767 - 总字段数:27 ### 最终数据集结构 下表展示了本数据集的最终字段结构: | 字段 | 数据类型 | 说明 | | ---- | ---- | ---- | | main_category | str | 商品主类别 | | title | str | 商品名称 | | average_rating | float | 商品页面展示的产品评分 | | rating_number | int | 商品的总评分次数 | | features | list | 以项目符号列表形式呈现的商品特性 | | description | list | 商品描述 | | price | float | 采集时的商品价格(单位:美元) | | images | list | 商品图片。每张图片包含多种尺寸(缩略图、大图、高清图),“variant”字段表示图片的位置序号 | | videos | list | 商品视频,包含视频标题与链接 | | store | str | 商品所属店铺名称 | | categories | list | 商品的层级分类体系 | | parent_asin | str | 商品的父ASIN ID | | item_weight | str | 商品重量 | | brand | str | 品牌名称 | | item_model_number | str | 商品型号 | | product_dimensions | str | 商品尺寸 | | batteries_required | str | 所需电池规格 | | color | str | 商品颜色 | | material | str | 商品材质 | | material_type | str | 材质类型 | | style | str | 商品款式 | | number_of_items | str | 套装商品数量 | | manufacturer | str | 制造商信息 | | package_dimensions | str | 包装尺寸 | | date_first_available | int64 | 商品首次上架日期 | | best_sellers_rank | str | 畅销榜单排名 | | age_range_(description) | str | 适用年龄范围(描述版) |
提供机构:
smartcat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作