Name: smartcat/Amazon_Fashion_2023
Creator: smartcat
Published: 2024-10-31 08:29:50
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/smartcat/Amazon_Fashion_2023

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: title dtype: string - name: average_rating dtype: string - name: rating_number dtype: string - name: features dtype: string - name: description dtype: string - name: price dtype: string - name: images dtype: string - name: videos dtype: string - name: store dtype: string - name: categories dtype: string - name: parent_asin dtype: string - name: date_first_available dtype: timestamp[s] - name: department dtype: string - name: manufacturer dtype: string - name: item_model_number dtype: string - name: package_dimensions dtype: string - name: country_of_origin dtype: string - name: is_discontinued_by_manufacturer dtype: string - name: product_dimensions dtype: string - name: item_weight dtype: string - name: brand dtype: string - name: color dtype: string - name: material dtype: string - name: age_range_(description) dtype: string - name: style dtype: string - name: size dtype: string - name: closure_type dtype: string splits: - name: train num_bytes: 110486655 num_examples: 43070 download_size: 46484678 dataset_size: 110486655 configs: - config_name: default data_files: - split: train path: data/train-* --- # Amazon Fashion 2023 Dataset Original dataset can be found on: https://amazon-reviews-2023.github.io/ ## Dataset Details This dataset is downloaded from the link above, the category Amazon Fashion meta dataset. ### Dataset Description The Amazon Fashion 2023 dataset contains information on fashion products from Amazon, including titles, ratings, descriptions, prices, and categories. Each entry includes product-specific attributes like color, size, brand, and materials, along with identifiers like ASIN and department. It also holds metadata such as release dates, package dimensions, and origin. The table below represents the original structure of the dataset. <table border="1" cellpadding="5" cellspacing="0"> <tr> <th>Field</th> <th>Type</th> <th>Explanation</th> </tr> <tr> <td>main_category</td> <td>str</td> <td>Main category (i.e., domain) of the product.</td> </tr> <tr> <td>title</td> <td>str</td> <td>Name of the product.</td> </tr> <tr> <td>average_rating</td> <td>float</td> <td>Rating of the product shown on the product page.</td> </tr> <tr> <td>rating_number</td> <td>int</td> <td>Number of ratings in the product.</td> </tr> <tr> <td>features</td> <td>list</td> <td>Bullet-point format features of the product.</td> </tr> <tr> <td>description</td> <td>list</td> <td>Description of the product.</td> </tr> <tr> <td>price</td> <td>float</td> <td>Price in US dollars (at time of crawling).</td> </tr> <tr> <td>images</td> <td>list</td> <td>Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image.</td> </tr> <tr> <td>videos</td> <td>list</td> <td>Videos of the product including title and url.</td> </tr> <tr> <td>store</td> <td>str</td> <td>Store name of the product.</td> </tr> <tr> <td>categories</td> <td>list</td> <td>Hierarchical categories of the product.</td> </tr> <tr> <td>details</td> <td>dict</td> <td>Product details, including materials, brand, sizes, etc.</td> </tr> <tr> <td>parent_asin</td> <td>str</td> <td>Parent ID of the product.</td> </tr> <tr> <td>bought_together</td> <td>list</td> <td>Recommended bundles from the websites.</td> </tr> </table> ### Modifications Made <ul> <li>Products without a description, title, images or details were removed.</li> <li>Lists in features and description are transformed into strings concatinated with a newline</li> <li>For the details column, only the top 16 most frequent detail types were kept. The details column was then split into these new 16 columns based on the detail types kept.</li> <li>Products with date first available before the year 2015 are dropped.</li> <li>Products with is_discontinued_by_manufacturer set to 'true' or 'yes' are dropped.</li> <li>Column bought_together is dropped due to missing values.</li> </ul> ### Dataset Size <ul> <li>Total entries: 43 070</li> <li>Total columns: 27</li> </ul> ### Final structure <table border="1" cellpadding="5" cellspacing="0"> <tr> <th>Field</th> <th>Type</th> <th>Explanation</th> </tr> <tr> <td>title</td> <td>str</td> <td>Name of the product.</td> </tr> <tr> <td>average_rating</td> <td>float</td> <td>Rating of the product shown on the product page.</td> </tr> <tr> <td>rating_number</td> <td>int</td> <td>Number of ratings in the product.</td> </tr> <tr> <td>features</td> <td>list</td> <td>Bullet-point format features of the product.</td> </tr> <tr> <td>description</td> <td>list</td> <td>Description of the product.</td> </tr> <tr> <td>price</td> <td>float</td> <td>Price in US dollars (at time of crawling).</td> </tr> <tr> <td>images</td> <td>list</td> <td>Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image.</td> </tr> <tr> <td>videos</td> <td>list</td> <td>Videos of the product including title and url.</td> </tr> <tr> <td>store</td> <td>str</td> <td>Store name of the product.</td> </tr> <tr> <td>categories</td> <td>list[str]</td> <td>Subcategories of the product.</td> </tr> <tr> <td>parent_asin</td> <td>str</td> <td>Parent ID of the product.</td> </tr> <tr> <td>date_first_available</td> <td>timestamp</td> <td>First date when the product was available.</td> </tr> <tr> <td>department</td> <td>string</td> <td>Department of the product. (E.g. womens, mens)</td> </tr> <tr> <td>country_of_origin</td> <td>string</td> <td>Name of the country of origin</td> </tr> <tr> <td>item_weight</td> <td>string</td> <td>Weight of the product in ounces or pounds.</td> </tr> <tr> <td>brand</td> <td>str</td> <td>Brand name associated with the product.</td> </tr> <tr> <td>manufacturer</td> <td>str</td> <td>Name of the company or manufacturer responsible for producing the product.</td> </tr> <tr> <td>product_dimension</td> <td>str</td> <td>Dimensions of the product, typically including length, width, and height.</td> </tr> <tr> <td>color</td> <td>str</td> <td>Primary color or color variants of the product.</td> </tr> <tr> <td>material</td> <td>str</td> <td>Main materials used in the product’s construction.</td> </tr> <tr> <td>is_discontinued_by_manufacturer</td> <td>str</td> <td>Indicates whether the product has been discontinued by the manufacturer.</td> </tr> <tr> <td>item_model_number</td> <td>str</td> <td>Model number of the product as assigned by the manufacturer.</td> </tr> <tr> <td>age_range_(description)</td> <td>str</td> <td>Recommended age range for the product, often used for toys or children’s products.</td> </tr> <tr> <td>style</td> <td>str</td> <td>Style or design</td> </tr> <tr> <td>size</td> <td>str</td> <td>Size of product</td> </tr> <tr> <td>closure_type</td> <td>str</td> <td>Closure type (E.g. zipper, button)</td> </tr> </table> ### Usage ## Download the dataset ```ruby dataset = load_dataset("smartcat/Amazon_Fashion_2023", split="train") ```

dataset_info: features: - name: title dtype: string - name: average_rating dtype: string - name: rating_number dtype: string - name: features dtype: string - name: description dtype: string - name: price dtype: string - name: images dtype: string - name: videos dtype: string - name: store dtype: string - name: categories dtype: string - name: parent_asin dtype: string - name: date_first_available dtype: timestamp[s] - name: department dtype: string - name: manufacturer dtype: string - name: item_model_number dtype: string - name: package_dimensions dtype: string - name: country_of_origin dtype: string - name: is_discontinued_by_manufacturer dtype: string - name: product_dimensions dtype: string - name: item_weight dtype: string - name: brand dtype: string - name: color dtype: string - name: material dtype: string - name: age_range_(description) dtype: string - name: style dtype: string - name: size dtype: string - name: closure_type dtype: string splits: - name: train num_bytes: 110486655 num_examples: 43070 download_size: 46484678 dataset_size: 110486655 configs: - config_name: default data_files: - split: train path: data/train-* # 亚马逊时尚2023数据集原始数据集可在以下链接找到：https://amazon-reviews-2023.github.io/ ## 数据集详情本数据集从上述链接下载，属于亚马逊时尚元数据集类别。 ### 数据集描述亚马逊时尚2023数据集包含亚马逊平台上时尚产品的信息，包括标题、评分、描述、价格和类别。每条记录包含产品特定属性（如颜色、尺寸、品牌和材质）以及标识符（如ASIN和部门）。此外，数据集还包含发布日期、包装尺寸和原产地等元数据。下表展示了数据集的原始结构。 <table border="1" cellpadding="5" cellspacing="0"> <tr> <th>字段</th> <th>类型</th> <th>说明</th> </tr> <tr> <td>main_category</td> <td>str</td> <td>产品的主类别（即领域）。</td> </tr> <tr> <td>title</td> <td>str</td> <td>产品名称。</td> </tr> <tr> <td>average_rating</td> <td>float</td> <td>产品页面显示的评分。</td> </tr> <tr> <td>rating_number</td> <td>int</td> <td>产品的评分数量。</td> </tr> <tr> <td>features</td> <td>list</td> <td>产品的要点格式特征。</td> </tr> <tr> <td>description</td> <td>list</td> <td>产品描述。</td> </tr> <tr> <td>price</td> <td>float</td> <td>产品价格（以美元计，爬取时的价格）。</td> </tr> <tr> <td>images</td> <td>list</td> <td>产品图片，每张图片有不同尺寸（缩略图、大图、高清图）。“variant”字段表示图片位置。</td> </tr> <tr> <td>videos</td> <td>list</td> <td>产品视频，包含标题和URL。</td> </tr> <tr> <td>store</td> <td>str</td> <td>产品所属店铺名称。</td> </tr> <tr> <td>categories</td> <td>list</td> <td>产品的层级类别。</td> </tr> <tr> <td>details</td> <td>dict</td> <td>产品详情，包括材质、品牌、尺寸等。</td> </tr> <tr> <td>parent_asin</td> <td>str</td> <td>产品的父ID。</td> </tr> <tr> <td>bought_together</td> <td>list</td> <td>网站推荐的捆绑商品。</td> </tr> </table> ### 修改说明 <ul> <li>移除了缺少描述、标题、图片或详情的产品。</li> <li>将features和description中的列表转换为用换行符连接的字符串。</li> <li>对于details列，仅保留出现频率最高的16种详情类型，并根据这些类型将details列拆分为16个新列。</li> <li>移除了首次上架日期早于2015年的产品。</li> <li>移除了is_discontinued_by_manufacturer字段值为'true'或'yes'的产品。</li> <li>因缺失值过多，删除了bought_together列。</li> </ul> ### 数据集大小 <ul> <li>总记录数：43070</li> <li>总列数：27</li> </ul> ### 最终结构 <table border="1" cellpadding="5" cellspacing="0"> <tr> <th>字段</th> <th>类型</th> <th>说明</th> </tr> <tr> <td>title</td> <td>str</td> <td>产品名称。</td> </tr> <tr> <td>average_rating</td> <td>float</td> <td>产品页面显示的评分。</td> </tr> <tr> <td>rating_number</td> <td>int</td> <td>产品的评分数量。</td> </tr> <tr> <td>features</td> <td>list</td> <td>产品的要点格式特征。</td> </tr> <tr> <td>description</td> <td>list</td> <td>产品描述。</td> </tr> <tr> <td>price</td> <td>float</td> <td>产品价格（以美元计，爬取时的价格）。</td> </tr> <tr> <td>images</td> <td>list</td> <td>产品图片，每张图片有不同尺寸（缩略图、大图、高清图）。“variant”字段表示图片位置。</td> </tr> <tr> <td>videos</td> <td>list</td> <td>产品视频，包含标题和URL。</td> </tr> <tr> <td>store</td> <td>str</td> <td>产品所属店铺名称。</td> </tr> <tr> <td>categories</td> <td>list[str]</td> <td>产品的子类别。</td> </tr> <tr> <td>parent_asin</td> <td>str</td> <td>产品的父ID。</td> </tr> <tr> <td>date_first_available</td> <td>timestamp</td> <td>产品首次上架日期。</td> </tr> <tr> <td>department</td> <td>string</td> <td>产品所属部门（例如：女装、男装）。</td> </tr> <tr> <td>country_of_origin</td> <td>string</td> <td>原产地国家名称。</td> </tr> <tr> <td>item_weight</td> <td>string</td> <td>产品重量（以盎司或磅为单位）。</td> </tr> <tr> <td>brand</td> <td>str</td> <td>产品关联的品牌名称。</td> </tr> <tr> <td>manufacturer</td> <td>str</td> <td>负责生产产品的公司或制造商名称。</td> </tr> <tr> <td>product_dimension</td> <td>str</td> <td>产品尺寸，通常包括长度、宽度和高度。</td> </tr> <tr> <td>color</td> <td>str</td> <td>产品的主色调或颜色变体。</td> </tr> <tr> <td>material</td> <td>str</td> <td>产品构造所用的主要材质。</td> </tr> <tr> <td>is_discontinued_by_manufacturer</td> <td>str</td> <td>指示产品是否已被制造商停产。</td> </tr> <tr> <td>item_model_number</td> <td>str</td> <td>制造商分配给产品的型号。</td> </tr> <tr> <td>age_range_(description)</td> <td>str</td> <td>产品的推荐年龄范围，常用于玩具或儿童产品。</td> </tr> <tr> <td>style</td> <td>str</td> <td>产品风格或设计。</td> </tr> <tr> <td>size</td> <td>str</td> <td>产品尺寸。</td> </tr> <tr> <td>closure_type</td> <td>str</td> <td>闭合类型（例如：拉链、纽扣）。</td> </tr> </table> ### 使用说明 ## 下载数据集 ruby dataset = load_dataset("smartcat/Amazon_Fashion_2023", split="train")

应用场景：