five

Studeni/AMAZON-Products-2023-Arabic

收藏
Hugging Face2024-05-21 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Studeni/AMAZON-Products-2023-Arabic
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification - feature-extraction - sentence-similarity - text2text-generation - translation language: - en tags: - e-commerce - products - amazon - arabic size_categories: - 100K<n<1M --- # Dataset Card for Amazon Products 2023 Arabic ## Dataset Summary This dataset contains product metadata from Amazon, filtered to include only products that became available in 2023. The dataset is intended for use in semantic search applications and includes a variety of product categories. - **Number of Rows:** 117,243 - **Number of Columns:** 17 ## Data Source The data is sourced from [Amazon Reviews 2023](https://amazon-reviews-2023.github.io/). It includes product information across multiple categories, with additional embeddings. [NLLB 1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) was used for translation to `Modern Standard Arabic`. Embeddings were created from the **title_arb** + **description_arb** using the **text-embedding-3-small** model. ## Dataset Structure ### Number of Products by Filename | | filename | product_count | |---:|:----------------------------------------|----------------:| | 0 | meta_Amazon_Fashion | 470 | | 1 | meta_Appliances | 573 | | 2 | meta_Arts_Crafts_and_Sewing | 2948 | | 3 | meta_Automotive | 7161 | | 4 | meta_Baby_Products | 526 | | 5 | meta_Beauty_and_Personal_Care | 1402 | | 6 | meta_Books | 2 | | 7 | meta_CDs_and_Vinyl | 1319 | | 8 | meta_Cell_Phones_and_Accessories | 5062 | | 9 | meta_Clothing_Shoes_and_Jewelry | 41777 | | 10 | meta_Digital_Music | 56 | | 11 | meta_Electronics | 7681 | | 12 | meta_Gift_Cards | 8 | | 13 | meta_Grocery_and_Gourmet_Food | 96 | | 14 | meta_Handmade_Products | 1018 | | 15 | meta_Health_and_Household | 4760 | | 16 | meta_Health_and_Personal_Care | 93 | | 17 | meta_Home_and_Kitchen | 17326 | | 18 | meta_Industrial_and_Scientific | 1216 | | 19 | meta_Magazine_Subscriptions | 3 | | 20 | meta_Musical_Instruments | 639 | | 21 | meta_Office_Products | 3545 | | 22 | meta_Patio_Lawn_and_Garden | 3075 | | 23 | meta_Pet_Supplies | 2742 | | 24 | meta_Software | 157 | | 25 | meta_Sports_and_Outdoors | 6343 | | 26 | meta_Tools_and_Home_Improvement | 4776 | | 27 | meta_Toys_and_Games | 1367 | | 28 | meta_Unknown | 541 | | 29 | meta_Video_Games | 561 | ### Columns - **parent_asin (str):** Unique identifier for the product. - **date_first_available (datetime64[ns]):** The date when the product first became available. - **title (str):** Title of the product. - **title_arb (str):** Title of the product translated to `Modern Standard Arabic`. - **description (str):** Description of the product. - **description_arb (str):** Description of the product translated to `Modern Standard Arabic`. - **filename (str):** Filename associated with the product metadata. - **main_category (str):** Main category of the product. - **categories (List[str]):** Subcategories of the product. - **store (str):** Store information for the product. - **average_rating (float64):** Average rating of the product. - **rating_number (float64):** Number of ratings for the product. - **price (float64):** Price of the product. - **features (List[str]):** Features of the product. - **details (str):** Additional details of the product. The string is JSON serializable. - **embeddings (List[float64]):** Embeddings generated for the product using **text-embedding-3-small** model. - **image (str):** URL of the product image. ### Missing Values - **main_category:** 24,805 missing values - **store:** 253 missing values - **rating_number:** 6 missing values - **price:** 35,869 missing values ### Sample Data ```json [ { "parent_asin": "B000044U2O", "date_first_available": "2023-04-29T00:00:00", "title": "Anomie & Bonhomie", "description": "Amazon.com Fans of Scritti Politti's synth-pop-funk masterpiece Cupid & Psyche 85 may be shocked by how far afield Scritti mastermind Green Gartside has gone since then. Anomie & Bonhomie, his return to recording after a decadelong absence, ranges from guest shots by rappers and funksters such as Mos Def and Me'Shell Ndegeocello to Foo Fighters tributes. Gartside's trademark breathy vocals and spot-on melodicism do find their places here, but are often forced to make way for other influences. Neither a total success nor a total failure, Anomie does display a spark that makes one hope that Gartside doesn't wait so long to record again. --Rickey Wright", "filename": "meta_Digital_Music", "main_category": "Digital Music", "categories": [], "store": "Scritti Politti Format: Audio CD", "average_rating": 4.2, "rating_number": 56.0, "price": null, "features": [], "details": "{'Date First Available': 'April 29, 2023'}", "embeddings": [], "image": "https://m.media-amazon.com/images/I/41T618NE88L.jpg" }, ... ] ``` ### Usage This dataset can be used for various applications, including: - Semantic Search: Utilizing the embeddings to find similar products based on textual descriptions. - Product Recommendation: Enhancing recommendation systems with detailed product metadata. ### Citation ```bibtex @article{hou2024bridging, title={Bridging Language and Items for Retrieval and Recommendation}, author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, journal={arXiv preprint arXiv:2403.03952}, year={2024} } ``` ### Contact For questions or issues regarding the dataset, please contact [Amazon Reviews 2023](https://amazon-reviews-2023.github.io/).
提供机构:
Studeni
原始信息汇总

数据集概述

数据集名称: Amazon Products 2023 Arabic

数据集目的: 用于语义搜索应用,包含多种产品类别。

数据集大小:

  • 行数: 117,243
  • 列数: 17

语言: 英语和现代标准阿拉伯语

任务类别:

  • 文本分类
  • 特征提取
  • 句子相似度
  • 文本到文本生成
  • 翻译

标签:

  • 电子商务
  • 产品
  • 亚马逊
  • 阿拉伯语

数据来源: 数据来源于Amazon Reviews 2023,包含多个类别的产品信息及附加嵌入。

翻译工具: 使用NLLB 1.3B进行翻译至现代标准阿拉伯语。

嵌入生成: 使用text-embedding-3-small模型从title_arbdescription_arb生成嵌入。

数据集结构

产品数量按文件名分布:

文件名 产品数量
meta_Amazon_Fashion 470
meta_Appliances 573
... ...
meta_Video_Games 561

列信息:

  • parent_asin (str): 产品唯一标识符。
  • date_first_available (datetime64[ns]): 产品首次可用日期。
  • title (str): 产品标题。
  • title_arb (str): 产品标题翻译至现代标准阿拉伯语。
  • description (str): 产品描述。
  • description_arb (str): 产品描述翻译至现代标准阿拉伯语。
  • filename (str): 产品元数据关联的文件名。
  • main_category (str): 产品主类别。
  • categories (List[str]): 产品子类别。
  • store (str): 产品商店信息。
  • average_rating (float64): 产品平均评分。
  • rating_number (float64): 产品评分数量。
  • price (float64): 产品价格。
  • features (List[str]): 产品特征。
  • details (str): 产品附加详情,JSON可序列化字符串。
  • embeddings (List[float64]): 使用text-embedding-3-small模型生成的嵌入。
  • image (str): 产品图片URL。

缺失值情况:

  • main_category: 24,805个缺失值
  • store: 253个缺失值
  • rating_number: 6个缺失值
  • price: 35,869个缺失值

数据集用途

  • 语义搜索: 利用嵌入查找基于文本描述的相似产品。
  • 产品推荐: 通过详细的产品元数据增强推荐系统。

引用信息

bibtex @article{hou2024bridging, title={Bridging Language and Items for Retrieval and Recommendation}, author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, journal={arXiv preprint arXiv:2403.03952}, year={2024} }

联系方式

如对数据集有疑问或问题,请联系Amazon Reviews 2023

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作