ckandemir/amazon-products

Name: ckandemir/amazon-products
Creator: ckandemir
Published: 2023-11-21 09:46:07
License: 暂无描述

Hugging Face2023-11-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ckandemir/amazon-products

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: eval path: data/eval-* dataset_info: features: - name: Product Name dtype: string - name: Category dtype: string - name: Description dtype: string - name: Selling Price dtype: string - name: Product Specification dtype: string - name: Image dtype: string splits: - name: train num_bytes: 12542887 num_examples: 23993 - name: test num_bytes: 3499375 num_examples: 6665 - name: eval num_bytes: 1376174 num_examples: 2666 download_size: 6391314 dataset_size: 17418436 license: apache-2.0 task_categories: - image-classification - image-to-text language: - en size_categories: - 10K<n<100K --- ## Dataset Creation and Processing Overview This dataset underwent a comprehensive process of loading, cleaning, processing, and preparing, incorporating a range of data manipulation and NLP techniques to optimize its utility for machine learning models, particularly in natural language processing. ### Data Loading and Initial Cleaning - **Source**: Loaded from the Hugging Face dataset repository [bprateek/amazon_product_description](https://huggingface.co/datasets/bprateek/amazon_product_description). - **Conversion to Pandas DataFrame**: For ease of data manipulation. - **Null Value Removal**: Rows with null values in the 'About Product' column were discarded. ### Data Cleaning and NLP Processing - **Sentence Extraction**: 'About Product' descriptions were split into sentences, identifying common phrases. - **Emoji and Special Character Removal**: A regex function removed these elements from the product descriptions. - **Common Phrase Elimination**: A function was used to strip common phrases from each product description. - **Improving Writing Standards**: Adjusted capitalization, punctuation, and replaced '&' with 'and' for better readability and formalization. ### Sentence Similarity Analysis - **Model Application**: The pre-trained Sentence Transformer model 'all-MiniLM-L6-v2' was used. - **Sentence Comparison**: Identified the most similar sentence to each product name within the cleaned product descriptions. ### Dataset Refinement - **Column Selection**: Retained relevant columns for final dataset. - **Image URL Processing**: Split multiple image URLs into individual URLs, removing specific unwanted URLs. ### Image Validation - **Image URL Validation**: Implemented a function to verify the validity of each image URL. - **Filtering Valid Images**: Retained only rows with valid image URLs. ### Dataset Splitting for Machine Learning - **Creation of Train, Test, and Eval Sets**: Used scikit-learn's `train_test_split` for dataset division. For further details or to contribute to enhancing the dataset card, please refer to the [Hugging Face Dataset Card Contribution Guide](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards).

提供机构：

ckandemir

原始信息汇总

数据集概述

数据处理流程

数据集经历了全面的加载、清洗、处理和准备过程。
应用了多种数据操作和自然语言处理（NLP）技术。

优化目标

优化数据集以适用于机器学习模型，特别是在自然语言处理领域。

5,000+

优质数据集

54 个

任务类型

进入经典数据集