five

ckandemir/amazon-products

收藏
Hugging Face2023-11-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ckandemir/amazon-products
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: eval path: data/eval-* dataset_info: features: - name: Product Name dtype: string - name: Category dtype: string - name: Description dtype: string - name: Selling Price dtype: string - name: Product Specification dtype: string - name: Image dtype: string splits: - name: train num_bytes: 12542887 num_examples: 23993 - name: test num_bytes: 3499375 num_examples: 6665 - name: eval num_bytes: 1376174 num_examples: 2666 download_size: 6391314 dataset_size: 17418436 license: apache-2.0 task_categories: - image-classification - image-to-text language: - en size_categories: - 10K<n<100K --- ## Dataset Creation and Processing Overview This dataset underwent a comprehensive process of loading, cleaning, processing, and preparing, incorporating a range of data manipulation and NLP techniques to optimize its utility for machine learning models, particularly in natural language processing. ### Data Loading and Initial Cleaning - **Source**: Loaded from the Hugging Face dataset repository [bprateek/amazon_product_description](https://huggingface.co/datasets/bprateek/amazon_product_description). - **Conversion to Pandas DataFrame**: For ease of data manipulation. - **Null Value Removal**: Rows with null values in the 'About Product' column were discarded. ### Data Cleaning and NLP Processing - **Sentence Extraction**: 'About Product' descriptions were split into sentences, identifying common phrases. - **Emoji and Special Character Removal**: A regex function removed these elements from the product descriptions. - **Common Phrase Elimination**: A function was used to strip common phrases from each product description. - **Improving Writing Standards**: Adjusted capitalization, punctuation, and replaced '&' with 'and' for better readability and formalization. ### Sentence Similarity Analysis - **Model Application**: The pre-trained Sentence Transformer model 'all-MiniLM-L6-v2' was used. - **Sentence Comparison**: Identified the most similar sentence to each product name within the cleaned product descriptions. ### Dataset Refinement - **Column Selection**: Retained relevant columns for final dataset. - **Image URL Processing**: Split multiple image URLs into individual URLs, removing specific unwanted URLs. ### Image Validation - **Image URL Validation**: Implemented a function to verify the validity of each image URL. - **Filtering Valid Images**: Retained only rows with valid image URLs. ### Dataset Splitting for Machine Learning - **Creation of Train, Test, and Eval Sets**: Used scikit-learn's `train_test_split` for dataset division. For further details or to contribute to enhancing the dataset card, please refer to the [Hugging Face Dataset Card Contribution Guide](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards).
提供机构:
ckandemir
原始信息汇总

数据集概述

数据处理流程

  • 数据集经历了全面的加载、清洗、处理和准备过程。
  • 应用了多种数据操作和自然语言处理(NLP)技术。

优化目标

  • 优化数据集以适用于机器学习模型,特别是在自然语言处理领域。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作