---
dataset_info:
features:
- name: asin
dtype: string
- name: category
dtype: string
- name: img_url
dtype: string
- name: title
dtype: string
- name: feature-bullets
sequence: string
- name: tech_data
sequence:
sequence: string
- name: labels
dtype: string
- name: tech_process
dtype: string
splits:
- name: train
num_bytes: 75797
num_examples: 20
download_size: 62474
dataset_size: 75797
license: cc-by-nc-4.0
task_categories:
- text-generation
language:
- en
size_categories:
- n<1K
---
# Dataset Card for "amazon-product-data-filter"
## Dataset Description
- **Homepage:** [τenai.io - AI Consulting](https://www.tenai.io/)
- **Point of Contact:** [Iftach Arbel](mailto:ia@momentum-ai.io)
### Dataset Summary
The Amazon Product Dataset contains product listing data from the Amazon US website. It can be used for various NLP and classification tasks, such as text generation, product type classification, attribute extraction, image recognition and more.
**NOTICE:** This is a sample of the full [Amazon Product Dataset](https://huggingface.co/datasets/iarbel/amazon-product-data-filter), which contains 1K examples. Follow the link to gain access to the full dataset.
### Languages
The text in the dataset is in English.
## Dataset Structure
### Data Instances
Each data point provides product information, such as ASIN (Amazon Standard Identification Number), title, feature-bullets, and more.
### Data Fields
- `asin`: Amazon Standard Identification Number.
- `category`: The product category. This field represents the search-string used to obtain the listing, it is not the product category as appears on Amazon.com.
- `img_url`: Main image URL from the product page.
- `title`: Product title, as appears on the product page.
- `feature-bullets`: Product feature-bullets list, as they appear on the product page.
- `tech_data`: Product technical data (material, style, etc.), as they appear on the product page. Structured as a list of tuples, where the first element is a feature (e.g. material) and the second element is a value (e.g. plastic).
- `labels`: A processed instance of `feature-bullets` field. The original feature-bullets were aligned to form a standard structure with a capitalized prefix, remove emojis, etc. Finally, the list items were concatenated to a single string with a `\n` seperator.
- `tech_process`: A processed instance of `tech_data` field. The original tech data was filtered and transformed from a `(key, value)` structure to a natural language text.
### Data Splits
The sample dataset has 20 train examples. For the full dataset cilck [here](https://huggingface.co/datasets/iarbel/amazon-product-data-filter).
## Dataset Creation
### Curation Rationale
This dataset was built to provide high-quality data in the e-commerce domain, and fine-tuning LLMs for specific tasks. Raw, unstractured data was collected from Amazom.com, parsed, processed, and filtered using various techniques (annotations, rule-based, models).
### Source Data
#### Initial Data Collection and Normalization
The data was obtained by collected raw HTML data from Amazom.com.
### Annotations
The dataset does not contain any additional annotations.
### Personal and Sensitive Information
There is no personal information in the dataset.
## Considerations for Using the Data
### Social Impact of Dataset
To the best of our knowledge, there is no social impact for this dataset. The data is highly technical, and usage for product text-generation or classification does not pose a risk.
### Other Known Limitations
The quality of product listings may vary, and may not be accurate.
## Additional Information
### Dataset Curators
The dataset was collected and curated by [Iftach Arbel](mailto:ia@momentum-ai.io).
### Licensing Information
The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode).
### Citation Information
```
@misc{amazon_product_filter,
author = {Iftach Arbel},
title = {Amazon Product Dataset Sample},
year = {2023},
publisher = {Huggingface},
journal = {Huggingface dataset},
howpublished = {https://huggingface.co/datasets/iarbel/amazon-product-data-sample},
}
```
数据集信息:
特征字段:
- 名称:asin,数据类型:字符串
- 名称:category,数据类型:字符串
- 名称:img_url,数据类型:字符串
- 名称:title,数据类型:字符串
- 名称:feature-bullets,数据类型:字符串序列
- 名称:tech_data,数据类型:字符串序列的序列
- 名称:labels,数据类型:字符串
- 名称:tech_process,数据类型:字符串
拆分集:
- 名称:train,字节数:75797,样本数:20
下载大小:62474,数据集大小:75797
许可协议:CC BY-NC 4.0
任务类别:
- 文本生成
语言:
- 英语
样本规模:
- n<1000
# 「亚马逊产品数据筛选」数据集卡片
## 数据集说明
- **官网**:[tenai.io - AI咨询](https://www.tenai.io/)
- **联系人**:[伊夫塔赫·阿贝尔(Iftach Arbel)](mailto:ia@momentum-ai.io)
### 数据集概述
本亚马逊产品数据集包含来自美国亚马逊网站的商品上架数据,可应用于各类自然语言处理(NLP)与分类任务,例如文本生成、产品类型分类、属性提取、图像识别等。
**注意**:本数据集为完整亚马逊产品数据集的样本,完整数据集包含1000条样本,可通过[完整亚马逊产品数据集](https://huggingface.co/datasets/iarbel/amazon-product-data-filter)获取完整数据集。
### 语言说明
本数据集内的文本语言为英语。
## 数据集结构
### 数据样本
每个数据样本包含商品相关信息,例如ASIN(亚马逊标准识别码,Amazon Standard Identification Number)、商品标题、特性要点列表(Feature-Bullets)等。
### 数据字段
- `asin`:亚马逊标准识别码(ASIN,Amazon Standard Identification Number)。
- `category`:商品类别。该字段为用于获取商品上架信息的搜索字符串,并非亚马逊官网展示的商品分类。
- `img_url`:商品页面的主图片URL。
- `title`:商品页面展示的商品标题。
- `feature-bullets`:商品页面展示的产品特性要点列表(Feature-Bullets)。
- `tech_data`:商品页面展示的产品技术数据(如材质、款式等)。其结构为元组列表,其中第一个元素为特性名称(例如材质),第二个元素为对应属性值(例如塑料)。
- `labels`:`feature-bullets`字段的处理后版本。原始特性要点列表已被标准化处理:添加大写前缀、移除表情符号等。最终所有列表项以换行符(`
`)作为分隔符拼接为单个字符串。
- `tech_process`:`tech_data`字段的处理后版本。原始技术数据已经过筛选,并从`(键, 值)`结构转换为自然语言文本。
### 数据拆分
本样本数据集包含20条训练样本。完整数据集可通过[此链接](https://huggingface.co/datasets/iarbel/amazon-product-data-filter)获取。
## 数据集构建
### 构建逻辑
本数据集旨在为电商领域提供高质量数据,并用于针对特定任务微调大语言模型(LLMs)。原始非结构化数据采集自亚马逊官网,随后通过各类技术(包括标注、基于规则的方法、模型等)进行解析、处理与筛选。
### 源数据
#### 初始数据采集与标准化
本数据集通过采集亚马逊官网的原始HTML数据获取。
### 标注信息
本数据集不包含额外标注内容。
### 个人与敏感信息
本数据集不包含任何个人敏感信息。
## 数据使用注意事项
### 数据集的社会影响
据我们所知,本数据集无潜在社会影响。该数据为纯技术类数据,用于产品文本生成或分类任务不会带来风险。
### 已知其他局限性
商品上架信息的质量可能存在差异,且部分信息可能不准确。
## 补充信息
### 数据集管理者
本数据集由[伊夫塔赫·阿贝尔(Iftach Arbel)](mailto:ia@momentum-ai.io)采集与整理。
### 许可协议
本数据集采用[知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode)进行授权。
### 引用信息
@misc{amazon_product_filter,
author = {Iftach Arbel},
title = {Amazon Product Dataset Sample},
year = {2023},
publisher = {Huggingface},
journal = {Huggingface dataset},
howpublished = {https://huggingface.co/datasets/iarbel/amazon-product-data-sample},
}