five

smartcat/Amazon_Luxury_Beauty_2018

收藏
Hugging Face2024-10-22 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/smartcat/Amazon_Luxury_Beauty_2018
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: metadata features: - name: asin dtype: string - name: title dtype: string - name: description dtype: string - name: brand dtype: string - name: main_cat dtype: string - name: category sequence: 'null' - name: also_buy sequence: string - name: also_view sequence: string - name: imageURL sequence: string - name: imageURLHighRes sequence: string splits: - name: train num_bytes: 24898313 num_examples: 12299 download_size: 9490355 dataset_size: 24898313 - config_name: reviews features: - name: reviewerID dtype: string - name: reviewerName dtype: string - name: overall sequence: int64 - name: reviewTime sequence: timestamp[us] - name: asin sequence: string - name: reviewText sequence: string - name: summary sequence: string splits: - name: train num_bytes: 24663121 num_examples: 6107 download_size: 11955594 dataset_size: 24663121 configs: - config_name: metadata data_files: - split: train path: metadata/train-* - config_name: reviews data_files: - split: train path: reviews/train-* --- # Amazon Luxury Beauty Dataset ## Directory Structure - **metadata**: Contains product information. - **reviews**: Contains user reviews about the products. - **filtered**: - **e5-base-v2_embeddings.jsonl**: Contains "asin" and "embeddings" created with [e5-base-v2](https://huggingface.co/intfloat/e5-base-v2). - **metadata.jsonl**: Contains "asin" and "text", where text is created from the title, description, brand, main category, and category. - **reviews.jsonl**: Contains "reviewerID", "reviewTime", and "asin". Reviews are filtered to include only perfect 5-star ratings with a minimum of 5 ratings. ## Usage ### Download metadata ```python metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="metadata", split="train") ``` ### Download reviews ```python metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="reviews", split="train") ``` ### Download filtered files ``` filtered_reviews = load_dataset( path="smartcat/Amazon_Luxury_Beauty_2018", data_files="filtered/reviews.jsonl", split="train", ) ``` **📎 Note:** You can set any file or list of files from the "filtered" directory as the "data_files" argument.

数据集信息: - 配置名称:元数据(metadata) 特征字段: - 字段名:asin(亚马逊标准识别码,Amazon Standard Identification Number),数据类型:字符串(string) - 字段名:title,数据类型:字符串(string) - 字段名:description,数据类型:字符串(string) - 字段名:brand,数据类型:字符串(string) - 字段名:main_cat,数据类型:字符串(string) - 字段名:category,序列类型,元素为null - 字段名:also_buy,字符串序列 - 字段名:also_view,字符串序列 - 字段名:imageURL,图片链接序列 - 字段名:imageURLHighRes,高清图片链接序列 拆分集: - 名称:训练集(train),字节数:24898313,样本数:12299 下载大小:9490355,数据集总大小:24898313 - 配置名称:评论数据(reviews) 特征字段: - 字段名:reviewerID,数据类型:字符串(string) - 字段名:reviewerName,数据类型:字符串(string) - 字段名:overall,64位整型(int64)序列 - 字段名:reviewTime,微秒级时间戳(timestamp[us])序列 - 字段名:asin,字符串序列 - 字段名:reviewText,评论正文序列 - 字段名:summary,评论摘要序列 拆分集: - 名称:训练集(train),字节数:24663121,样本数:6107 下载大小:11955594,数据集总大小:24663121 配置项: - 配置名称:元数据(metadata),数据文件: - 拆分集:训练集(train),路径:metadata/train-* - 配置名称:评论数据(reviews),数据文件: - 拆分集:训练集(train),路径:reviews/train-* --- # 亚马逊奢侈品美妆数据集(Amazon Luxury Beauty Dataset) ## 目录结构 - **元数据(metadata)**:存储商品相关信息。 - **评论数据(reviews)**:存储用户对商品的评论内容。 - **filtered(过滤后数据)**: - **e5-base-v2_embeddings.jsonl**:存储asin与使用[e5-base-v2](https://huggingface.co/intfloat/e5-base-v2)生成的嵌入向量(embeddings)。 - **metadata.jsonl**:存储asin与拼接后的文本,该文本由商品标题、描述、品牌、主分类及分类信息拼接而成。 - **reviews.jsonl**:存储reviewerID、reviewTime与asin。评论已经过过滤,仅保留5星满分评分且至少有5条评分的评论。 ## 使用方法 ### 下载元数据 python metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="metadata", split="train") ### 下载评论数据 python metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="reviews", split="train") ### 下载过滤后数据文件 filtered_reviews = load_dataset( path="smartcat/Amazon_Luxury_Beauty_2018", data_files="filtered/reviews.jsonl", split="train", ) **📎 注意:** 您可将"filtered"目录下的任意文件或文件列表作为`data_files`参数传入。
提供机构:
smartcat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作