five

cogsci13/Amazon-Reviews-2023-Books-Review

收藏
Hugging Face2024-04-18 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/cogsci13/Amazon-Reviews-2023-Books-Review
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - recommendation - reviews size_categories: - 100M<n<1B --- # Amazon Reviews 2023 (Books Only) **This is a subset of Amazon Review 2023 dataset. Please visit [amazon-reviews-2023.github.io/](https://amazon-reviews-2023.github.io/) for more details, loading scripts, and preprocessed benchmark files.** **[April 18, 2024]** Update 1. This dataset was created and pushed for the first time. --- <!-- Provide a quick summary of the dataset. --> This is a large-scale **Amazon Reviews** dataset, collected in **2023** by [McAuley Lab](https://cseweb.ucsd.edu/~jmcauley/), and it includes rich features such as: 1. **User Reviews** (*ratings*, *text*, *helpfulness votes*, etc.); 2. **Item Metadata** (*descriptions*, *price*, *raw image*, etc.); ## What's New? In the Amazon Reviews'23, we provide: 1. **Larger Dataset:** We collected 571.54M reviews, 245.2% larger than the last version; 2. **Newer Interactions:** Current interactions range from May. 1996 to Sep. 2023; 3. **Richer Metadata:** More descriptive features in item metadata; 4. **Fine-grained Timestamp:** Interaction timestamp at the second or finer level; 5. **Cleaner Processing:** Cleaner item metadata than previous versions; 6. **Standard Splitting:** Standard data splits to encourage RecSys benchmarking. ## Basic Statistics > We define the <b>#R_Tokens</b> as the number of [tokens](https://pypi.org/project/tiktoken/) in user reviews and <b>#M_Tokens</b> as the number of [tokens](https://pypi.org/project/tiktoken/) if treating the dictionaries of item attributes as strings. We emphasize them as important statistics in the era of LLMs. > We count the number of items based on user reviews rather than item metadata files. Note that some items lack metadata. ### Grouped by Category | Category | #User | #Item | #Rating | #R_Token | #M_Token | Download | | ------------------------ | ------: | ------: | --------: | -------: | -------: | ------------------------------: | | Books | 10.3M | 4.4M | 29.5M | 2.9B | 3.7B | <a href='https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/Books.jsonl.gz' download> review</a>, <a href='https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/meta_categories/meta_Books.jsonl.gz' download> meta </a> | meta </a> | > Check Pure ID files and corresponding data splitting strategies in <b>[Common Data Processing](https://amazon-reviews-2023.github.io/data_processing/index.html)</b> section. ## Quick Start ### Load User Reviews ```python from datasets import load_dataset dataset = load_dataset("cogsci13/Amazon-Reviews-2023-Books-Review", "raw_review_Books", trust_remote_code=True) print(dataset["full"][0]) ``` ```json {'rating': {0: 1.0}, 'title': {0: 'Not a watercolor book! Seems like copies imo.'}, 'text': {0: 'It is definitely not a watercolor book. The paper bucked completely. The pages honestly appear to be photo copies of other pictures. I say that bc if you look at the seal pics you can see the tell tale line at the bottom of the page. As someone who has made many photocopies of pages in my time so I could try out different colors & mediums that black line is a dead giveaway to me. It’s on other pages too. The entire book just seems off. Nothing is sharp & clear. There is what looks like toner dust on all the pages making them look muddy. There are no sharp lines & there is no clear definition. At least there isn’t in my copy. And the Coloring Book for Adult on the bottom of the front cover annoys me. Why is it singular & not plural? They usually say coloring book for kids or coloring book for kids & adults or coloring book for adults- plural. Lol Plus it would work for kids if you can get over the grey scale nature of it. Personally I’m not going to waste expensive pens & paints trying to paint over the grey & black mess. I grew up in SW Florida minutes from the beaches & I was really excited about the sea life in this. I hope the printers & designers figure out how to clean up the mess bc some of the designs are really cute. They just aren’t worth my time to hand trace & transfer them, but I’m sure there are ppl that will be up to the challenge. This is one is a hard no. Going back. I tried.'}, 'images': {0: array([{'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/516HBU7LQoL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/516HBU7LQoL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/516HBU7LQoL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71+XwcacMmL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71+XwcacMmL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71+XwcacMmL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71RbTuvD1ZL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71RbTuvD1ZL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71RbTuvD1ZL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71U63wdOeZL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71U63wdOeZL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71U63wdOeZL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71WFEDyKcKL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71WFEDyKcKL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71WFEDyKcKL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/8109NwjpHKL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/8109NwjpHKL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/8109NwjpHKL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/814gxfh8wcL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/814gxfh8wcL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/814gxfh8wcL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81HC0vKRC2L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81HC0vKRC2L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81HC0vKRC2L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81Nx6BnRLxL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81Nx6BnRLxL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81Nx6BnRLxL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81QQMwBcVPL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81QQMwBcVPL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81QQMwBcVPL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81fgT3R3OwL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81fgT3R3OwL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81fgT3R3OwL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81mfzny0I5L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81mfzny0I5L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81mfzny0I5L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81nir7bf91L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81nir7bf91L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81nir7bf91L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81yLUo6ZL3L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81yLUo6ZL3L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81yLUo6ZL3L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81zh9h5RwkL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81zh9h5RwkL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81zh9h5RwkL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/91yfcpFlEqL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/91yfcpFlEqL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/91yfcpFlEqL._SL256_.jpg'}], dtype=object)}, 'asin': {0: 'B09BGPFTDB'}, 'parent_asin': {0: 'B09BGPFTDB'}, 'user_id': {0: 'AFKZENTNBQ7A7V7UXW5JJI6UGRYQ'}, 'timestamp': {0: 1642399598485}, 'helpful_vote': {0: 0}, 'verified_purchase': {0: True}} ``` ### Load Item Metadata ```python dataset = load_dataset("cogsci13/Amazon-Reviews-2023-Books-Meta", "raw_meta_Books", split="full", trust_remote_code=True) print(dataset[0]) ``` ```json {'main_category': {0: 'Books'}, 'title': {0: 'Chaucer'}, 'average_rating': {0: 4.5}, 'rating_number': {0: 29}, 'features': {0: array([], dtype=object)}, 'description': {0: array([], dtype=object)}, 'price': {0: '8.23'}, 'images': {0: {'hi_res': array([None], dtype=object), 'large': array(['https://m.media-amazon.com/images/I/41X61VPJYKL._SX334_BO1,204,203,200_.jpg'], dtype=object), 'thumb': array([None], dtype=object), 'variant': array(['MAIN'], dtype=object)}}, 'videos': {0: {'title': array([], dtype=object), 'url': array([], dtype=object), 'user_id': array([], dtype=object)}}, 'store': {0: 'Peter Ackroyd (Author)'}, 'categories': {0: array(['Books', 'Literature & Fiction', 'History & Criticism'], dtype=object)}, 'details': {0: '{"Publisher": "Chatto & Windus; First Edition (January 1, 2004)", "Language": "English", "Hardcover": "196 pages", "ISBN 10": "0701169850", "ISBN 13": "978-0701169855", "Item Weight": "10.1 ounces", "Dimensions": "5.39 x 0.71 x 7.48 inches"}'}, 'parent_asin': {0: '0701169850'}, 'bought_together': {0: None}, 'subtitle': {0: 'Hardcover – Import, January 1, 2004'}, 'author': {0: "{'avatar': 'https://m.media-amazon.com/images/I/21Je2zja9pL._SY600_.jpg', 'name': 'Peter Ackroyd', 'about': ['Peter Ackroyd, (born 5 October 1949) is an English biographer, novelist and critic with a particular interest in the history and culture of London. For his novels about English history and culture and his biographies of, among others, William Blake, Charles Dickens, T. S. Eliot and Sir Thomas More, he won the Somerset Maugham Award and two Whitbread Awards. He is noted for the volume of work he has produced, the range of styles therein, his skill at assuming different voices and the depth of his research.', 'He was elected a fellow of the Royal Society of Literature in 1984 and appointed a Commander of the Order of the British Empire in 2003.', 'Bio from Wikipedia, the free encyclopedia.']}"}} ``` > Check data loading examples and Huggingface datasets APIs in <b>[Common Data Loading](https://amazon-reviews-2023.github.io/data_loading/index.html)</b> section. ## Data Fields ### For User Reviews | Field | Type | Explanation | | ----- | ---- | ----------- | | rating | float | Rating of the product (from 1.0 to 5.0). | | title | str | Title of the user review. | | text | str | Text body of the user review. | | images | list | Images that users post after they have received the product. Each image has different sizes (small, medium, large), represented by the small_image_url, medium_image_url, and large_image_url respectively. | | asin | str | ID of the product. | | parent_asin | str | Parent ID of the product. Note: Products with different colors, styles, sizes usually belong to the same parent ID. The “asin” in previous Amazon datasets is actually parent ID. <b>Please use parent ID to find product meta.</b> | | user_id | str | ID of the reviewer | | timestamp | int | Time of the review (unix time) | | verified_purchase | bool | User purchase verification | | helpful_vote | int | Helpful votes of the review | ### For Item Metadata | Field | Type | Explanation | | ----- | ---- | ----------- | | main_category | str | Main category (i.e., domain) of the product. | | title | str | Name of the product. | | average_rating | float | Rating of the product shown on the product page. | | rating_number | int | Number of ratings in the product. | | features | list | Bullet-point format features of the product. | | description | list | Description of the product. | | price | float | Price in US dollars (at time of crawling). | | images | list | Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image. | | videos | list | Videos of the product including title and url. | | store | str | Store name of the product. | | categories | list | Hierarchical categories of the product. | | details | dict | Product details, including materials, brand, sizes, etc. | | parent_asin | str | Parent ID of the product. | | bought_together | list | Recommended bundles from the websites. | ## Citation ```bibtex @article{hou2024bridging, title={Bridging Language and Items for Retrieval and Recommendation}, author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, journal={arXiv preprint arXiv:2403.03952}, year={2024} } ``` ## Contact Us - **Report Bugs**: To report bugs in the dataset, please file an issue on our [GitHub](https://github.com/hyp1231/AmazonReviews2023/issues/new). - **Others**: For research collaborations or other questions, please email **yphou AT ucsd.edu**.
提供机构:
cogsci13
原始信息汇总

数据集概述:Amazon Reviews 2023 (Books Only)

数据集基本信息

  • 名称: Amazon Reviews 2023 (Books Only)
  • 语言: 英语
  • 标签: 推荐系统, 评论
  • 规模: 100M<n<1B

数据集内容

  • 用户评论: 包括评分、文本、有用投票等。
  • 商品元数据: 包括描述、价格、原始图片等。

数据集更新

  • 首次发布: 2024年4月18日
  • 数据集大小: 571.54M评论,比上一版本大245.2%。
  • 交互时间范围: 1996年5月至2023年9月。
  • 元数据丰富度: 提供更多描述性特征。
  • 时间戳精度: 精确到秒或更细粒度。
  • 数据处理: 比前一版本更清洁的商品元数据。
  • 数据分割: 标准分割,鼓励推荐系统基准测试。

数据集统计

  • 类别: 书籍
  • 用户数: 10.3M
  • 商品数: 4.4M
  • 评分数: 29.5M
  • 用户评论中的令牌数(#R_Token): 2.9B
  • 商品元数据中的令牌数(#M_Token): 3.7B

数据集加载

  • 用户评论加载: 使用Huggingface的load_dataset函数。
  • 商品元数据加载: 同样使用load_dataset函数。

数据集字段

用户评论

  • rating: 评分(1.0-5.0)
  • title: 评论标题
  • text: 评论文本
  • images: 用户上传的图片
  • asin: 商品ID
  • parent_asin: 商品父ID
  • user_id: 用户ID
  • timestamp: 评论时间(Unix时间)
  • verified_purchase: 购买验证
  • helpful_vote: 有用投票数

商品元数据

  • main_category: 主类别
  • title: 商品名称
  • average_rating: 商品页面显示的平均评分
  • rating_number: 评分数量
  • features: 商品特征
  • description: 商品描述
  • price: 商品价格(美元)
  • images: 商品图片
  • videos: 商品视频
  • store: 商品商店
  • categories: 商品类别
  • details: 商品详细信息
  • parent_asin: 商品父ID
  • bought_together: 推荐捆绑销售

数据集引用

bibtex @article{hou2024bridging, title={Bridging Language and Items for Retrieval and Recommendation}, author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, journal={arXiv preprint arXiv:2403.03952}, year={2024} }

联系方式

  • 报告错误: GitHub问题跟踪
  • 其他咨询: 邮件联系 yphou AT ucsd.edu
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集由McAuley实验室于2023年收集,旨在为推荐系统研究提供大规模的亚马逊商品评论数据。数据集包含了丰富的用户评论和商品元数据,涵盖了书籍类商品的用户评分、评论文本、帮助投票等信息,以及商品描述、价格、原始图片等元数据。数据集的构建过程涉及从亚马逊网站爬取数据、数据清洗和预处理等步骤,确保了数据的准确性和可用性。
特点
Amazon Reviews 2023 (Books Only)数据集具有以下特点:数据量大,包含571.54M条评论,比上一个版本增加了245.2%;数据更新,涵盖从1996年5月到2023年9月的交互数据;元数据丰富,提供了更多描述性特征;时间戳精确,以秒或更细的粒度记录交互时间;数据处理更清洁,商品元数据质量更高;数据分割标准化,便于推荐系统基准测试。
使用方法
使用Amazon Reviews 2023 (Books Only)数据集的方法如下:首先,通过HuggingFace的datasets库加载用户评论数据或商品元数据;其次,根据需要选择数据集的分割方式,如训练集、验证集等;最后,根据研究需求进行数据分析和模型训练。例如,可以使用用户评论数据来训练情感分析模型,或者使用商品元数据来训练推荐系统模型。
背景与挑战
背景概述
在电子商务和推荐系统领域,用户评价数据集对于研究和开发推荐算法至关重要。cogsci13/Amazon-Reviews-2023-Books-Review数据集由McAuley Lab于2023年创建,旨在为研究人员提供大规模的用户评价数据,以促进推荐系统的研究。该数据集包含丰富的用户评价特征,如评分、文本、帮助性投票等,以及项目元数据,如描述、价格、原始图像等。其创建时间、更新频率和丰富的数据特征使其成为推荐系统领域的重要资源。
当前挑战
cogsci13/Amazon-Reviews-2023-Books-Review数据集在构建过程中面临诸多挑战。首先,该数据集所解决的领域问题是构建一个大规模、高质量的推荐系统数据集,以支持研究和开发先进的推荐算法。其次,在构建过程中,研究人员面临数据清洗、数据整合和特征提取等挑战。例如,如何从原始数据中提取有用特征,如何处理缺失数据,以及如何保证数据的一致性和准确性。此外,随着数据量的不断增长,如何有效地存储和管理数据也成为了一个挑战。
常用场景
经典使用场景
亚马逊评论2023(仅限书籍)数据集为推荐系统研究提供了丰富的资源,尤其在书籍推荐领域。通过分析用户评论、评分、帮助投票等数据,研究人员可以训练模型来预测用户对书籍的偏好,从而为用户推荐可能感兴趣的新书。此外,该数据集还包含商品元数据,如描述、价格、原始图像等,这些信息可以帮助模型更好地理解商品特性,提高推荐准确性。
衍生相关工作
亚马逊评论2023(仅限书籍)数据集的发布推动了相关领域的研究进展。例如,研究人员可以利用该数据集来开发新的推荐算法,提高推荐准确性。此外,该数据集还可以用于研究用户行为和商品特性之间的关系,从而为电商平台和出版商提供有价值的市场洞察。此外,该数据集还可以用于开发新的自然语言处理模型,提高文本分析和情感分析能力。
数据集最近研究
最新研究方向
在推荐系统领域,亚马逊评论数据集2023版(仅书籍)的最新研究方向主要集中在利用用户评论和商品元数据来提升推荐算法的性能。该数据集提供了大量的用户评论、评分、帮助投票等信息,以及商品的描述、价格、原始图片等元数据,为研究提供了丰富的数据基础。研究者们正在探索如何将这些数据整合到推荐系统中,以更好地理解用户行为和商品属性,从而实现更准确的个性化推荐。此外,随着大型语言模型的兴起,如何将这些模型与推荐系统相结合,利用其强大的语言理解和生成能力来增强推荐效果,也是一个热门的研究方向。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作