five

booking-reviews-dataset|住宿评论数据集|市场分析数据集

收藏
huggingface2024-07-11 更新2024-12-12 收录
住宿评论
市场分析
下载链接:
https://huggingface.co/datasets/efainman/booking-reviews-dataset
下载链接
链接失效反馈
资源简介:
该数据集包含约160万条来自全球4万多家住宿的预订网评论,所有评论均为2023年发布的英文评论,并经过了审核确保其真实性和合规性。数据集中的评论至少包含三个主题,这些主题是通过Text2topic模型筛选的。数据集的列包括评论标题、正面和负面评论部分、客人评分、有用投票数、客人类型、客人国家、预订夜数、入住月份、住宿ID、住宿类型、住宿评分、住宿国家、住宿星级评级以及住宿位置信息(如海滩、滑雪场、市中心)。
创建时间:
2024-07-11
原始信息汇总

Booking.com 住宿评论数据集

数据集概述

本数据集包含用户生成的 Booking.com 评论训练集,约 160 万条评论来自全球 4 万多家住宿设施。所有评论均由实际入住的客人撰写,并已通过审核流程确保其真实性且不违反平台指南。为保护用户隐私,数据中未包含任何个人身份信息;为保护商业敏感统计数据,数据集仅限于数万家住宿设施。此外,数据集仅包含基于 Text2topic 模型 筛选出的至少包含 3 个主题的信息性评论。

数据集字段描述

以下表格描述了数据集中的字段:

列名 描述
review_title 评论标题
review_positive 评论中的正面(“喜欢”)部分
review_negative 评论中的负面(“不喜欢”)部分
guest_score 入住评分
review_helpful_votes 标记为有用的用户数
guest_type 旅客类型:单人旅客(1 成人)/ 情侣(2 成人)/ 团体(>2 成人)/ 家庭(成人与儿童)
guest_country 预订来源的匿名国家
room_nights 预订的晚数
month 预订的入住月份
accommodation_id 匿名的住宿设施 ID
accommodation_type 住宿设施类型,如酒店、公寓、旅馆
accommodation_score 住宿设施的总体平均客人评分
accommodation_country 住宿设施所在国家
accommodation_star_rating 住宿设施的星级评分,通常由官方住宿评级机构或第三方提供
location_is_beach 住宿设施是否位于海滩位置
location_is_ski 住宿设施是否位于滑雪位置
location_is_city_center 住宿设施是否位于市中心

许可证

数据集发布在以下非商业 许可证 下。

引用

相关论文可在 arXiv 上找到。

AI搜集汇总
数据集介绍
main_image_url
构建方式
Booking.com Accommodation Review Dataset 的构建基于2023年发布的用户生成评论数据,涵盖了全球约40,000家住宿设施的160万条评论。所有评论均来自实际入住过的客人,并经过平台审核,确保其真实性和合规性。为保护用户隐私,数据中未包含任何个人身份信息,同时为避免商业敏感信息泄露,数据集仅包含数万家住宿设施的数据。此外,通过Text2topic模型筛选出至少包含3个主题的评论,确保数据的丰富性和信息量。
特点
该数据集的特点在于其多样性和结构化信息。评论内容分为正面和负面部分,便于情感分析和主题挖掘。数据集还包含丰富的元数据,如客人评分、住宿类型、地理位置、入住时长等,为多维度分析提供了基础。此外,评论的匿名化处理确保了数据的隐私安全性,而基于Text2topic模型的筛选则提升了数据的质量。这些特点使其成为研究旅游决策、情感分析和个性化推荐等领域的理想选择。
使用方法
使用该数据集时,研究者可基于评论的正面和负面部分进行情感分析,结合客人评分和住宿类型等元数据,探索用户偏好和住宿体验的影响因素。通过分析地理位置和入住时长等字段,可进一步研究旅游行为模式。此外,数据集的结构化信息支持机器学习模型的训练,如用于个性化推荐系统的开发。研究者需遵循非商业许可协议,并在引用时注明相关论文。
背景与挑战
背景概述
Booking.com Accommodation Review Dataset 是由Booking.com平台于2023年发布的用户生成评论数据集,旨在为旅游和住宿领域的自然语言处理研究提供支持。该数据集由Reda Igebaria等研究人员主导构建,包含了来自全球40,000多家住宿设施的约160万条英文评论。这些评论均经过平台审核,确保其真实性和合规性。数据集的核心研究问题在于如何通过用户评论提升个性化推荐系统的性能,特别是在住宿选择和旅行决策中的应用。该数据集为研究个性化推荐、情感分析和文本分类等任务提供了丰富的语料资源,对旅游领域的学术研究和商业应用具有重要影响。
当前挑战
Booking.com Accommodation Review Dataset 面临的挑战主要体现在两个方面。首先,在领域问题方面,如何从海量用户评论中提取有效信息以支持个性化推荐和决策优化是一个关键挑战。尽管数据集提供了丰富的评论内容,但评论的多样性和主观性使得情感分析和主题提取变得复杂。其次,在数据构建过程中,研究人员需要平衡数据隐私与信息丰富性之间的关系。例如,为了保护用户隐私,数据集删除了所有个人身份信息,并限制了住宿设施的数量,这可能影响数据的代表性和多样性。此外,数据集中仅包含通过Text2topic模型筛选的“信息丰富”评论,这一筛选标准可能导致部分有价值的信息被遗漏。
常用场景
经典使用场景
Booking.com Accommodation Review Dataset 数据集广泛应用于旅游和酒店管理领域的研究,特别是在用户生成内容(UGC)分析方面。研究者可以利用该数据集进行情感分析、主题建模以及用户行为预测等任务。通过分析用户的正面和负面评价,研究者能够深入理解用户对住宿体验的满意度及其影响因素。
解决学术问题
该数据集解决了旅游和酒店管理领域中的多个学术问题,如用户评价的情感极性分析、住宿推荐系统的优化以及用户偏好的个性化建模。通过提供大量真实的用户评价数据,研究者能够开发更精确的算法来预测用户满意度,并为酒店管理者提供改进服务的依据。此外,数据集中的多维度信息(如住宿类型、地理位置等)为跨领域研究提供了丰富的数据支持。
衍生相关工作
基于该数据集,研究者已经开展了多项经典工作,如基于对比学习的个性化评价排序算法研究(如 arXiv:2407.00787 中所述)。这些研究不仅提升了推荐系统的性能,还为旅游领域的个性化服务提供了新的思路。此外,该数据集还推动了情感分析和主题建模技术的发展,为后续的研究提供了重要的数据基础和方法参考。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

Google Scholar

Google Scholar是一个学术搜索引擎,旨在检索学术文献、论文、书籍、摘要和文章等。它涵盖了广泛的学科领域,包括自然科学、社会科学、艺术和人文学科。用户可以通过关键词搜索、作者姓名、出版物名称等方式查找相关学术资源。

scholar.google.com 收录

中国气象数据

本数据集包含了中国2023年1月至11月的气象数据,包括日照时间、降雨量、温度、风速等关键数据。通过这些数据,可以深入了解气象现象对不同地区的影响,并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。

github 收录

Tropicos

Tropicos是一个全球植物名称数据库,包含超过130万种植物的名称、分类信息、分布数据、图像和参考文献。该数据库由密苏里植物园维护,旨在为植物学家、生态学家和相关领域的研究人员提供全面的植物信息。

www.tropicos.org 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

DALY

DALY数据集包含了全球疾病负担研究(Global Burden of Disease Study)中的伤残调整生命年(Disability-Adjusted Life Years, DALYs)数据。该数据集提供了不同国家和地区在不同年份的DALYs指标,用于衡量因疾病、伤害和早逝导致的健康损失。

ghdx.healthdata.org 收录