MongoDB/airbnb_embeddings
收藏Hugging Face2024-03-28 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/MongoDB/airbnb_embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text-retrieval
- text-to-image
language:
- en
tags:
- vector search
- multimodal
- retrieval augmented generation
size_categories:
- 1K<n<10K
---
## Overview
This dataset consists of AirBnB listings with property descriptions, reviews, and other metadata.
It also contains text embeddings of the property descriptions as well as image embeddings of the listing image. The text embeddings were created using OpenAI's **text-embedding-3-small** model and the image embeddings using OpenAI's [**clip-vit-base-patch32**](https://huggingface.co/openai/clip-vit-base-patch32) model available on Hugging Face.
The text embeddings have 1536 dimensions, while the image embeddings have 512 dimensions.
## Dataset Structure
Here is a full list of fields contained in the dataset. Some noteworthy fields have been highlighted:
- _id: Unique identifier for the listing
- listing_url: URL for the listing on AirBnB
- **name**: Title or name of the listing
- **summary**: Short overview of listing
- **space**: Short description of the space, amenities etc.
- **description**: Full listing description
- neighborhood_overview: Description of surrounding area
- notes: Special instructions or notes
- transit: Nearby public transportation options
- access: How to access the property. Door codes etc.
- interaction: Host's preferred interaction medium
- house_rules: Rules guests must follow
- **property_type**: Type of property
- room_type: Listing's room category
- bed_type: Type of bed provided
- minimum_nights: Minimum stay required
- maximum_nights: Maximum stay allowed
- cancellation_policy: Terms for cancelling booking
- first_review: Date of first review
- last_review: Date of latest review
- **accommodates**: Number of guests accommodated
- **bedrooms**: Number of bedrooms available
- **beds**: Number of beds available
- number_of_reviews: Total reviews received
- bathrooms: Number of bathrooms available
- **amenities**: List of amenities offered
- **price**: Nightly price for listing
- security_deposit: Required security deposit amount
- cleaning_fee: Additional cleaning fee charged
- extra_people: Fee for additional guests
- guests_included: Number of guests included in the base price
- **images**: Links to listing images
- host: Information about the host
- **address**: Physical address of listing
- **availability**: Availability dates for listing
- **review_scores**: Aggregate review scores
- reviews: Individual guest reviews
- weekly_price: Discounted price for week
- monthly_price: Discounted price for month
- **text_embeddings**: Embeddings of the property description in the `space` field
- **image_embeddings**: Embeddings of the `picture_url` in the `images` field
## Usage
This dataset can be useful for:
- Building Multimodal Search applications. Embed text queries using the CLIP model, and retrieve relevant images using the image embeddings provided.
- Building Hybrid Search applications. Use the embeddings provided for vector search and the metadata fields for pre-filtering and/or full-text search.
- Building RAG applications
## Ingest Data
To experiment with this dataset using MongoDB Atlas, first [create a MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register?utm_campaign=devrel&utm_source=community&utm_medium=organic_social&utm_content=Hugging%20Face%20Dataset&utm_term=apoorva.joshi).
You can then use the following script to load this dataset into your MongoDB Atlas cluster:
```
import os
from pymongo import MongoClient
import datasets
from datasets import load_dataset
from bson import json_util
# MongoDB Atlas URI and client setup
uri = os.environ.get('MONGODB_ATLAS_URI')
client = MongoClient(uri)
# Change to the appropriate database and collection names
db_name = 'your_database_name' # Change this to your actual database name
collection_name = 'airbnb_embeddings' # Change this to your actual collection name
collection = client[db_name][collection_name]
# Load the "airbnb_embeddings" dataset from Hugging Face
dataset = load_dataset("MongoDB/airbnb_embeddings")
insert_data = []
# Iterate through the dataset and prepare the documents for insertion
# The script below ingests 1000 records into the database at a time
for item in dataset['train']:
# Convert the dataset item to MongoDB document format
doc_item = json_util.loads(json_util.dumps(item))
insert_data.append(doc_item)
# Insert in batches of 1000 documents
if len(insert_data) == 1000:
collection.insert_many(insert_data)
print("1000 records ingested")
insert_data = []
# Insert any remaining documents
if len(insert_data) > 0:
collection.insert_many(insert_data)
print("{} records ingested".format(len(insert_data)))
print("All records ingested successfully!")
```
提供机构:
MongoDB
原始信息汇总
数据集概述
该数据集包含AirBnB房源列表的详细信息,包括房源描述、评论和其他元数据。此外,数据集还包含房源描述的文本嵌入和房源图片的图像嵌入。文本嵌入使用OpenAI的text-embedding-3-small模型生成,图像嵌入使用OpenAI的clip-vit-base-patch32模型生成。文本嵌入有1536维,图像嵌入有512维。
数据集结构
数据集包含以下字段:
- _id: 房源的唯一标识符
- listing_url: 房源在AirBnB上的URL
- name: 房源的标题或名称
- summary: 房源的简短概述
- space: 房源空间、设施等的简短描述
- description: 房源的完整描述
- neighborhood_overview: 周边地区的描述
- notes: 特殊说明或注意事项
- transit: 附近的公共交通选项
- access: 如何进入房源,如门禁密码等
- interaction: 房东偏好的交流方式
- house_rules: 客人必须遵守的规则
- property_type: 房源类型
- room_type: 房源的房间类别
- bed_type: 提供的床的类型
- minimum_nights: 最少住宿天数
- maximum_nights: 最多住宿天数
- cancellation_policy: 取消预订的条款
- first_review: 首次评论的日期
- last_review: 最新评论的日期
- accommodates: 可容纳的客人数量
- bedrooms: 可用的卧室数量
- beds: 可用的床数量
- number_of_reviews: 收到的总评论数
- bathrooms: 可用的浴室数量
- amenities: 提供的设施列表
- price: 每晚价格
- security_deposit: 所需的安全押金金额
- cleaning_fee: 额外的清洁费用
- extra_people: 额外客人的费用
- guests_included: 基础价格包含的客人数量
- images: 房源图片的链接
- host: 房东信息
- address: 房源的物理地址
- availability: 房源的可预订日期
- review_scores: 综合评论分数
- reviews: 客人的个人评论
- weekly_price: 周折扣价格
- monthly_price: 月折扣价格
- text_embeddings:
space字段的房源描述嵌入 - image_embeddings:
images字段的图片链接嵌入
使用场景
该数据集可用于以下场景:
- 构建多模态搜索应用程序。使用CLIP模型嵌入文本查询,并使用提供的图像嵌入检索相关图片。
- 构建混合搜索应用程序。使用提供的嵌入进行向量搜索,并使用元数据字段进行预过滤和/或全文搜索。
- 构建增强检索生成(RAG)应用程序。



