Yelp Academic Dataset
收藏Yelp Dataset Challenge for Python
数据集概述
该数据集包含Yelp Dataset Challenge第6轮的数据,格式为Pandas的pickle格式。数据集存储在AWS S3上,提供了多个表格,包括用户信息、商家信息、评论、签到和提示。
数据集结构
数据集包含以下表格:
用户表 (366k rows)
- 字段:average_stars, compliments, elite, fans, friends, name, review_count, type, user_id, votes, yelping_since
商家表 (61k rows)
- 字段:attributes, business_id, categories, city, full_address, hours, latitude, longitude, name, neighborhoods, open, review_count, stars, state, type
评论表 (1.5M rows)
- 字段:business_id, date, review_id, stars, text, type, user_id, votes_cool, votes_funny, votes_useful
签到表 (45k rows)
- 字段:business_id, checkin_info, type
提示表 (495k rows)
- 字段:business_id, date, likes, text, type, user_id
数据下载与读取
可以使用yelp_util包下载数据,下载后的文件存储在data文件夹中。读取pickle文件的示例如下:
python import pandas as pd review = pd.read_pickle(data/yelp_academic_dataset_review.pickle) review.head()
数据处理示例
商家聚类
可以使用KMeans算法对商家进行聚类:
python from sklearn.cluster import KMeans business = pd.read_pickle(data/yelp_academic_dataset_business.pickle) tags = business.categories.tolist() tag_countmatrix = yelp_util.taglist_to_matrix(tags) km = KMeans(n_clusters=3) km.fit(tag_countmatrix) business[cluster] = km.predict(tag_countmatrix)
训练word2vec模型
可以使用评论数据训练word2vec模型:
python review = pd.read_pickle(data/yelp_academic_dataset_review.pickle) yelp_review_sample = list(review.text.iloc[10000:20000]) model = yelp_util.create_word2vec_model(yelp_review_sample)




