A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions
收藏Zenodo2026-03-25 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.15658129
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the two semantically enriched trajectory datasets introduced in the paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI), available at https://arxiv.org/pdf/2510.02333
The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.
Input data
The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:
raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:
traj_id: trajectory identifier;
user: user identifier;
lat: latitude of a trajectory sample;
lon: longitude of a trajectory sample;
time: timestamp of a sample;
pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:
osmid: POI OSM identifier
element_type: POI OSM element type
name: POI native name;
name:en: POI English name;
wikidata: POI WikiData identifier;
geometry: geometry associated with the POI;
category: POI category.
social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:
tweet_ID: identifier of the post;
text: post's text;
tweet_created: post's timestamp;
uid: identifier of the user who posted.
weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:
DATE: date in which the weather observation was recorded;
TAVG_C: average temperature in celsius;
DESCRIPTION: weather conditions.
Output data: the semantically enriched Paris and New York City datasets
Tabular Representation
The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:
traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:
(1) we filtered out trajectories having less than 2 samples;
(2) we filtered noisy samples inducing velocities above 300km/h:
(3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.
stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:
datetime, which indicates when a stop segments starts;
leaving_datetime, which indicates when a stop segment ends;
uid, the trajectory user's identifier;
tid, the trajectory's identifier;
lat, the stop segment's centroid latitude;
lng, the stop segment's centroid longitude.NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).
moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:
datetime, which indicates when a sample's timestamp;
uid, the samples' trajectory user's identifier;
tid, the sample's trajectory's identifier;
lat, the sample's latitude;
lng, the sample's longitude;
move_id, the identifier of a move segment. NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).
enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.The first set of columns concerns a stop's charachteristics:
stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;
geometry_stop, which is a Shapely Point representing a stop's centroid;
geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.
There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:
index_poi, which is the index of the associated POI in the pois.parqet file;
osmid, which is the identifier given by OpenStreetMap to the POI;
name, the POI's name;
wikidata, the POI identifier on wikidata;
category, the POI's category;
geometry_poi, a Shapely (multi)polygon describing the POI's geometry;
distance, the distance between the stop segment's centroid and the POI.
enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:
systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;
frequency, the number of systematic stops within a systematic stop's cluster;
home, the probability that the systematic stop's cluster represents the home of the associated user;
work, the probability that the systematic stop's cluster represents the workplace of the associated user;
other, the probability that the systematic stop's cluster represents some other place the associated user consistently visits (e.g., gym);
importance, which represents the fraction of time a user spends in the cluster associated with a systematic stop - this is relevant if a user has been associated with multiple clusters of systematic stops.
enriched_moves.parquet: parquet file storing the dataframe containing the augmented move segments. The dataframe has exactly the same structure of moves.parquet, plus the following columns:
label, which indicates the inferred transportation mean. The values are numerical and their semantics is as follows:
0, which corresponds to walk;
1, which corresponds to bike;
2, which corresponds to bus;
3, which corresponds to car;
4, which corresponds to subway;
5, which corresponds to train;
6, which corresponds to taxi.
weather_enrichment.parquet: parquet file storing the dataframe containing the weather information that have been associated with the trajectories. Recall that our weather data has a daily frequency. Consequently, each row represents the association between the weather conditions recorded in a specific day and some trajectory. The columns to consider in the dataframe are:
tid, the trajectory's identifier;
uid, the trajectory user's identifier;
datetime, the timestamp of the first trajectory's sample that fall within the day to which the weather conditions refer to; this is essentially the first sample of a trajectory falling within a given day.
lat, the latitude of the first trajectory's sample that fall within the day to which the weather conditions refer to;
lng, the longitude of the first trajectory's sample that fall within the day to which the weather conditions refer to;
end_datetime, the timestamp of the last trajectory's sample that fall within the day to which the weather conditions refer to; this is essentially the last sample of a trajectory falling within a given day.
end_lat, the latitude of the last trajectory's sample that fall within the day to which the weather conditions refer to;
end_lng, the longitude of the last trajectory's sample that fall within the day to which the weather conditions refer to;
DATE, the date in the form of YYYY-MM-DD to which the weather conditions used in a row refer to;
TAVG_C, the average temperature recorder for a given DATE;
DESCRIPTION, string describing the inferred sky conditions for a given DATE. Possible values are sunny, light rain, moderate rain, heavy rain.
RDF representation
The [paris|nyc]_output_RDF.zip archives contain the RDF knowledge graphs outputted by MAT-Builder representing the semantically enriched Paris and New York City datasets. The graphs' internal structure follow the one defined by our customized STEPv2 ontology, which has been described in the IEEE Access MAT-Builder's paper. The knowledge graphs are stored in Turtle (.ttl) files, which can be imported in any popular triplestore of choice such as GraphDB.
Acknowledgments
This research has been partially funded by the European Union’s Horizon Europe research and innovation program EFRA (Grant Agreement Number 101093026) and the MUSIT Project through the European Union’s Horizon 2020 research and innovation program under Marie-Sklodowska Curie grant agreement no. 101182585. Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or European Commission-EU. Neither the European Union nor the granting authority can be held responsible for them.
提供机构:
Zenodo
创建时间:
2025-06-13



