five

From CSV to Arrow: Creating a Unified Data Set for Efficient Cross-Platform Analysis

收藏
DataCite Commons2025-01-22 更新2025-05-07 收录
下载链接:
https://tandf.figshare.com/articles/dataset/From_CSV_to_Arrow_Creating_a_Unified_Data_Set_for_Efficient_Cross-Platform_Analysis/28255680
下载链接
链接失效反馈
官方服务:
资源简介:
Handling open data, like the vast repository of New York City (NYC) 311 service requests, often starts with the ubiquitous CSV (comma-separated value) file format. However, CSV files are notoriously inefficient for curation, bogged down by redundancy and potential misinterpretations. Enter Apache Arrow, a game-changing approach that not only slashes storage requirements but also primes data for seamless analysis across popular platforms like R, Python, and Julia. Using the NYC 311 service request data, we demonstrate the conversion of a CSV file to the Arrow IPC (Inter-Process Communication) format. An Arrow file stores the table schema with the data in a binary format that can be memorymapped for reading, enabling instantaneous access to potentially large datasets. The Arrow IPC data serves as a universal starting point for analysis across various environments. In our example, this conversion is done in Julia, which has powerful packages for reading and writing CSV or Arrow files and calling functions in other popular environments such as R and Python.
提供机构:
Taylor & Francis
创建时间:
2025-01-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作