From CSV to Arrow: Creating a Unified Data Set for Efficient Cross-Platform Analysis
收藏DataCite Commons2025-01-22 更新2025-05-07 收录
下载链接:
https://tandf.figshare.com/articles/dataset/From_CSV_to_Arrow_Creating_a_Unified_Data_Set_for_Efficient_Cross-Platform_Analysis/28255680
下载链接
链接失效反馈官方服务:
资源简介:
Handling open data, like the vast repository of New York City (NYC) 311 service requests, often starts with the ubiquitous CSV (comma-separated value) file format. However, CSV files are notoriously inefficient for curation, bogged down by redundancy and potential misinterpretations. Enter Apache Arrow, a game-changing approach that not only slashes storage requirements but also primes data for seamless analysis across popular platforms like R, Python, and Julia. Using the NYC 311 service request data, we demonstrate the conversion of a CSV file to the Arrow IPC (Inter-Process Communication) format. An Arrow file stores the table schema with the data in a binary format that can be memorymapped for reading, enabling instantaneous access to potentially large datasets. The Arrow IPC data serves as a universal starting point for analysis across various environments. In our example, this conversion is done in Julia, which has powerful packages for reading and writing CSV or Arrow files and calling functions in other popular environments such as R and Python.
提供机构:
Taylor & Francis
创建时间:
2025-01-22



