From CSV to Arrow: Creating a Unified Data Set for Efficient Cross-Platform Analysis

Name: From CSV to Arrow: Creating a Unified Data Set for Efficient Cross-Platform Analysis
Creator: Taylor & Francis
Published: 2025-01-22 16:00:10
License: 暂无描述

DataCite Commons2025-01-22 更新2025-05-07 收录

下载链接：

https://tandf.figshare.com/articles/dataset/From_CSV_to_Arrow_Creating_a_Unified_Data_Set_for_Efficient_Cross-Platform_Analysis/28255680

下载链接

链接失效反馈

官方服务：

资源简介：

Handling open data, like the vast repository of New York City (NYC) 311 service requests, often starts with the ubiquitous CSV (comma-separated value) file format. However, CSV files are notoriously inefficient for curation, bogged down by redundancy and potential misinterpretations. Enter Apache Arrow, a game-changing approach that not only slashes storage requirements but also primes data for seamless analysis across popular platforms like R, Python, and Julia. Using the NYC 311 service request data, we demonstrate the conversion of a CSV file to the Arrow IPC (Inter-Process Communication) format. An Arrow file stores the table schema with the data in a binary format that can be memorymapped for reading, enabling instantaneous access to potentially large datasets. The Arrow IPC data serves as a universal starting point for analysis across various environments. In our example, this conversion is done in Julia, which has powerful packages for reading and writing CSV or Arrow files and calling functions in other popular environments such as R and Python.

提供机构：

Taylor & Francis

创建时间：

2025-01-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集