Upcoming feature: Arrow file format support #1091

lwhite1 · 2022-04-01T19:59:57Z

lwhite1
Apr 1, 2022
Maintainer

In a couple weeks I will be starting a new gig as an open source dev working on Apache Arrow so i thought I would provide Arrow support in Tablesaw.

If you are not familiar with Arrow, it provides a high-performance, cross-platform 'backend' for dataframes, and a standard way to exchange data between data science tools and projects. The project was created by Wes McKinney, who also started the Pandas project. Arrow is widely supported as a data interchange standard so it would provide another way for Tablesaw to exchange data with projects like Dremio and Apache Spark, as well as other dataframes like Pandas and the R dataframe.

While Arrow is at its best when used as an in-memory data structure, initially, at least, Tablesaw support will be limited to reading and writing files in the Arrow format. Arrow's data model is based on a memory-mapped binary structure that would require an extensive rewrite to implement within Tablesaw, so that's off the table - at least for now.

Arrow IO will be implemented as a separate sub-project, like our excel support, so there should be no additional dependencies and few if any changes to Tablesaw core or other subprojects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upcoming feature: Arrow file format support #1091

{{title}}

Replies: 0 comments

Select a reply

Upcoming feature: Arrow file format support #1091

lwhite1 Apr 1, 2022 Maintainer

Replies: 0 comments

lwhite1
Apr 1, 2022
Maintainer