You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a couple weeks I will be starting a new gig as an open source dev working on Apache Arrow so i thought I would provide Arrow support in Tablesaw.
If you are not familiar with Arrow, it provides a high-performance, cross-platform 'backend' for dataframes, and a standard way to exchange data between data science tools and projects. The project was created by Wes McKinney, who also started the Pandas project. Arrow is widely supported as a data interchange standard so it would provide another way for Tablesaw to exchange data with projects like Dremio and Apache Spark, as well as other dataframes like Pandas and the R dataframe.
While Arrow is at its best when used as an in-memory data structure, initially, at least, Tablesaw support will be limited to reading and writing files in the Arrow format. Arrow's data model is based on a memory-mapped binary structure that would require an extensive rewrite to implement within Tablesaw, so that's off the table - at least for now.
Arrow IO will be implemented as a separate sub-project, like our excel support, so there should be no additional dependencies and few if any changes to Tablesaw core or other subprojects.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In a couple weeks I will be starting a new gig as an open source dev working on Apache Arrow so i thought I would provide Arrow support in Tablesaw.
If you are not familiar with Arrow, it provides a high-performance, cross-platform 'backend' for dataframes, and a standard way to exchange data between data science tools and projects. The project was created by Wes McKinney, who also started the Pandas project. Arrow is widely supported as a data interchange standard so it would provide another way for Tablesaw to exchange data with projects like Dremio and Apache Spark, as well as other dataframes like Pandas and the R dataframe.
While Arrow is at its best when used as an in-memory data structure, initially, at least, Tablesaw support will be limited to reading and writing files in the Arrow format. Arrow's data model is based on a memory-mapped binary structure that would require an extensive rewrite to implement within Tablesaw, so that's off the table - at least for now.
Arrow IO will be implemented as a separate sub-project, like our excel support, so there should be no additional dependencies and few if any changes to Tablesaw core or other subprojects.
Beta Was this translation helpful? Give feedback.
All reactions