Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve data performance #4

Open
schr476 opened this issue Jun 10, 2020 · 8 comments
Open

Improve data performance #4

schr476 opened this issue Jun 10, 2020 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@schr476
Copy link
Collaborator

schr476 commented Jun 10, 2020

Develop TF data object (https://www.tensorflow.org/guide/data_performance).

@schr476 schr476 added the enhancement New feature or request label Jun 10, 2020
@gnperdue
Copy link

Are the inputs CSV files? Might be worth converting to TFRecord for storage and faster read back later (depends on workflow).

@schr476
Copy link
Collaborator Author

schr476 commented Jun 24, 2020

@gnperdue I believe we need both. Data reformatting and data processing (pre-fetching, etc.)

@schr476 schr476 assigned schr476 and ommiaa and unassigned schr476 Jun 24, 2020
@gnperdue
Copy link

@ommiaa - I have some code for converting HDF5 to TFRecords - it was for a neutrino experiment, and probably hard to fully grok, but I can dig it out if you'd like to look at it.

@gnperdue
Copy link

The TF documentation has examples for CSV to TFRecord also...

@ommiaa
Copy link

ommiaa commented Jun 24, 2020

Thanks @gnperdue, having an example to start from certainly helps! No rush of course, I am getting started and I want to meet with @schr476 to make sure all my environment is OK.

@gnperdue
Copy link

@ommiaa actually, even better than my neutrino stuff -- look at this

https://github.com/gnperdue/RandomData/tree/master/TensorFlow

It is an HDF5 -> TFRecord converter for the "fashion MNIST" dataset. It is using TF1.X era code (it has been a while since I touched this), but the TF folks have nice TF1->TF2 conversion utilities, so if it doesn't work, you can try that to update for TF2.

The HDF5 inputs exist here

https://github.com/gnperdue/RandomData/tree/master/hdf5

(in the same repo). IIRC, the official TF documentation for CSV -> TFRecord conversion is pretty good. That is a use case Google cares about (CSV, and JPG/PNG images -> TFRecord is easy, but they don't care about HDF5, so you have to do a bit more "by hand").

@ommiaa
Copy link

ommiaa commented Jul 8, 2020

@gnperdue , @schr476 . I am ramping up with Jason's help. Please confirm the following assumptions (I know they might seem obvious)

  • The data we save at FNAL is saved in the HD5F format
  • The TF data object allows for faster performance
  • The goal of this ticket is to write the code that converts HD5F data to TF data objects

@gnperdue
Copy link

@ommiaa yes, that is correct. We want to convert HDF5 to TFRecords. Actually, there is an upstream step - we first go CSV to HDF5. You could skip that and go right to TFRecord.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants