-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Can i read parquet data from HDFS? #443
Comments
Hi @wangxingda , Thanks for trying HugeCTR with HDFS. We used to have a notebook sample demonstrating the usage of HDFS. Can you confirm that there exists a In addition could you please post your cmake log here? I'd confirm the macro ENABLE_ARROW_PARQUET is defined or not. |
@JacoCheung Thanks for your help. I just use CMakeLists.txt with main branch in hugectr repo. And i confirm my metadata file is exists. Do you notice this line if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt ? The notebook seems to be out of date, I can not run it successfully with both parquet and HDFS. |
Hi @wangxingda , Thanks for reminding! There was a destructive change to the remote reading in v23.02 release where if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt came into play. Specifically, to optimize the reading process in HugeCTR, we had to know the row_group_size of all training data files (Parquet) in advance (Before any actual data reading). And the way of getting the information is to resort to arrow parquet reader reading the metadata from parquet file from local filesystem. Therefore, HDFS should be disabled since v23.02 release. We should mark it as a known issue. Sorry for the inconvenience. May I know the reason and importance of trying HDFS? Is it a toy trial or not? Could you try out the release prior to v23.02 release if you need HDFS feature support in the short term. |
@JacoCheung Thanks, I plan to use hugectr in a production environment. The training data strore in HDFS. So does hugectr-team have a plan to support HDFS with parquet format? I hope to support this feature very much. |
Hi @wangxingda , Thanks for your reply. Yes, we're surely to restore the remote IO (HDFS) feature. As I mentioned, this should be an issue to be fixed. We're planning to refactor our data reader and fix the HDFS problem. But before that happen, you can play with release prior to v23.02 release. Thanks. |
Thanks very much! |
I recompile hugectr with -DENABLE_HDFS=ON, i get an this error when i read parquet data from HDFS.
[HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library
res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119)
[HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: failed to read a file
Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255)
[HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library
res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119)
[HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: failed to read a file
Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255)
The text was updated successfully, but these errors were encountered: