[Question] Can i read parquet data from HDFS? #443

wangxingda · 2024-02-27T07:21:26Z

I recompile hugectr with -DENABLE_HDFS=ON， i get an this error when i read parquet data from HDFS.

[HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library
res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119)
[HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: failed to read a file
Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255)
[HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library
res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119)
[HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: failed to read a file
Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255)

JacoCheung · 2024-02-27T08:21:23Z

Hi @wangxingda , Thanks for trying HugeCTR with HDFS. We used to have a notebook sample demonstrating the usage of HDFS. Can you confirm that there exists a _metadata.json file in your dataset source folder? (follow the instructions of the notebook sample)

In addition could you please post your cmake log here? I'd confirm the macro ENABLE_ARROW_PARQUET is defined or not.

wangxingda · 2024-02-28T05:19:30Z

@JacoCheung Thanks for your help. I just use CMakeLists.txt with main branch in hugectr repo. And i confirm my metadata file is exists.

Do you notice this line if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt ?
Does this mean that I cannot use both "parquet" and "HDFS" at the same time?

The notebook seems to be out of date, I can not run it successfully with both parquet and HDFS.

JacoCheung · 2024-02-29T01:37:32Z

Hi @wangxingda , Thanks for reminding! There was a destructive change to the remote reading in v23.02 release where if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt came into play.

Specifically, to optimize the reading process in HugeCTR, we had to know the row_group_size of all training data files (Parquet) in advance (Before any actual data reading). And the way of getting the information is to resort to arrow parquet reader reading the metadata from parquet file from local filesystem.

Therefore, HDFS should be disabled since v23.02 release. We should mark it as a known issue. Sorry for the inconvenience.

May I know the reason and importance of trying HDFS? Is it a toy trial or not? Could you try out the release prior to v23.02 release if you need HDFS feature support in the short term.

wangxingda · 2024-03-04T03:00:43Z

@JacoCheung Thanks, I plan to use hugectr in a production environment. The training data strore in HDFS. So does hugectr-team have a plan to support HDFS with parquet format? I hope to support this feature very much.

JacoCheung · 2024-03-04T13:05:23Z

Hi @wangxingda , Thanks for your reply. Yes, we're surely to restore the remote IO (HDFS) feature. As I mentioned, this should be an issue to be fixed. We're planning to refactor our data reader and fix the HDFS problem. But before that happen, you can play with release prior to v23.02 release.

Thanks.

wangxingda · 2024-03-07T08:01:57Z

Thanks very much!

wangxingda added the question Further information is requested label Feb 27, 2024

JacoCheung self-assigned this Feb 27, 2024

wangxingda closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Can i read parquet data from HDFS? #443

[Question] Can i read parquet data from HDFS? #443

wangxingda commented Feb 27, 2024

JacoCheung commented Feb 27, 2024

wangxingda commented Feb 28, 2024 •

edited

Loading

JacoCheung commented Feb 29, 2024

wangxingda commented Mar 4, 2024

JacoCheung commented Mar 4, 2024

wangxingda commented Mar 7, 2024

[Question] Can i read parquet data from HDFS? #443

[Question] Can i read parquet data from HDFS? #443

Comments

wangxingda commented Feb 27, 2024

JacoCheung commented Feb 27, 2024

wangxingda commented Feb 28, 2024 • edited Loading

JacoCheung commented Feb 29, 2024

wangxingda commented Mar 4, 2024

JacoCheung commented Mar 4, 2024

wangxingda commented Mar 7, 2024

wangxingda commented Feb 28, 2024 •

edited

Loading