You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My company currently operates a Recommender model trained with TensorFlow 2 (TF2) and served on CPU pods. We are exploring the potential of HugeCTR due to its promising GPU embedding cache capabilities and are considering switching our model to it.
We have successfully retrained our existing TF2 model with the SparseOperationsKit (more info) and created the inference graph with HPS, as demonstrated in these notebooks: sok_to_hps_dlrm_demo.ipynb and demo_for_tf_trained_model.ipynb
Result:
We deployed the model and used Triton's perf_analyzer to test its performance with varying batch sizes. The results were as follows:
Batch size: 24k
GPU memory usage: 13013MiB / 15109MiB (with "gpucacheper" set at 0.8)
GPU utilization: 51%
Batch size: 16k
GPU memory usage: 14013MiB / 15109MiB (with "gpucacheper" set at 1.0)
To maximize throughput, we plan to test the model across different instance types with varying GPU memory sizes. However, optimizing different parameters in config and selecting the best instance type for inference requires a clear understanding of how embedding cache size is calculated.
Details about the current model and embedding tables:
Our current model has various dense, sparse and pre-trained sparse features. After exporting the TF+SOK model to HPS, we have total 42 embedding tables, i.e.: sparse_files in hps_config.json. Here’s the stats:
dense features: 1 embedding table
embedding_dimension: 2
num features: 221
total rows: 16343 (sum of (num quantiles+1))
max_nnz: 1
trainable sparse features: total 38 embedding tables
Given the specific details of our HPS model and the provided context, can you guide us on how to estimate the GPU memory needed to store the embedding cache based on the different batch sizes with HugeCTR backend for inference scenarios? This information will assist us in determining the optimal configuration and instance type to maximize our model's throughput during inference.
Assuming that the GPU memory is insufficient to store all embeddings, what would be the best configuration? I understand that I might reduce the GPU cache ratio and cache the entire the embedding table in CPU Memory Database (volatile_db). Could you confirm if this is the correct approach?
I also have a question regarding the allocation_rate configuration in the above volatile_db. I observed that I must reduce allocation_rate = 1e6, or else the default allocation (256 MiB) leads to out-of-memory issue during hps.init. Could you explain why this happens and provide some insights into this matter?
The text was updated successfully, but these errors were encountered:
tuanavu
changed the title
[Question] How can we pre-calculate the GPU memory required for embedding cache size?
[Question] How can I pre-calculate the GPU memory required for embedding cache size?
Oct 27, 2023
Regarding 2:
Using the parallel_hash_map as your volatile_db is the suggested approach, if you cannot put the entire embedding table directly into the GPU.
Regarding 3:
For performance reasons (avoid frequent small allocations) and long term memory fragmentation the hash_map backends allocate memory in chunks. The size of these chunks is 256 MiB. Since you have 42 tables, that means at least 42 x 256 MiB = 10752 MiB will be allocated. Given that your EC2 instance only has 16 GiB memory, you seeing that OOM (Out-Of-Memory) error is not too surprising. However, I noticed your tables are rather small. I think, without loss of performance, it should be fine to decrease the allocation rate to 128 MiB, 100 MiB or even lower like 64 MiB.
@tuanavu Regarding the 2rd question, I have some comments here. We already support quantization for fp8 in the static embedding cache from v23.08. HPS will perform fp8 quantization on the embedding vector when reading the embedding table by enable "fp8_quant": true and embedding_cache_type":"static" item in HPS json configuration file, and perform fp32 dequantization on the embedding vector corresponding to the queried embedding key in the static embedding cache, so as to ensure the accuracy of dense part prediction.
Since the embedding is stored with fp8 type and the GPU memory size will be greatly reduced. However, due to different business use cases, the precision loss caused by quantization/dequantization still needs to be evaluated in the real production. So currently we only have experimental support for static embedding caching for POC verification. If quantization can bring greater benefits to your case, we will add quantization features to dynamics and upcoming lock-free optimized gpu cache.
Details
My company currently operates a Recommender model trained with TensorFlow 2 (TF2) and served on CPU pods. We are exploring the potential of HugeCTR due to its promising GPU embedding cache capabilities and are considering switching our model to it.
We have successfully retrained our existing TF2 model with the SparseOperationsKit (more info) and created the inference graph with HPS, as demonstrated in these notebooks: sok_to_hps_dlrm_demo.ipynb and demo_for_tf_trained_model.ipynb
Result:
We deployed the model and used Triton's perf_analyzer to test its performance with varying batch sizes. The results were as follows:
Testing Environment:
To maximize throughput, we plan to test the model across different instance types with varying GPU memory sizes. However, optimizing different parameters in config and selecting the best instance type for inference requires a clear understanding of how embedding cache size is calculated.
Details about the current model and embedding tables:
Our current model has various dense, sparse and pre-trained sparse features. After exporting the TF+SOK model to HPS, we have total 42 embedding tables, i.e.:
sparse_files
in hps_config.json. Here’s the stats:hps.Init
outputhps_config.json
used for inferenceQuestions
allocation_rate
configuration in the abovevolatile_db
. I observed that I must reduceallocation_rate = 1e6
, or else the default allocation (256 MiB) leads to out-of-memory issue duringhps.init
. Could you explain why this happens and provide some insights into this matter?The text was updated successfully, but these errors were encountered: