-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base operator for HugeCTR serving support #129
base: main
Are you sure you want to change the base?
Conversation
Click to view CI ResultsGitHub pull request #129 of commit 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb, no merge conflicts. Running as SYSTEM Setting status of 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/125/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_systems using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10 > git rev-parse 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb^{commit} # timeout=10 Checking out Revision 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10 Commit message: "remove common folder in tests and remove unneeded lines in test hugectr" > git rev-list --no-walk 088570474e008fa0580cb7ae6de1c4a2bceadf4e # timeout=10 [merlin_systems] $ /bin/bash /tmp/jenkins11789234233452956815.sh PYTHONPATH=/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 48 items |
Documentation preview |
rerun tests |
Click to view CI ResultsGitHub pull request #129 of commit 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb, no merge conflicts. Running as SYSTEM Setting status of 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/126/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_systems using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10 > git rev-parse 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb^{commit} # timeout=10 Checking out Revision 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10 Commit message: "remove common folder in tests and remove unneeded lines in test hugectr" > git rev-list --no-walk 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10 [merlin_systems] $ /bin/bash /tmp/jenkins13412966895579345381.sh PYTHONPATH=/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 48 items |
Click to view CI ResultsGitHub pull request #129 of commit ac56b79d882d571f189c2aa3db3d5dc2f3d71083, no merge conflicts. Running as SYSTEM Setting status of ac56b79d882d571f189c2aa3db3d5dc2f3d71083 to PENDING with url https://10.20.13.93:8080/job/merlin_systems/140/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_systems using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10 > git rev-parse ac56b79d882d571f189c2aa3db3d5dc2f3d71083^{commit} # timeout=10 Checking out Revision ac56b79d882d571f189c2aa3db3d5dc2f3d71083 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f ac56b79d882d571f189c2aa3db3d5dc2f3d71083 # timeout=10 Commit message: "Merge branch 'main' into hugectr-base" > git rev-list --no-walk 74b88a50a8974327d917509b551a08015f5c7c81 # timeout=10 [merlin_systems] $ /bin/bash /tmp/jenkins13320333107056980916.sh PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 49 items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some style suggestions, but nothing that would block this PR once the tests pass
merlin/systems/dag/ops/hugectr.py
Outdated
if "opt" not in path.name | ||
] | ||
|
||
config_dict = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the linter suggestion for this is to use {}
instead of dict()
merlin/systems/dag/ops/hugectr.py
Outdated
model = dict() | ||
model["model"] = model_name | ||
model["slot_num"] = num_cat_columns | ||
model["sparse_files"] = sparse_paths | ||
model["dense_file"] = dense_path | ||
model["maxnum_des_feature_per_sample"] = data_layer["dense"]["dense_dim"] | ||
model["network_file"] = network_file | ||
model["num_of_worker_buffer_in_pool"] = 4 | ||
model["num_of_refresher_buffer_in_pool"] = 1 | ||
model["deployed_device_list"] = self.device_list | ||
model["max_batch_size"] = self.max_batch_size | ||
model["default_value_for_each_table"] = [0.0] * len(sparse_layers) | ||
model["hit_rate_threshold"] = 0.9 | ||
model["gpucacheper"] = self.hugectr_params["gpucacheper"] | ||
model["gpucache"] = True | ||
model["cache_refresh_percentage_per_iteration"] = 0.2 | ||
model["maxnum_catfeature_query_per_table_per_sample"] = [ | ||
len(x["sparse_embedding_hparam"]["slot_size_array"]) for x in sparse_layers | ||
] | ||
model["embedding_vecsize_per_table"] = vec_size | ||
model["embedding_table_names"] = [x["top"] for x in sparse_layers] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if might be worthwhile to extract a helper function to construct this dictionary
merlin/systems/dag/ops/hugectr.py
Outdated
return config | ||
|
||
|
||
def _hugectr_config(name, hugectr_params, max_batch_size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there's a fair amount of repetition in this method. Maybe some of this can be done with a for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we implemented this a for loop @jperez999 must have been during the splitting of commits from #125
Click to view CI ResultsGitHub pull request #129 of commit 92070d02437d7679280097b7eaf495c1f5b19541, no merge conflicts. Running as SYSTEM Setting status of 92070d02437d7679280097b7eaf495c1f5b19541 to PENDING with url https://10.20.13.93:8080/job/merlin_systems/146/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_systems using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10 > git rev-parse 92070d02437d7679280097b7eaf495c1f5b19541^{commit} # timeout=10 Checking out Revision 92070d02437d7679280097b7eaf495c1f5b19541 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 92070d02437d7679280097b7eaf495c1f5b19541 # timeout=10 Commit message: "Merge branch 'main' into hugectr-base" > git rev-list --no-walk b2f89fe1c8f53060270d0483dcccc04b46b29164 # timeout=10 [merlin_systems] $ /bin/bash /tmp/jenkins4798257444405123681.sh PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 49 items |
92070d0
to
221c35c
Compare
Click to view CI ResultsGitHub pull request #129 of commit 221c35c040eb96d183e8302fb1cae4d8542d514e, no merge conflicts. Running as SYSTEM Setting status of 221c35c040eb96d183e8302fb1cae4d8542d514e to PENDING with url https://10.20.13.93:8080/job/merlin_systems/310/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_systems using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10 > git rev-parse 221c35c040eb96d183e8302fb1cae4d8542d514e^{commit} # timeout=10 Checking out Revision 221c35c040eb96d183e8302fb1cae4d8542d514e (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 221c35c040eb96d183e8302fb1cae4d8542d514e # timeout=10 Commit message: "Split out model and dataset creation into conftest" > git rev-list --no-walk 4269cf90c507f051348b5b63ad6236b3638e05ba # timeout=10 [merlin_systems] $ /bin/bash /tmp/jenkins5580325469944632981.sh PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 72 items |
221c35c
to
4d99847
Compare
Click to view CI ResultsGitHub pull request #129 of commit 4d99847a4d45afb83050acc2c99235edc09ac0eb, no merge conflicts. Running as SYSTEM Setting status of 4d99847a4d45afb83050acc2c99235edc09ac0eb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/311/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_systems using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10 > git rev-parse 4d99847a4d45afb83050acc2c99235edc09ac0eb^{commit} # timeout=10 Checking out Revision 4d99847a4d45afb83050acc2c99235edc09ac0eb (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 4d99847a4d45afb83050acc2c99235edc09ac0eb # timeout=10 Commit message: "Add slot_sizes parameter" > git rev-list --no-walk 221c35c040eb96d183e8302fb1cae4d8542d514e # timeout=10 [merlin_systems] $ /bin/bash /tmp/jenkins15556838871102646320.sh PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 71 items |
Click to view CI ResultsGitHub pull request #129 of commit 027f495e62b6030a2cd712f532280b54c3b54a5a, no merge conflicts. Running as SYSTEM Setting status of 027f495e62b6030a2cd712f532280b54c3b54a5a to PENDING with url https://10.20.13.93:8080/job/merlin_systems/312/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_systems using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10 > git rev-parse 027f495e62b6030a2cd712f532280b54c3b54a5a^{commit} # timeout=10 Checking out Revision 027f495e62b6030a2cd712f532280b54c3b54a5a (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 027f495e62b6030a2cd712f532280b54c3b54a5a # timeout=10 Commit message: "Extract config to methods and extend with all known params" > git rev-list --no-walk 4d99847a4d45afb83050acc2c99235edc09ac0eb # timeout=10 [merlin_systems] $ /bin/bash /tmp/jenkins10890944273429405522.sh PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 71 items |
@jperez999 @oliverholworthy Could you resolve the conflicts on this when you get a chance? I'd still like to add this support, and maybe we can nudge the HugeCTR team to help us out again. |
This PR will introduce the initial hugectr Operator. This operator works along and will need a wrapper operator to handle inputs coming from a dataframe. The PR lays the foundation for using Hugectr in systems. Allows you to pass a model or path to a model and it is loaded, relevant information extracted and the necessary artifacts are created (ps.json, model files, model.json, config.pbtxt for triton).