Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base operator for HugeCTR serving support #129

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jperez999
Copy link
Collaborator

This PR will introduce the initial hugectr Operator. This operator works along and will need a wrapper operator to handle inputs coming from a dataframe. The PR lays the foundation for using Hugectr in systems. Allows you to pass a model or path to a model and it is loaded, relevant information extracted and the necessary artifacts are created (ps.json, model files, model.json, config.pbtxt for triton).

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #129 of commit 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb, no merge conflicts.
Running as SYSTEM
Setting status of 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/125/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
 > git rev-parse 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb^{commit} # timeout=10
Checking out Revision 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10
Commit message: "remove common folder in tests and remove unneeded lines in test hugectr"
 > git rev-list --no-walk 088570474e008fa0580cb7ae6de1c4a2bceadf4e # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins11789234233452956815.sh
PYTHONPATH=/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 48 items

tests/unit/test_version.py . [ 2%]
tests/unit/systems/test_ensemble.py .... [ 10%]
tests/unit/systems/test_ensemble_ops.py .. [ 14%]
tests/unit/systems/test_export.py . [ 16%]
tests/unit/systems/test_graph.py . [ 18%]
tests/unit/systems/test_inference_ops.py .. [ 22%]
tests/unit/systems/test_op_runner.py .... [ 31%]
tests/unit/systems/test_tensorflow_inf_op.py ... [ 37%]
tests/unit/systems/fil/test_fil.py .......................... [ 91%]
tests/unit/systems/fil/test_forest.py ... [ 97%]
tests/unit/systems/hugectr/test_hugectr.py F [100%]

=================================== FAILURES ===================================
________________________________ test_training _________________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-3/test_training0')

def test_training(tmpdir):
    cat_dtypes = {"a": int, "b": int, "c": int}
    dataset = cudf.datasets.randomdata(1, dtypes={**cat_dtypes, "label": bool})
    dataset["label"] = dataset["label"].astype("int32")

    categorical_columns = list(cat_dtypes.keys())

    gdf = cudf.DataFrame(
        {
            "a": np.arange(64),
            "b": np.arange(64),
            "c": np.arange(64),
            "d": np.random.rand(64).tolist(),
            "label": [0] * 64,
        },
        dtype="int64",
    )
    gdf["label"] = gdf["label"].astype("float32")
    train_dataset = nvt.Dataset(gdf)

    dense_columns = ["d"]

    dict_dtypes = {}
    for col in dense_columns:
        dict_dtypes[col] = np.float32

    for col in categorical_columns:
        dict_dtypes[col] = np.int64

    for col in ["label"]:
        dict_dtypes[col] = np.float32

    train_path = os.path.join(tmpdir, "train/")
    os.mkdir(train_path)

    train_dataset.to_parquet(
        output_path=train_path,
        shuffle=nvt.io.Shuffle.PER_PARTITION,
        cats=categorical_columns,
        conts=dense_columns,
        labels=["label"],
        dtypes=dict_dtypes,
    )

    embeddings = {"a": (64, 16), "b": (64, 16), "c": (64, 16)}

    total_cardinality = 0
    slot_sizes = []

    for column in cat_dtypes:
        slot_sizes.append(embeddings[column][0])
        total_cardinality += embeddings[column][0]

    # slot sizes = list of caridinalities per column, total is sum of individual
    model = _run_model(slot_sizes, train_path, len(dense_columns))

    model_op = HugeCTR(model, max_nnz=2, device_list=[0])

    model_repository_path = os.path.join(tmpdir, "model_repository")

    input_schema = Schema(
        [
            ColumnSchema("DES", dtype=np.float32),
            ColumnSchema("CATCOLUMN", dtype=np.int64),
            ColumnSchema("ROWINDEX", dtype=np.int32),
        ]
    )
    triton_chain = ColumnSelector(["DES", "CATCOLUMN", "ROWINDEX"]) >> model_op
    ens = Ensemble(triton_chain, input_schema)

    os.makedirs(model_repository_path)

    enc_config, node_configs = ens.export(model_repository_path)

    assert enc_config
    assert len(node_configs) == 1
    assert node_configs[0].name == "0_hugectr"

    df = train_dataset.to_ddf().compute()[:5]
    dense, cats, rowptr = _convert(df, slot_sizes, categorical_columns, labels=["label"])

    inputs = [
        grpcclient.InferInput("DES", dense.shape, triton.np_to_triton_dtype(dense.dtype)),
        grpcclient.InferInput("CATCOLUMN", cats.shape, triton.np_to_triton_dtype(cats.dtype)),
        grpcclient.InferInput("ROWINDEX", rowptr.shape, triton.np_to_triton_dtype(rowptr.dtype)),
    ]
    inputs[0].set_data_from_numpy(dense)
    inputs[1].set_data_from_numpy(cats)
    inputs[2].set_data_from_numpy(rowptr)
  response = _run_ensemble_on_tritonserver(
        model_repository_path,
        ["OUTPUT0"],
        inputs,
        "0_hugectr",
        backend_config=f"hugectr,ps={tmpdir}/model_repository/ps.json",
    )

tests/unit/systems/hugectr/test_hugectr.py:230:


tests/unit/systems/utils/triton.py:39: in _run_ensemble_on_tritonserver
with run_triton_server(tmpdir, backend_config=backend_config) as client:
/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)


modelpath = '/tmp/pytest-of-jenkins/pytest-3/test_training0/model_repository'
backend_config = 'hugectr,ps=/tmp/pytest-of-jenkins/pytest-3/test_training0/model_repository/ps.json'

@contextlib.contextmanager
def run_triton_server(modelpath, backend_config="tensorflow,version=2"):
    """This function starts up a Triton server instance and returns a client to it.

    Parameters
    ----------
    modelpath : string
        The path to the model to load.

    Yields
    ------
    client: tritonclient.InferenceServerClient
        The client connected to the Triton server.

    """
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        f"--backend-config={backend_config}",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                      raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

E RuntimeError: Tritonserver failed to start (ret=1)

merlin/systems/triton/utils.py:46: RuntimeError
----------------------------- Captured stdout call -----------------------------
HugeCTR Version: 3.7
====================================================Model Init=====================================================
[HCTR][00:08:17.887][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][00:08:17.887][WARNING][RK0][main]: MPI was already initialized somewhere elese. Lifetime service disabled.
[HCTR][00:08:17.887][INFO][RK0][main]: Global seed is 2344316770
[HCTR][00:08:17.933][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
[HCTR][00:08:18.489][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:08:18.489][INFO][RK0][main]: Start all2all warmup
[HCTR][00:08:18.489][INFO][RK0][main]: End all2all warmup
[HCTR][00:08:18.490][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][00:08:18.490][INFO][RK0][main]: Device 0: Tesla P100-DGXS-16GB
[HCTR][00:08:18.491][INFO][RK0][main]: num of DataReader workers: 1
[HCTR][00:08:18.491][INFO][RK0][main]: Vocabulary size: 0
[HCTR][00:08:18.491][INFO][RK0][main]: max_vocabulary_size_per_gpu_=584362
[HCTR][00:08:18.491][DEBUG][RK0][tid #139974410213120]: file_name_ /tmp/pytest-of-jenkins/pytest-3/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:08:18.491][DEBUG][RK0][tid #139973350016768]: file_name_ /tmp/pytest-of-jenkins/pytest-3/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:08:18.493][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][00:08:18.783][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][00:08:18.783][INFO][RK0][main]: gpu0 init embedding done
[HCTR][00:08:18.784][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][00:08:18.785][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][00:08:18.785][INFO][RK0][main]: label Dense Sparse
label dense data1
(None, 1) (None, 1)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (None, 3, 16)

InnerProduct dense fc1 (None, 512)

Reshape sparse_embedding1 reshape1 (None, 48)

InnerProduct reshape1 fc2 (None, 1)
fc1

BinaryCrossEntropyLoss fc2 loss
label

=====================================================Model Fit=====================================================
[HCTR][00:08:18.785][INFO][RK0][main]: Use non-epoch mode with number of iterations: 20
[HCTR][00:08:18.785][INFO][RK0][main]: Training batchsize: 10, evaluation batchsize: 10
[HCTR][00:08:18.785][INFO][RK0][main]: Evaluation interval: 200, snapshot interval: 10
[HCTR][00:08:18.785][INFO][RK0][main]: Dense network trainable: True
[HCTR][00:08:18.785][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][00:08:18.785][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][00:08:18.785][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][00:08:18.785][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][00:08:18.785][INFO][RK0][main]: Training source file: /tmp/pytest-of-jenkins/pytest-3/test_training0/train/file_list.txt
[HCTR][00:08:18.785][INFO][RK0][main]: Evaluation source file: /tmp/pytest-of-jenkins/pytest-3/test_training0/train/file_list.txt
[HCTR][00:08:18.790][DEBUG][RK0][tid #139974410213120]: file_name
/tmp/pytest-of-jenkins/pytest-3/test_training0/train/part_0.parquet file_total_rows
64
[HCTR][00:08:18.795][DEBUG][RK0][tid #139974410213120]: file_name_ /tmp/pytest-of-jenkins/pytest-3/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:08:18.800][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][00:08:18.800][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][00:08:18.825][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:08:18.861][INFO][RK0][main]: Done
[HCTR][00:08:18.879][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:08:18.917][INFO][RK0][main]: Done
[HCTR][00:08:18.918][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][00:08:18.919][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][00:08:18.919][INFO][RK0][main]: Dumping dense optimizer states to file, successful
[HCTR][00:08:18.924][DEBUG][RK0][tid #139974410213120]: file_name_ /tmp/pytest-of-jenkins/pytest-3/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:08:18.927][INFO][RK0][main]: Finish 20 iterations with batchsize: 10 in 0.14s.
[HCTR][00:08:18.928][INFO][RK0][main]: Save the model graph to /tmp/pytest-of-jenkins/pytest-3/test_training0/model_repository/0_hugectr/1/0_hugectr.json successfully
[HCTR][00:08:18.929][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][00:08:18.929][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][00:08:18.946][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:08:18.982][INFO][RK0][main]: Done
[HCTR][00:08:19.001][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:08:19.039][INFO][RK0][main]: Done
[HCTR][00:08:19.040][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][00:08:19.040][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][00:08:19.040][INFO][RK0][main]: Dumping dense optimizer states to file, successful
----------------------------- Captured stderr call -----------------------------
I0706 00:08:19.331564 12291 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f192e000000' with size 268435456
I0706 00:08:19.332344 12291 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0706 00:08:19.334954 12291 model_repository_manager.cc:1191] loading: 0_hugectr:1
I0706 00:08:19.468733 12291 hugectr.cc:1738] TRITONBACKEND_Initialize: hugectr
I0706 00:08:19.468771 12291 hugectr.cc:1745] Triton TRITONBACKEND API version: 1.9
I0706 00:08:19.468780 12291 hugectr.cc:1749] 'hugectr' TRITONBACKEND API version: 1.10
I0706 00:08:19.468789 12291 hugectr.cc:1827] TRITONBACKEND_Backend Finalize: HugectrBackend
E0706 00:08:19.468800 12291 model_repository_manager.cc:1348] failed to load '0_hugectr' version 1: Unsupported: Triton backend API version does not support this backend
E0706 00:08:19.468872 12291 model_repository_manager.cc:1551] Invalid argument: ensemble 'ensemble_model' depends on '0_hugectr' which has no loaded version
I0706 00:08:19.468962 12291 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0706 00:08:19.469001 12291 server.cc:583]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0706 00:08:19.469071 12291 server.cc:626]
+-----------+---------+------------------------------------------------------------------------------------+
| Model | Version | Status |
+-----------+---------+------------------------------------------------------------------------------------+
| 0_hugectr | 1 | UNAVAILABLE: Unsupported: Triton backend API version does not support this backend |
+-----------+---------+------------------------------------------------------------------------------------+

I0706 00:08:19.532783 12291 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0706 00:08:19.533709 12291 tritonserver.cc:2138]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.22.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-3/test_training0/model_repository |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0706 00:08:19.533750 12291 server.cc:257] Waiting for in-flight requests to complete.
I0706 00:08:19.533758 12291 server.cc:273] Timeout 30: Found 0 model versions that have in-flight inferences
I0706 00:08:19.533769 12291 server.cc:288] All models are stopped, unloading models
I0706 00:08:19.533775 12291 server.cc:295] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
W0706 00:08:20.552696 12291 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0706 00:08:20.552770 12291 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
=============================== warnings summary ===============================
../../../.local/lib/python3.8/site-packages/nvtabular/framework_utils/init.py:18
/var/jenkins_home/.local/lib/python3.8/site-packages/nvtabular/framework_utils/init.py:18: DeprecationWarning: The nvtabular.framework_utils module is being replaced by the Merlin Models library. Support for importing from nvtabular.framework_utils is deprecated, and will be removed in a future version. Please consider using the models and layers from Merlin Models instead.
warnings.warn(

tests/unit/systems/test_ensemble.py::test_workflow_tf_e2e_config_verification[parquet]
tests/unit/systems/test_ensemble.py::test_workflow_tf_e2e_multi_op_run[parquet]
tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
tests/unit/systems/test_inference_ops.py::test_workflow_op_validates_schemas[parquet]
tests/unit/systems/test_inference_ops.py::test_workflow_op_exports_own_config[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_loads_config[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_loads_multiple_ops_same[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_loads_multiple_ops_same_execute[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_single_node_export[parquet]
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column x is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column y is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column id is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/fil/test_fil.py::test_binary_classifier_default[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_binary_classifier_with_proba[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_multi_classifier[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_regressor[sklearn_forest_regressor-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_model_file[sklearn_forest_regressor-checkpoint.tl]
/usr/local/lib/python3.8/dist-packages/sklearn/utils/deprecation.py:103: FutureWarning: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
warnings.warn(msg, category=FutureWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/systems/hugectr/test_hugectr.py::test_training - RuntimeErr...
============ 1 failed, 47 passed, 18 warnings in 161.19s (0:02:41) =============
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins13773288337511600376.sh

@github-actions
Copy link

github-actions bot commented Jul 6, 2022

Documentation preview

https://nvidia-merlin.github.io/systems/review/pr-129

@jperez999
Copy link
Collaborator Author

rerun tests

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #129 of commit 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb, no merge conflicts.
Running as SYSTEM
Setting status of 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/126/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
 > git rev-parse 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb^{commit} # timeout=10
Checking out Revision 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10
Commit message: "remove common folder in tests and remove unneeded lines in test hugectr"
 > git rev-list --no-walk 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins13412966895579345381.sh
PYTHONPATH=/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 48 items

tests/unit/test_version.py . [ 2%]
tests/unit/systems/test_ensemble.py .... [ 10%]
tests/unit/systems/test_ensemble_ops.py .. [ 14%]
tests/unit/systems/test_export.py . [ 16%]
tests/unit/systems/test_graph.py . [ 18%]
tests/unit/systems/test_inference_ops.py .. [ 22%]
tests/unit/systems/test_op_runner.py .... [ 31%]
tests/unit/systems/test_tensorflow_inf_op.py ... [ 37%]
tests/unit/systems/fil/test_fil.py .......................... [ 91%]
tests/unit/systems/fil/test_forest.py ... [ 97%]
tests/unit/systems/hugectr/test_hugectr.py F [100%]

=================================== FAILURES ===================================
________________________________ test_training _________________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_training0')

def test_training(tmpdir):
    cat_dtypes = {"a": int, "b": int, "c": int}
    dataset = cudf.datasets.randomdata(1, dtypes={**cat_dtypes, "label": bool})
    dataset["label"] = dataset["label"].astype("int32")

    categorical_columns = list(cat_dtypes.keys())

    gdf = cudf.DataFrame(
        {
            "a": np.arange(64),
            "b": np.arange(64),
            "c": np.arange(64),
            "d": np.random.rand(64).tolist(),
            "label": [0] * 64,
        },
        dtype="int64",
    )
    gdf["label"] = gdf["label"].astype("float32")
    train_dataset = nvt.Dataset(gdf)

    dense_columns = ["d"]

    dict_dtypes = {}
    for col in dense_columns:
        dict_dtypes[col] = np.float32

    for col in categorical_columns:
        dict_dtypes[col] = np.int64

    for col in ["label"]:
        dict_dtypes[col] = np.float32

    train_path = os.path.join(tmpdir, "train/")
    os.mkdir(train_path)

    train_dataset.to_parquet(
        output_path=train_path,
        shuffle=nvt.io.Shuffle.PER_PARTITION,
        cats=categorical_columns,
        conts=dense_columns,
        labels=["label"],
        dtypes=dict_dtypes,
    )

    embeddings = {"a": (64, 16), "b": (64, 16), "c": (64, 16)}

    total_cardinality = 0
    slot_sizes = []

    for column in cat_dtypes:
        slot_sizes.append(embeddings[column][0])
        total_cardinality += embeddings[column][0]

    # slot sizes = list of caridinalities per column, total is sum of individual
    model = _run_model(slot_sizes, train_path, len(dense_columns))

    model_op = HugeCTR(model, max_nnz=2, device_list=[0])

    model_repository_path = os.path.join(tmpdir, "model_repository")

    input_schema = Schema(
        [
            ColumnSchema("DES", dtype=np.float32),
            ColumnSchema("CATCOLUMN", dtype=np.int64),
            ColumnSchema("ROWINDEX", dtype=np.int32),
        ]
    )
    triton_chain = ColumnSelector(["DES", "CATCOLUMN", "ROWINDEX"]) >> model_op
    ens = Ensemble(triton_chain, input_schema)

    os.makedirs(model_repository_path)

    enc_config, node_configs = ens.export(model_repository_path)

    assert enc_config
    assert len(node_configs) == 1
    assert node_configs[0].name == "0_hugectr"

    df = train_dataset.to_ddf().compute()[:5]
    dense, cats, rowptr = _convert(df, slot_sizes, categorical_columns, labels=["label"])

    inputs = [
        grpcclient.InferInput("DES", dense.shape, triton.np_to_triton_dtype(dense.dtype)),
        grpcclient.InferInput("CATCOLUMN", cats.shape, triton.np_to_triton_dtype(cats.dtype)),
        grpcclient.InferInput("ROWINDEX", rowptr.shape, triton.np_to_triton_dtype(rowptr.dtype)),
    ]
    inputs[0].set_data_from_numpy(dense)
    inputs[1].set_data_from_numpy(cats)
    inputs[2].set_data_from_numpy(rowptr)
  response = _run_ensemble_on_tritonserver(
        model_repository_path,
        ["OUTPUT0"],
        inputs,
        "0_hugectr",
        backend_config=f"hugectr,ps={tmpdir}/model_repository/ps.json",
    )

tests/unit/systems/hugectr/test_hugectr.py:230:


tests/unit/systems/utils/triton.py:39: in _run_ensemble_on_tritonserver
with run_triton_server(tmpdir, backend_config=backend_config) as client:
/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)


modelpath = '/tmp/pytest-of-jenkins/pytest-1/test_training0/model_repository'
backend_config = 'hugectr,ps=/tmp/pytest-of-jenkins/pytest-1/test_training0/model_repository/ps.json'

@contextlib.contextmanager
def run_triton_server(modelpath, backend_config="tensorflow,version=2"):
    """This function starts up a Triton server instance and returns a client to it.

    Parameters
    ----------
    modelpath : string
        The path to the model to load.

    Yields
    ------
    client: tritonclient.InferenceServerClient
        The client connected to the Triton server.

    """
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        f"--backend-config={backend_config}",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                      raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

E RuntimeError: Tritonserver failed to start (ret=1)

merlin/systems/triton/utils.py:46: RuntimeError
----------------------------- Captured stdout call -----------------------------
HugeCTR Version: 3.7
====================================================Model Init=====================================================
[HCTR][12:22:22.148][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][12:22:22.148][WARNING][RK0][main]: MPI was already initialized somewhere elese. Lifetime service disabled.
[HCTR][12:22:22.148][INFO][RK0][main]: Global seed is 1130608250
[HCTR][12:22:22.188][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
[HCTR][12:22:22.771][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][12:22:22.771][INFO][RK0][main]: Start all2all warmup
[HCTR][12:22:22.771][INFO][RK0][main]: End all2all warmup
[HCTR][12:22:22.771][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][12:22:22.772][INFO][RK0][main]: Device 0: Tesla P100-DGXS-16GB
[HCTR][12:22:22.772][INFO][RK0][main]: num of DataReader workers: 1
[HCTR][12:22:22.772][INFO][RK0][main]: Vocabulary size: 0
[HCTR][12:22:22.772][INFO][RK0][main]: max_vocabulary_size_per_gpu_=584362
[HCTR][12:22:22.772][DEBUG][RK0][tid #140454641243904]: file_name_ /tmp/pytest-of-jenkins/pytest-1/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][12:22:22.773][DEBUG][RK0][tid #140454117918464]: file_name_ /tmp/pytest-of-jenkins/pytest-1/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][12:22:22.774][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][12:22:23.060][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][12:22:23.061][INFO][RK0][main]: gpu0 init embedding done
[HCTR][12:22:23.062][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][12:22:23.063][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][12:22:23.063][INFO][RK0][main]: label Dense Sparse
label dense data1
(None, 1) (None, 1)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (None, 3, 16)

InnerProduct dense fc1 (None, 512)

Reshape sparse_embedding1 reshape1 (None, 48)

InnerProduct reshape1 fc2 (None, 1)
fc1

BinaryCrossEntropyLoss fc2 loss
label

=====================================================Model Fit=====================================================
[HCTR][12:22:23.063][INFO][RK0][main]: Use non-epoch mode with number of iterations: 20
[HCTR][12:22:23.063][INFO][RK0][main]: Training batchsize: 10, evaluation batchsize: 10
[HCTR][12:22:23.063][INFO][RK0][main]: Evaluation interval: 200, snapshot interval: 10
[HCTR][12:22:23.063][INFO][RK0][main]: Dense network trainable: True
[HCTR][12:22:23.063][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][12:22:23.063][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][12:22:23.063][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][12:22:23.063][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][12:22:23.063][INFO][RK0][main]: Training source file: /tmp/pytest-of-jenkins/pytest-1/test_training0/train/file_list.txt
[HCTR][12:22:23.063][INFO][RK0][main]: Evaluation source file: /tmp/pytest-of-jenkins/pytest-1/test_training0/train/file_list.txt
[HCTR][12:22:23.068][DEBUG][RK0][tid #140454641243904]: file_name
/tmp/pytest-of-jenkins/pytest-1/test_training0/train/part_0.parquet file_total_rows
64
[HCTR][12:22:23.073][DEBUG][RK0][tid #140454641243904]: file_name_ /tmp/pytest-of-jenkins/pytest-1/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][12:22:23.077][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][12:22:23.077][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][12:22:23.103][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][12:22:23.137][INFO][RK0][main]: Done
[HCTR][12:22:23.156][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][12:22:23.193][INFO][RK0][main]: Done
[HCTR][12:22:23.195][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][12:22:23.195][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][12:22:23.195][INFO][RK0][main]: Dumping dense optimizer states to file, successful
[HCTR][12:22:23.200][DEBUG][RK0][tid #140454641243904]: file_name_ /tmp/pytest-of-jenkins/pytest-1/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][12:22:23.202][INFO][RK0][main]: Finish 20 iterations with batchsize: 10 in 0.14s.
[HCTR][12:22:23.204][INFO][RK0][main]: Save the model graph to /tmp/pytest-of-jenkins/pytest-1/test_training0/model_repository/0_hugectr/1/0_hugectr.json successfully
[HCTR][12:22:23.204][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][12:22:23.205][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][12:22:23.222][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][12:22:23.258][INFO][RK0][main]: Done
[HCTR][12:22:23.276][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][12:22:23.314][INFO][RK0][main]: Done
[HCTR][12:22:23.315][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][12:22:23.315][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][12:22:23.315][INFO][RK0][main]: Dumping dense optimizer states to file, successful
----------------------------- Captured stderr call -----------------------------
I0706 12:22:23.594422 16422 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f3dc6000000' with size 268435456
I0706 12:22:23.595167 16422 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0706 12:22:23.597604 16422 model_repository_manager.cc:1191] loading: 0_hugectr:1
I0706 12:22:23.731106 16422 hugectr.cc:1738] TRITONBACKEND_Initialize: hugectr
I0706 12:22:23.731136 16422 hugectr.cc:1745] Triton TRITONBACKEND API version: 1.9
I0706 12:22:23.731144 16422 hugectr.cc:1749] 'hugectr' TRITONBACKEND API version: 1.10
I0706 12:22:23.731154 16422 hugectr.cc:1827] TRITONBACKEND_Backend Finalize: HugectrBackend
E0706 12:22:23.731164 16422 model_repository_manager.cc:1348] failed to load '0_hugectr' version 1: Unsupported: Triton backend API version does not support this backend
E0706 12:22:23.731230 16422 model_repository_manager.cc:1551] Invalid argument: ensemble 'ensemble_model' depends on '0_hugectr' which has no loaded version
I0706 12:22:23.731309 16422 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0706 12:22:23.731345 16422 server.cc:583]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0706 12:22:23.731407 16422 server.cc:626]
+-----------+---------+------------------------------------------------------------------------------------+
| Model | Version | Status |
+-----------+---------+------------------------------------------------------------------------------------+
| 0_hugectr | 1 | UNAVAILABLE: Unsupported: Triton backend API version does not support this backend |
+-----------+---------+------------------------------------------------------------------------------------+

I0706 12:22:23.795726 16422 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0706 12:22:23.796619 16422 tritonserver.cc:2138]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.22.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-1/test_training0/model_repository |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0706 12:22:23.796660 16422 server.cc:257] Waiting for in-flight requests to complete.
I0706 12:22:23.796668 16422 server.cc:273] Timeout 30: Found 0 model versions that have in-flight inferences
I0706 12:22:23.796678 16422 server.cc:288] All models are stopped, unloading models
I0706 12:22:23.796684 16422 server.cc:295] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
W0706 12:22:24.815550 16422 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0706 12:22:24.815610 16422 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
=============================== warnings summary ===============================
../../../.local/lib/python3.8/site-packages/nvtabular/framework_utils/init.py:18
/var/jenkins_home/.local/lib/python3.8/site-packages/nvtabular/framework_utils/init.py:18: DeprecationWarning: The nvtabular.framework_utils module is being replaced by the Merlin Models library. Support for importing from nvtabular.framework_utils is deprecated, and will be removed in a future version. Please consider using the models and layers from Merlin Models instead.
warnings.warn(

tests/unit/systems/test_ensemble.py::test_workflow_tf_e2e_config_verification[parquet]
tests/unit/systems/test_ensemble.py::test_workflow_tf_e2e_multi_op_run[parquet]
tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
tests/unit/systems/test_inference_ops.py::test_workflow_op_validates_schemas[parquet]
tests/unit/systems/test_inference_ops.py::test_workflow_op_exports_own_config[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_loads_config[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_loads_multiple_ops_same[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_loads_multiple_ops_same_execute[parquet]
tests/unit/systems/test_op_runner.py::test_op_runner_single_node_export[parquet]
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column x is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column y is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column id is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/fil/test_fil.py::test_binary_classifier_default[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_binary_classifier_with_proba[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_multi_classifier[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_regressor[sklearn_forest_regressor-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_model_file[sklearn_forest_regressor-checkpoint.tl]
/usr/local/lib/python3.8/dist-packages/sklearn/utils/deprecation.py:103: FutureWarning: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
warnings.warn(msg, category=FutureWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/systems/hugectr/test_hugectr.py::test_training - RuntimeErr...
============ 1 failed, 47 passed, 18 warnings in 167.71s (0:02:47) =============
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins14199478240402209974.sh

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #129 of commit ac56b79d882d571f189c2aa3db3d5dc2f3d71083, no merge conflicts.
Running as SYSTEM
Setting status of ac56b79d882d571f189c2aa3db3d5dc2f3d71083 to PENDING with url https://10.20.13.93:8080/job/merlin_systems/140/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
 > git rev-parse ac56b79d882d571f189c2aa3db3d5dc2f3d71083^{commit} # timeout=10
Checking out Revision ac56b79d882d571f189c2aa3db3d5dc2f3d71083 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f ac56b79d882d571f189c2aa3db3d5dc2f3d71083 # timeout=10
Commit message: "Merge branch 'main' into hugectr-base"
 > git rev-list --no-walk 74b88a50a8974327d917509b551a08015f5c7c81 # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins13320333107056980916.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 49 items

tests/unit/test_version.py . [ 2%]
tests/unit/examples/test_serving_ranking_models_with_merlin_systems.py . [ 4%]
[ 4%]
tests/unit/systems/test_ensemble.py .... [ 12%]
tests/unit/systems/test_ensemble_ops.py .. [ 16%]
tests/unit/systems/test_export.py . [ 18%]
tests/unit/systems/test_graph.py . [ 20%]
tests/unit/systems/test_inference_ops.py .. [ 24%]
tests/unit/systems/test_op_runner.py .... [ 32%]
tests/unit/systems/test_tensorflow_inf_op.py ... [ 38%]
tests/unit/systems/fil/test_fil.py .......................... [ 91%]
tests/unit/systems/fil/test_forest.py ... [ 97%]
tests/unit/systems/hugectr/test_hugectr.py F [100%]

=================================== FAILURES ===================================
________________________________ test_training _________________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_training0')

def test_training(tmpdir):
    cat_dtypes = {"a": int, "b": int, "c": int}
    dataset = cudf.datasets.randomdata(1, dtypes={**cat_dtypes, "label": bool})
    dataset["label"] = dataset["label"].astype("int32")

    categorical_columns = list(cat_dtypes.keys())

    gdf = cudf.DataFrame(
        {
            "a": np.arange(64),
            "b": np.arange(64),
            "c": np.arange(64),
            "d": np.random.rand(64).tolist(),
            "label": [0] * 64,
        },
        dtype="int64",
    )
    gdf["label"] = gdf["label"].astype("float32")
    train_dataset = nvt.Dataset(gdf)

    dense_columns = ["d"]

    dict_dtypes = {}
    for col in dense_columns:
        dict_dtypes[col] = np.float32

    for col in categorical_columns:
        dict_dtypes[col] = np.int64

    for col in ["label"]:
        dict_dtypes[col] = np.float32

    train_path = os.path.join(tmpdir, "train/")
    os.mkdir(train_path)

    train_dataset.to_parquet(
        output_path=train_path,
        shuffle=nvt.io.Shuffle.PER_PARTITION,
        cats=categorical_columns,
        conts=dense_columns,
        labels=["label"],
        dtypes=dict_dtypes,
    )

    embeddings = {"a": (64, 16), "b": (64, 16), "c": (64, 16)}

    total_cardinality = 0
    slot_sizes = []

    for column in cat_dtypes:
        slot_sizes.append(embeddings[column][0])
        total_cardinality += embeddings[column][0]

    # slot sizes = list of caridinalities per column, total is sum of individual
    model = _run_model(slot_sizes, train_path, len(dense_columns))

    model_op = HugeCTR(model, max_nnz=2, device_list=[0])

    model_repository_path = os.path.join(tmpdir, "model_repository")

    input_schema = Schema(
        [
            ColumnSchema("DES", dtype=np.float32),
            ColumnSchema("CATCOLUMN", dtype=np.int64),
            ColumnSchema("ROWINDEX", dtype=np.int32),
        ]
    )
    triton_chain = ColumnSelector(["DES", "CATCOLUMN", "ROWINDEX"]) >> model_op
    ens = Ensemble(triton_chain, input_schema)

    os.makedirs(model_repository_path)

    enc_config, node_configs = ens.export(model_repository_path)

    assert enc_config
    assert len(node_configs) == 1
    assert node_configs[0].name == "0_hugectr"

    df = train_dataset.to_ddf().compute()[:5]
    dense, cats, rowptr = _convert(df, slot_sizes, categorical_columns, labels=["label"])

    inputs = [
        grpcclient.InferInput("DES", dense.shape, triton.np_to_triton_dtype(dense.dtype)),
        grpcclient.InferInput("CATCOLUMN", cats.shape, triton.np_to_triton_dtype(cats.dtype)),
        grpcclient.InferInput("ROWINDEX", rowptr.shape, triton.np_to_triton_dtype(rowptr.dtype)),
    ]
    inputs[0].set_data_from_numpy(dense)
    inputs[1].set_data_from_numpy(cats)
    inputs[2].set_data_from_numpy(rowptr)

    response = _run_ensemble_on_tritonserver(
        model_repository_path,
        ["OUTPUT0"],
        inputs,
        "0_hugectr",
        backend_config=f"hugectr,ps={tmpdir}/model_repository/ps.json",
    )
    assert len(response.as_numpy("OUTPUT0")) == df.shape[0]

    model_config = node_configs[0].parameters["config"].string_value

    hugectr_name = node_configs[0].name
    dense_path = f"{tmpdir}/model_repository/{hugectr_name}/1/_dense_0.model"
    sparse_files = [f"{tmpdir}/model_repository/{hugectr_name}/1/0_sparse_0.model"]
  out_predict = _predict(
        dense, cats, rowptr, model_config, hugectr_name, dense_path, sparse_files
    )

tests/unit/systems/hugectr/test_hugectr.py:244:


dense_features = array([[0., 0., 0., 0., 0.]], dtype=float32)
embedding_columns = array([[ 0, 64, 128, 1, 65, 129, 2, 66, 130, 3, 67, 131, 4,
68, 132]])
row_ptrs = array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]],
dtype=int32)
config_file = '/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json'
model_name = '0_hugectr'
dense_path = '/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/_dense_0.model'
sparse_paths = ['/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_sparse_0.model']

def _predict(
    dense_features, embedding_columns, row_ptrs, config_file, model_name, dense_path, sparse_paths
):
    inference_params = InferenceParams(
        model_name=model_name,
        max_batchsize=64,
        hit_rate_threshold=0.5,
        dense_model_file=dense_path,
        sparse_model_files=sparse_paths,
        device_id=0,
        use_gpu_embedding_cache=True,
        cache_size_percentage=0.2,
        i64_input_key=True,
        use_mixed_precision=False,
    )
    inference_session = CreateInferenceSession(config_file, inference_params)
  output = inference_session.predict(
        dense_features[0].tolist(), embedding_columns[0].tolist(), row_ptrs[0].tolist()
    )

E RuntimeError: Runtime error: an illegal memory access was encountered
E cudaStreamSynchronize(resource_manager_->get_local_gpu(0)->get_stream()) at predict(/hugectr/HugeCTR/src/inference/inference_session.cpp:203)

tests/unit/systems/hugectr/test_hugectr.py:267: RuntimeError
----------------------------- Captured stdout call -----------------------------
HugeCTR Version: 3.7
====================================================Model Init=====================================================
[HCTR][00:07:05.716][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][00:07:05.716][WARNING][RK0][main]: MPI was already initialized somewhere elese. Lifetime service disabled.
[HCTR][00:07:05.716][INFO][RK0][main]: Global seed is 177617637
[HCTR][00:07:05.759][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
[HCTR][00:07:06.022][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:07:06.022][INFO][RK0][main]: Start all2all warmup
[HCTR][00:07:06.023][INFO][RK0][main]: End all2all warmup
[HCTR][00:07:06.023][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][00:07:06.023][INFO][RK0][main]: Device 0: Tesla P100-DGXS-16GB
[HCTR][00:07:06.024][INFO][RK0][main]: num of DataReader workers: 1
[HCTR][00:07:06.024][INFO][RK0][main]: Vocabulary size: 0
[HCTR][00:07:06.024][INFO][RK0][main]: max_vocabulary_size_per_gpu_=584362
[HCTR][00:07:06.024][DEBUG][RK0][tid #140080073135872]: file_name_ /tmp/pytest-of-jenkins/pytest-13/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:07:06.025][DEBUG][RK0][tid #140080064743168]: file_name_ /tmp/pytest-of-jenkins/pytest-13/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:07:06.026][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][00:07:06.319][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][00:07:06.320][INFO][RK0][main]: gpu0 init embedding done
[HCTR][00:07:06.321][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][00:07:06.321][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][00:07:06.321][INFO][RK0][main]: label Dense Sparse
label dense data1
(None, 1) (None, 1)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (None, 3, 16)

InnerProduct dense fc1 (None, 512)

Reshape sparse_embedding1 reshape1 (None, 48)

InnerProduct reshape1 fc2 (None, 1)
fc1

BinaryCrossEntropyLoss fc2 loss
label

=====================================================Model Fit=====================================================
[HCTR][00:07:06.321][INFO][RK0][main]: Use non-epoch mode with number of iterations: 20
[HCTR][00:07:06.321][INFO][RK0][main]: Training batchsize: 10, evaluation batchsize: 10
[HCTR][00:07:06.321][INFO][RK0][main]: Evaluation interval: 200, snapshot interval: 10
[HCTR][00:07:06.321][INFO][RK0][main]: Dense network trainable: True
[HCTR][00:07:06.321][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][00:07:06.321][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][00:07:06.321][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][00:07:06.321][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][00:07:06.321][INFO][RK0][main]: Training source file: /tmp/pytest-of-jenkins/pytest-13/test_training0/train/file_list.txt
[HCTR][00:07:06.321][INFO][RK0][main]: Evaluation source file: /tmp/pytest-of-jenkins/pytest-13/test_training0/train/file_list.txt
[HCTR][00:07:06.326][DEBUG][RK0][tid #140080073135872]: file_name
/tmp/pytest-of-jenkins/pytest-13/test_training0/train/part_0.parquet file_total_rows
64
[HCTR][00:07:06.331][DEBUG][RK0][tid #140080073135872]: file_name_ /tmp/pytest-of-jenkins/pytest-13/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:07:06.347][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][00:07:06.347][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][00:07:06.365][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:07:06.400][INFO][RK0][main]: Done
[HCTR][00:07:06.419][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:07:06.457][INFO][RK0][main]: Done
[HCTR][00:07:06.458][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][00:07:06.458][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][00:07:06.458][INFO][RK0][main]: Dumping dense optimizer states to file, successful
[HCTR][00:07:06.463][DEBUG][RK0][tid #140080073135872]: file_name_ /tmp/pytest-of-jenkins/pytest-13/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][00:07:06.466][INFO][RK0][main]: Finish 20 iterations with batchsize: 10 in 0.14s.
[HCTR][00:07:06.468][INFO][RK0][main]: Save the model graph to /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json successfully
[HCTR][00:07:06.469][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][00:07:06.469][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][00:07:06.487][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:07:06.522][INFO][RK0][main]: Done
[HCTR][00:07:06.541][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][00:07:06.578][INFO][RK0][main]: Done
[HCTR][00:07:06.580][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][00:07:06.580][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][00:07:06.580][INFO][RK0][main]: Dumping dense optimizer states to file, successful
[HCTR][00:07:06.988][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][00:07:06.988][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][00:07:06.988][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][00:07:06.988][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][00:07:07.229][INFO][RK0][main]: Table: hps_et.0_hugectr.sparse_embedding1; cached 64 / 64 embeddings in volatile database (PreallocatedHashMapBackend); load: 64 / 18446744073709551615 (0.00%).
[HCTR][00:07:07.229][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][00:07:07.229][INFO][RK0][main]: Create embedding cache in device 0.
[HCTR][00:07:07.230][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][00:07:07.230][INFO][RK0][main]: Configured cache hit rate threshold: 0.900000
[HCTR][00:07:07.230][INFO][RK0][main]: The size of thread pool: 16
[HCTR][00:07:07.230][INFO][RK0][main]: The size of worker memory pool: 4
[HCTR][00:07:07.230][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][00:07:07.247][INFO][RK0][main]: Global seed is 1959148862
[HCTR][00:07:07.842][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:07:07.842][INFO][RK0][main]: Start all2all warmup
[HCTR][00:07:07.842][INFO][RK0][main]: End all2all warmup
[HCTR][00:07:07.843][INFO][RK0][main]: Create inference session on device: 0
[HCTR][00:07:07.843][INFO][RK0][main]: Model name: 0_hugectr
[HCTR][00:07:07.843][INFO][RK0][main]: Use mixed precision: False
[HCTR][00:07:07.843][INFO][RK0][main]: Use cuda graph: True
[HCTR][00:07:07.843][INFO][RK0][main]: Max batchsize: 64
[HCTR][00:07:07.843][INFO][RK0][main]: Use I64 input key: True
[HCTR][00:07:07.843][INFO][RK0][main]: start create embedding for inference
[HCTR][00:07:07.843][INFO][RK0][main]: sparse_input name data1
[HCTR][00:07:07.843][INFO][RK0][main]: create embedding for inference success
[HCTR][00:07:07.843][INFO][RK0][main]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
Signal (2) received.
[HCTR][00:07:12.183][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][00:07:12.183][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][00:07:12.183][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][00:07:12.183][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][00:07:12.434][INFO][RK0][main]: Table: hps_et.0_hugectr.sparse_embedding1; cached 64 / 64 embeddings in volatile database (PreallocatedHashMapBackend); load: 64 / 18446744073709551615 (0.00%).
[HCTR][00:07:12.434][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][00:07:12.434][INFO][RK0][main]: Create embedding cache in device 0.
[HCTR][00:07:12.434][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.200000
[HCTR][00:07:12.434][INFO][RK0][main]: Configured cache hit rate threshold: 0.500000
[HCTR][00:07:12.434][INFO][RK0][main]: The size of thread pool: 16
[HCTR][00:07:12.434][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][00:07:12.434][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][00:07:12.437][INFO][RK0][main]: Global seed is 2946692632
[HCTR][00:07:12.437][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
[HCTR][00:07:12.466][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:07:12.466][INFO][RK0][main]: Start all2all warmup
[HCTR][00:07:12.466][INFO][RK0][main]: End all2all warmup
[HCTR][00:07:12.467][INFO][RK0][main]: Create inference session on device: 0
[HCTR][00:07:12.467][INFO][RK0][main]: Model name: 0_hugectr
[HCTR][00:07:12.467][INFO][RK0][main]: Use mixed precision: False
[HCTR][00:07:12.467][INFO][RK0][main]: Use cuda graph: True
[HCTR][00:07:12.467][INFO][RK0][main]: Max batchsize: 64
[HCTR][00:07:12.467][INFO][RK0][main]: Use I64 input key: True
[HCTR][00:07:12.467][INFO][RK0][main]: start create embedding for inference
[HCTR][00:07:12.467][INFO][RK0][main]: sparse_input name data1
[HCTR][00:07:12.467][INFO][RK0][main]: create embedding for inference success
[HCTR][00:07:12.467][INFO][RK0][main]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
----------------------------- Captured stderr call -----------------------------
I0714 00:07:06.852815 31037 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f7d66000000' with size 268435456
I0714 00:07:06.853566 31037 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0714 00:07:06.856049 31037 model_repository_manager.cc:1191] loading: 0_hugectr:1
I0714 00:07:06.988109 31037 hugectr.cc:1738] TRITONBACKEND_Initialize: hugectr
I0714 00:07:06.988136 31037 hugectr.cc:1745] Triton TRITONBACKEND API version: 1.9
I0714 00:07:06.988144 31037 hugectr.cc:1749] 'hugectr' TRITONBACKEND API version: 1.9
I0714 00:07:06.988149 31037 hugectr.cc:1772] The HugeCTR backend Repository location: /opt/tritonserver/backends/hugectr
I0714 00:07:06.988155 31037 hugectr.cc:1781] The HugeCTR backend configuration: {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","ps":"/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/ps.json","default-max-batch-size":"4"}}
I0714 00:07:06.988176 31037 hugectr.cc:345] *****Parsing Parameter Server Configuration from /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/ps.json
I0714 00:07:06.988231 31037 hugectr.cc:366] Support 64-bit keys = 1
I0714 00:07:06.988265 31037 hugectr.cc:591] Model name = 0_hugectr
I0714 00:07:06.988273 31037 hugectr.cc:600] Model '0_hugectr' -> network file = /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0714 00:07:06.988280 31037 hugectr.cc:607] Model '0_hugectr' -> max. batch size = 64
I0714 00:07:06.988286 31037 hugectr.cc:613] Model '0_hugectr' -> dense model file = /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/_dense_0.model
I0714 00:07:06.988294 31037 hugectr.cc:619] Model '0_hugectr' -> sparse model files = [/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_sparse_0.model]
I0714 00:07:06.988300 31037 hugectr.cc:630] Model '0_hugectr' -> use GPU embedding cache = 1
I0714 00:07:06.988318 31037 hugectr.cc:639] Model '0_hugectr' -> hit rate threshold = 0.9
I0714 00:07:06.988325 31037 hugectr.cc:647] Model '0_hugectr' -> per model GPU cache = 0.5
I0714 00:07:06.988337 31037 hugectr.cc:664] Model '0_hugectr' -> use_mixed_precision = 0
I0714 00:07:06.988344 31037 hugectr.cc:671] Model '0_hugectr' -> scaler = 1
I0714 00:07:06.988350 31037 hugectr.cc:677] Model '0_hugectr' -> use_algorithm_search = 1
I0714 00:07:06.988356 31037 hugectr.cc:685] Model '0_hugectr' -> use_cuda_graph = 1
I0714 00:07:06.988362 31037 hugectr.cc:692] Model '0_hugectr' -> num. pool worker buffers = 4
I0714 00:07:06.988368 31037 hugectr.cc:700] Model '0_hugectr' -> num. pool refresh buffers = 1
I0714 00:07:06.988374 31037 hugectr.cc:708] Model '0_hugectr' -> cache refresh rate per iteration = 0.2
I0714 00:07:06.988381 31037 hugectr.cc:717] Model '0_hugectr' -> deployed device list = [0]
I0714 00:07:06.988389 31037 hugectr.cc:725] Model '0_hugectr' -> default value for each table = [0]
I0714 00:07:06.988394 31037 hugectr.cc:733] Model '0_hugectr' -> maxnum_des_feature_per_sample = 1
I0714 00:07:06.988400 31037 hugectr.cc:741] Model '0_hugectr' -> refresh_delay = 0
I0714 00:07:06.988406 31037 hugectr.cc:747] Model '0_hugectr' -> refresh_interval = 0
I0714 00:07:06.988414 31037 hugectr.cc:755] Model '0_hugectr' -> maxnum_catfeature_query_per_table_per_sample list = [3]
I0714 00:07:06.988421 31037 hugectr.cc:766] Model '0_hugectr' -> embedding_vecsize_per_table list = [16]
I0714 00:07:06.988428 31037 hugectr.cc:773] Model '0_hugectr' -> embedding model names = [, sparse_embedding1]
I0714 00:07:06.988433 31037 hugectr.cc:780] Model '0_hugectr' -> label_dim = 1
I0714 00:07:06.988439 31037 hugectr.cc:785] Model '0_hugectr' -> the number of slots = 3
I0714 00:07:06.988451 31037 hugectr.cc:806] *****The HugeCTR Backend Parameter Server is creating... *****
I0714 00:07:06.988579 31037 hugectr.cc:814] ***** Parameter Server(Int64) is creating... *****
I0714 00:07:07.234997 31037 hugectr.cc:825] *****The HugeCTR Backend Backend created the Parameter Server successfully! *****
I0714 00:07:07.235061 31037 hugectr.cc:1844] TRITONBACKEND_ModelInitialize: 0_hugectr (version 1)
I0714 00:07:07.235069 31037 hugectr.cc:1857] Repository location: /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr
I0714 00:07:07.235075 31037 hugectr.cc:1872] backend configuration in mode: {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","ps":"/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/ps.json","default-max-batch-size":"4"}}
I0714 00:07:07.235085 31037 hugectr.cc:1888] Parsing the latest Parameter Server json config file for deploying model 0_hugectr online
I0714 00:07:07.235090 31037 hugectr.cc:1893] Hierarchical PS version is 0 and the current Model Version is 1
I0714 00:07:07.235096 31037 hugectr.cc:345] *****Parsing Parameter Server Configuration from /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/ps.json
I0714 00:07:07.235139 31037 hugectr.cc:366] Support 64-bit keys = 1
I0714 00:07:07.235159 31037 hugectr.cc:591] Model name = 0_hugectr
I0714 00:07:07.235166 31037 hugectr.cc:600] Model '0_hugectr' -> network file = /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0714 00:07:07.235173 31037 hugectr.cc:607] Model '0_hugectr' -> max. batch size = 64
I0714 00:07:07.235179 31037 hugectr.cc:613] Model '0_hugectr' -> dense model file = /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/_dense_0.model
I0714 00:07:07.235187 31037 hugectr.cc:619] Model '0_hugectr' -> sparse model files = [/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_sparse_0.model]
I0714 00:07:07.235193 31037 hugectr.cc:630] Model '0_hugectr' -> use GPU embedding cache = 1
I0714 00:07:07.235203 31037 hugectr.cc:639] Model '0_hugectr' -> hit rate threshold = 0.9
I0714 00:07:07.235210 31037 hugectr.cc:647] Model '0_hugectr' -> per model GPU cache = 0.5
I0714 00:07:07.235222 31037 hugectr.cc:664] Model '0_hugectr' -> use_mixed_precision = 0
I0714 00:07:07.235229 31037 hugectr.cc:671] Model '0_hugectr' -> scaler = 1
I0714 00:07:07.235234 31037 hugectr.cc:677] Model '0_hugectr' -> use_algorithm_search = 1
I0714 00:07:07.235240 31037 hugectr.cc:685] Model '0_hugectr' -> use_cuda_graph = 1
I0714 00:07:07.235246 31037 hugectr.cc:692] Model '0_hugectr' -> num. pool worker buffers = 4
I0714 00:07:07.235252 31037 hugectr.cc:700] Model '0_hugectr' -> num. pool refresh buffers = 1
I0714 00:07:07.235259 31037 hugectr.cc:708] Model '0_hugectr' -> cache refresh rate per iteration = 0.2
I0714 00:07:07.235266 31037 hugectr.cc:717] Model '0_hugectr' -> deployed device list = [0]
I0714 00:07:07.235273 31037 hugectr.cc:725] Model '0_hugectr' -> default value for each table = [0]
I0714 00:07:07.235279 31037 hugectr.cc:733] Model '0_hugectr' -> maxnum_des_feature_per_sample = 1
I0714 00:07:07.235285 31037 hugectr.cc:741] Model '0_hugectr' -> refresh_delay = 0
I0714 00:07:07.235291 31037 hugectr.cc:747] Model '0_hugectr' -> refresh_interval = 0
I0714 00:07:07.235299 31037 hugectr.cc:755] Model '0_hugectr' -> maxnum_catfeature_query_per_table_per_sample list = [3]
I0714 00:07:07.235305 31037 hugectr.cc:766] Model '0_hugectr' -> embedding_vecsize_per_table list = [16]
I0714 00:07:07.235313 31037 hugectr.cc:773] Model '0_hugectr' -> embedding model names = [, sparse_embedding1]
I0714 00:07:07.235318 31037 hugectr.cc:780] Model '0_hugectr' -> label_dim = 1
I0714 00:07:07.235324 31037 hugectr.cc:785] Model '0_hugectr' -> the number of slots = 3
I0714 00:07:07.236185 31037 hugectr.cc:1078] Verifying model configuration: {
"name": "0_hugectr",
"platform": "",
"backend": "hugectr",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 64,
"input": [
{
"name": "DES",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "CATCOLUMN",
"data_type": "TYPE_INT64",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "ROWINDEX",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "0_hugectr_0",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"des_feature_num": {
"string_value": "1"
},
"gpucache": {
"string_value": "true"
},
"embeddingkey_long_type": {
"string_value": "true"
},
"slots": {
"string_value": "3"
},
"cat_feature_num": {
"string_value": "3"
},
"config": {
"string_value": "/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json"
},
"label_dim": {
"string_value": "1"
},
"max_nnz": {
"string_value": "2"
},
"embedding_vector_size": {
"string_value": "16"
},
"gpucacheper": {
"string_value": "0.5"
}
},
"model_warmup": []
}
I0714 00:07:07.236225 31037 hugectr.cc:1164] The model configuration: {
"name": "0_hugectr",
"platform": "",
"backend": "hugectr",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 64,
"input": [
{
"name": "DES",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "CATCOLUMN",
"data_type": "TYPE_INT64",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "ROWINDEX",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "0_hugectr_0",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"des_feature_num": {
"string_value": "1"
},
"gpucache": {
"string_value": "true"
},
"embeddingkey_long_type": {
"string_value": "true"
},
"slots": {
"string_value": "3"
},
"cat_feature_num": {
"string_value": "3"
},
"config": {
"string_value": "/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json"
},
"label_dim": {
"string_value": "1"
},
"max_nnz": {
"string_value": "2"
},
"embedding_vector_size": {
"string_value": "16"
},
"gpucacheper": {
"string_value": "0.5"
}
},
"model_warmup": []
}
I0714 00:07:07.236243 31037 hugectr.cc:1209] slots set = 3
I0714 00:07:07.236250 31037 hugectr.cc:1213] slots set = 3
I0714 00:07:07.236256 31037 hugectr.cc:1221] desene number = 1
I0714 00:07:07.236262 31037 hugectr.cc:1239] The max categorical feature number = 3
I0714 00:07:07.236269 31037 hugectr.cc:1244] embedding size = 16
I0714 00:07:07.236274 31037 hugectr.cc:1250] embedding size = 16
I0714 00:07:07.236280 31037 hugectr.cc:1256] maxnnz = 2
I0714 00:07:07.236288 31037 hugectr.cc:1265] refresh_interval = 0
I0714 00:07:07.236294 31037 hugectr.cc:1273] refresh_delay = 0
I0714 00:07:07.236300 31037 hugectr.cc:1281] HugeCTR model config path = /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0714 00:07:07.236308 31037 hugectr.cc:1329] support mixed_precision = 0
I0714 00:07:07.236317 31037 hugectr.cc:1348] gpu cache per = 0.5
I0714 00:07:07.236323 31037 hugectr.cc:1366] hit-rate threshold = 0.9
I0714 00:07:07.236329 31037 hugectr.cc:1374] Label dim = 1
I0714 00:07:07.236335 31037 hugectr.cc:1383] support 64-bit embedding key = 1
I0714 00:07:07.236341 31037 hugectr.cc:1394] Model_Inference_Para.max_batchsize: 64
I0714 00:07:07.236347 31037 hugectr.cc:1398] max_batch_size in model config.pbtxt is 64
I0714 00:07:07.236354 31037 hugectr.cc:1468] ******Creating Embedding Cache for model 0_hugectr in device 0
I0714 00:07:07.236360 31037 hugectr.cc:1495] ******Creating Embedding Cache for model 0_hugectr successfully
I0714 00:07:07.236709 31037 hugectr.cc:1996] TRITONBACKEND_ModelInstanceInitialize: 0_hugectr_0 (device 0)
I0714 00:07:07.236720 31037 hugectr.cc:1637] Triton Model Instance Initialization on device 0
I0714 00:07:07.236727 31037 hugectr.cc:1647] Dense Feature buffer allocation:
I0714 00:07:07.246912 31037 hugectr.cc:1654] Categorical Feature buffer allocation:
I0714 00:07:07.246951 31037 hugectr.cc:1672] Categorical Row Index buffer allocation:
I0714 00:07:07.246964 31037 hugectr.cc:1680] Predict result buffer allocation:
I0714 00:07:07.246977 31037 hugectr.cc:2009] Loading HugeCTR Model
I0714 00:07:07.246984 31037 hugectr.cc:1698] The model origin json configuration file path is: /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0714 00:07:07.976853 31037 hugectr.cc:1706] ******Loading HugeCTR model successfully
I0714 00:07:07.977016 31037 model_repository_manager.cc:1345] successfully loaded '0_hugectr' version 1
I0714 00:07:07.977364 31037 model_repository_manager.cc:1191] loading: ensemble_model:1
I0714 00:07:08.077712 31037 model_repository_manager.cc:1345] successfully loaded 'ensemble_model' version 1
I0714 00:07:08.077849 31037 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0714 00:07:08.077988 31037 server.cc:583]
+---------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+---------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","ps":"/tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository/ps.json","default-max-batch-size":"4"}} |
+---------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0714 00:07:08.078051 31037 server.cc:626]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
| 0_hugectr | 1 | READY |
| ensemble_model | 1 | READY |
+----------------+---------+--------+

I0714 00:07:08.111960 31037 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0714 00:07:08.112788 31037 tritonserver.cc:2138]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.22.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-13/test_training0/model_repository |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0714 00:07:08.113590 31037 grpc_server.cc:4589] Started GRPCInferenceService at 0.0.0.0:8001
I0714 00:07:08.113797 31037 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0714 00:07:08.154879 31037 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
W0714 00:07:09.134720 31037 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0714 00:07:09.134777 31037 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
I0714 00:07:09.690520 31037 server.cc:257] Waiting for in-flight requests to complete.
I0714 00:07:09.690556 31037 server.cc:273] Timeout 30: Found 0 model versions that have in-flight inferences
I0714 00:07:09.690577 31037 model_repository_manager.cc:1223] unloading: ensemble_model:1
I0714 00:07:09.690667 31037 model_repository_manager.cc:1223] unloading: 0_hugectr:1
I0714 00:07:09.690757 31037 server.cc:288] All models are stopped, unloading models
I0714 00:07:09.690777 31037 server.cc:295] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I0714 00:07:09.690789 31037 model_repository_manager.cc:1328] successfully unloaded 'ensemble_model' version 1
I0714 00:07:09.691248 31037 hugectr.cc:2026] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0714 00:07:09.704461 31037 hugectr.cc:1957] TRITONBACKEND_ModelFinalize: delete model state
I0714 00:07:09.705264 31037 hugectr.cc:1505] ******Destorying Embedding Cache for model 0_hugectr successfully
I0714 00:07:09.705305 31037 model_repository_manager.cc:1328] successfully unloaded '0_hugectr' version 1
W0714 00:07:10.134947 31037 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0714 00:07:10.135014 31037 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
I0714 00:07:10.690873 31037 server.cc:295] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
I0714 00:07:10.692544 31037 hugectr.cc:1827] TRITONBACKEND_Backend Finalize: HugectrBackend
W0714 00:07:11.155751 31037 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0714 00:07:11.155847 31037 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/nvtabular/framework_utils/init.py:18
/usr/local/lib/python3.8/dist-packages/nvtabular/framework_utils/init.py:18: DeprecationWarning: The nvtabular.framework_utils module is being replaced by the Merlin Models library. Support for importing from nvtabular.framework_utils is deprecated, and will be removed in a future version. Please consider using the models and layers from Merlin Models instead.
warnings.warn(

tests/unit/examples/test_serving_ranking_models_with_merlin_systems.py: 1 warning
tests/unit/systems/test_ensemble.py: 2 warnings
tests/unit/systems/test_export.py: 1 warning
tests/unit/systems/test_inference_ops.py: 2 warnings
tests/unit/systems/test_op_runner.py: 4 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column x is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column y is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column id is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/fil/test_fil.py::test_binary_classifier_default[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_binary_classifier_with_proba[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_multi_classifier[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_regressor[sklearn_forest_regressor-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_model_file[sklearn_forest_regressor-checkpoint.tl]
/usr/local/lib/python3.8/dist-packages/sklearn/utils/deprecation.py:103: FutureWarning: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
warnings.warn(msg, category=FutureWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/systems/hugectr/test_hugectr.py::test_training - RuntimeErr...
============ 1 failed, 48 passed, 19 warnings in 257.57s (0:04:17) =============
terminate called without an active exception
/tmp/jenkins13320333107056980916.sh: line 14: 29517 Aborted (core dumped) pytest tests/unit
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins1274115635018183239.sh

Copy link
Contributor

@karlhigley karlhigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some style suggestions, but nothing that would block this PR once the tests pass

if "opt" not in path.name
]

config_dict = dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the linter suggestion for this is to use {} instead of dict()

Comment on lines 180 to 203
model = dict()
model["model"] = model_name
model["slot_num"] = num_cat_columns
model["sparse_files"] = sparse_paths
model["dense_file"] = dense_path
model["maxnum_des_feature_per_sample"] = data_layer["dense"]["dense_dim"]
model["network_file"] = network_file
model["num_of_worker_buffer_in_pool"] = 4
model["num_of_refresher_buffer_in_pool"] = 1
model["deployed_device_list"] = self.device_list
model["max_batch_size"] = self.max_batch_size
model["default_value_for_each_table"] = [0.0] * len(sparse_layers)
model["hit_rate_threshold"] = 0.9
model["gpucacheper"] = self.hugectr_params["gpucacheper"]
model["gpucache"] = True
model["cache_refresh_percentage_per_iteration"] = 0.2
model["maxnum_catfeature_query_per_table_per_sample"] = [
len(x["sparse_embedding_hparam"]["slot_size_array"]) for x in sparse_layers
]
model["embedding_vecsize_per_table"] = vec_size
model["embedding_table_names"] = [x["top"] for x in sparse_layers]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if might be worthwhile to extract a helper function to construct this dictionary

return config


def _hugectr_config(name, hugectr_params, max_batch_size=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a fair amount of repetition in this method. Maybe some of this can be done with a for loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we implemented this a for loop @jperez999 must have been during the splitting of commits from #125

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #129 of commit 92070d02437d7679280097b7eaf495c1f5b19541, no merge conflicts.
Running as SYSTEM
Setting status of 92070d02437d7679280097b7eaf495c1f5b19541 to PENDING with url https://10.20.13.93:8080/job/merlin_systems/146/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
 > git rev-parse 92070d02437d7679280097b7eaf495c1f5b19541^{commit} # timeout=10
Checking out Revision 92070d02437d7679280097b7eaf495c1f5b19541 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 92070d02437d7679280097b7eaf495c1f5b19541 # timeout=10
Commit message: "Merge branch 'main' into hugectr-base"
 > git rev-list --no-walk b2f89fe1c8f53060270d0483dcccc04b46b29164 # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins4798257444405123681.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 49 items

tests/unit/test_version.py . [ 2%]
tests/unit/examples/test_serving_ranking_models_with_merlin_systems.py . [ 4%]
[ 4%]
tests/unit/systems/test_ensemble.py .... [ 12%]
tests/unit/systems/test_ensemble_ops.py .. [ 16%]
tests/unit/systems/test_export.py . [ 18%]
tests/unit/systems/test_graph.py . [ 20%]
tests/unit/systems/test_inference_ops.py .. [ 24%]
tests/unit/systems/test_op_runner.py .... [ 32%]
tests/unit/systems/test_tensorflow_inf_op.py ... [ 38%]
tests/unit/systems/fil/test_fil.py .......................... [ 91%]
tests/unit/systems/fil/test_forest.py ... [ 97%]
tests/unit/systems/hugectr/test_hugectr.py F [100%]

=================================== FAILURES ===================================
________________________________ test_training _________________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-4/test_training0')

def test_training(tmpdir):
    cat_dtypes = {"a": int, "b": int, "c": int}
    dataset = cudf.datasets.randomdata(1, dtypes={**cat_dtypes, "label": bool})
    dataset["label"] = dataset["label"].astype("int32")

    categorical_columns = list(cat_dtypes.keys())

    gdf = cudf.DataFrame(
        {
            "a": np.arange(64),
            "b": np.arange(64),
            "c": np.arange(64),
            "d": np.random.rand(64).tolist(),
            "label": [0] * 64,
        },
        dtype="int64",
    )
    gdf["label"] = gdf["label"].astype("float32")
    train_dataset = nvt.Dataset(gdf)

    dense_columns = ["d"]

    dict_dtypes = {}
    for col in dense_columns:
        dict_dtypes[col] = np.float32

    for col in categorical_columns:
        dict_dtypes[col] = np.int64

    for col in ["label"]:
        dict_dtypes[col] = np.float32

    train_path = os.path.join(tmpdir, "train/")
    os.mkdir(train_path)

    train_dataset.to_parquet(
        output_path=train_path,
        shuffle=nvt.io.Shuffle.PER_PARTITION,
        cats=categorical_columns,
        conts=dense_columns,
        labels=["label"],
        dtypes=dict_dtypes,
    )

    embeddings = {"a": (64, 16), "b": (64, 16), "c": (64, 16)}

    total_cardinality = 0
    slot_sizes = []

    for column in cat_dtypes:
        slot_sizes.append(embeddings[column][0])
        total_cardinality += embeddings[column][0]

    # slot sizes = list of caridinalities per column, total is sum of individual
    model = _run_model(slot_sizes, train_path, len(dense_columns))

    model_op = HugeCTR(model, max_nnz=2, device_list=[0])

    model_repository_path = os.path.join(tmpdir, "model_repository")

    input_schema = Schema(
        [
            ColumnSchema("DES", dtype=np.float32),
            ColumnSchema("CATCOLUMN", dtype=np.int64),
            ColumnSchema("ROWINDEX", dtype=np.int32),
        ]
    )
    triton_chain = ColumnSelector(["DES", "CATCOLUMN", "ROWINDEX"]) >> model_op
    ens = Ensemble(triton_chain, input_schema)

    os.makedirs(model_repository_path)

    enc_config, node_configs = ens.export(model_repository_path)

    assert enc_config
    assert len(node_configs) == 1
    assert node_configs[0].name == "0_hugectr"

    df = train_dataset.to_ddf().compute()[:5]
    dense, cats, rowptr = _convert(df, slot_sizes, categorical_columns, labels=["label"])

    inputs = [
        grpcclient.InferInput("DES", dense.shape, triton.np_to_triton_dtype(dense.dtype)),
        grpcclient.InferInput("CATCOLUMN", cats.shape, triton.np_to_triton_dtype(cats.dtype)),
        grpcclient.InferInput("ROWINDEX", rowptr.shape, triton.np_to_triton_dtype(rowptr.dtype)),
    ]
    inputs[0].set_data_from_numpy(dense)
    inputs[1].set_data_from_numpy(cats)
    inputs[2].set_data_from_numpy(rowptr)

    response = _run_ensemble_on_tritonserver(
        model_repository_path,
        ["OUTPUT0"],
        inputs,
        "0_hugectr",
        backend_config=f"hugectr,ps={tmpdir}/model_repository/ps.json",
    )
    assert len(response.as_numpy("OUTPUT0")) == df.shape[0]

    model_config = node_configs[0].parameters["config"].string_value

    hugectr_name = node_configs[0].name
    dense_path = f"{tmpdir}/model_repository/{hugectr_name}/1/_dense_0.model"
    sparse_files = [f"{tmpdir}/model_repository/{hugectr_name}/1/0_sparse_0.model"]
  out_predict = _predict(
        dense, cats, rowptr, model_config, hugectr_name, dense_path, sparse_files
    )

tests/unit/systems/hugectr/test_hugectr.py:244:


dense_features = array([[0., 0., 0., 0., 0.]], dtype=float32)
embedding_columns = array([[ 0, 64, 128, 1, 65, 129, 2, 66, 130, 3, 67, 131, 4,
68, 132]])
row_ptrs = array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]],
dtype=int32)
config_file = '/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json'
model_name = '0_hugectr'
dense_path = '/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/_dense_0.model'
sparse_paths = ['/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_sparse_0.model']

def _predict(
    dense_features, embedding_columns, row_ptrs, config_file, model_name, dense_path, sparse_paths
):
    inference_params = InferenceParams(
        model_name=model_name,
        max_batchsize=64,
        hit_rate_threshold=0.5,
        dense_model_file=dense_path,
        sparse_model_files=sparse_paths,
        device_id=0,
        use_gpu_embedding_cache=True,
        cache_size_percentage=0.2,
        i64_input_key=True,
        use_mixed_precision=False,
    )
    inference_session = CreateInferenceSession(config_file, inference_params)
  output = inference_session.predict(
        dense_features[0].tolist(), embedding_columns[0].tolist(), row_ptrs[0].tolist()
    )

E RuntimeError: Runtime error: an illegal memory access was encountered
E cudaStreamSynchronize(resource_manager_->get_local_gpu(0)->get_stream()) at predict(/hugectr/HugeCTR/src/inference/inference_session.cpp:203)

tests/unit/systems/hugectr/test_hugectr.py:267: RuntimeError
----------------------------- Captured stdout call -----------------------------
HugeCTR Version: 3.7
====================================================Model Init=====================================================
[HCTR][13:08:41.090][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][13:08:41.090][WARNING][RK0][main]: MPI was already initialized somewhere elese. Lifetime service disabled.
[HCTR][13:08:41.090][INFO][RK0][main]: Global seed is 206256206
[HCTR][13:08:41.137][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
[HCTR][13:08:41.396][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][13:08:41.396][INFO][RK0][main]: Start all2all warmup
[HCTR][13:08:41.396][INFO][RK0][main]: End all2all warmup
[HCTR][13:08:41.397][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][13:08:41.397][INFO][RK0][main]: Device 0: Tesla P100-DGXS-16GB
[HCTR][13:08:41.397][INFO][RK0][main]: num of DataReader workers: 1
[HCTR][13:08:41.397][INFO][RK0][main]: Vocabulary size: 0
[HCTR][13:08:41.398][INFO][RK0][main]: max_vocabulary_size_per_gpu_=584362
[HCTR][13:08:41.398][DEBUG][RK0][tid #140438409291520]: file_name_ /tmp/pytest-of-jenkins/pytest-4/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][13:08:41.399][DEBUG][RK0][tid #140437340702464]: file_name_ /tmp/pytest-of-jenkins/pytest-4/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][13:08:41.399][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][13:08:41.693][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][13:08:41.693][INFO][RK0][main]: gpu0 init embedding done
[HCTR][13:08:41.695][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][13:08:41.696][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][13:08:41.696][INFO][RK0][main]: label Dense Sparse
label dense data1
(None, 1) (None, 1)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (None, 3, 16)

InnerProduct dense fc1 (None, 512)

Reshape sparse_embedding1 reshape1 (None, 48)

InnerProduct reshape1 fc2 (None, 1)
fc1

BinaryCrossEntropyLoss fc2 loss
label

=====================================================Model Fit=====================================================
[HCTR][13:08:41.696][INFO][RK0][main]: Use non-epoch mode with number of iterations: 20
[HCTR][13:08:41.696][INFO][RK0][main]: Training batchsize: 10, evaluation batchsize: 10
[HCTR][13:08:41.696][INFO][RK0][main]: Evaluation interval: 200, snapshot interval: 10
[HCTR][13:08:41.696][INFO][RK0][main]: Dense network trainable: True
[HCTR][13:08:41.696][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][13:08:41.696][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][13:08:41.696][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][13:08:41.696][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][13:08:41.696][INFO][RK0][main]: Training source file: /tmp/pytest-of-jenkins/pytest-4/test_training0/train/file_list.txt
[HCTR][13:08:41.696][INFO][RK0][main]: Evaluation source file: /tmp/pytest-of-jenkins/pytest-4/test_training0/train/file_list.txt
[HCTR][13:08:41.701][DEBUG][RK0][tid #140438409291520]: file_name
/tmp/pytest-of-jenkins/pytest-4/test_training0/train/part_0.parquet file_total_rows
64
[HCTR][13:08:41.706][DEBUG][RK0][tid #140438409291520]: file_name_ /tmp/pytest-of-jenkins/pytest-4/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][13:08:41.723][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][13:08:41.723][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][13:08:41.741][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][13:08:41.777][INFO][RK0][main]: Done
[HCTR][13:08:41.795][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][13:08:41.833][INFO][RK0][main]: Done
[HCTR][13:08:41.834][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][13:08:41.834][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][13:08:41.834][INFO][RK0][main]: Dumping dense optimizer states to file, successful
[HCTR][13:08:41.839][DEBUG][RK0][tid #140438409291520]: file_name_ /tmp/pytest-of-jenkins/pytest-4/test_training0/train/part_0.parquet file_total_rows_ 64
[HCTR][13:08:41.842][INFO][RK0][main]: Finish 20 iterations with batchsize: 10 in 0.15s.
[HCTR][13:08:41.843][INFO][RK0][main]: Save the model graph to /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json successfully
[HCTR][13:08:41.844][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][13:08:41.844][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][13:08:41.862][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][13:08:41.897][INFO][RK0][main]: Done
[HCTR][13:08:41.916][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][13:08:41.954][INFO][RK0][main]: Done
[HCTR][13:08:41.955][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][13:08:41.955][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][13:08:41.955][INFO][RK0][main]: Dumping dense optimizer states to file, successful
[HCTR][13:08:42.367][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][13:08:42.367][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][13:08:42.367][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][13:08:42.367][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][13:08:42.631][INFO][RK0][main]: Table: hps_et.0_hugectr.sparse_embedding1; cached 64 / 64 embeddings in volatile database (PreallocatedHashMapBackend); load: 64 / 18446744073709551615 (0.00%).
[HCTR][13:08:42.631][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][13:08:42.631][INFO][RK0][main]: Create embedding cache in device 0.
[HCTR][13:08:42.632][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][13:08:42.632][INFO][RK0][main]: Configured cache hit rate threshold: 0.900000
[HCTR][13:08:42.632][INFO][RK0][main]: The size of thread pool: 16
[HCTR][13:08:42.632][INFO][RK0][main]: The size of worker memory pool: 4
[HCTR][13:08:42.632][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][13:08:42.649][INFO][RK0][main]: Global seed is 3322202506
[HCTR][13:08:43.268][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][13:08:43.268][INFO][RK0][main]: Start all2all warmup
[HCTR][13:08:43.268][INFO][RK0][main]: End all2all warmup
[HCTR][13:08:43.269][INFO][RK0][main]: Create inference session on device: 0
[HCTR][13:08:43.269][INFO][RK0][main]: Model name: 0_hugectr
[HCTR][13:08:43.269][INFO][RK0][main]: Use mixed precision: False
[HCTR][13:08:43.269][INFO][RK0][main]: Use cuda graph: True
[HCTR][13:08:43.269][INFO][RK0][main]: Max batchsize: 64
[HCTR][13:08:43.269][INFO][RK0][main]: Use I64 input key: True
[HCTR][13:08:43.269][INFO][RK0][main]: start create embedding for inference
[HCTR][13:08:43.269][INFO][RK0][main]: sparse_input name data1
[HCTR][13:08:43.269][INFO][RK0][main]: create embedding for inference success
[HCTR][13:08:43.269][INFO][RK0][main]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
Signal (2) received.
[HCTR][13:08:47.604][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][13:08:47.605][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][13:08:47.605][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][13:08:47.605][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][13:08:47.897][INFO][RK0][main]: Table: hps_et.0_hugectr.sparse_embedding1; cached 64 / 64 embeddings in volatile database (PreallocatedHashMapBackend); load: 64 / 18446744073709551615 (0.00%).
[HCTR][13:08:47.897][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][13:08:47.897][INFO][RK0][main]: Create embedding cache in device 0.
[HCTR][13:08:47.898][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.200000
[HCTR][13:08:47.898][INFO][RK0][main]: Configured cache hit rate threshold: 0.500000
[HCTR][13:08:47.898][INFO][RK0][main]: The size of thread pool: 16
[HCTR][13:08:47.898][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][13:08:47.898][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][13:08:47.902][INFO][RK0][main]: Global seed is 2525407045
[HCTR][13:08:47.902][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
[HCTR][13:08:47.931][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][13:08:47.931][INFO][RK0][main]: Start all2all warmup
[HCTR][13:08:47.931][INFO][RK0][main]: End all2all warmup
[HCTR][13:08:47.931][INFO][RK0][main]: Create inference session on device: 0
[HCTR][13:08:47.931][INFO][RK0][main]: Model name: 0_hugectr
[HCTR][13:08:47.931][INFO][RK0][main]: Use mixed precision: False
[HCTR][13:08:47.931][INFO][RK0][main]: Use cuda graph: True
[HCTR][13:08:47.931][INFO][RK0][main]: Max batchsize: 64
[HCTR][13:08:47.931][INFO][RK0][main]: Use I64 input key: True
[HCTR][13:08:47.931][INFO][RK0][main]: start create embedding for inference
[HCTR][13:08:47.931][INFO][RK0][main]: sparse_input name data1
[HCTR][13:08:47.931][INFO][RK0][main]: create embedding for inference success
[HCTR][13:08:47.931][INFO][RK0][main]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
----------------------------- Captured stderr call -----------------------------
I0715 13:08:42.231054 19468 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1916000000' with size 268435456
I0715 13:08:42.231852 19468 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0715 13:08:42.234315 19468 model_repository_manager.cc:1191] loading: 0_hugectr:1
I0715 13:08:42.367296 19468 hugectr.cc:1738] TRITONBACKEND_Initialize: hugectr
I0715 13:08:42.367327 19468 hugectr.cc:1745] Triton TRITONBACKEND API version: 1.9
I0715 13:08:42.367334 19468 hugectr.cc:1749] 'hugectr' TRITONBACKEND API version: 1.9
I0715 13:08:42.367340 19468 hugectr.cc:1772] The HugeCTR backend Repository location: /opt/tritonserver/backends/hugectr
I0715 13:08:42.367347 19468 hugectr.cc:1781] The HugeCTR backend configuration: {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","ps":"/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/ps.json","default-max-batch-size":"4"}}
I0715 13:08:42.367372 19468 hugectr.cc:345] *****Parsing Parameter Server Configuration from /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/ps.json
I0715 13:08:42.367425 19468 hugectr.cc:366] Support 64-bit keys = 1
I0715 13:08:42.367460 19468 hugectr.cc:591] Model name = 0_hugectr
I0715 13:08:42.367468 19468 hugectr.cc:600] Model '0_hugectr' -> network file = /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0715 13:08:42.367475 19468 hugectr.cc:607] Model '0_hugectr' -> max. batch size = 64
I0715 13:08:42.367481 19468 hugectr.cc:613] Model '0_hugectr' -> dense model file = /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/_dense_0.model
I0715 13:08:42.367490 19468 hugectr.cc:619] Model '0_hugectr' -> sparse model files = [/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_sparse_0.model]
I0715 13:08:42.367496 19468 hugectr.cc:630] Model '0_hugectr' -> use GPU embedding cache = 1
I0715 13:08:42.367515 19468 hugectr.cc:639] Model '0_hugectr' -> hit rate threshold = 0.9
I0715 13:08:42.367523 19468 hugectr.cc:647] Model '0_hugectr' -> per model GPU cache = 0.5
I0715 13:08:42.367535 19468 hugectr.cc:664] Model '0_hugectr' -> use_mixed_precision = 0
I0715 13:08:42.367543 19468 hugectr.cc:671] Model '0_hugectr' -> scaler = 1
I0715 13:08:42.367549 19468 hugectr.cc:677] Model '0_hugectr' -> use_algorithm_search = 1
I0715 13:08:42.367555 19468 hugectr.cc:685] Model '0_hugectr' -> use_cuda_graph = 1
I0715 13:08:42.367561 19468 hugectr.cc:692] Model '0_hugectr' -> num. pool worker buffers = 4
I0715 13:08:42.367567 19468 hugectr.cc:700] Model '0_hugectr' -> num. pool refresh buffers = 1
I0715 13:08:42.367574 19468 hugectr.cc:708] Model '0_hugectr' -> cache refresh rate per iteration = 0.2
I0715 13:08:42.367582 19468 hugectr.cc:717] Model '0_hugectr' -> deployed device list = [0]
I0715 13:08:42.367589 19468 hugectr.cc:725] Model '0_hugectr' -> default value for each table = [0]
I0715 13:08:42.367595 19468 hugectr.cc:733] Model '0_hugectr' -> maxnum_des_feature_per_sample = 1
I0715 13:08:42.367601 19468 hugectr.cc:741] Model '0_hugectr' -> refresh_delay = 0
I0715 13:08:42.367608 19468 hugectr.cc:747] Model '0_hugectr' -> refresh_interval = 0
I0715 13:08:42.367615 19468 hugectr.cc:755] Model '0_hugectr' -> maxnum_catfeature_query_per_table_per_sample list = [3]
I0715 13:08:42.367622 19468 hugectr.cc:766] Model '0_hugectr' -> embedding_vecsize_per_table list = [16]
I0715 13:08:42.367629 19468 hugectr.cc:773] Model '0_hugectr' -> embedding model names = [, sparse_embedding1]
I0715 13:08:42.367635 19468 hugectr.cc:780] Model '0_hugectr' -> label_dim = 1
I0715 13:08:42.367641 19468 hugectr.cc:785] Model '0_hugectr' -> the number of slots = 3
I0715 13:08:42.367651 19468 hugectr.cc:806] *****The HugeCTR Backend Parameter Server is creating... *****
I0715 13:08:42.367785 19468 hugectr.cc:814] ***** Parameter Server(Int64) is creating... *****
I0715 13:08:42.636847 19468 hugectr.cc:825] *****The HugeCTR Backend Backend created the Parameter Server successfully! *****
I0715 13:08:42.636911 19468 hugectr.cc:1844] TRITONBACKEND_ModelInitialize: 0_hugectr (version 1)
I0715 13:08:42.636919 19468 hugectr.cc:1857] Repository location: /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr
I0715 13:08:42.636925 19468 hugectr.cc:1872] backend configuration in mode: {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","ps":"/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/ps.json","default-max-batch-size":"4"}}
I0715 13:08:42.636935 19468 hugectr.cc:1888] Parsing the latest Parameter Server json config file for deploying model 0_hugectr online
I0715 13:08:42.636943 19468 hugectr.cc:1893] Hierarchical PS version is 0 and the current Model Version is 1
I0715 13:08:42.636951 19468 hugectr.cc:345] *****Parsing Parameter Server Configuration from /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/ps.json
I0715 13:08:42.636991 19468 hugectr.cc:366] Support 64-bit keys = 1
I0715 13:08:42.637013 19468 hugectr.cc:591] Model name = 0_hugectr
I0715 13:08:42.637020 19468 hugectr.cc:600] Model '0_hugectr' -> network file = /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0715 13:08:42.637027 19468 hugectr.cc:607] Model '0_hugectr' -> max. batch size = 64
I0715 13:08:42.637033 19468 hugectr.cc:613] Model '0_hugectr' -> dense model file = /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/_dense_0.model
I0715 13:08:42.637041 19468 hugectr.cc:619] Model '0_hugectr' -> sparse model files = [/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_sparse_0.model]
I0715 13:08:42.637047 19468 hugectr.cc:630] Model '0_hugectr' -> use GPU embedding cache = 1
I0715 13:08:42.637058 19468 hugectr.cc:639] Model '0_hugectr' -> hit rate threshold = 0.9
I0715 13:08:42.637067 19468 hugectr.cc:647] Model '0_hugectr' -> per model GPU cache = 0.5
I0715 13:08:42.637079 19468 hugectr.cc:664] Model '0_hugectr' -> use_mixed_precision = 0
I0715 13:08:42.637086 19468 hugectr.cc:671] Model '0_hugectr' -> scaler = 1
I0715 13:08:42.637092 19468 hugectr.cc:677] Model '0_hugectr' -> use_algorithm_search = 1
I0715 13:08:42.637097 19468 hugectr.cc:685] Model '0_hugectr' -> use_cuda_graph = 1
I0715 13:08:42.637103 19468 hugectr.cc:692] Model '0_hugectr' -> num. pool worker buffers = 4
I0715 13:08:42.637109 19468 hugectr.cc:700] Model '0_hugectr' -> num. pool refresh buffers = 1
I0715 13:08:42.637119 19468 hugectr.cc:708] Model '0_hugectr' -> cache refresh rate per iteration = 0.2
I0715 13:08:42.637126 19468 hugectr.cc:717] Model '0_hugectr' -> deployed device list = [0]
I0715 13:08:42.637134 19468 hugectr.cc:725] Model '0_hugectr' -> default value for each table = [0]
I0715 13:08:42.637139 19468 hugectr.cc:733] Model '0_hugectr' -> maxnum_des_feature_per_sample = 1
I0715 13:08:42.637145 19468 hugectr.cc:741] Model '0_hugectr' -> refresh_delay = 0
I0715 13:08:42.637151 19468 hugectr.cc:747] Model '0_hugectr' -> refresh_interval = 0
I0715 13:08:42.637159 19468 hugectr.cc:755] Model '0_hugectr' -> maxnum_catfeature_query_per_table_per_sample list = [3]
I0715 13:08:42.637165 19468 hugectr.cc:766] Model '0_hugectr' -> embedding_vecsize_per_table list = [16]
I0715 13:08:42.637174 19468 hugectr.cc:773] Model '0_hugectr' -> embedding model names = [, sparse_embedding1]
I0715 13:08:42.637180 19468 hugectr.cc:780] Model '0_hugectr' -> label_dim = 1
I0715 13:08:42.637186 19468 hugectr.cc:785] Model '0_hugectr' -> the number of slots = 3
I0715 13:08:42.638097 19468 hugectr.cc:1078] Verifying model configuration: {
"name": "0_hugectr",
"platform": "",
"backend": "hugectr",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 64,
"input": [
{
"name": "DES",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "CATCOLUMN",
"data_type": "TYPE_INT64",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "ROWINDEX",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "0_hugectr_0",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"gpucache": {
"string_value": "true"
},
"embeddingkey_long_type": {
"string_value": "true"
},
"slots": {
"string_value": "3"
},
"config": {
"string_value": "/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json"
},
"cat_feature_num": {
"string_value": "3"
},
"label_dim": {
"string_value": "1"
},
"max_nnz": {
"string_value": "2"
},
"gpucacheper": {
"string_value": "0.5"
},
"embedding_vector_size": {
"string_value": "16"
},
"des_feature_num": {
"string_value": "1"
}
},
"model_warmup": []
}
I0715 13:08:42.638148 19468 hugectr.cc:1164] The model configuration: {
"name": "0_hugectr",
"platform": "",
"backend": "hugectr",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 64,
"input": [
{
"name": "DES",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "CATCOLUMN",
"data_type": "TYPE_INT64",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "ROWINDEX",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "0_hugectr_0",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"gpucache": {
"string_value": "true"
},
"embeddingkey_long_type": {
"string_value": "true"
},
"slots": {
"string_value": "3"
},
"config": {
"string_value": "/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json"
},
"cat_feature_num": {
"string_value": "3"
},
"label_dim": {
"string_value": "1"
},
"max_nnz": {
"string_value": "2"
},
"gpucacheper": {
"string_value": "0.5"
},
"embedding_vector_size": {
"string_value": "16"
},
"des_feature_num": {
"string_value": "1"
}
},
"model_warmup": []
}
I0715 13:08:42.638167 19468 hugectr.cc:1209] slots set = 3
I0715 13:08:42.638173 19468 hugectr.cc:1213] slots set = 3
I0715 13:08:42.638179 19468 hugectr.cc:1221] desene number = 1
I0715 13:08:42.638188 19468 hugectr.cc:1239] The max categorical feature number = 3
I0715 13:08:42.638196 19468 hugectr.cc:1244] embedding size = 16
I0715 13:08:42.638201 19468 hugectr.cc:1250] embedding size = 16
I0715 13:08:42.638207 19468 hugectr.cc:1256] maxnnz = 2
I0715 13:08:42.638215 19468 hugectr.cc:1265] refresh_interval = 0
I0715 13:08:42.638221 19468 hugectr.cc:1273] refresh_delay = 0
I0715 13:08:42.638227 19468 hugectr.cc:1281] HugeCTR model config path = /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0715 13:08:42.638234 19468 hugectr.cc:1329] support mixed_precision = 0
I0715 13:08:42.638246 19468 hugectr.cc:1348] gpu cache per = 0.5
I0715 13:08:42.638253 19468 hugectr.cc:1366] hit-rate threshold = 0.9
I0715 13:08:42.638260 19468 hugectr.cc:1374] Label dim = 1
I0715 13:08:42.638266 19468 hugectr.cc:1383] support 64-bit embedding key = 1
I0715 13:08:42.638272 19468 hugectr.cc:1394] Model_Inference_Para.max_batchsize: 64
I0715 13:08:42.638277 19468 hugectr.cc:1398] max_batch_size in model config.pbtxt is 64
I0715 13:08:42.638284 19468 hugectr.cc:1468] ******Creating Embedding Cache for model 0_hugectr in device 0
I0715 13:08:42.638290 19468 hugectr.cc:1495] ******Creating Embedding Cache for model 0_hugectr successfully
I0715 13:08:42.638654 19468 hugectr.cc:1996] TRITONBACKEND_ModelInstanceInitialize: 0_hugectr_0 (device 0)
I0715 13:08:42.638666 19468 hugectr.cc:1637] Triton Model Instance Initialization on device 0
I0715 13:08:42.638673 19468 hugectr.cc:1647] Dense Feature buffer allocation:
I0715 13:08:42.649219 19468 hugectr.cc:1654] Categorical Feature buffer allocation:
I0715 13:08:42.649261 19468 hugectr.cc:1672] Categorical Row Index buffer allocation:
I0715 13:08:42.649274 19468 hugectr.cc:1680] Predict result buffer allocation:
I0715 13:08:42.649287 19468 hugectr.cc:2009] Loading HugeCTR Model
I0715 13:08:42.649295 19468 hugectr.cc:1698] The model origin json configuration file path is: /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/0_hugectr/1/0_hugectr.json
I0715 13:08:43.402591 19468 hugectr.cc:1706] ******Loading HugeCTR model successfully
I0715 13:08:43.402756 19468 model_repository_manager.cc:1345] successfully loaded '0_hugectr' version 1
I0715 13:08:43.403105 19468 model_repository_manager.cc:1191] loading: ensemble_model:1
I0715 13:08:43.503483 19468 model_repository_manager.cc:1345] successfully loaded 'ensemble_model' version 1
I0715 13:08:43.503619 19468 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0715 13:08:43.503758 19468 server.cc:583]
+---------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+---------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","ps":"/tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository/ps.json","default-max-batch-size":"4"}} |
+---------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0715 13:08:43.503820 19468 server.cc:626]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
| 0_hugectr | 1 | READY |
| ensemble_model | 1 | READY |
+----------------+---------+--------+

I0715 13:08:43.540290 19468 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0715 13:08:43.541137 19468 tritonserver.cc:2138]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.22.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-4/test_training0/model_repository |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0715 13:08:43.541956 19468 grpc_server.cc:4589] Started GRPCInferenceService at 0.0.0.0:8001
I0715 13:08:43.542488 19468 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0715 13:08:43.583764 19468 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
W0715 13:08:44.561628 19468 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0715 13:08:44.561690 19468 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
I0715 13:08:45.059789 19468 server.cc:257] Waiting for in-flight requests to complete.
I0715 13:08:45.059819 19468 server.cc:273] Timeout 30: Found 0 model versions that have in-flight inferences
I0715 13:08:45.059837 19468 model_repository_manager.cc:1223] unloading: ensemble_model:1
I0715 13:08:45.059912 19468 model_repository_manager.cc:1223] unloading: 0_hugectr:1
I0715 13:08:45.059975 19468 server.cc:288] All models are stopped, unloading models
I0715 13:08:45.059994 19468 server.cc:295] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I0715 13:08:45.060004 19468 model_repository_manager.cc:1328] successfully unloaded 'ensemble_model' version 1
I0715 13:08:45.060397 19468 hugectr.cc:2026] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0715 13:08:45.075745 19468 hugectr.cc:1957] TRITONBACKEND_ModelFinalize: delete model state
I0715 13:08:45.076894 19468 hugectr.cc:1505] ******Destorying Embedding Cache for model 0_hugectr successfully
I0715 13:08:45.076965 19468 model_repository_manager.cc:1328] successfully unloaded '0_hugectr' version 1
W0715 13:08:45.561810 19468 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0715 13:08:45.561866 19468 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
I0715 13:08:46.060056 19468 server.cc:295] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
I0715 13:08:46.061463 19468 hugectr.cc:1827] TRITONBACKEND_Backend Finalize: HugectrBackend
W0715 13:08:46.582229 19468 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0715 13:08:46.582329 19468 metrics.cc:507] Unable to get memory usage for GPU 0. Memory usage status:Success, value:0. Memory total status:Success, value:0
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/nvtabular/framework_utils/init.py:18
/usr/local/lib/python3.8/dist-packages/nvtabular/framework_utils/init.py:18: DeprecationWarning: The nvtabular.framework_utils module is being replaced by the Merlin Models library. Support for importing from nvtabular.framework_utils is deprecated, and will be removed in a future version. Please consider using the models and layers from Merlin Models instead.
warnings.warn(

tests/unit/examples/test_serving_ranking_models_with_merlin_systems.py: 1 warning
tests/unit/systems/test_ensemble.py: 2 warnings
tests/unit/systems/test_export.py: 1 warning
tests/unit/systems/test_inference_ops.py: 2 warnings
tests/unit/systems/test_op_runner.py: 4 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column x is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column y is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column id is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/fil/test_fil.py::test_binary_classifier_default[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_binary_classifier_with_proba[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_multi_classifier[sklearn_forest_classifier-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_regressor[sklearn_forest_regressor-get_model_params4]
tests/unit/systems/fil/test_fil.py::test_model_file[sklearn_forest_regressor-checkpoint.tl]
/usr/local/lib/python3.8/dist-packages/sklearn/utils/deprecation.py:103: FutureWarning: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
warnings.warn(msg, category=FutureWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/systems/hugectr/test_hugectr.py::test_training - RuntimeErr...
============ 1 failed, 48 passed, 19 warnings in 509.81s (0:08:29) =============
terminate called without an active exception
/tmp/jenkins4798257444405123681.sh: line 14: 13874 Aborted (core dumped) pytest tests/unit
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins5671734044207979905.sh

@karlhigley karlhigley changed the title Hugectr base Base operator for HugeCTR serving support Jul 29, 2022
@karlhigley karlhigley added this to the Merlin 22.08 milestone Jul 29, 2022
@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #129 of commit 221c35c040eb96d183e8302fb1cae4d8542d514e, no merge conflicts.
Running as SYSTEM
Setting status of 221c35c040eb96d183e8302fb1cae4d8542d514e to PENDING with url https://10.20.13.93:8080/job/merlin_systems/310/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
 > git rev-parse 221c35c040eb96d183e8302fb1cae4d8542d514e^{commit} # timeout=10
Checking out Revision 221c35c040eb96d183e8302fb1cae4d8542d514e (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 221c35c040eb96d183e8302fb1cae4d8542d514e # timeout=10
Commit message: "Split out model and dataset creation into conftest"
 > git rev-list --no-walk 4269cf90c507f051348b5b63ad6236b3638e05ba # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins5580325469944632981.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 72 items

tests/unit/test_version.py . [ 1%]
tests/unit/examples/test_serving_an_xgboost_model_with_merlin_systems.py . [ 2%]
[ 2%]
tests/unit/examples/test_serving_ranking_models_with_merlin_systems.py . [ 4%]
[ 4%]
tests/unit/systems/test_export.py . [ 5%]
tests/unit/systems/dag/test_graph.py .. [ 8%]
tests/unit/systems/dag/test_model_registry.py .. [ 11%]
tests/unit/systems/dag/test_op_runner.py .... [ 16%]
tests/unit/systems/dag/ops/test_ops.py .. [ 19%]
tests/unit/systems/ops/feast/test_op.py ....... [ 29%]
tests/unit/systems/ops/fil/test_ensemble.py . [ 30%]
tests/unit/systems/ops/fil/test_forest.py .... [ 36%]
tests/unit/systems/ops/fil/test_op.py .......................... [ 72%]
tests/unit/systems/ops/hugectr/test_op.py EE [ 75%]
tests/unit/systems/ops/implicit/test_op.py ..Fatal Python error: Segmentation fault

Thread 0x00007f281bfff700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 306 in wait
File "/usr/lib/python3.8/threading.py", line 558 in wait
File "/usr/local/lib/python3.8/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28c6ffd700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28c77fe700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28c7fff700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28ccff9700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28cd7fa700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28ceffd700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28cffff700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28f0b64700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28f27f4700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 306 in wait
File "/usr/lib/python3.8/queue.py", line 179 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/threadpoolexecutor.py", line 51 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28f1365700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 306 in wait
File "/usr/lib/python3.8/queue.py", line 179 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/threadpoolexecutor.py", line 51 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28f2ff5700 (most recent call first):
File "/usr/local/lib/python3.8/dist-packages/distributed/profile.py", line 275 in _watch
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f28f37f6700 (most recent call first):
File "/usr/local/lib/python3.8/dist-packages/psutil/_pslinux.py", line 256 in is_storage_device
File "/usr/local/lib/python3.8/dist-packages/psutil/_pslinux.py", line 1157 in disk_io_counters
File "/usr/local/lib/python3.8/dist-packages/psutil/init.py", line 2064 in disk_io_counters
File "/usr/local/lib/python3.8/dist-packages/distributed/system_monitor.py", line 115 in update
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921 in _run
File "/usr/lib/python3.8/asyncio/events.py", line 81 in _run
File "/usr/lib/python3.8/asyncio/base_events.py", line 1859 in _run_once
File "/usr/lib/python3.8/asyncio/base_events.py", line 570 in run_forever
File "/usr/local/lib/python3.8/dist-packages/tornado/platform/asyncio.py", line 215 in start
File "/usr/local/lib/python3.8/dist-packages/distributed/utils.py", line 456 in run_loop
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f297373f700 (most recent call first):
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f2c4a0d2b80 (most recent call first):
File "/usr/local/lib/python3.8/dist-packages/scipy/sparse/_sputils.py", line 228 in isshape
File "/usr/local/lib/python3.8/dist-packages/scipy/sparse/_compressed.py", line 37 in init
File "/usr/local/lib/python3.8/dist-packages/scipy/sparse/_compressed.py", line 1230 in _with_data
File "/usr/local/lib/python3.8/dist-packages/scipy/sparse/_data.py", line 71 in astype
File "/var/jenkins_home/workspace/merlin_systems/systems/tests/unit/systems/ops/implicit/test_op.py", line 47 in test_reload_from_config
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 1761 in runtest
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 166 in pytest_runtest_call
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 259 in
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 338 in from_call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 258 in call_runtest_hook
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 219 in call_and_report
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 130 in runtestprotocol
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 347 in pytest_runtestloop
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 322 in _main
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 268 in wrap_session
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 315 in pytest_cmdline_main
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 164 in main
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 187 in console_main
File "/usr/local/bin/pytest", line 8 in
/tmp/jenkins5580325469944632981.sh: line 16: 2166 Segmentation fault (core dumped) pytest tests/unit
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins13845742082461438711.sh

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #129 of commit 4d99847a4d45afb83050acc2c99235edc09ac0eb, no merge conflicts.
Running as SYSTEM
Setting status of 4d99847a4d45afb83050acc2c99235edc09ac0eb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/311/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
 > git rev-parse 4d99847a4d45afb83050acc2c99235edc09ac0eb^{commit} # timeout=10
Checking out Revision 4d99847a4d45afb83050acc2c99235edc09ac0eb (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 4d99847a4d45afb83050acc2c99235edc09ac0eb # timeout=10
Commit message: "Add slot_sizes parameter"
 > git rev-list --no-walk 221c35c040eb96d183e8302fb1cae4d8542d514e # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins15556838871102646320.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 71 items

tests/unit/test_version.py . [ 1%]
tests/unit/examples/test_serving_an_xgboost_model_with_merlin_systems.py . [ 2%]
[ 2%]
tests/unit/examples/test_serving_ranking_models_with_merlin_systems.py . [ 4%]
[ 4%]
tests/unit/systems/test_export.py . [ 5%]
tests/unit/systems/dag/test_graph.py .. [ 8%]
tests/unit/systems/dag/test_model_registry.py .. [ 11%]
tests/unit/systems/dag/test_op_runner.py .... [ 16%]
tests/unit/systems/dag/ops/test_ops.py .. [ 19%]
tests/unit/systems/hugectr/test_hugectr.py . [ 21%]
tests/unit/systems/ops/feast/test_op.py ....... [ 30%]
tests/unit/systems/ops/fil/test_ensemble.py Fatal Python error: Segmentation fault

Thread 0x00007f978ed26700 (most recent call first):
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f9ad041ab80 (most recent call first):
File "/usr/local/lib/python3.8/dist-packages/cudf/core/column/column.py", line 302 in from_arrow
File "/usr/local/lib/python3.8/dist-packages/cudf/core/column/column.py", line 1776 in as_column
File "/usr/local/lib/python3.8/dist-packages/cudf/core/column/column.py", line 1983 in as_column
File "/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py", line 4386 in from_pandas
File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101 in inner
File "/usr/local/lib/python3.8/dist-packages/merlin/core/dispatch.py", line 567 in convert_data
File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 262 in init
File "/var/jenkins_home/workspace/merlin_systems/systems/tests/unit/systems/ops/fil/test_ensemble.py", line 59 in test_workflow_with_forest_inference
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 1761 in runtest
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 166 in pytest_runtest_call
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 259 in
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 338 in from_call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 258 in call_runtest_hook
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 219 in call_and_report
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 130 in runtestprotocol
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 347 in pytest_runtestloop
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 322 in _main
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 268 in wrap_session
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 315 in pytest_cmdline_main
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 164 in main
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 187 in console_main
File "/usr/local/bin/pytest", line 8 in
/tmp/jenkins15556838871102646320.sh: line 16: 4300 Segmentation fault (core dumped) pytest tests/unit
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins7581381615603073211.sh

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #129 of commit 027f495e62b6030a2cd712f532280b54c3b54a5a, no merge conflicts.
Running as SYSTEM
Setting status of 027f495e62b6030a2cd712f532280b54c3b54a5a to PENDING with url https://10.20.13.93:8080/job/merlin_systems/312/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
 > git rev-parse 027f495e62b6030a2cd712f532280b54c3b54a5a^{commit} # timeout=10
Checking out Revision 027f495e62b6030a2cd712f532280b54c3b54a5a (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 027f495e62b6030a2cd712f532280b54c3b54a5a # timeout=10
Commit message: "Extract config to methods and extend with all known params"
 > git rev-list --no-walk 4d99847a4d45afb83050acc2c99235edc09ac0eb # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins10890944273429405522.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 71 items

tests/unit/test_version.py . [ 1%]
tests/unit/examples/test_serving_an_xgboost_model_with_merlin_systems.py . [ 2%]
[ 2%]
tests/unit/examples/test_serving_ranking_models_with_merlin_systems.py . [ 4%]
[ 4%]
tests/unit/systems/test_export.py . [ 5%]
tests/unit/systems/dag/test_graph.py .. [ 8%]
tests/unit/systems/dag/test_model_registry.py .. [ 11%]
tests/unit/systems/dag/test_op_runner.py .... [ 16%]
tests/unit/systems/dag/ops/test_ops.py .. [ 19%]
tests/unit/systems/hugectr/test_hugectr.py F [ 21%]
tests/unit/systems/ops/feast/test_op.py ....... [ 30%]
tests/unit/systems/ops/fil/test_ensemble.py . [ 32%]
tests/unit/systems/ops/fil/test_forest.py .... [ 38%]
tests/unit/systems/ops/fil/test_op.py .......................... [ 74%]
tests/unit/systems/ops/implicit/test_op.py ...... [ 83%]
tests/unit/systems/ops/nvtabular/test_op.py .. [ 85%]
tests/unit/systems/ops/tf/test_ensemble.py Fatal Python error: Segmentation fault

Thread 0x00007f8a407f4700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 306 in wait
File "/usr/lib/python3.8/threading.py", line 558 in wait
File "/usr/local/lib/python3.8/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ab67fc700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ab6ffd700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ab77fe700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ab7fff700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ad4ff9700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ad5ffb700 (most recent call first):
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27 in poll
File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47 in wait
File "/usr/lib/python3.8/multiprocessing/process.py", line 149 in join
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 216 in _watch_process
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ad67fc700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ad77fe700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 302 in wait
File "/usr/lib/python3.8/queue.py", line 170 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/process.py", line 201 in _watch_message_queue
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8b08ffd700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 306 in wait
File "/usr/lib/python3.8/queue.py", line 179 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/threadpoolexecutor.py", line 51 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8ad7fff700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 306 in wait
File "/usr/lib/python3.8/queue.py", line 179 in get
File "/usr/local/lib/python3.8/dist-packages/distributed/threadpoolexecutor.py", line 51 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8b097fe700 (most recent call first):
File "/usr/local/lib/python3.8/dist-packages/distributed/profile.py", line 275 in _watch
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8b09fff700 (most recent call first):
File "/usr/local/lib/python3.8/dist-packages/psutil/_common.py", line 741 in open_text
File "/usr/local/lib/python3.8/dist-packages/psutil/_pslinux.py", line 1020 in net_io_counters
File "/usr/local/lib/python3.8/dist-packages/psutil/init.py", line 2114 in net_io_counters
File "/usr/local/lib/python3.8/dist-packages/distributed/system_monitor.py", line 98 in update
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921 in _run
File "/usr/lib/python3.8/asyncio/events.py", line 81 in _run
File "/usr/lib/python3.8/asyncio/base_events.py", line 1859 in _run_once
File "/usr/lib/python3.8/asyncio/base_events.py", line 570 in run_forever
File "/usr/local/lib/python3.8/dist-packages/tornado/platform/asyncio.py", line 215 in start
File "/usr/local/lib/python3.8/dist-packages/distributed/utils.py", line 456 in run_loop
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f8b5173f700 (most recent call first):
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f8e931d2b80 (most recent call first):
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 501 in init
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/containers.py", line 272 in extend
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/containers.py", line 282 in MergeFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1328 in MergeFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/containers.py", line 274 in extend
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/containers.py", line 282 in MergeFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1328 in MergeFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1336 in MergeFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/message.py", line 129 in CopyFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/containers.py", line 499 in MergeFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1328 in MergeFrom
File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/containers.py", line 274 in extend
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/function_def_to_graph.py", line 215 in function_def_to_graph_def
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/function_def_to_graph.py", line 82 in function_def_to_graph
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/function_deserialization.py", line 409 in load_function_def_library
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 151 in init
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 912 in load_partial
File "/usr/local/lib/python3.8/dist-packages/keras/saving/saved_model/load.py", line 141 in load
File "/usr/local/lib/python3.8/dist-packages/keras/saving/save.py", line 209 in load_model
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64 in error_handler
File "/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/dag/ops/tensorflow.py", line 147 in _construct_schemas_from_model
File "/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/dag/ops/tensorflow.py", line 61 in init
File "/var/jenkins_home/workspace/merlin_systems/systems/tests/unit/systems/ops/tf/test_ensemble.py", line 86 in test_workflow_tf_e2e_config_verification
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 1761 in runtest
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 166 in pytest_runtest_call
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 259 in
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 338 in from_call
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 258 in call_runtest_hook
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 219 in call_and_report
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 130 in runtestprotocol
File "/usr/local/lib/python3.8/dist-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 347 in pytest_runtestloop
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 322 in _main
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 268 in wrap_session
File "/usr/local/lib/python3.8/dist-packages/_pytest/main.py", line 315 in pytest_cmdline_main
File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 39 in _multicall
File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 80 in _hookexec
File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 265 in call
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 164 in main
File "/usr/local/lib/python3.8/dist-packages/_pytest/config/init.py", line 187 in console_main
File "/usr/local/bin/pytest", line 8 in
/tmp/jenkins10890944273429405522.sh: line 16: 6555 Segmentation fault (core dumped) pytest tests/unit
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins781435638442693619.sh

@karlhigley karlhigley removed this from the Merlin 22.08 milestone Aug 26, 2022
@EvenOldridge
Copy link
Member

@minseokl @zehuanw We're very close to HugeCTR support in Systems but we're running into this segfault. Can you please review with the team.

@karlhigley
Copy link
Contributor

@jperez999 @oliverholworthy Could you resolve the conflicts on this when you get a chance? I'd still like to add this support, and maybe we can nudge the HugeCTR team to help us out again.

@karlhigley karlhigley modified the milestones: Merlin 22.11, Merlin 23.05 Apr 4, 2023
@karlhigley karlhigley modified the milestones: Merlin 23.05, Future May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants