From 9fae20973a3b89e8a83fd544f0052606edb0ac2d Mon Sep 17 00:00:00 2001 From: David Gardner <96306125+dagardner-nv@users.noreply.github.com> Date: Thu, 1 Aug 2024 14:43:03 -0700 Subject: [PATCH] Add documentation checks to CI (#1821) * Add [vale](https://github.com/errata-ai/vale) as a documentation linter to CI * Add custom vocabulary (`ci/vale/styles/config/vocabularies/morpheus/accept.txt`). * Add sphinx's linkcheck to the documentation builds * Fix spelling & grammar errors found in existing documentation Closes #545 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md). - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) - https://github.com/hsin-c URL: https://github.com/nv-morpheus/Morpheus/pull/1821 --- .devcontainer/README.md | 12 +- .vale.ini | 29 ++++ ci/conda/channel/README.md | 12 +- ci/scripts/documentation_checks.sh | 27 ++++ ci/scripts/github/checks.sh | 3 + ci/scripts/github/docs.sh | 3 + .../config/vocabularies/morpheus/accept.txt | 77 ++++++++++ .../config/vocabularies/morpheus/reject.txt | 3 + .../all_cuda-121_arch-x86_64.yaml | 3 + .../dev_cuda-121_arch-x86_64.yaml | 3 + dependencies.yaml | 3 + docs/CMakeLists.txt | 15 +- docs/source/basics/building_a_pipeline.md | 2 +- docs/source/basics/overview.rst | 2 +- docs/source/cloud_deployment_guide.md | 16 +-- docs/source/conf.py | 11 ++ docs/source/developer_guide/architecture.md | 59 ++------ docs/source/developer_guide/contributing.md | 11 +- docs/source/developer_guide/guides.md | 2 +- ...modular_pipeline_digital_fingerprinting.md | 46 +++--- .../guides/1_simple_python_stage.md | 38 ++--- .../guides/2_real_world_phishing.md | 30 ++-- .../guides/3_simple_cpp_stage.md | 16 +-- .../guides/4_source_cpp_stage.md | 8 +- .../guides/5_digital_fingerprinting.md | 124 ++++++++-------- .../6_digital_fingerprinting_reference.md | 100 ++++++------- .../guides/7_python_modules.md | 4 +- .../developer_guide/guides/8_cpp_modules.md | 2 +- docs/source/examples.md | 2 +- docs/source/examples/llm/README.md | 8 +- docs/source/extra_info/glossary.md | 2 +- docs/source/extra_info/known_issues.md | 2 +- docs/source/extra_info/troubleshooting.md | 4 +- docs/source/getting_started.md | 18 +-- docs/source/loaders/core/file_to_df_loader.md | 24 ++-- docs/source/loaders/core/fsspec_loader.md | 6 +- docs/source/loaders/core/rest_to_df_loader.md | 28 ++-- docs/source/loaders/core/sql_loader.md | 14 +- docs/source/loaders/index.md | 4 +- docs/source/models_and_datasets.md | 4 +- docs/source/modules/core/data_loader.md | 2 +- docs/source/modules/core/file_batcher.md | 6 +- docs/source/modules/core/file_to_df.md | 14 +- .../modules/core/filter_control_message.md | 8 +- docs/source/modules/core/filter_detections.md | 29 ++-- .../modules/core/mlflow_model_writer.md | 14 +- docs/source/modules/core/payload_batcher.md | 2 +- docs/source/modules/core/serialize.md | 10 +- .../modules/core/write_to_elasticsearch.md | 10 +- docs/source/modules/core/write_to_file.md | 10 +- .../digital_fingerprinting/dfp_data_prep.md | 10 +- .../digital_fingerprinting/dfp_deployment.md | 134 +++++++++--------- .../digital_fingerprinting/dfp_inference.md | 14 +- .../dfp_inference_pipe.md | 59 ++++---- .../digital_fingerprinting/dfp_monitor.md | 14 +- .../digital_fingerprinting/dfp_preproc.md | 46 +++--- .../dfp_rolling_window.md | 14 +- .../digital_fingerprinting/dfp_split_users.md | 14 +- .../digital_fingerprinting/dfp_training.md | 8 +- .../dfp_training_pipe.md | 72 +++++----- .../spear_phishing/sp_email_enrichment.md | 12 +- .../spear_phishing/sp_inference_intent.md | 24 ++-- .../sp_inference_sp_classifier.md | 12 +- .../spear_phishing/sp_label_and_score.md | 8 +- .../spear_phishing/sp_preprocessing.md | 8 +- .../sp_sender_sketch_aggregator.md | 10 +- .../sp_sender_sketch_query_constructor.md | 7 +- .../spear_phishing/sp_sender_sketch_update.md | 14 +- .../sp_spear_phishing_post_inference.md | 10 +- .../sp_spear_phishing_pre_inference.md | 10 +- docs/source/stages/morpheus_stages.md | 22 +-- examples/abp_nvsmi_detection/README.md | 2 +- examples/abp_pcap_detection/README.md | 6 +- examples/abp_pcap_detection/run.py | 2 +- .../developer_guide/2_2_rabbitmq/README.md | 2 +- .../4_rabbitmq_cpp_stage/README.md | 4 +- .../demo/submit_messages.md | 2 +- .../digital_fingerprinting/demo/training.md | 2 +- .../production/README.md | 14 +- .../production/grafana/README.md | 10 +- .../production/morpheus/benchmarks/README.md | 52 +++---- .../digital_fingerprinting/starter/README.md | 58 ++++---- .../visualization/README.md | 2 +- examples/doca/README.md | 12 +- examples/doca/vdb_realtime/README.md | 2 +- examples/llm/agents/README.md | 12 +- examples/llm/rag/README.md | 24 ++-- examples/llm/vdb_upload/README.md | 28 ++-- examples/nlp_si_detection/README.md | 22 +-- examples/ransomware_detection/README.md | 6 +- examples/root_cause_analysis/README.md | 4 +- examples/sid_visualization/README.md | 2 +- models/README.md | 40 ++++-- models/datasets/README.md | 24 ++-- models/mlflow/README.md | 8 +- models/model-cards/abp-model-card.md | 85 +++++------ models/model-cards/dfp-model-card.md | 31 ++-- models/model-cards/gnn-fsi-model-card.md | 67 ++++----- models/model-cards/phishing-model-card.md | 74 +++++----- .../root-cause-analysis-model-card.md | 66 ++++----- .../fraud-detection-models/README.md | 4 +- models/triton-model-repo/README.md | 2 +- morpheus/_lib/README.md | 15 +- .../stages/input/cloud_trail_source_stage.py | 2 +- scripts/validation/kafka_testing.md | 12 +- tests/benchmarks/README.md | 36 ++--- 106 files changed, 1133 insertions(+), 1005 deletions(-) create mode 100644 .vale.ini create mode 100755 ci/scripts/documentation_checks.sh create mode 100644 ci/vale/styles/config/vocabularies/morpheus/accept.txt create mode 100644 ci/vale/styles/config/vocabularies/morpheus/reject.txt diff --git a/.devcontainer/README.md b/.devcontainer/README.md index 8a28296fe9..8a56b70fa6 100644 --- a/.devcontainer/README.md +++ b/.devcontainer/README.md @@ -17,20 +17,20 @@ limitations under the License. # Morpheus Devcontainer -The Morpheus devcontainer is provided as a quick-to-set-up development and exploration environment for use with [Visual Studio Code](https://code.visualstudio.com) (Code). The devcontainer is a lightweight container which mounts-in a conda environment with cached packages, alleviating long conda download times on subsequent launches. It provides a simple framework for adding developer-centric [scripts](#development-scripts), and incorperates some helpful Code plugins, such as clangd and cmake support. +The Morpheus devcontainer is provided as a quick-to-set-up development and exploration environment for use with [Visual Studio Code](https://code.visualstudio.com) (Code). The devcontainer is a lightweight container which mounts-in a Conda environment with cached packages, alleviating long Conda download times on subsequent launches. It provides a simple framework for adding developer-centric [scripts](#development-scripts), and incorporates some helpful Code plugins, such as clangd and CMake support. -More information about devcontainers can be found at [containers.dev](https://containers.dev/). +More information about devcontainers can be found at [`containers.dev`](https://containers.dev/). ## Getting Started -To get started, simply open the morpheus repository root folder within Code. A window should appear at the bottom-right corner of the editor asking if you would like to reopen the workspace inside of the dev container. After clicking the confirmation dialog, the container will first build, then launch, then remote-attach. +To get started, simply open the Morpheus repository root folder within Code. A window should appear at the bottom-right corner of the editor asking if you would like to reopen the workspace inside of the dev container. After clicking the confirmation dialog, the container will first build, then launch, then remote-attach. If the window does not appear, or you would like to rebuild the container, click ctrl-shift-p and search for `Dev Containers: Rebuild and Reopen in Container`. Hit enter, and the container will first build, then launch, then remote-attach. -Once remoted in to the devcontainer within code, the `setup-morpheus-env` script will begin to run and solve a morpheus conda environment (this conda environment is local to the morpheus repository and dev container and will not override any host environments). You should see the script executing in one of Code's integrated terminal. Once the script has completed, we're ready to start development or exploration of Morpheus. By default, each _new_ integrated terminal will automatically conda activate the morpheus environment. +Once connected to the devcontainer within code, the `setup-morpheus-env` script will begin to run and solve a Morpheus Conda environment (this Conda environment is local to the Morpheus repository and dev container and will not override any host environments). You should see the script executing in one of Code's integrated terminal. Once the script has completed, we're ready to start development or exploration of Morpheus. By default, each _new_ integrated terminal will automatically Conda activate the Morpheus environment. ## Development Scripts -Several convienient scripts are available in the devcontainer's `PATH` (`.devcontainer/bin`) for starting, stopping, and interacting with Triton and Kafka. More scripts can be added as needed. +Several convenient scripts are available in the devcontainer's `PATH` (`.devcontainer/bin`) for starting, stopping, and interacting with Triton and Kafka. More scripts can be added as needed. ### Interacting with Triton To start Triton and connect it to the devcontainer network, the `dev-triton-start` script can be used. The following example starts _or restarts_ Triton with the `abp-pcap-xgb` model loaded. @@ -54,7 +54,7 @@ To start Kafka and connect it to the devcontainer network, the `dev-kafka-start` ``` dev-kafka-start ``` -Kafka should now be started and DNS resolveable as `kafka`. +Kafka should now be started and DNS resolvable as `kafka`. ``` ping kafka ``` diff --git a/.vale.ini b/.vale.ini new file mode 100644 index 0000000000..733f33a6c5 --- /dev/null +++ b/.vale.ini @@ -0,0 +1,29 @@ +StylesPath = ci/vale/styles + +MinAlertLevel = error + +Vocab = morpheus + +Packages = Microsoft, write-good + +# Configs for markdown and reStructuredText files +[*{.md,.rst}] + +BasedOnStyles = Vale, write-good, Microsoft + +# Lower these checks to just 'suggestion' level. + +# This check enforces usage of contractions (ex: "it is" -> "it's") lowering to suggestion to allow it +Microsoft.Contractions = suggestion + +# This check disallows the use of "there is" and "there are" at the start of a sentence, I tried looking this up to +# determine the reasoning behind the rule but could not find one. Lowering to suggestion to allow it +write-good.ThereIs = suggestion + +# Allow writing dates in numeric form 02/10/2022 +Microsoft.DateOrder = suggestion + +# reStructuredText specific configs +[*.rst] +# Ignore template items inside of curly braces +TokenIgnores = ({.*}) diff --git a/ci/conda/channel/README.md b/ci/conda/channel/README.md index 854464f9c9..5ad67700f5 100644 --- a/ci/conda/channel/README.md +++ b/ci/conda/channel/README.md @@ -15,13 +15,13 @@ See the License for the specific language governing permissions and limitations under the License. --> -Creates a local conda channel using docker-compose and nginx. Can be helpful when testing new conda packages +Creates a local Conda channel using Docker Compose and nginx. Can be helpful when testing new Conda packages To Use: -1. Ensure `docker-compose` is installed -2. Set the location of the conda-bld folder to host as a conda channel to the variable `$CONDA_REPO_DIR` - 1. i.e. `export CONDA_REPO_DIR=$CONDA_PREFIX/conda-bld` -3. Launch docker-compose +1. Ensure Docker Compose is installed +2. Set the location of the `conda-bld` folder to host as a Conda channel to the variable `$CONDA_REPO_DIR` + 1. For example, `export CONDA_REPO_DIR=$CONDA_PREFIX/conda-bld` +3. Launch Docker Compose 1. `docker compose up -d` -4. Install conda packages using the local channel +4. Install Conda packages using the local channel 1. `conda install -c http://localhost:8080 ` diff --git a/ci/scripts/documentation_checks.sh b/ci/scripts/documentation_checks.sh new file mode 100755 index 0000000000..4db5a35589 --- /dev/null +++ b/ci/scripts/documentation_checks.sh @@ -0,0 +1,27 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" +source ${SCRIPT_DIR}/common.sh + +set +e + +# Intentionally excluding CHANGELOG.md as it immutable +DOC_FILES=$(git ls-files "*.md" "*.rst" | grep -v -E '^CHANGELOG\.md$') + +vale ${DOC_FILES} +RETVAL=$? +exit $RETVAL diff --git a/ci/scripts/github/checks.sh b/ci/scripts/github/checks.sh index 5f90a50828..2ce0c648aa 100755 --- a/ci/scripts/github/checks.sh +++ b/ci/scripts/github/checks.sh @@ -60,3 +60,6 @@ ${MORPHEUS_ROOT}/ci/scripts/version_checks.sh rapids-logger "Runing C++ style checks" ${MORPHEUS_ROOT}/ci/scripts/cpp_checks.sh + +rapids-logger "Runing Documentation checks" +${MORPHEUS_ROOT}/ci/scripts/documentation_checks.sh diff --git a/ci/scripts/github/docs.sh b/ci/scripts/github/docs.sh index 0a5696188d..c47ea4bbc2 100755 --- a/ci/scripts/github/docs.sh +++ b/ci/scripts/github/docs.sh @@ -44,6 +44,9 @@ rapids-logger "Building docs" cmake --build ${BUILD_DIR} --parallel ${PARALLEL_LEVEL} --target install cmake --build ${BUILD_DIR} --parallel ${PARALLEL_LEVEL} --target morpheus_docs +rapids-logger "Checking documentation links" +cmake --build ${BUILD_DIR} --parallel ${PARALLEL_LEVEL} --target morpheus_docs_linkcheck + rapids-logger "Archiving the docs" tar cfj "${WORKSPACE_TMP}/docs.tar.bz" ${BUILD_DIR}/docs/html diff --git a/ci/vale/styles/config/vocabularies/morpheus/accept.txt b/ci/vale/styles/config/vocabularies/morpheus/accept.txt new file mode 100644 index 0000000000..157edebd18 --- /dev/null +++ b/ci/vale/styles/config/vocabularies/morpheus/accept.txt @@ -0,0 +1,77 @@ +# List of case-sensitive regular expressions matching words that should be accepted by Vale. For product names like +# "cuDF" or "cuML", we want to ensure that they are capitalized the same way they're written by the product owners. +# Regular expressions are parsed according to the Go syntax: https://golang.org/pkg/regexp/syntax/ + +API(s?) +[Aa]utoencoder +[Aa]nonymize(d?) +[Bb]ackpressure +[Bb]atcher +[Bb]oolean +# Documentation for ccache only capitalizes the name at the start of a sentence https://ccache.dev/ +[Cc]cache +[Cc]hatbot(s?) +# clangd is never capitalized even at the start of a sentence https://clangd.llvm.org/ +clangd +CMake +[Cc]omposable +Conda +CPython +[Cc]ryptocurrenc[y|ies] +[Cc]yber +[Cc]ybersecurity +Cython +Dask +Databricks +[Dd]eserialize +[Dd]ev +[Dd]ocstring(s?) +[Ee]ngineerable +[Ee]xplainability +[Gg]eneratable +glog +GPU(s?) +Grafana +[Gg]ranularities +[Hh]ashable +[Hh]yperparameter(s?) +[Ii]nferencing +jsonlines +# libcudf isn't styled in the way that cuDF is https://docs.rapids.ai/api/libcudf/stable/ +libcudf +LLM(s?) +# https://github.com/logpai/loghub/ +Loghub +Milvus +[Mm]ixin +MLflow +Morpheus +[Nn]amespace(s?) +NeMo +nginx +NIC +NIM(s?) +NVIDIA +[Pp]arallelization +[Pp]arsable +PCIe +PDF(s?) +[Pp]reprocess +[Pp]retrained +pytest +[Rr]epo +[Rr]etarget(ed?) +[Ss]erializable +[Ss]ubclassing +[Ss]ubcard(s?) +[Ss]ubgraph(s?) +[Ss]ubword(s?) +[Tt]imestamp(s?) +[Tt]okenization +[Tt]okenizer(s?) +triages +[Uu]nencrypted +[Uu]nittest(s?) +[Uu]ploader +XGBoost +zsh diff --git a/ci/vale/styles/config/vocabularies/morpheus/reject.txt b/ci/vale/styles/config/vocabularies/morpheus/reject.txt new file mode 100644 index 0000000000..07e5703ce1 --- /dev/null +++ b/ci/vale/styles/config/vocabularies/morpheus/reject.txt @@ -0,0 +1,3 @@ +# List of regular expressions matching words we want to reject. Even though we don't have any words listed this +# file needs to exitst in order for vale to pick up our accept.txt file +# Regular expressions are parsed according to the Go syntax: https://golang.org/pkg/regexp/syntax/ diff --git a/conda/environments/all_cuda-121_arch-x86_64.yaml b/conda/environments/all_cuda-121_arch-x86_64.yaml index f8bb9b9529..af1e756709 100644 --- a/conda/environments/all_cuda-121_arch-x86_64.yaml +++ b/conda/environments/all_cuda-121_arch-x86_64.yaml @@ -110,6 +110,9 @@ dependencies: - transformers=4.36.2 - tritonclient=2.34 - typing_utils=0.1 +- vale-styles-microsoft +- vale-styles-write-good +- vale=3.7 - versioneer - versioneer-518 - watchdog=3.0 diff --git a/conda/environments/dev_cuda-121_arch-x86_64.yaml b/conda/environments/dev_cuda-121_arch-x86_64.yaml index e0d60b211e..9baf0eb1a9 100644 --- a/conda/environments/dev_cuda-121_arch-x86_64.yaml +++ b/conda/environments/dev_cuda-121_arch-x86_64.yaml @@ -90,6 +90,9 @@ dependencies: - tqdm=4 - tritonclient=2.34 - typing_utils=0.1 +- vale-styles-microsoft +- vale-styles-write-good +- vale=3.7 - versioneer - versioneer-518 - watchdog=3.0 diff --git a/dependencies.yaml b/dependencies.yaml index 84b58f0e11..a66dc0c4ad 100644 --- a/dependencies.yaml +++ b/dependencies.yaml @@ -284,6 +284,9 @@ dependencies: - include-what-you-use=0.20 - isort - pylint=3.0.3 + - vale=3.7 + - vale-styles-microsoft + - vale-styles-write-good - versioneer - yapf=0.40.1 diff --git a/docs/CMakeLists.txt b/docs/CMakeLists.txt index d103e03115..a718ce584f 100644 --- a/docs/CMakeLists.txt +++ b/docs/CMakeLists.txt @@ -20,14 +20,25 @@ find_package(Sphinx REQUIRED) set(SPHINX_SOURCE ${CMAKE_CURRENT_SOURCE_DIR}/source) set(SPHINX_BUILD ${CMAKE_CURRENT_BINARY_DIR}/html) -set(SPHINX_ARGS -b html -j auto -T -W) +set(SPHINX_LINKCHECK_OUT ${CMAKE_CURRENT_BINARY_DIR}/linkcheck) +set(SPHINX_ARGS -j auto -T -W) +set(SPHINX_HTML_ARGS -b html ${SPHINX_ARGS}) +set(SPHINX_LINKCHECK_ARGS -b linkcheck ${SPHINX_ARGS}) add_custom_target(${PROJECT_NAME}_docs COMMAND - BUILD_DIR=${CMAKE_CURRENT_BINARY_DIR} ${SPHINX_EXECUTABLE} ${SPHINX_ARGS} ${SPHINX_SOURCE} ${SPHINX_BUILD} + BUILD_DIR=${CMAKE_CURRENT_BINARY_DIR} ${SPHINX_EXECUTABLE} ${SPHINX_HTML_ARGS} ${SPHINX_SOURCE} ${SPHINX_BUILD} WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR} COMMENT "Generating documentation with Sphinx" DEPENDS morpheus-package-outputs ) +add_custom_target(${PROJECT_NAME}_docs_linkcheck + COMMAND + BUILD_DIR=${CMAKE_CURRENT_BINARY_DIR} ${SPHINX_EXECUTABLE} ${SPHINX_LINKCHECK_ARGS} ${SPHINX_SOURCE} ${SPHINX_LINKCHECK_OUT} + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR} + COMMENT "Checking documentation links with Sphinx" + DEPENDS morpheus-package-outputs +) + list(POP_BACK CMAKE_MESSAGE_CONTEXT) diff --git a/docs/source/basics/building_a_pipeline.md b/docs/source/basics/building_a_pipeline.md index 9aa1aa8645..e667b34573 100644 --- a/docs/source/basics/building_a_pipeline.md +++ b/docs/source/basics/building_a_pipeline.md @@ -142,7 +142,7 @@ morpheus run pipeline-nlp --help ## Basic Usage Examples ### Remove Fields from JSON Objects -This example only copies the fields 'timestamp', 'src_ip' and 'dest_ip' from `examples/data/pcap_dump.jsonlines` to +This example only copies the fields `timestamp`, `src_ip` and `dest_ip` from `examples/data/pcap_dump.jsonlines` to `out.jsonlines`. ![../img/remove_fields_from_json_objects.png](../img/remove_fields_from_json_objects.png) diff --git a/docs/source/basics/overview.rst b/docs/source/basics/overview.rst index f8d8b10c7a..ca1f8b6981 100644 --- a/docs/source/basics/overview.rst +++ b/docs/source/basics/overview.rst @@ -106,7 +106,7 @@ queried in the same manner: AutoComplete ------------ -The Morpheus CLI supports bash, fish, zsh, and powershell autocompletion. To set up autocomplete, it must first be +The Morpheus CLI supports bash, fish, zsh, and PowerShell autocompletion. To set up autocomplete, it must first be installed. Morpheus comes with a tool to assist with this: .. code-block:: console diff --git a/docs/source/cloud_deployment_guide.md b/docs/source/cloud_deployment_guide.md index 0f6adc7345..1dac95c9ae 100644 --- a/docs/source/cloud_deployment_guide.md +++ b/docs/source/cloud_deployment_guide.md @@ -75,7 +75,7 @@ Continue with the setup steps below once the host system is installed, configure ### Set up NGC API Key and Install NGC Registry CLI -First, you will need to set up your NGC API Key to access all the Morpheus components, using the linked instructions from the [NGC Registry CLI User Guide](https://docs.nvidia.com/dgx/ngc-registry-cli-user-guide/index.html#topic_4_1). +First, you will need to set up your NGC API Key to access all the Morpheus components, using the linked instructions from the [NGC Registry CLI User Guide](https://docs.nvidia.com/ngc/gpu-cloud/ngc-private-registry-user-guide/index.html#generating-personal-api-key). Once you've created your API key, create an environment variable containing your API key for use by the commands used further in this document: @@ -83,7 +83,7 @@ Once you've created your API key, create an environment variable containing your export API_KEY="" ``` -Next, install and configure the NGC Registry CLI on your system using the linked instructions from the [NGC Registry CLI User Guide](https://docs.nvidia.com/dgx/ngc-registry-cli-user-guide/index.html#topic_4_1). +Next, install and configure the NGC Registry CLI on your system using the linked instructions from the [NGC Registry CLI User Guide](https://docs.nvidia.com/ngc/gpu-cloud/ngc-private-registry-user-guide/index.html#generating-personal-api-key). ### Create Namespace for Morpheus @@ -221,7 +221,7 @@ kubectl -n $NAMESPACE exec -it deploy/mlflow -- bash (mlflow) root@mlflow-6d98:/mlflow# ``` -`Important`: When (mlflow) is present, commands are directly within the container. +`Important`: When `(mlflow)` is present, commands are directly within the container. First let's examine the syntax of the commands we will be using to communicate with the MLflow Triton plugin before we start deploying models. Publish models to MLflow server is in the form of: @@ -427,9 +427,9 @@ helm install --set ngc.apiKey="$API_KEY" \ ### Run AutoEncoder Digital Fingerprinting Pipeline The following AutoEncoder pipeline example shows how to train and validate the AutoEncoder model and write the inference results to a specified location. Digital fingerprinting has also been referred to as **HAMMAH (Human as Machine <> Machine as Human)**. -These use cases are currently implemented to detect user behavior changes that indicate a change from a human to a machine or a machine to a human, thus leaving a "digital fingerprint". The model is an ensemble of an autoencoder and fast fourier transform reconstruction. +These use cases are currently implemented to detect user behavior changes that indicate a change from a human to a machine or a machine to a human, thus leaving a "digital fingerprint." The model is an ensemble of an autoencoder and fast Fourier transform reconstruction. -Inference and training based on a userid (`user123`). The model is trained once and inference is conducted on the supplied input entries in the example pipeline below. The `--train_data_glob` parameter must be removed for continuous training. +Inference and training based on a user ID (`user123`). The model is trained once and inference is conducted on the supplied input entries in the example pipeline below. The `--train_data_glob` parameter must be removed for continuous training. ```bash helm install --set ngc.apiKey="$API_KEY" \ @@ -620,7 +620,7 @@ kubectl -n $NAMESPACE exec -it deploy/broker -c broker -- kafka-console-producer > **Note**: This should be used for development purposes only via this developer kit. Loading from the file into Kafka should not be used in production deployments of Morpheus. ### Run FIL Anomalous Behavior Profiling Pipeline -The following Anomalous Behavior Profiling pipeline examples use a pre-trained FIL model to ingest and analyze NVIDIA System Management Interface (nvidia-smi) logs, like the example below, as input sample data to identify crypto mining activity on GPU devices. +The following Anomalous Behavior Profiling pipeline examples use a pre-trained FIL model to ingest and analyze NVIDIA System Management Interface (`nvidia-smi`) logs, like the example below, as input sample data to identify cryptocurrency mining activity on GPU devices. ```json {"nvidia_smi_log.gpu.pci.tx_util": "0 KB/s", "nvidia_smi_log.gpu.pci.rx_util": "0 KB/s", "nvidia_smi_log.gpu.fb_memory_usage.used": "3980 MiB", "nvidia_smi_log.gpu.fb_memory_usage.free": "12180 MiB", "nvidia_smi_log.gpu.bar1_memory_usage.total": "16384 MiB", "nvidia_smi_log.gpu.bar1_memory_usage.used": "11 MiB", "nvidia_smi_log.gpu.bar1_memory_usage.free": "16373 MiB", "nvidia_smi_log.gpu.utilization.gpu_util": "0 %", "nvidia_smi_log.gpu.utilization.memory_util": "0 %", "nvidia_smi_log.gpu.temperature.gpu_temp": "61 C", "nvidia_smi_log.gpu.temperature.gpu_temp_max_threshold": "90 C", "nvidia_smi_log.gpu.temperature.gpu_temp_slow_threshold": "87 C", "nvidia_smi_log.gpu.temperature.gpu_temp_max_gpu_threshold": "83 C", "nvidia_smi_log.gpu.temperature.memory_temp": "57 C", "nvidia_smi_log.gpu.temperature.gpu_temp_max_mem_threshold": "85 C", "nvidia_smi_log.gpu.power_readings.power_draw": "61.77 W", "nvidia_smi_log.gpu.clocks.graphics_clock": "1530 MHz", "nvidia_smi_log.gpu.clocks.sm_clock": "1530 MHz", "nvidia_smi_log.gpu.clocks.mem_clock": "877 MHz", "nvidia_smi_log.gpu.clocks.video_clock": "1372 MHz", "nvidia_smi_log.gpu.applications_clocks.graphics_clock": "1312 MHz", "nvidia_smi_log.gpu.applications_clocks.mem_clock": "877 MHz", "nvidia_smi_log.gpu.default_applications_clocks.graphics_clock": "1312 MHz", "nvidia_smi_log.gpu.default_applications_clocks.mem_clock": "877 MHz", "nvidia_smi_log.gpu.max_clocks.graphics_clock": "1530 MHz", "nvidia_smi_log.gpu.max_clocks.sm_clock": "1530 MHz", "nvidia_smi_log.gpu.max_clocks.mem_clock": "877 MHz", "nvidia_smi_log.gpu.max_clocks.video_clock": "1372 MHz", "nvidia_smi_log.gpu.max_customer_boost_clocks.graphics_clock": "1530 MHz", "nvidia_smi_log.gpu.processes.process_info.0.process_name": "python", "nvidia_smi_log.gpu.processes.process_info.1.process_name": "tritonserver", "hostname": "ip-10-100-8-98", "timestamp": 1615542360.9566503} @@ -794,7 +794,7 @@ This section lists solutions to problems you might encounter with Morpheus or fr - Models Unloaded After Reboot - When the pod is restarted, K8s will not automatically load the models. Since models are deployed to *ai-engine* in explicit mode using MLflow, we'd have to manually deploy them again using the [Model Deployment](#model-deployment) process. - AI Engine CPU Only Mode - - After a server restart, the ai-engine pod on k8s can start up before the GPU operator infrastructure is available, making it "think" there is no driver installed (i.e., CPU -only mode). + - After a server restart, the ai-engine pod on k8s can start up before the GPU operator infrastructure is available, making it "think" there is no driver installed (that is, CPU -only mode). - Improve Pipeline Message Processing Rate - Below settings need to be considered - Provide the workflow with the optimal number of threads (`—num threads`), as having more or fewer threads can have an impact on pipeline performance. @@ -804,6 +804,6 @@ This section lists solutions to problems you might encounter with Morpheus or fr ```console 1649207839.253|COMMITFAIL|rdkafka#consumer-2| [thrd:main]: Offset commit (manual) failed for 1/1 partition(s) in join-state wait-unassign-call: Broker: Unknown member: topic[0]@112071(Broker: Unknown member) ``` - - Problem: If the standalone kafka cluster is receiving significant message throughput from the producer, this error may happen. + - Problem: If the standalone Kafka cluster is receiving significant message throughput from the producer, this error may happen. - Solution: Reinstall the Morpheus workflow and reduce the Kafka topic's message retention time and message producing rate. diff --git a/docs/source/conf.py b/docs/source/conf.py index ca862cf78c..330053a5ac 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -191,6 +191,17 @@ numpydoc_show_inherited_class_members = True numpydoc_class_members_toctree = False +# Config linkcheck +# Ignore localhost and url prefix fragments +# Ignore openai.com links, as these always report a 403 when requested by the linkcheck agent +linkcheck_ignore = [ + r'http://localhost:\d+/', + r'https://localhost:\d+/', + r'^http://$', + r'^https://$', + r'https://(platform\.)?openai.com', +] + # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] diff --git a/docs/source/developer_guide/architecture.md b/docs/source/developer_guide/architecture.md index fcadb0a0c7..6070304b4f 100644 --- a/docs/source/developer_guide/architecture.md +++ b/docs/source/developer_guide/architecture.md @@ -24,7 +24,7 @@ The organization of Morpheus can be broken down into four different layers. Work * Orchestration Layer * Responsible for coordinating pipelines and facilitating communication. * That is, monitoring pipelines, transferring messages between pipelines, starting and stopping pipelines, - assigning resources to pipelines, and so on. + and assigning resources to pipelines. * Plays a large role in multi-machine pipelines but works out of the box for single-machine pipelines. * Pipeline Layer * Composed of one or more stages connected by edges. @@ -45,32 +45,15 @@ The organization of Morpheus can be broken down into four different layers. Work ## Pipeline Details -Pipelines are a collection of one or more stages that are connected via edges. Data flows from one stage to the next -across these edges using buffers. We utilize these buffers to allow stages to process messages at different rates. Once -each stage is done processing a message, the pipeline will move it onto the next stage's buffer for processing. This -process continues until the message has made it through the entire pipeline. +Pipelines are a collection of one or more stages that are connected via edges. Data flows from one stage to the next across these edges using buffers. We utilize these buffers to allow stages to process messages at different rates. Once each stage is done processing a message, the pipeline will move it onto the next stage's buffer for processing. This process continues until the message has made it through the entire pipeline. -The main goal of the pipeline is to maximize throughput via parallel execution of the stages. So we can utilize hardware -optimally and avoid processing individual messages sequentially. Given a multi-stage pipeline consisting of stages 1 and -2. Stage 1 collects its first message from its data source and begins processing it. Once Stage 1 is done with its first -message, the resulting output message will be forwarded to Stage 2. At this point, Stage 1 immediately begins processing -the next input to the pipeline, while Stage 2 begins work on the output of Stage 1. This allows for multiple messages to -be in flight in the pipeline at a time, increasing parallelization. +The main goal of the pipeline is to maximize throughput via parallel execution of the stages. Such that we can utilize hardware optimally and avoid processing individual messages sequentially. Given a multi-stage pipeline consisting of stages 1 and 2. Stage 1 collects its first message from its data source and begins processing it. Once Stage 1 is done with its first message, the resulting output message will be forwarded to Stage 2. At this point, Stage 1 immediately begins processing the next input to the pipeline, while Stage 2 begins work on the output of Stage 1. This allows for multiple messages to be in flight in the pipeline at a time, increasing parallelization. -Utilizing buffers between stages in this way does come at a cost. Increasing the size of the buffers helps improve -parallelization by ensuring all stages have some work to do. But this also increases latency since messages can sit in a -buffer waiting to be processed. The inverse is also true. Decreasing the buffer sizes improves latency, but can starve -some stages of work to do, decreasing parallelization. The pipeline has to walk a fine line of keeping all stages -supplied with data with the smallest buffers possible. +Utilizing buffers between stages in this way does come at a cost. Increasing the size of the buffers helps improve parallelization by ensuring all stages have some work to do. But this also increases latency since messages can sit in a buffer waiting to be processed. The inverse is also true. Decreasing the buffer sizes improves latency, but can starve some stages of work to do, decreasing parallelization. The pipeline has to walk a fine line of keeping all stages supplied with data with the smallest buffers possible. ## Stage Details -A stage is the fundamental building block in Morpheus and is responsible for performing all of the work in a pipeline. A -stage can encapsulate any piece of functionality and is capable of integrating with any service or external library. -This freedom allows stages to range from very small Python map functions up to very complex inference stages, which -connect to services and work in multiple threads. For example, Morpheus has simple stages for actions like reading and -writing to a file and more complex stages like the Triton inference stage, which can send many asynchronous inference -requests using shared device memory. +A stage is the fundamental building block in Morpheus and is responsible for performing all of the work in a pipeline. A stage can encapsulate any piece of functionality and is capable of integrating with any service or external library. This freedom allows stages to range from very small Python map functions up to very complex inference stages, which connect to services and work in multiple threads. For example, Morpheus has simple stages for actions like reading and writing to a file and more complex stages like the Triton inference stage, which can send many asynchronous inference requests using shared device memory. While stages are very flexible, they all comprise three main pieces: identification, type inference, and node creation. @@ -80,38 +63,20 @@ The stage identifier is a unique string used in both logging and creating the st ### Type Inference -To perform work, each stage needs to know what type of data it will be operating on. Since Morpheus can pass any type of -data from stage to stage, the pipeline must ensure compatible types at every edge connection between stages. This -process is called stage type inference and is performed during the pipeline build phase. +To perform work, each stage needs to know what type of data it will be operating on. Morpheus can pass any type of data from stage to stage, the pipeline must ensure compatible types at every edge connection between stages. This process is called stage type inference and is performed during the pipeline build phase. -Stage type inference is necessary because the output type of some stages may depend on the output type of the previous -stage. For example, consider a simple pass through stage that passes the input message to the next stage unmodified. If -our pass through stage is preceded by a stage generating a string, its output type will be a string. Instead, if it's -preceded by a stage generating an integer, its output type will be an integer. +Stage type inference is necessary because the output type of some stages can depend on the output type of the previous stage. For example, consider a simple pass through stage that passes the input message to the next stage unmodified. If our pass through stage is preceded by a stage generating a string, its output type will be a string. Instead, if it's preceded by a stage generating an integer, its output type will be an integer. -Due to the dynamic nature of the output type of a stage, stages must specify a type inference function that accepts an -input type and returns the output type. Starting at the source stages, the pipeline will use this function to determine -the output type of the source stages. This result will then be passed to the type inference function of the next stage, -and so on until the input and output types of every stage in the pipeline have been determined. +Due to the dynamic nature of the output type of a stage, stages must specify a type inference function that accepts an input type and returns the output type. Starting at the source stages, the pipeline will use this function to determine the output type of the source stages. This result is then be passed to the type inference function of the next stage, until the input and output types of every stage in the pipeline have been determined. -After the build phase, the output types of stages cannot be changed. Returning a different type than specified during -the build phase will result in undefined behavior. +After the build phase, the output types of stages cannot be changed. Returning a different type than specified during the build phase will result in undefined behavior. ### Node Creation -The most important piece of a stage is node creation. The node creation function is responsible for creating the -instances of the nodes which will make up a stage. Like a pipeline, stages can be built up of one or more smaller nodes -connected by edges. +The most important piece of a stage is node creation. The node creation function is responsible for creating the instances of the nodes which will make up a stage. Like a pipeline, stages can be built up of one or more smaller nodes connected by edges. -The difference between stages and nodes is that stages guarantee that the same machine will run all nodes in the same -process space. This allows nodes to optimize the information they pass between themselves to ensure maximum performance. -For example, two nodes could pass a raw GPU device pointer between them, allowing maximum performance with minimum -overhead. Without this guarantee that both nodes are running in the same process space, passing such a low-level piece -of information would be unsafe. +The difference between stages and nodes is that stages guarantee that the same machine will run all nodes in the same process space. This allows nodes to optimize the information they pass between themselves to ensure maximum performance. For example, two nodes could pass a raw GPU device pointer between them, allowing maximum performance with minimum overhead. Without this guarantee that both nodes are running in the same process space, passing such a low-level piece of information would be unsafe. ## Morpheus Modules -Modules, introduced in the 23.03 release, introduce a new method for defining units of work which are compact, -composable, nestable, and fully reusable. Once a module has been defined and registered, it can be used in new and -existing pipelines as either a new ModuleStage or loaded directly within the context of an existing stage using -`builder.load_module(...)`. +Modules, introduced in the 23.03 release, introduce a new method for defining units of work which are compact, composable, nestable, and fully reusable. Once a module has been defined and registered, it can be used in new and existing pipelines as either a new ModuleStage or loaded directly within the context of an existing stage using `builder.load_module(...)`. diff --git a/docs/source/developer_guide/contributing.md b/docs/source/developer_guide/contributing.md index 2ccdb1a2d1..84ed389753 100644 --- a/docs/source/developer_guide/contributing.md +++ b/docs/source/developer_guide/contributing.md @@ -105,9 +105,8 @@ This workflow utilizes a Docker container to set up most dependencies ensuring a ```shell DOCKER_TARGET=development_pydbg ./docker/build_container_dev.sh ``` - 1. Note: When debugging Python code, you just need to add `ci/conda/recipes/python-dbg/source` to your debugger's - source path. - 1. Once created, you will be able to introspect Python objects from within GDB. For example, if we were to break within a generator setup call and examine its PyFrame_Object `f`, it might be similar to: + 1. Note: When debugging Python code, you just need to add `ci/conda/recipes/python-dbg/source` to the source path your debugger. + 1. Once created, you will be able to introspect Python objects from within GDB. For example, if we were to break within a generator setup call and examine its `PyFrame_Object` `f`, it might be similar to: ```shell #4 0x000056498ce685f4 in gen_send_ex (gen=0x7f3ecc07ad40, arg=, exc=, closing=) at Objects/genobject.c:222 (gdb) pyo f @@ -171,7 +170,7 @@ Note: These instructions assume the user is using `mamba` instead of `conda` sin - **Note:** `mamba` should only be installed once in the base environment -1. Set up env variables and clone the repo: +1. Set up environment variables and clone the repo: ```bash export MORPHEUS_ROOT=$(pwd)/morpheus git clone https://github.com/nv-morpheus/Morpheus.git $MORPHEUS_ROOT @@ -286,12 +285,12 @@ Launching a full production Kafka cluster is outside the scope of this project; $ echo $KAFKA_ADVERTISED_HOST_NAME "172.17.0.1" ``` -6. Launch kafka with 3 instances: +6. Launch Kafka with 3 instances: ```bash docker compose up -d --scale kafka=3 ``` - In practice, 3 instances have been shown to work well. Use as many instances as required. Keep in mind each instance takes about 1 Gb of memory. + In practice, 3 instances have been shown to work well. Use as many instances as required. Keep in mind each instance takes about 1 GB of memory. 7. Launch the Kafka shell 1. To configure the cluster, you will need to launch into a container that has the Kafka shell. 2. You can do this with: diff --git a/docs/source/developer_guide/guides.md b/docs/source/developer_guide/guides.md index 9e4fba5ff7..f6244f3f67 100644 --- a/docs/source/developer_guide/guides.md +++ b/docs/source/developer_guide/guides.md @@ -33,7 +33,7 @@ in both Python and C++. - [Simple C++ Stage](./guides/3_simple_cpp_stage.md) - [Creating a C++ Source Stage](./guides/4_source_cpp_stage.md) -> **Note**: The code for the above guides can be found in the `examples/developer_guide` directory of the Morpheus repository. To build the C++ examples, pass `-DMORPHEUS_BUILD_EXAMPLES=ON` to CMake when building Morpheus. Users building Morpheus with the provided `scripts/compile.sh` script can do do by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: +> **Note**: The code for the above guides can be found in the `examples/developer_guide` directory of the Morpheus repository. To build the C++ examples, pass `-DMORPHEUS_BUILD_EXAMPLES=ON` to CMake when building Morpheus. Users building Morpheus with the provided `scripts/compile.sh` script can do so by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: > ```bash > CMAKE_CONFIGURE_EXTRA_ARGS="-DMORPHEUS_BUILD_EXAMPLES=ON" ./scripts/compile.sh > ``` diff --git a/docs/source/developer_guide/guides/10_modular_pipeline_digital_fingerprinting.md b/docs/source/developer_guide/guides/10_modular_pipeline_digital_fingerprinting.md index 46ccd34446..968119b08f 100644 --- a/docs/source/developer_guide/guides/10_modular_pipeline_digital_fingerprinting.md +++ b/docs/source/developer_guide/guides/10_modular_pipeline_digital_fingerprinting.md @@ -27,7 +27,7 @@ limitations under the License. - [Setting up Morpheus](#setting-up-morpheus) - [Morpheus Modules](#morpheus-modules) - [DFP Deployment](#dfp-deployment) - - [fsspec Dataloader](#fsspec-dataloader) + - [`fsspec` Data Loader](#fsspec-data-loader) - [DFP Training and Inference Pipelines](#dfp-training-and-inference-pipelines) - [DFP Preprocessing](#dfp-preprocessing) - [Control Message Filter](#control-message-filter) @@ -38,7 +38,7 @@ limitations under the License. - [DFP Data Prep](#dfp-data-prep) - [DFP Training Pipeline](#dfp-training-pipeline) - [DFP Training](#dfp-training) - - [MLFlow Model Writer](#mlflow-model-writer) + - [MLflow Model Writer](#mlflow-model-writer) - [DFP Inference Pipeline](#dfp-inference-pipeline) - [DFP Inference](#dfp-inference) - [Filter Detections](#filter-detections) @@ -106,7 +106,7 @@ pipeline.run() ## Setting up Morpheus -For a full introduction in how to set up and run morpheus, please refer to the [Getting Started](../../getting_started.md) guide. +For a full introduction in how to set up and run Morpheus, please refer to the [Getting Started](../../getting_started.md) guide. ## Morpheus Modules @@ -147,12 +147,12 @@ def dfp_deployment(builder: mrc.Builder): builder.register_module_input("input", fsspec_dataloader_module.input_port("input")) ``` -### fsspec Dataloader +### `fsspec` Data Loader Source: `morpheus/loaders/fsspec_loader.py` -This is an instance of the new DataLoader module, utilizing a pre-defined 'fsspec' style loader. The module is used to transform glob specified file lists into individual file paths and update the control message with those paths. +This is an instance of the new DataLoader module, utilizing a pre-defined `fsspec` style loader. The module is used to transform glob specified file lists into individual file paths and update the control message with those paths. For a complete reference, refer to: [DataLoader Module](../../modules/core/data_loader.md) @@ -164,7 +164,7 @@ There are a number of modules that are used in both the training and inference p Source: `examples/digital_fingerprinting/production/morpheus/dfp/modules/dfp_preproc.py` -The `dfp_preproc` module is a functional component within the Morpheus framework that combines multiple data filtering and processing pipeline modules related to inference and training. This module simplifies the pipeline by consolidating various modules into a single, cohesive unit. The `dfp_preproc` module offers configurability for parameters such as cache directory, timestamp column name, pre-filter options, batching options, user splitting options, and supported data loaders for different file types. +The `dfp_preproc` module is a functional component within the Morpheus framework that combines multiple data filtering and processing pipeline modules related to inference and training. This module simplifies the pipeline by consolidating various modules into a single, cohesive unit. The `dfp_preproc` module supports configuration parameters such as the cache directory, timestamp column name, pre-filter options, batching options, user splitting options, and supported data loaders for various file types. The module itself consists of a series of chained sub-modules, which are connected in a logical sequence: @@ -173,11 +173,11 @@ The module itself consists of a series of chained sub-modules, which are connect - `file_batcher_module` - Responsible for batching files, either into a single control message in the case of an encapsulated training message, or into a series of control messages in the of streaming data. - `file_to_df_dataloader_module` - - Responsible for file retrieval and insertion into a cuDF dataframe. + - Responsible for file retrieval and insertion into a cuDF DataFrame. - `dfp_split_users_module` - - Responsible for splitting the dataframe into a series of dataframes, one per user. + - Responsible for splitting the DataFrame into a series of DataFrames, one per user. -For a complete reference, refer to: [DFP Preproc](../../modules/examples/digital_fingerprinting/dfp_preproc.md) +For a complete reference, refer to: [`dfp_preproc`](../../modules/examples/digital_fingerprinting/dfp_preproc.md) ```python @register_module(DFP_PREPROC, MORPHEUS_MODULE_NAMESPACE) @@ -208,7 +208,7 @@ For a complete reference, refer to: [Filter Control Message](../../modules/core/ Source: `morpheus/modules/file_batcher.py` -The `file_batcher` module is a component that is responsible for loading input files, filtering out files older than the specified time window, and grouping the remaining files by periods that fall within the time window. This module offers configurability for parameters such as batching options, cache directory, file type, filtering null values, data schema, and the timestamp column name. The `file_batcher` module processes control messages, validates them, and generates a list of files with their timestamps. The module then groups files by the given period, creates control messages for each batch, and sends them downstream for further processing. A node function is used to handle the processing of control messages, and input and output ports are registered to integrate the module into the data processing pipeline seamlessly. +The `file_batcher` module is a component that is responsible for loading input files, filtering out files older than the specified time window, and grouping the remaining files by periods that fall within the time window. This module offers configuration for parameters such as batching options, cache directory, file type, filtering null values, data schema, and the timestamp column name. The `file_batcher` module processes control messages, validates them, and generates a list of files with their timestamps. The module then groups files by the given period, creates control messages for each batch, and sends them downstream for further processing. A node function is used to handle the processing of control messages, and input and output ports are registered to integrate the module into the data processing pipeline seamlessly. The file batcher is one of the first pipeline components that begins to differ more substantially from the previous raw-data pipeline, prior to 23.03. In addition to its previous functionality, the file batcher is now control message aware, and can handle both streaming and encapsulated control messages, a property denoted by the `data_type` property of the control message's metadata being set as either `streaming` or `payload`. Additionally, the file batcher's default processing criteria for `period`, `sampling_rate_s`, `start_time`, and `end_time` can now be overridden by their corresponding values in the control message's `batching_options` metadata entry. @@ -227,7 +227,7 @@ def file_batcher(builder: mrc.Builder): Source: `morpheus/loaders/file_to_df_loader.py` -This is an instance of the new DataLoader module, utilizing a pre-defined 'file_to_df' style loader. The module is used to process 'load' tasks that reference files which need to be retrieved, possibly cached, and then loaded into a cuDF dataframe with is set as the control message payload. +This is an instance of the new DataLoader module, utilizing a pre-defined `file_to_df` style loader. The module is used to process `load` tasks that reference files which need to be retrieved, possibly cached, and then loaded into a cuDF DataFrame with is set as the control message payload. For a complete reference, refer to: [DataLoader Module](../../modules/core/data_loader.md) @@ -275,7 +275,7 @@ Source: `examples/digital_fingerprinting/production/morpheus/dfp/modules/dfp_dat The `dfp_data_prep` module is responsible for preparing data for either inference or model training. The module requires a defined schema for data preparation. -The main functionality of the module is in the `process_features` function. For each control message containing data, the function processes the columns of the data according to the given schema. The processed dataframe is then applied to the control message payload. +The main functionality of the module is in the `process_features` function. For each control message containing data, the function processes the columns of the data according to the given schema. The processed DataFrame is then applied to the control message payload. For a complete reference, refer to: [DFP Data Prep](../../modules/examples/digital_fingerprinting/dfp_data_prep.md) @@ -339,7 +339,7 @@ def dfp_inference(builder: mrc.Builder): ... ``` -### MLFlow Model Writer +### MLflow Model Writer Source: `morpheus/modules/mlflow_model_writer.py` @@ -462,7 +462,7 @@ Source: `morpheus/modules/serialize.py` The serialize module function is responsible for filtering columns from a `MultiMessage` object and emitting a `MessageMeta` object. -The `convert_to_df` function converts a dataframe to JSON lines. It takes a `MultiMessage` instance, `include_columns` (a pattern for columns to include), `exclude_columns` (a list of patterns for columns to exclude), and `columns` (a list of columns to include). The function filters the columns of the input dataframe based on the include and exclude patterns and retrieves the metadata of the filtered columns. +The `convert_to_df` function converts a DataFrame to JSON lines. It takes a `MultiMessage` instance, `include_columns` (a pattern for columns to include), `exclude_columns` (a list of patterns for columns to exclude), and `columns` (a list of columns to include). The function filters the columns of the input DataFrame based on the include and exclude patterns and retrieves the metadata of the filtered columns. The module function compiles the include and exclude patterns into regular expressions. It then creates a node using the `convert_to_df` function with the compiled include and exclude patterns and the specified columns. @@ -481,7 +481,7 @@ Source: `morpheus/modules/write_to_file.py` The `write_to_file` module function writes all messages to a file. -The convert_to_strings function takes a `DataFrame`` (either pandas or cuDF) and converts it into the appropriate string format based on the file type (JSON or CSV). It checks whether to include the index column or not. +The `convert_to_strings` function takes a DataFrame (either pandas or cuDF) and converts it into the appropriate string format based on the file type (JSON or CSV). It checks whether to include the index column or not. ```python @register_module(WRITE_TO_FILE, MORPHEUS_MODULE_NAMESPACE) @@ -497,10 +497,10 @@ For a complete reference, refer to: [Write to File](../../modules/core/write_to_ The following are steps to run modular DFP pipelines with example Azure and Duo datasets. ### System requirements -* [Docker](https://docs.docker.com/get-docker/) and [docker-compose](https://docs.docker.com/compose/) installed on the host machine​ -* Supported GPU with [nvidia-docker runtime​](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) +* [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/) installed on the host machine​ +* Supported GPU with [NVIDIA Container Toolkit​](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) -> **Note:** For GPU Requirements refer to [getting_started](../../getting_started.md#requirements) +> **Note:** For GPU requirements refer to the [Getting Started](../../getting_started.md#requirements) guide. ### Building the services From the root of the Morpheus repo, run: @@ -520,7 +520,7 @@ docker compose build > This is most likely due to using an older version of the `docker-compose` command, instead re-run the build with `docker compose`. Refer to [Migrate to Compose V2](https://docs.docker.com/compose/migrate/) for more information. ### Downloading the example datasets -First, we will need to install `s3fs` and then run the `examples/digital_fingerprinting/fetch_example_data.py` script. This will download the example data into the `examples/data/dfp` dir. +First, we will need to install `s3fs` and then run the `examples/digital_fingerprinting/fetch_example_data.py` script. This will download the example data into the `examples/data/dfp` dir. From the Morpheus repo, run: ```bash @@ -609,10 +609,10 @@ The output files, `dfp_detectiions_duo.csv` and `dfp_detections_azure.csv`, will Most of the fields in the output files generated by running the above examples are input fields or derived from input fields. The additional output fields are: | Field | Type | Description | | ----- | ---- | ----------- | -| event_time | TEXT | ISO 8601 formatted date string, the time the anomaly was detected by Morpheus | -| model_version | TEXT | Name and version of the model used to performed the inference, in the form of `:` | -| max_abs_z | FLOAT | Max z-score across all features | -| mean_abs_z | FLOAT | Average z-score across all features | +| `event_time` | TEXT | ISO 8601 formatted date string, the time the anomaly was detected by Morpheus | +| `model_version` | TEXT | Name and version of the model used to performed the inference, in the form of `:` | +| `max_abs_z` | FLOAT | Max z-score across all features | +| `mean_abs_z` | FLOAT | Average z-score across all features | In addition to this, for each input feature the following output fields will exist: | Field | Type | Description | diff --git a/docs/source/developer_guide/guides/1_simple_python_stage.md b/docs/source/developer_guide/guides/1_simple_python_stage.md index adb813a8b2..fe9c901de4 100644 --- a/docs/source/developer_guide/guides/1_simple_python_stage.md +++ b/docs/source/developer_guide/guides/1_simple_python_stage.md @@ -25,7 +25,7 @@ Morpheus makes use of the MRC graph-execution framework. Morpheus pipelines are ## The Pass Through Stage -To start, we will implement a single stage that could be included in a pipeline. For illustration, this stage will do nothing but take the input from the previous stage and forward it to the next stage. All Morpheus stages have several things in common, so while this doesn't do too much, it ends up being a good starting point for writing a new stage. From there, we can add our functionality as needed. Morpheus provides two ways of defining a stage, as a stand-alone function or as a class. +To start, we will implement a single stage that could be included in a pipeline. For illustration, this stage will do nothing but take the input from the previous stage and forward it to the next stage. All Morpheus stages have several things in common, so while this doesn't do too much, it ends up being a good starting point for writing a new stage. From there, we can add our functionality as needed. Morpheus provides two ways of defining a stage, as a stand-alone function or as a class. ### Stand-alone Function @@ -76,11 +76,11 @@ pipe.add_stage(multiplier(config, column='probs', value=5)) ### Stage Class -The class based aproach to defining a stage offers a bit more flexibility, specifically the ability to validate constructor arguments, and perform any needed setup prior to being invoked in a pipeline. Defining this stage requires us to specify the stage type. Morpheus stages which contain a single input and a single output typically inherit from `SinglePortStage`. Stages that act as sources of data, in that they do not take an input from a prior stage but rather produce data from a source such as a file, Kafka service, or other external sources, will need to inherit from the `SingleOutputSource` base class. +The class based approach to defining a stage offers a bit more flexibility, specifically the ability to validate constructor arguments, and perform any needed setup prior to being invoked in a pipeline. Defining this stage requires us to specify the stage type. Morpheus stages which contain a single input and a single output typically inherit from `SinglePortStage`. Stages that act as sources of data, in that they do not take an input from a prior stage but rather produce data from a source such as a file, Kafka service, or other external sources, will need to inherit from the `SingleOutputSource` base class. -Stages in Morpheus define what types of data they accept, and the type of data that they emit. In this example we are emitting messages of the same type that is received, this is actually quite common and Morpheus provides a mixin class, `PassThruTypeMixin`, to simplify this. +Stages in Morpheus define what types of data they accept, and the type of data that they emit. In this example we are emitting messages of the same type that is received, this is actually quite common and Morpheus provides a mixin class, `PassThruTypeMixin`, to simplify this. -Optionally, stages can be registered as a command with the Morpheus CLI using the `register_stage` decorator. This allows for pipelines to be constructed from both pre-built stages and custom user stages via the command line. Any constructor arguments will be introspected using [numpydoc](https://numpydoc.readthedocs.io/en/latest/) and exposed as command line flags. Similarly, the class's docstrings will be exposed in the help string of the stage on the command line. +Optionally, stages can be registered as a command with the Morpheus CLI using the `register_stage` decorator. This allows for pipelines to be constructed from both pre-built stages and custom user stages via the command line. Any constructor arguments will be introspected using [`numpydoc`](https://numpydoc.readthedocs.io/en/latest/) and exposed as command line flags. Similarly, the class's docstrings will be exposed in the help string of the stage on the command line. We start our class definition with a few basic imports: @@ -101,7 +101,7 @@ class PassThruStage(PassThruTypeMixin, SinglePortStage): There are four methods that need to be defined in our new subclass to implement the stage interface: `name`, `accepted_types`, `compute_schema`, `supports_cpp_node`, and `_build_single`. In practice, it is often necessary to define at least one more method which will perform the actual work of the stage; by convention, this method is typically named `on_data`, which we will define in our examples. -`name` is a property method; it should return a user-friendly name for the stage. Currently, this property is only used for debugging purposes, and there are no requirements on the content or format of the name. However by convention the string returned by this method should be the same as the string passed to the `register_stage` decorator. +`name` is a property method; it should return a user-friendly name for the stage. Currently, this property is only used for debugging purposes, and there are no requirements on the content or format of the name. However by convention the string returned by this method should be the same as the string passed to the `register_stage` decorator. ```python @property def name(self) -> str: @@ -114,7 +114,7 @@ The `accepted_types` method returns a tuple of message classes that this stage i return (typing.Any,) ``` -As mentioned previously we are making use of the `PassThruTypeMixin`, which defines the `compute_schema` method for us. This method returns the schema of the output message type. The `PassThruTypeMixin`, should be used anytime a stage receives and emits messages of the same type, even if it only accepts messages of a spefic type and modifies the data, the data type remains the same. Had we not used the `PassThruTypeMixin`, we would have defined the `compute_schema` method as follows: +As mentioned previously we are making use of the `PassThruTypeMixin`, which defines the `compute_schema` method for us. This method returns the schema of the output message type. The `PassThruTypeMixin`, should be used anytime a stage receives and emits messages of the same type, even if it only accepts messages of a specific type and modifies the data, the data type remains the same. Had we not used the `PassThruTypeMixin`, we would have defined the `compute_schema` method as follows: ```python from morpheus.pipeline.stage_schema import StageSchema ``` @@ -148,7 +148,7 @@ Finally, the `_build_single` method will be used at stage build time to construc return node ``` -For our purposes, a Morpheus _stage_ defines the input data type the stage will accept, the unit of work to be performed on that data, and the output data type. In contrast each individual node or nodes comprising a _stage_'s unit of work are wired into the underlying MRC execution pipeline. To build the node, we will call the `make_node` method of the builder instance, passing it our `unique_name` property method and applying MRC's map operator to the `on_data` method. We used the `unique_name` property, which will take the `name` property which we already defined and append a unique id to it. +For our purposes, a Morpheus _stage_ defines the input data type the stage will accept, the unit of work to be performed on that data, and the output data type. In contrast each individual node or nodes comprising a _stage_'s unit of work are wired into the underlying MRC execution pipeline. To build the node, we will call the `make_node` method of the builder instance, passing it our `unique_name` property method and applying the map operator to the `on_data` method. We used the `unique_name` property, which will take the `name` property which we already defined and append a unique id to it. ```python node = builder.make_node(self.unique_name, ops.map(self.on_data)) ``` @@ -209,7 +209,7 @@ To start testing both our new function-based and class-based stages, we are goin 1. This data will be read and processed by our pass through stage, in this case simply forwarding on the data. 1. A monitoring stage will record the messages from our pass through stage and terminate the pipeline. -First we will need to import a few things from Morpheus for this example to work. Note that this test script, which we will name "run.py", assumes that we saved the code for the class based `PassThruStage` in a file named "pass_thru.py" and the function based `pass_thru_stage` named "pass_thru_deco.py" in the same directory. +First we will need to import a few things from Morpheus for this example to work. Note that this test script, which we will name `"run.py"`, assumes that we saved the code for the class based `PassThruStage` in a file named `"pass_thru.py"` and the function based `pass_thru_stage` named "pass_thru_deco.py" in the same directory. ```python import logging @@ -278,18 +278,18 @@ The output should display: ``` ====Pipeline Pre-build==== ====Pre-Building Segment: linear_segment_0==== -====Pre-Building Segment Complete!==== -====Pipeline Pre-build Complete!==== -====Registering Pipeline==== -====Building Pipeline==== -====Building Pipeline Complete!==== -====Registering Pipeline Complete!==== -====Starting Pipeline==== -====Pipeline Started==== +====Pre-Building Segment Complete!==== +====Pipeline Pre-build Complete!==== +====Registering Pipeline==== +====Building Pipeline==== +====Building Pipeline Complete!==== +====Registering Pipeline Complete!==== +====Starting Pipeline==== +====Pipeline Started==== ====Building Segment: linear_segment_0==== Added source: └─> morpheus.MessageMeta -Added stage: , on_data_args=(), accept_type=None, return_type=None, needed_columns=None, on_data_kwargs={})> +Added stage: , on_data_args=(), accept_type=None, return_type=None, needed_columns=None, on_data_kwargs={})> └─ morpheus.MessageMeta -> morpheus.MessageMeta Added stage: └─ morpheus.MessageMeta -> morpheus.MessageMeta @@ -297,7 +297,7 @@ Added stage: └─ morpheus.MessageMeta -> morpheus.MessageMeta Added stage: └─ morpheus.MessageMeta -> morpheus.MessageMeta -====Building Segment Complete!==== +====Building Segment Complete!==== Progress[Complete]: 100 messages [00:01, 69.97 messages/s] Progress[Complete]: 100 messages [00:01, 69.76 messages/s] ====Pipeline Complete==== @@ -356,7 +356,7 @@ if __name__ == "__main__": ### Alternate Morpheus CLI example -The above example makes use of the Morpheus Python API. Alternately, we could test the class-based stage in a pipeline constructed using the Morpheus command line tool. We will need to pass in the path to our stage via the `--plugin` argument so that it will be visible to the command line tool. +The above example makes use of the Morpheus Python API. Alternately, we could test the class-based stage in a pipeline constructed using the Morpheus command line tool. We will need to pass in the path to our stage via the `--plugin` argument so that it will be visible to the command line tool. > **Note**: For now, registering a stage with the CLI tool is currently only available to class based stages. diff --git a/docs/source/developer_guide/guides/2_real_world_phishing.md b/docs/source/developer_guide/guides/2_real_world_phishing.md index 16b2b30f3d..e104288a35 100644 --- a/docs/source/developer_guide/guides/2_real_world_phishing.md +++ b/docs/source/developer_guide/guides/2_real_world_phishing.md @@ -29,7 +29,7 @@ For this task, we'll need to define a new stage, which we will call our `Recipie 1. Count the number of recipients in the email's metadata. 1. Emit a Morpheus `MessageMeta` object that will contain the record content along with the augmented metadata. -For this stage, the code will be similar to the previous example with a few notable changes. We will be working with the `MessageMeta` class. This is a Morpheus message containing a [cuDF](https://docs.rapids.ai/api/cudf/stable/) [DataFrame](https://docs.rapids.ai/api/cudf/stable/api_docs/dataframe.html). Since we will expect our new stage to operate on `MessageMeta` types, our new `accepted_types` method is defined as: +For this stage, the code will be similar to the previous example with a few notable changes. We will be working with the `MessageMeta` class. This is a Morpheus message containing a [cuDF](https://docs.rapids.ai/api/cudf/stable/) [DataFrame](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/dataframe/). Since we will expect our new stage to operate on `MessageMeta` types, our new `accepted_types` method is defined as: ```python def accepted_types(self) -> tuple: @@ -182,7 +182,7 @@ class RecipientFeaturesStage(PassThruTypeMixin, SinglePortStage): ### Stand-alone Function -For this example we started with the class based aproach. However we could have just as easily written this as a stand-alone function. The following example is equivalent to the class based example above: +For this example we started with the class based approach. However we could have just as easily written this as a stand-alone function. The following example is equivalent to the class based example above: ```python from morpheus.common import TypeId @@ -223,13 +223,13 @@ In the above the `needed_columns` were provided to as an argument to the `stage` Now we'll use the `RecipientFeaturesStage` that we just made in a real-world pipeline to detect fraudulent emails. The pipeline we will be building makes use of the `TritonInferenceStage` which is a pre-defined Morpheus stage designed to support the execution of Natural Language Processing (NLP) models via NVIDIA's [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server). NVIDIA Triton Inference Server allows for GPU accelerated ML/DL and seamless co-location and execution of a wide variety of model frameworks. For our application, we will be using the `phishing-bert-onnx` model, which is included with Morpheus models Docker container as well as in the `models/triton-model-repo/phishing-bert-onnx` directory. -It's important to note here that Triton is a service that is external to the Morpheus pipeline and often will not reside on the same machine(s) as the rest of the pipeline. The `TritonInferenceStage` will use HTTP and [gRPC](https://grpc.io/) network protocols to allow us to interact with the machine learning models that are hosted by the Triton server. +It's important to note here that Triton is a service that is external to the Morpheus pipeline and often will not reside on the same machine as the rest of the pipeline. The `TritonInferenceStage` will use HTTP and [gRPC](https://grpc.io/) network protocols to allow us to interact with the machine learning models that are hosted by the Triton server. ### Launching Triton -Triton will need to be running while we execute our pipeline. For simplicity, we will be using the Morpheus models container which includes both Trtion and the Morpheus models. +Triton will need to be running while we execute our pipeline. For simplicity, we will be using the Morpheus models container which includes both Triton and the Morpheus models. -> **Note**: This step assumes you have both [Docker](https://docs.docker.com/engine/install/) and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installation-guide) installed. +> **Note**: This step assumes you have both [Docker](https://docs.docker.com/engine/install/) and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation) installed. We will launch a Triton Docker container with: @@ -257,7 +257,7 @@ We can also query Triton for the available models: curl -X POST "localhost:8000/v2/repository/index" ``` -Let's ask Triton for some information about the `phishing-bert-onnx` model which we are going to be using, parsing the large JSON output with [jq](https://stedolan.github.io/jq/): +Let's ask Triton for some information about the `phishing-bert-onnx` model which we are going to be using, parsing the large JSON output with [`jq`](https://stedolan.github.io/jq/): ```shell curl "localhost:8000/v2/models/phishing-bert-onnx/config" | jq @@ -401,7 +401,7 @@ The `feature_length` property needs to match the dimensions of the model inputs, Ground truth classification labels are read from the `morpheus/data/labels_phishing.txt` file included in Morpheus. -Now that our config object is populated, we move on to the pipeline itself. We will be using the same input file from the previous example. +Now that our configuration object is populated, we move on to the pipeline itself. We will be using the same input file from the previous example. Next, we will add our custom recipient features stage to the pipeline. We imported both implementations of the stage, allowing us to add the appropriate one based on the `use_stage_function` value provided by the command-line. @@ -413,7 +413,7 @@ else: pipeline.add_stage(RecipientFeaturesStage(config)) ``` -To tokenize the input data we will use Morpheus' `PreprocessNLPStage`. This stage uses the [cudf subword tokenizer](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.subword_tokenizer.SubwordTokenizer.__call__.html) to transform strings into a tensor of numbers to be fed into the neural network model. Rather than split the string by characters or whitespaces, we split them into meaningful subwords based upon the occurrence of the subwords in a large training corpus. You can find more details here: [https://arxiv.org/abs/1810.04805v2](https://arxiv.org/abs/1810.04805v2). All we need to know for now is that the text will be converted to subword token ids based on the vocabulary file that we provide (`vocab_hash_file=vocab file`). +To tokenize the input data we will use Morpheus' `PreprocessNLPStage`. This stage uses the [cuDF subword tokenizer](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/subword_tokenize/#subwordtokenizer) to transform strings into a tensor of numbers to be fed into the neural network model. Rather than split the string by characters or whitespaces, we split them into meaningful subwords based upon the occurrence of the subwords in a large training corpus. You can find more details here: [https://arxiv.org/abs/1810.04805v2](https://arxiv.org/abs/1810.04805v2). All we need to know for now is that the text will be converted to subword token ids based on the vocabulary file that we provide (`vocab_hash_file=vocab file`). Let's go ahead and instantiate our `PreprocessNLPStage` and add it to the pipeline: @@ -452,7 +452,7 @@ pipeline.add_stage( pipeline.add_stage(MonitorStage(config, description="Inference Rate", smoothing=0.001, unit="inf")) ``` -Here we add a postprocessing stage that adds the probability score for `is_phishing`: +Here we add a post-processing stage that adds the probability score for `is_phishing`: ```python pipeline.add_stage(AddScoresStage(config, labels=["is_phishing"])) @@ -639,9 +639,9 @@ morpheus --log_level=debug --plugin examples/developer_guide/2_1_real_world_phis ## Stage Constructors -In our `RecipientFeaturesStage` example we added a constructor to our stage, however we didn't go into much detail on the implementation. Every stage constructor must receive an instance of a `morpheus.config.Config` object as its first argument and is then free to define additional stage-specific arguments after that. The Morpheus config object will contain configuration parameters needed by multiple stages in the pipeline, and the constructor in each Morpheus stage is free to inspect these. In contrast, parameters specific to a single stage are typically defined as constructor arguments. It is a best practice to perform any necessary validation checks in the constructor, and raising an exception in the case of mis-configuration. This allows us to fail early rather than after the pipeline has started. +In our `RecipientFeaturesStage` example we added a constructor to our stage, however we didn't go into much detail on the implementation. Every stage constructor must receive an instance of a `morpheus.config.Config` object as its first argument and is then free to define additional stage-specific arguments after that. The Morpheus configuration object will contain configuration parameters needed by multiple stages in the pipeline, and the constructor in each Morpheus stage is free to inspect these. In contrast, parameters specific to a single stage are typically defined as constructor arguments. It is a best practice to perform any necessary validation checks in the constructor, and raising an exception in the case of mis-configuration. This allows us to fail early rather than after the pipeline has started. -In our `RecipientFeaturesStage` example, we hard-coded the Bert separator token. Let's instead refactor the code to receive that as a constructor argument. This new constructor argument is documented following the [numpydoc](https://numpydoc.readthedocs.io/en/latest/format.html#parameters) formatting style allowing it to be documented properly for both API and CLI users. Let's also take the opportunity to verify that the pipeline mode is set to `morpheus.config.PipelineModes.NLP`. +In our `RecipientFeaturesStage` example, we hard-coded the Bert separator token. Let's instead refactor the code to receive that as a constructor argument. This new constructor argument is documented following the [`numpydoc`](https://numpydoc.readthedocs.io/en/latest/format.html#parameters) formatting style allowing it to be documented properly for both API and CLI users. Let's also take the opportunity to verify that the pipeline mode is set to `morpheus.config.PipelineModes.NLP`. > **Note**: Setting the pipeline mode in the `register_stage` decorator restricts usage of our stage to NLP pipelines when using the Morpheus command line tool, however there is no such enforcement with the Python API. @@ -748,7 +748,7 @@ In this example, we will create a source that reads messages from a [RabbitMQ](h The `PreallocatorMixin` when added to a stage class, typically a source stage, indicates that the stage emits newly constructed DataFrames either directly or contained in a `MessageMeta` instance into the pipeline. Adding this mixin allows any columns needed by other stages to be inserted into the DataFrame. -The `compute_schema` method allows us to define our output type of `MessageMeta`, we do so by calling the `set_type` method of the `output_schema` attribute of the `StageSchema` object passed into the method. Of note here is that it is perfectly valid for a stage to determine its output type based upon configuration arguments passed into the constructor. However the stage must document a single output type per output port. If a stage emitted multiple output types, then the types must share a common base class which would serve as the stage's output type. +The `compute_schema` method allows us to define our output type of `MessageMeta`, we do so by calling the `set_type` method of the `output_schema` attribute of the `StageSchema` object passed into the method. Of note here is that it is perfectly valid for a stage to determine its output type based upon configuration arguments passed into the constructor. However the stage must document a single output type per output port. If a stage emitted multiple output types, then the types must share a common base class which would serve as the stage's output type. ```python def compute_schema(self, schema: StageSchema): schema.output_schema.set_type(MessageMeta) @@ -785,7 +785,7 @@ def source_generator(self) -> collections.abc.Iterator[MessageMeta]: self._connection.close() ``` -Note that we read messages as quickly as we can from the queue. When the queue is empty we call `time.sleep`, allowing for a context switch to occur if needed. We acknowledge the message (by calling `basic_ack`) only once we have successfully emitted the message or failed to deserialize the message. This means that if the pipeline shuts down while consuming the queue, we will not lose any messages. However, in that situation we may end up with a duplicate message (i.e., if the pipeline is shut down after we have yielded the message but before calling `basic_ack`). +Note that we read messages as quickly as we can from the queue. When the queue is empty we call `time.sleep`, allowing for a context switch to occur if needed. We acknowledge the message (by calling `basic_ack`) only once we have successfully emitted the message or failed to deserialize the message. This means that if the pipeline shuts down while consuming the queue, we will not lose any messages. However, in that situation we may end up with a duplicate message (that is, if the pipeline is shut down after we have yielded the message but before calling `basic_ack`). #### The Completed Source Stage @@ -1000,7 +1000,7 @@ def _build_single(self, builder: mrc.Builder, input_node: mrc.SegmentObject) -> return node ``` -Similar to our previous examples, most of the actual business logic of the stage is contained in the `on_data` method. In this case, we grab a reference to the [cuDF](https://docs.rapids.ai/api/cudf/stable/) [DataFrame](https://docs.rapids.ai/api/cudf/stable/api_docs/dataframe.html) attached to the incoming message. We then serialize to an [io.StringIO](https://docs.python.org/3.10/library/io.html?highlight=stringio#io.StringIO) buffer, which is then sent to RabbitMQ. +Similar to our previous examples, most of the actual business logic of the stage is contained in the `on_data` method. In this case, we grab a reference to the [cuDF](https://docs.rapids.ai/api/cudf/stable/) [DataFrame](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/dataframe/) attached to the incoming message. We then serialize to an [`io.StringIO`](https://docs.python.org/3.10/library/io.html?highlight=stringio#io.StringIO) buffer, which is then sent to RabbitMQ. ```python def on_data(self, message: MessageMeta): @@ -1106,4 +1106,4 @@ class WriteToRabbitMQStage(PassThruTypeMixin, SinglePortStage): self._connection.close() ``` -> **Note**: For information about testing the `RabbitMQSourceStage`, `rabbitmq_source`, and `WriteToRabbitMQStage` stages refer to `examples/developer_guide/2_2_rabbitmq/README.md` in the the Morpheus repo. +> **Note**: For information about testing the `RabbitMQSourceStage`, `rabbitmq_source`, and `WriteToRabbitMQStage` stages refer to `examples/developer_guide/2_2_rabbitmq/README.md` in the Morpheus repo. diff --git a/docs/source/developer_guide/guides/3_simple_cpp_stage.md b/docs/source/developer_guide/guides/3_simple_cpp_stage.md index 6b1fbd9339..206b4eb13e 100644 --- a/docs/source/developer_guide/guides/3_simple_cpp_stage.md +++ b/docs/source/developer_guide/guides/3_simple_cpp_stage.md @@ -17,7 +17,7 @@ limitations under the License. # Simple C++ Stage ## Building the Example -The code for this guide can be found in the `examples/developer_guide/3_simple_cpp_stage` directory of the Morpheus repository. There are two ways to build the example. The first is to build the examples along with Morpheus by passing the `-DMORPHEUS_BUILD_EXAMPLES=ON` flag to cmake, for users using the `scripts/compile.sh` at the root of the Morpheus repo can do this by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: +The code for this guide can be found in the `examples/developer_guide/3_simple_cpp_stage` directory of the Morpheus repository. There are two ways to build the example. The first is to build the examples along with Morpheus by passing the `-DMORPHEUS_BUILD_EXAMPLES=ON` flag to CMake, for users using the `scripts/compile.sh` at the root of the Morpheus repo can do this by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: ```bash CMAKE_CONFIGURE_EXTRA_ARGS="-DMORPHEUS_BUILD_EXAMPLES=ON" ./scripts/compile.sh ``` @@ -34,7 +34,7 @@ pip install ./ ## Overview Morpheus offers the choice of writing pipeline stages in either Python or C++. For many use cases, a Python stage is perfectly fine. However, in the event that a Python stage becomes a bottleneck for the pipeline, then writing a C++ implementation for the stage becomes advantageous. The C++ implementations of Morpheus stages and messages utilize the [pybind11](https://pybind11.readthedocs.io/en/stable/index.html) library to provide Python bindings. -So far we have been defining our stages in Python, the option of defining a C++ implementation is only available to stages implemented as classes. Many of the stages included with Morpheus have both a Python and a C++ implementation, and Morpheus will use the C++ implementations by default. You can explicitly disable the use of C++ stage implementations by calling `morpheus.config.CppConfig.set_should_use_cpp(False)`: +We have been defining our stages in Python up to this point, the option of defining a C++ implementation is only available to stages implemented as classes. Many of the stages included with Morpheus have both a Python and a C++ implementation, and Morpheus will use the C++ implementations by default. You can explicitly disable the use of C++ stage implementations by calling `morpheus.config.CppConfig.set_should_use_cpp(False)`: ```python from morpheus.config import CppConfig @@ -54,20 +54,20 @@ def supports_cpp_node(self): return True ``` -C++ message object declarations can be found in the header files that are located in the `morpheus/_lib/include/morpheus/messages` directory. For example, the `MessageMeta` class declaration is located in `morpheus/_lib/include/morpheus/messages/meta.hpp`. Since this code is outside of the morpheus directory it would be included as: +C++ message object declarations can be found in the header files that are located in the `morpheus/_lib/include/morpheus/messages` directory. For example, the `MessageMeta` class declaration is located in `morpheus/_lib/include/morpheus/messages/meta.hpp`. Since this code is outside of the `morpheus` directory it would be included as: ```cpp #include ``` -Morpheus C++ source stages inherit from MRC's `PythonSource` class: +Morpheus C++ source stages inherit from the `PythonSource` class from MRC: ```cpp template class PythonSource : ... ``` -The `OutputT` type will be the datatype emitted by this stage. In contrast, general stages and sinks must inherit from MRC's `PythonNode` class, which specifies both receive and emit types: +The `OutputT` type will be the datatype emitted by this stage. In contrast, general stages and sinks must inherit from the `PythonNode` class from MRC, which specifies both receive and emit types: ```cpp template @@ -134,7 +134,7 @@ std::function, rxcpp::subscriber`. -All Morpheus C++ stages receive an instance of an MRC Segment Builder and a name (Typically this is the Python class' `unique_name` property) when constructed from Python. Note that C++ stages don't receive an instance of the Morpheus config. Therefore, if there are any attributes in the config needed by the C++ class, it is the responsibility of the Python class to extract them and pass them in as parameters to the C++ class. +All Morpheus C++ stages receive an instance of an MRC Segment Builder and a name (Typically this is the Python class' `unique_name` property) when constructed from Python. Note that C++ stages don't receive an instance of the Morpheus `Config` object. Therefore, if there are any attributes in the `Config` needed by the C++ class, it is the responsibility of the Python class to extract them and pass them in as parameters to the C++ class. We will also define an interface proxy object to keep the class definition separated from the Python interface. This isn't strictly required, but it is a convention used internally by Morpheus. Our proxy object will define a static method named `init` which is responsible for constructing a `PassThruStage` instance and returning it wrapped in a `shared_ptr`. There are many common Python types that pybind11 [automatically converts](https://pybind11.readthedocs.io/en/latest/advanced/cast/overview.html#conversion-table) to their associated C++ types. The MRC `Builder` is a C++ object with Python bindings. However there are other instances such as checking for values of `None` where the casting from Python to C++ types is not automatic. The proxy interface object fulfills this need and is used to help insulate Python bindings from internal implementation details. @@ -219,7 +219,7 @@ PassThruStage::PassThruStage() : {} ``` -However, this doesn't illustrate well how to customize a stage. So we will be using the long form signature for our examples. +However, this doesn't illustrate well how to customize a stage. For this reason, we will be using the long form signature for our examples. The `build_operator` method defines an observer which is subscribed to our input `rxcpp::observable`. The observer consists of three functions that are typically lambdas: `on_next`, `on_error`, and `on_completed`. Typically, these three functions call the associated methods on the output subscriber. @@ -237,7 +237,7 @@ PassThruStage::subscribe_fn_t PassThruStage::build_operator() Note the use of `std::move` in the `on_next` function. In Morpheus, our messages often contain both large payloads as well as Python objects where performing a copy necessitates acquiring the Python [Global Interpreter Lock (GIL)](https://docs.python.org/3.10/glossary.html#term-global-interpreter-lock). In either case, unnecessary copies can become a performance bottleneck, and much care is taken to limit the number of copies required for data to move through the pipeline. -There are situations in which a C++ stage does need to interact with Python, and therefore acquiring the GIL is a requirement. This is typically accomplished using pybind11's [gil_scoped_acquire](https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil) RAII class inside of a code block. Conversely there are situations in which we want to ensure that we are not holding the GIL and in these situations pybind11's [gil_scoped_release](https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil) class can be used. +There are situations in which a C++ stage does need to interact with Python, and therefore acquiring the GIL is a requirement. This is typically accomplished using pybind11's [`gil_scoped_acquire`](https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil) RAII class inside of a code block. Conversely there are situations in which we want to ensure that we are not holding the GIL and in these situations pybind11's [`gil_scoped_release`](https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil) class can be used. For stages it is important to ensure that the GIL is released before calling the output's `on_next` method. Consider the following `on_next` lambda function: diff --git a/docs/source/developer_guide/guides/4_source_cpp_stage.md b/docs/source/developer_guide/guides/4_source_cpp_stage.md index 74bf59f9dd..4b8f9eb601 100644 --- a/docs/source/developer_guide/guides/4_source_cpp_stage.md +++ b/docs/source/developer_guide/guides/4_source_cpp_stage.md @@ -17,7 +17,7 @@ limitations under the License. # Creating a C++ Source Stage ## Building the Example -The code for this guide can be found in the `examples/developer_guide/4_rabbitmq_cpp_stage` directory of the Morpheus repository. There are two ways to build the example. The first is to build the examples along with Morpheus by passing the `-DMORPHEUS_BUILD_EXAMPLES=ON` flag to cmake, for users using the `scripts/compile.sh` at the root of the Morpheus repo can do this by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: +The code for this guide can be found in the `examples/developer_guide/4_rabbitmq_cpp_stage` directory of the Morpheus repository. There are two ways to build the example. The first is to build the examples along with Morpheus by passing the `-DMORPHEUS_BUILD_EXAMPLES=ON` flag to CMake, for users using the `scripts/compile.sh` at the root of the Morpheus repo can do this by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: ```bash CMAKE_CONFIGURE_EXTRA_ARGS="-DMORPHEUS_BUILD_EXAMPLES=ON" ./scripts/compile.sh ``` @@ -73,7 +73,7 @@ class MORPHEUS_EXPORT RabbitMQSourceStage : public mrc::pymrc::PythonSource`, which we are going to use as it will occur in some of our function signatures. The way to think about `source_type_t` is it is the stage we are writing emits objects of type `MessageMeta`. The `subscriber_fn_t` is an alias for a function which will receive an `rxcpp::subscriber` instance and emit messages into the pipeline. The class we are deriving from `PythonSource` defines both of these to make writing function signatures easier. +Our base class defines `source_type_t` as an alias for `std::shared_ptr`, which we are going to use as it will occur in some of our function signatures. The way to think about `source_type_t` is it is the stage we are writing emits objects of type `MessageMeta`. The `subscriber_fn_t` is an alias for a function which will receive an `rxcpp::subscriber` instance and emit messages into the pipeline. The class we are deriving from `PythonSource` defines both of these to make writing function signatures easier. Our constructor is similar to the constructor of our Python class with the majority of the parameters being specific to communicating with RabbitMQ. In this case the default destructor is sufficient. @@ -98,7 +98,7 @@ void close(); The `build` method is responsible for returning a function with a signature matching `subscriber_fn_t`, the result of which will be passed into our base's constructor. Typically, this function is the center of a source stage, making calls to the `subscriber`'s `on_next`, `on_error`, and `on_completed` methods. For this example, the RabbitMQ-specific logic was broken out into the `source_generator` method, which should be analogous to the `source_generator` method from the Python class, and will emit new messages into the pipeline by calling `subscriber.on_next(message)`. -The `from_json` method parses a JSON string to a cuDF [table_with_metadata](https://docs.rapids.ai/api/libcudf/stable/structcudf_1_1io_1_1table__with__metadata.html). Lastly, the `close` method disconnects from the RabbitMQ exchange. +The `from_json` method parses a JSON string to a cuDF [`table_with_metadata`](https://docs.rapids.ai/api/libcudf/stable/structcudf_1_1io_1_1table__with__metadata.html). Lastly, the `close` method disconnects from the RabbitMQ exchange. We will also need three private attributes specific to our interactions with RabbitMQ: our polling interval, the name of the queue we are listening to, and a pointer to our channel object. @@ -285,7 +285,7 @@ void RabbitMQSourceStage::source_generator(rxcpp::subscriber **Note:** For GPU Requirements refer to [getting_started](../../getting_started.md#requirements) +> **Note:** For GPU Requirements refer to the [Getting Started](../../getting_started.md#requirements) guide. #### Building the services From the root of the Morpheus repo, run: @@ -186,7 +186,7 @@ docker compose build > This is most likely due to using an older version of the `docker-compose` command, instead re-run the build with `docker compose`. Refer to [Migrate to Compose V2](https://docs.docker.com/compose/migrate/) for more information. #### Downloading the example datasets -First, we will need to install `s3fs` and then run the `examples/digital_fingerprinting/fetch_example_data.py` script. This will download the example data into the `examples/data/dfp` dir. +First, we will need to install `s3fs` and then run the `examples/digital_fingerprinting/fetch_example_data.py` script. This will download the example data into the `examples/data/dfp` dir. From the Morpheus repo, run: ```bash @@ -242,15 +242,15 @@ Both scripts are capable of running either a training or inference pipeline for | `--train_users` | One of: `all`, `generic`, `individual`, `none` | Indicates whether or not to train per user or a generic model for all users. Selecting `none` runs the inference pipeline. | | `--skip_user` | TEXT | User IDs to skip. Mutually exclusive with `only_user` | | `--only_user` | TEXT | Only users specified by this option will be included. Mutually exclusive with `skip_user` | -| `--start_time` | TEXT | The start of the time window, if undefined start_date will be `now()-duration` | -| `--duration` | TEXT | The duration to run starting from `start_time` [default: 60d] | -| `--cache_dir` | TEXT | The location to cache data such as S3 downloads and pre-processed data [env var: `DFP_CACHE_DIR`; default: `./.cache/dfp`] | -| `--log_level` | One of: `CRITICAL`, `FATAL`, `ERROR`, `WARN`, `WARNING`, `INFO`, `DEBUG` | Specify the logging level to use. [default: `WARNING`] | -| `--sample_rate_s` | INTEGER | Minimum time step, in milliseconds, between object logs. [env var: `DFP_SAMPLE_RATE_S`; default: 0] | -| `-f`, `--input_file` | TEXT | List of files to process. Can specify multiple arguments for multiple files. Also accepts glob (*) wildcards and schema prefixes such as `s3://`. For example, to make a local cache of an s3 bucket, use `filecache::s3://mybucket/*`. Refer to [fsspec documentation](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files) for list of possible options. | +| `--start_time` | TEXT | The start of the time window, if undefined `start_date` will be `now()-duration` | +| `--duration` | TEXT | The duration to run starting from `start_time` [default: `60d`] | +| `--cache_dir` | TEXT | The location to cache data such as S3 downloads and pre-processed data [environment variable: `DFP_CACHE_DIR`; default: `./.cache/dfp`] | +| `--log_level` | One of: `CRITICAL`, `FATAL`, `ERROR`, `WARN`, `WARNING`, `INFO`, `DEBUG` | Specify the logging level to use. [default: `WARNING`] | +| `--sample_rate_s` | INTEGER | Minimum time step, in milliseconds, between object logs. [environment variable: `DFP_SAMPLE_RATE_S`; default: 0] | +| `-f`, `--input_file` | TEXT | List of files to process. Can specify multiple arguments for multiple files. Also accepts glob (*) wildcards and schema prefixes such as `s3://`. For example, to make a local cache of an s3 bucket, use `filecache::s3://mybucket/*`. Refer to [`fsspec` documentation](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files) for list of possible options. | | `--watch_inputs` | FLAG | Instructs the pipeline to continuously check the paths specified by `--input_file` for new files. This assumes that the at least one paths contains a wildcard. | | `--watch_interval` | FLOAT | Amount of time, in seconds, to wait between checks for new files. Only used if --watch_inputs is set. [default `1.0`] | -| `--tracking_uri` | TEXT | The MLflow tracking URI to connect to the tracking backend. [default: `http://localhost:5000`] | +| `--tracking_uri` | TEXT | The MLflow tracking URI to connect to. [default: `http://localhost:5000`] | | `--help` | | Show this message and exit. | @@ -282,10 +282,10 @@ The output files will contain those logs from the input dataset for which an ano Most of the fields in the output files generated by running the above examples are input fields or derived from input fields. The additional output fields are: | Field | Type | Description | | ----- | ---- | ----------- | -| event_time | TEXT | ISO 8601 formatted date string, the time the anomaly was detected by Morpheus | -| model_version | TEXT | Name and version of the model used to performed the inference, in the form of `:` | -| max_abs_z | FLOAT | Max z-score across all features | -| mean_abs_z | FLOAT | Average z-score across all features | +| `event_time` | TEXT | ISO 8601 formatted date string, the time the anomaly was detected by Morpheus | +| `model_version` | TEXT | Name and version of the model used to performed the inference, in the form of `:` | +| `max_abs_z` | FLOAT | Max z-score across all features | +| `mean_abs_z` | FLOAT | Average z-score across all features | In addition to this, for each input feature the following output fields will exist: | Field | Type | Description | @@ -297,7 +297,7 @@ In addition to this, for each input feature the following output fields will exi Refer to [DFPInferenceStage](6_digital_fingerprinting_reference.md#inference-stage-dfpinferencestage) for more on these fields. ##### Optional MLflow Service -Starting the `morpheus_pipeline` or the `jupyter` service, will start the `mlflow` service in the background. For debugging purposes, it can be helpful to view the logs of the running MLflow service. +Starting the `morpheus_pipeline` or the `jupyter` service, will start the `mlflow` service in the background. For debugging purposes, it can be helpful to view the logs of the running MLflow service. From the `examples/digital_fingerprinting/production` dir, run: ```bash @@ -309,7 +309,7 @@ docker compose up mlflow * [Kubernetes](https://kubernetes.io/) cluster configured with GPU resources​ * [NVIDIA GPU Operator](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/gpu-operator) installed in the cluster -> **Note:** For GPU Requirements refer to [getting_started](../../getting_started.md#requirements) +> **Note:** For GPU Requirements refer to the [Getting Started](../../getting_started.md#requirements) guide. ## Customizing DFP For details on customizing the DFP pipeline refer to [Digital Fingerprinting (DFP) Reference](./6_digital_fingerprinting_reference.md). diff --git a/docs/source/developer_guide/guides/6_digital_fingerprinting_reference.md b/docs/source/developer_guide/guides/6_digital_fingerprinting_reference.md index 3307d2ece5..cd9c2c99bd 100644 --- a/docs/source/developer_guide/guides/6_digital_fingerprinting_reference.md +++ b/docs/source/developer_guide/guides/6_digital_fingerprinting_reference.md @@ -23,11 +23,11 @@ limitations under the License. ### Pipeline Structure Configuration ![Pipeline Structure Configuration](img/dfp_pipeline_structure.png) -The stages in both the Training and Inference pipelines can be mixed and matched with little impact​, that is, the `MultiFileSource` can be configured to pull from S3 or from local files and can be replaced altogether with any other Morpheus input stage. Similarly, the S3 writer can be replaced with any Morpheus output stage. Regardless of the inputs and outputs the core pipeline should remain unchanged. While stages in the core of the pipeline (inside the blue areas in the above diagram) perform common actions that should be configured not exchanged. +The stages in both the Training and Inference pipelines can be mixed and matched with little impact, that is, the `MultiFileSource` can be configured to pull from S3 or from local files and can be replaced altogether with any other Morpheus input stage. Similarly, the S3 writer can be replaced with any Morpheus output stage. Regardless of the inputs and outputs the core pipeline should remain unchanged. While stages in the core of the pipeline (inside the blue areas in the above diagram) perform common actions that should be configured not exchanged. -### Morpheus Config +### Morpheus `Config` -For both inference and training pipeline the Morpheus config object should be constructed with the same values, for example: +For both inference and training pipeline the Morpheus `Config` object should be constructed with the same values, for example: ```python import os @@ -56,9 +56,9 @@ Other attributes which might be needed: ### Schema Definition #### DataFrame Input Schema (`DataFrameInputSchema`) -The {py:class}`~morpheus.utils.column_info.DataFrameInputSchema` class defines the schema specifying the columns to be included in the output `DataFrame`. Within the DFP pipeline there are two stages where pre-processing is performed, the `DFPFileToDataFrameStage` stage and the `DFPPreprocessingStage`. This decoupling of the pre-processing stages from the actual operations needed to be performed allows for the actual schema to be user-defined in the pipeline and re-usability of the stages. It is up to the user to define the fields which will appear in the `DataFrame`. Any column in the input data that isn't specified in either `column_info` or `preserve_columns` constructor arguments will not appear in the output. The exception to this are JSON fields, specified in the `json_columns` argument which defines json fields which are to be normalized. +The {py:class}`~morpheus.utils.column_info.DataFrameInputSchema` class defines the schema specifying the columns to be included in the output `DataFrame`. Within the DFP pipeline there are two stages where pre-processing is performed, the `DFPFileToDataFrameStage` stage and the `DFPPreprocessingStage`. This decoupling of the pre-processing stages from the actual operations needed to be performed allows for the actual schema to be user-defined in the pipeline and re-usability of the stages. It is up to the user to define the fields which will appear in the `DataFrame`. Any column in the input data that isn't specified in either `column_info` or `preserve_columns` constructor arguments will not appear in the output. The exception to this are JSON fields, specified in the `json_columns` argument which defines JSON fields which are to be normalized. -It is important to note that the fields defined in `json_columns` are normalized prior to the processing of the fields in `column_info`, allowing for processing to be performed on fields nested in JSON columns. For example, say we had a JSON field named `event` containing a key named `timestamp`, which in the JSON data appears as an ISO 8601 formatted date string, we could ensure it was converted to a datetime object to downstream stages with the following: +It is important to note that the fields defined in `json_columns` are normalized prior to the processing of the fields in `column_info`, allowing for processing to be performed on fields nested in JSON columns. For example, say we had a JSON field named `event` containing a key named `timestamp`, which in the JSON data appears as an ISO 8601 formatted date string, we could ensure it was converted to a `datetime` object to downstream stages with the following: ```python from morpheus.utils.column_info import DataFrameInputSchema from morpheus.utils.column_info import DateTimeColumn @@ -71,8 +71,8 @@ schema = DataFrameInputSchema( In the above examples, three operations were performed: 1. The `event` JSON field was normalized, resulting in new fields prefixed with `event.` to be included in the output `DataFrame`. -2. The newly created field `event.timestamp` is parsed into a datetime field. -3. Since the DFP pipeline explicitly requires a timestamp field, we name this new column with the `config.ae.timestamp_column_name` config attribute ensuring it matches the pipeline configuration. When `name` and `input_name` are the same the old field is overwritten, and when they differ a new field is created. +2. The newly created field `event.timestamp` is parsed into a `datetime` field. +3. Since the DFP pipeline explicitly requires a timestamp field, we name this new column with the `config.ae.timestamp_column_name` attribute ensuring it matches the pipeline configuration. When `name` and `input_name` are the same the old field is overwritten, and when they differ a new field is created. The `DFPFileToDataFrameStage` is executed first and is responsible for flattening potentially nested JSON data and performing any sort of data type conversions. The `DFPPreprocessingStage` is executed later after the `DFPSplitUsersStage` allowing for the possibility of per-user computed fields such as the `logcount` and `locincrement` fields mentioned previously. Both stages are performed after the `DFPFileBatcherStage` allowing for per time period (per-day by default) computed fields. @@ -98,18 +98,18 @@ Subclass of `ColumnInfo`, defines a column to be computed by a user-defined func | `name` | `str` | Name of the column | | `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) | | `process_column_fn` | `function` | Function which receives the entire `DataFrame` as its only input, returning a new [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object to be stored in column `name`. | -| `input_column_types` | `dict[str, str]` | The input columns and the expected dtypes that are needed for this Column to successfully process. Setting this as `None` will pass all columns. Specifying which columns are needed improves performance. | +| `input_column_types` | `dict[str, str]` | The input columns and the expected [`dtype` strings](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) that are needed for this Column to successfully process. Setting this as `None` will pass all columns. Specifying which columns are needed improves performance. | #### Rename Column (`RenameColumn`) Subclass of `ColumnInfo`, adds the ability to also perform a rename. | Argument | Type | Description | | -------- | ---- | ----------- | | `name` | `str` | Name of the destination column | -| `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) | +| `dtype` | `str` or Python type | Any type string or Python class recognized by [pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) | | `input_name` | `str` | Original column name | #### Boolean Column (`BoolColumn`) -Subclass of `RenameColumn`, adds the ability to map a set custom values as boolean values. For example say we had a string input field containing one of five possible enum values: `OK`, `SUCCESS`, `DENIED`, `CANCELED` and `EXPIRED` we could map these values into a single boolean field as: +Subclass of `RenameColumn`, adds the ability to map a set custom values as boolean values. For example say we had a string input field containing one of five possible `enum` values: `OK`, `SUCCESS`, `DENIED`, `CANCELED` and `EXPIRED` we could map these values into a single boolean field as: ```python from morpheus.utils.column_info import BoolColumn ``` @@ -121,7 +121,7 @@ field = BoolColumn(name="result", false_values=["DENIED", "CANCELED", "EXPIRED"]) ``` -We used strings in this example; however, we also could have just as easily mapped integer status codes. We also have the ability to map onto types other than boolean by providing custom values for true and false (for example, `1`/`0`, `yes`/`no`) . +We used strings in this example; however, we also could have just as easily mapped integer status codes. We also have the ability to map onto types other than boolean by providing custom values for true and false (for example, `1`/`0`, `yes`/`no`) . | Argument | Type | Description | | -------- | ---- | ----------- | @@ -134,7 +134,7 @@ We used strings in this example; however, we also could have just as easily mapp | `false_values` | `List[str]` | List of string values to be interpreted as false. | #### Date-Time Column (`DateTimeColumn`) -Subclass of `RenameColumn`, specific to casting UTC localized datetime values. When incoming values contain a time-zone offset string the values are converted to UTC, while values without a time-zone are assumed to be UTC. +Subclass of `RenameColumn`, specific to casting UTC localized `datetime` values. When incoming values contain a time-zone offset string the values are converted to UTC, while values without a time-zone are assumed to be UTC. | Argument | Type | Description | | -------- | ---- | ----------- | @@ -177,32 +177,32 @@ Subclass of `DateTimeColumn`, counts the unique occurrences of a value in `group ![Input Stages](img/dfp_input_config.png) #### Source Stage (`MultiFileSource`) -The `MultiFileSource` (`examples/digital_fingerprinting/production/morpheus/dfp/stages/multi_file_source.py`) receives a path or list of paths (`filenames`), and will collectively be emitted into the pipeline as an [fsspec.core.OpenFiles](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.OpenFiles) object. The paths may include wildcards `*` as well as URLs (ex: `s3://path`) to remote storage providers such as S3, FTP, GCP, Azure, Databricks and others as defined by [fsspec](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files). In addition to this paths can be cached locally by prefixing them with `filecache::` (ex: `filecache::s3://bucket-name/key-name`). +The `MultiFileSource` (`examples/digital_fingerprinting/production/morpheus/dfp/stages/multi_file_source.py`) receives a path or list of paths (`filenames`), and will collectively be emitted into the pipeline as an [`fsspec.core.OpenFiles`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.OpenFiles) object. The paths may include wildcards `*` as well as URLs (ex: `s3://path`) to remote storage providers such as S3, FTP, GCP, Azure, Databricks and others as defined by [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files). In addition to this paths can be cached locally by prefixing them with `filecache::` (ex: `filecache::s3://bucket-name/key-name`). > **Note:** This stage does not actually download the data files, allowing the file list to be filtered and batched prior to being downloaded. | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `filenames` | `List[str]` or `str` | Paths to source file to be read from | -| `watch` | `bool` | Optional: when True will repeatedly poll `filenames` for new files. This assumes that at least one of the paths in `filenames` containes a wildcard. By default False. | -| `watch_interval` | `float` | When `watch` is True, this is the time in seconds between polling the paths in `filenames` for new files. Ignored when `watch` is False. | +| `watch` | `bool` | Optional: when `True` will repeatedly poll `filenames` for new files. This assumes that at least one of the paths in `filenames` contains a wildcard. By default `False`. | +| `watch_interval` | `float` | When `watch` is `True`, this is the time in seconds between polling the paths in `filenames` for new files. Ignored when `watch` is `False`. | #### File Batcher Stage (`DFPFileBatcherStage`) -The `DFPFileBatcherStage` (`examples/digital_fingerprinting/production/morpheus/dfp/stages/dfp_file_batcher_stage.py`) groups data in the incoming `DataFrame` in batches of a time period (per day default), and optionally filtering incoming data to a specific time window. This stage can potentially improve performance by combining multiple small files into a single batch. This stage assumes that the date of the logs can be easily inferred such as encoding the creation time in the file name (for example, `AUTH_LOG-2022-08-21T22.05.23Z.json`), or using the modification time as reported by the file system. The actual method for extracting the date is encoded in a user-supplied `date_conversion_func` function (more on this later). +The `DFPFileBatcherStage` (`examples/digital_fingerprinting/production/morpheus/dfp/stages/dfp_file_batcher_stage.py`) groups data in the incoming `DataFrame` in batches of a time period (per day default), and optionally filtering incoming data to a specific time window. This stage can potentially improve performance by combining multiple small files into a single batch. This stage assumes that the date of the logs can be easily inferred such as encoding the creation time in the file name (for example, `AUTH_LOG-2022-08-21T22.05.23Z.json`), or using the modification time as reported by the file system. The actual method for extracting the date is encoded in a user-supplied `date_conversion_func` function (more on this later). | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | -| `date_conversion_func` | `function` | Function receives a single [fsspec.core.OpenFile](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.OpenFile) argument and returns a `datetime.datetime` object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | +| `date_conversion_func` | `function` | Function receives a single [`fsspec.core.OpenFile`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.OpenFile) argument and returns a `datetime.datetime` object | | `period` | `str` | Time period to group data by, value must be [one of pandas' offset strings](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases) | -| `sampling_rate_s` | `int` | Optional, default=`None`. Deprecated consider using `sampling` instead. When defined a subset of the incoming data files will be sampled, taking the first row for each `sampling_rate_s` seconds.| +| `sampling_rate_s` | `int` | Optional, default=`None`. Deprecated consider using `sampling` instead. When defined a subset of the incoming data files will be sampled, taking the first row for each `sampling_rate_s` seconds.| | `start_time` | `datetime` | Optional, default=`None`. When not None incoming data files will be filtered, excluding any files created prior to `start_time` | | `end_time` | `datetime` | Optional, default=`None`. When not None incoming data files will be filtered, excluding any files created after `end_time` | | `sampling` | `str`, `float`, `int` | Optional, When non-None a subset of the incoming data files will be sampled. When a string, the value is interpreted as a pandas frequency. The first row for each frequency will be taken. When the value is between [0,1), a percentage of rows will be taken. When the value is greater than 1, the value is interpreted as the random count of rows to take. | -For situations where the creation date of the log file is encoded in the filename, the `date_extractor` in the `morpheus/utils/file_utils.py` module can be used. The `date_extractor` assumes that the timestamps are localized to UTC and will need to have a regex pattern bound to it before being passed in as a parameter to `DFPFileBatcherStage`. The regex pattern will need to contain the following named groups: `year`, `month`, `day`, `hour`, `minute`, `second`, and optionally `microsecond`. In cases where the regular expression does not match the `date_extractor` function will fallback to using the modified time of the file. +For situations where the creation date of the log file is encoded in the filename, the `date_extractor` in the `morpheus/utils/file_utils.py` module can be used. The `date_extractor` assumes that the timestamps are localized to UTC and will need to have a regex pattern bound to it before being passed in as a parameter to `DFPFileBatcherStage`. The regex pattern will need to contain the following named groups: `year`, `month`, `day`, `hour`, `minute`, `second`, and optionally `microsecond`. In cases where the regular expression does not match the `date_extractor` function will fallback to using the modified time of the file. For input files containing an ISO 8601 formatted date string the `iso_date_regex` regex can be used ex: ```python @@ -219,23 +219,23 @@ pipeline.add_stage( date_conversion_func=functools.partial(date_extractor, filename_regex=iso_date_regex))) ``` -> **Note:** If `date_conversion_func` returns time-zone aware timestamps, then `start_time` and `end_time` if not-None need to also be timezone aware datetime objects. +> **Note:** If `date_conversion_func` returns time-zone aware timestamps, then `start_time` and `end_time` if not `None` need to also be timezone aware `datetime` objects. #### File to DataFrame Stage (`DFPFileToDataFrameStage`) -The `DFPFileToDataFrameStage` (examples/digital_fingerprinting/production/morpheus/dfp/stages/dfp_file_to_df.py) stage receives a `list` of an [fsspec.core.OpenFiles](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.OpenFiles) and loads them into a single `DataFrame` which is then emitted into the pipeline. When the parent stage is `DFPFileBatcherStage` each batch (typically one day) is concatenated into a single `DataFrame`. If the parent was `MultiFileSource` the entire dataset is loaded into a single `DataFrame`. Because of this, it is important to choose a `period` argument for `DFPFileBatcherStage` small enough such that each batch can fit into memory. +The `DFPFileToDataFrameStage` (`examples/digital_fingerprinting/production/morpheus/dfp/stages/dfp_file_to_df.py`) stage receives a `list` of an [`fsspec.core.OpenFiles`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.OpenFiles) and loads them into a single `DataFrame` which is then emitted into the pipeline. When the parent stage is `DFPFileBatcherStage` each batch (typically one day) is concatenated into a single `DataFrame`. If the parent was `MultiFileSource` the entire dataset is loaded into a single `DataFrame`. Because of this, it is important to choose a `period` argument for `DFPFileBatcherStage` small enough such that each batch can fit into memory. | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `schema` | `DataFrameInputSchema` | Schema specifying columns to load, along with any necessary renames and data type conversions | | `filter_null` | `bool` | Optional: Whether to filter null rows after loading, by default True. | -| `file_type` | `morpheus.common.FileTypes` (enum) | Optional: Indicates file type to be loaded. Currently supported values at time of writing are: `FileTypes.Auto`, `FileTypes.CSV`, `FileTypes.JSON` and `FileTypes.PARQUET`. Default value is `FileTypes.Auto` which will infer the type based on the file extension, set this value if using a custom extension | +| `file_type` | `morpheus.common.FileTypes` (`enum`) | Optional: Indicates file type to be loaded. Currently supported values at time of writing are: `FileTypes.Auto`, `FileTypes.CSV`, `FileTypes.JSON` and `FileTypes.PARQUET`. Default value is `FileTypes.Auto` which will infer the type based on the file extension, set this value if using a custom extension | | `parser_kwargs` | `dict` or `None` | Optional: additional keyword arguments to be passed into the `DataFrame` parser, currently this is going to be either [`pandas.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html), [`pandas.read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) or [`pandas.read_parquet`](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html) | | `cache_dir` | `str` | Optional: path to cache location, defaults to `./.cache/dfp` | -This stage is able to download and load data files concurrently by multiple methods. Currently supported methods are: `single_thread`, `dask`, and `dask_thread`. The method used is chosen by setting the {envvar}`MORPHEUS_FILE_DOWNLOAD_TYPE` environment variable, and `dask_thread` is used by default, and `single_thread` effectively disables concurrent loading. +This stage is able to download and load data files concurrently by multiple methods. Currently supported methods are: `single_thread`, `dask`, and `dask_thread`. The method used is chosen by setting the {envvar}`MORPHEUS_FILE_DOWNLOAD_TYPE` environment variable, and `dask_thread` is used by default, and `single_thread` effectively disables concurrent loading. -This stage will cache the resulting `DataFrame` in `cache_dir`, since we are caching the `DataFrame`s and not the source files, a cache hit avoids the cost of parsing the incoming data. In the case of remote storage systems, such as S3, this avoids both parsing and a download on a cache hit. One consequence of this is that any change to the `schema` will require purging cached files in the `cache_dir` before those changes are visible. +This stage will cache the resulting `DataFrame` in `cache_dir`, since we are caching the `DataFrame`s and not the source files, a cache hit avoids the cost of parsing the incoming data. In the case of remote storage systems, such as S3, this avoids both parsing and a download on a cache hit. One consequence of this is that any change to the `schema` will require purging cached files in the `cache_dir` before those changes are visible. > **Note:** This caching is in addition to any caching which may have occurred when using the optional `filecache::` prefix. @@ -249,34 +249,36 @@ This final stage will write all received messages to a single output file in eit | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `filename` | `str` | The file to write anomalous log messages to. | | `overwrite` | `bool` | Optional, defaults to `False`. If the file specified in `filename` already exists, it will be overwritten if this option is set to `True` | #### Write to S3 Stage (`WriteToS3Stage`) -The {py:obj}`~dfp.stages.write_to_s3_stage.WriteToS3Stage` stage writes the resulting anomaly detections to S3. The `WriteToS3Stage` decouples the S3 specific operations from the Morpheus stage, and as such receives an `s3_writer` argument. +The {py:obj}`~dfp.stages.write_to_s3_stage.WriteToS3Stage` stage writes the resulting anomaly detections to S3. The `WriteToS3Stage` decouples the S3 specific operations from the Morpheus stage, and as such receives an `s3_writer` argument. | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `s3_writer` | `function` | User defined function which receives an instance of a `morpheus.messages.message_meta.MessageMeta` and returns that same message instance. Any S3 specific configurations, such as bucket name, should be bound to the method. | ### Core Pipeline These stages are common to both the training and inference pipelines, unlike the input and output stages these are specific to the DFP pipeline and intended to be configured but not replaceable. #### Split Users Stage (`DFPSplitUsersStage`) -The {py:obj}`~dfp.stages.dfp_split_users_stage.DFPSplitUsersStage` stage receives an incoming `DataFrame` and emits a `list` of `DFPMessageMeta` where each `DFPMessageMeta` represents the records associated for a given user. This allows for downstream stages to perform all necessary operations on a per user basis. +The {py:obj}`~dfp.stages.dfp_split_users_stage.DFPSplitUsersStage` stage receives an incoming `DataFrame` and emits a `list` of `DFPMessageMeta` where each `DFPMessageMeta` represents the records associated for a given user. This allows for downstream stages to perform all necessary operations on a per user basis. | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `include_generic` | `bool` | When `True` a `DFPMessageMeta` will be constructed for the generic user containing all records not excluded by the `skip_users` and `only_users` filters | | `include_individual` | `bool` | When `True` a `DFPMessageMeta` instance will be constructed for each user not excluded by the `skip_users` and `only_users` filters | -| `skip_users` | `List[str]` or `None` | List of users to exclude, when `include_generic` is `True` excluded records will also be excluded from the generic user. Mutually exclusive with `only_users`. | +| `skip_users` | `List[str]` or `None` | List of users to exclude, when `include_generic` is `True` excluded records will also be excluded from the generic user. Mutually exclusive with `only_users`. | | `only_users` | `List[str]` or `None` | Limit records to a specific list of users, when `include_generic` is `True` the generic user's records will also be limited to the users in this list. Mutually exclusive with `skip_users`. | #### Rolling Window Stage (`DFPRollingWindowStage`) The {py:obj}`~dfp.stages.dfp_rolling_window_stage.DFPRollingWindowStage` stage performs several key pieces of functionality for DFP. + + 1. This stage keeps a moving window of logs on a per user basis * These logs are saved to disk to reduce memory requirements between logs from the same user 1. It only emits logs when the window history requirements are met @@ -286,31 +288,33 @@ The {py:obj}`~dfp.stages.dfp_rolling_window_stage.DFPRollingWindowStage` stage p * To support all column feature types, incoming log messages can be combined with existing history and sent to downstream stages. * For example, to calculate a feature that increments a counter for the number of logs a particular user has generated in a single day, we must have the user's log history for the past 24 hours. To support this, this stage will combine new logs with existing history into a single `DataFrame`. * It is the responsibility of downstream stages to distinguish between new logs and existing history. + | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `min_history` | `int` | Exclude users with less than `min_history` records, setting this to `1` effectively disables this feature | | `min_increment` | `int` | Exclude incoming batches for users where less than `min_increment` new records have been added since the last batch, setting this to `0` effectively disables this feature | | `max_history` | `int`, `str` or `None` | When not `None`, include up to `max_history` records. When `max_history` is an int, then the last `max_history` records will be included. When `max_history` is a `str` it is assumed to represent a duration parsable by [`pandas.Timedelta`](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html) and only those records within the window of [latest timestamp - `max_history`, latest timestamp] will be included. | | `cache_dir` | `str` | Optional path to cache directory, cached items will be stored in a subdirectory under `cache_dir` named `rolling-user-data` this directory, along with `cache_dir` will be created if it does not already exist. | + > **Note:** this stage computes a row hash for the first and last rows of the incoming `DataFrame` as such all data contained must be hashable, any non-hashable values such as `lists` should be dropped or converted into hashable types in the `DFPFileToDataFrameStage`. #### Preprocessing Stage (`DFPPreprocessingStage`) -The {py:obj}`~dfp.stages.dfp_preprocessing_stage.DFPPreprocessingStage` stage, the actual logic of preprocessing is defined in the `input_schema` argument. Since this stage occurs in the pipeline after the `DFPFileBatcherStage` and `DFPSplitUsersStage` stages all records in the incoming `DataFrame` correspond to only a single user within a specific time period allowing for columns to be computer on a per-user per-time period basis such as the `logcount` and `locincrement` features mentioned above. Making the type of processing performed in this stage different from those performed in the `DFPFileToDataFrameStage`. +The {py:obj}`~dfp.stages.dfp_preprocessing_stage.DFPPreprocessingStage` stage, the actual logic of preprocessing is defined in the `input_schema` argument. Since this stage occurs in the pipeline after the `DFPFileBatcherStage` and `DFPSplitUsersStage` stages all records in the incoming `DataFrame` correspond to only a single user within a specific time period allowing for columns to be computer on a per-user per-time period basis such as the `logcount` and `locincrement` features mentioned above. Making the type of processing performed in this stage different from those performed in the `DFPFileToDataFrameStage`. | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `input_schema` | `DataFrameInputSchema` | Schema specifying columns to be included in the output `DataFrame` including computed columns | ## Training Pipeline ![Training PipelineOverview](img/dfp_training_overview.png) -Training must begin with the generic user model​ which is trained with the logs from all users. This model serves as a fallback model for users and accounts without sufficient training data​. The name of the generic user is defined in the `ae.fallback_username` attribute of the Morpheus config object and defaults to `generic_user`. +Training must begin with the generic user model which is trained with the logs from all users. This model serves as a fallback model for users and accounts without sufficient training data. The name of the generic user is defined in the `ae.fallback_username` attribute of the Morpheus configuration object and defaults to `generic_user`. -After training the generic model, individual user models can be trained​. Individual user models provide better accuracy but require sufficient data​. Many users do not have sufficient data to train the model accurately​. +After training the generic model, individual user models can be trained. Individual user models provide better accuracy but require sufficient data. Many users do not have sufficient data to train the model accurately. ### Training Stages @@ -319,7 +323,7 @@ The {py:obj}`~dfp.stages.dfp_training.DFPTraining` trains a model for each incom | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `model_kwargs` | `dict` or `None` | Optional dictionary of keyword arguments to be used when constructing the model. Refer to [`AutoEncoder`](https://github.com/nv-morpheus/dfencoder/blob/master/dfencoder/autoencoder.py) for information on the available options.| | `epochs` | `int` | Number of training epochs. Default is 30.| | `validation_size` | `float` | Proportion of the input dataset to use for training validation. Should be between 0.0 and 1.0. Default is 0.0.| @@ -329,10 +333,10 @@ The {py:obj}`~dfp.stages.dfp_mlflow_model_writer.DFPMLFlowModelWriterStage` stag | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | +| `c` | `morpheus.config.Config` | Morpheus configuration object | | `model_name_formatter` | `str` | Optional format string to control the name of models stored in MLflow, default is `dfp-{user_id}`. Currently available field names are: `user_id` and `user_md5` which is an md5 hexadecimal digest as returned by [`hash.hexdigest`](https://docs.python.org/3.10/library/hashlib.html?highlight=hexdigest#hashlib.hash.hexdigest). | -| `experiment_name_formatter` | `str` | Optional format string to control the experiment name for models stored in MLflow, default is `/dfp-models/{reg_model_name}`. Currently available field names are: `user_id`, `user_md5` and `reg_model_name` which is the model name as defined by `model_name_formatter` once the field names have been applied. | -| `databricks_permissions` | `dict` or `None` | Optional, when not `None` sets permissions needed when using a databricks hosted MLflow server | +| `experiment_name_formatter` | `str` | Optional format string to control the experiment name for models stored in MLflow, default is `/dfp-models/{reg_model_name}`. Currently available field names are: `user_id`, `user_md5` and `reg_model_name` which is the model name as defined by `model_name_formatter` once the field names have been applied. | +| `databricks_permissions` | `dict` or `None` | Optional, when not `None` sets permissions needed when using a Databricks hosted MLflow server | > **Note:** If using a remote MLflow server, users will need to call [`mlflow.set_tracking_uri`](https://www.mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_tracking_uri) before starting the pipeline. @@ -342,21 +346,21 @@ The {py:obj}`~dfp.stages.dfp_mlflow_model_writer.DFPMLFlowModelWriterStage` stag ### Inference Stages #### Inference Stage (`DFPInferenceStage`) -The {py:obj}`~dfp.stages.dfp_inference_stage.DFPInferenceStage` stage loads models from MLflow and performs inferences against those models. This stage emits a message containing the original `DataFrame` along with new columns containing the z score (`mean_abs_z`), as well as the name and version of the model that generated that score (`model_version`). For each feature in the model, three additional columns will also be added: +The {py:obj}`~dfp.stages.dfp_inference_stage.DFPInferenceStage` stage loads models from MLflow and performs inferences against those models. This stage emits a message containing the original `DataFrame` along with new columns containing the z score (`mean_abs_z`), as well as the name and version of the model that generated that score (`model_version`). For each feature in the model, three additional columns will also be added: * `_loss` : The loss * `_z_loss` : The loss z-score * `_pred` : The predicted value For a hypothetical feature named `result`, the three added columns will be: `result_loss`, `result_z_loss`, `result_pred`. -For performance models fetched from MLflow are cached locally and are cached for up to 10 minutes allowing updated models to be routinely updated. In addition to caching individual models, the stage also maintains a cache of which models are available, so a newly trained user model published to MLflow won't be visible to an already running inference pipeline for up to 10 minutes. +For performance models fetched from MLflow are cached locally and are cached for up to 10 minutes allowing updated models to be routinely updated. In addition to caching individual models, the stage also maintains a cache of which models are available, so a newly trained user model published to MLflow won't be visible to an already running inference pipeline for up to 10 minutes. -For any user without an associated model in MLflow, the model for the generic user is used. The name of the generic user is defined in the `ae.fallback_username` attribute of the Morpheus config object defaults to `generic_user`. +For any user without an associated model in MLflow, the model for the generic user is used. The name of the generic user is defined in the `ae.fallback_username` attribute of the Morpheus configuration object defaults to `generic_user`. | Argument | Type | Description | | -------- | ---- | ----------- | -| `c` | `morpheus.config.Config` | Morpheus config object | -| `model_name_formatter` | `str` | Format string to control the name of models fetched from MLflow. Currently available field names are: `user_id` and `user_md5` which is an md5 hexadecimal digest as returned by [`hash.hexdigest`](https://docs.python.org/3.10/library/hashlib.html?highlight=hexdigest#hashlib.hash.hexdigest). | +| `c` | `morpheus.config.Config` | Morpheus configuration object | +| `model_name_formatter` | `str` | Format string to control the name of models fetched from MLflow. Currently available field names are: `user_id` and `user_md5` which is an md5 hexadecimal digest as returned by [`hash.hexdigest`](https://docs.python.org/3.10/library/hashlib.html?highlight=hexdigest#hashlib.hash.hexdigest). | #### Filter Detection Stage (`FilterDetectionsStage`) The {py:obj}`~morpheus.stages.postprocess.filter_detections_stage.FilterDetectionsStage` stage filters the output from the inference stage for any anomalous messages. Logs which exceed the specified Z-Score will be passed onto the next stage. All remaining logs which are below the threshold will be dropped. For the purposes of the DFP pipeline, this stage is configured to use the `mean_abs_z` column of the DataFrame as the filter criteria. @@ -364,7 +368,7 @@ The {py:obj}`~morpheus.stages.postprocess.filter_detections_stage.FilterDetectio | Name | Type | Default | Description | | --- | --- | --- | :-- | | `threshold` | `float` | `0.5` | The threshold value above which logs are considered to be anomalous. The default is `0.5`; however, the DFP pipeline uses a value of `2.0`. All normal logs will be filtered out and anomalous logs will be passed on. | -| `copy` | `bool` | `True` | When the `copy` argument is `True` (default), rows that meet the filter criteria are copied into a new dataframe. When `False` sliced views are used instead. This is a performance optimization, and has no functional impact. | +| `copy` | `bool` | `True` | When the `copy` argument is `True` (default), rows that meet the filter criteria are copied into a new DataFrame. When `False` sliced views are used instead. This is a performance optimization, and has no functional impact. | | `filter_source` | `FilterSource` | `FilterSource.Auto` | Indicates if the filter criteria exists in an output tensor (`FilterSource.TENSOR`) or a column in a DataFrame (`FilterSource.DATAFRAME`). | | `field_name` | `str` | `probs` | Name of the tensor (`filter_source=FilterSource.TENSOR`) or DataFrame column (`filter_source=FilterSource.DATAFRAME`) to use as the filter criteria. | diff --git a/docs/source/developer_guide/guides/7_python_modules.md b/docs/source/developer_guide/guides/7_python_modules.md index 4bb1e64a3b..82bf5a71ea 100644 --- a/docs/source/developer_guide/guides/7_python_modules.md +++ b/docs/source/developer_guide/guides/7_python_modules.md @@ -21,9 +21,9 @@ limitations under the License. Morpheus makes use of the MRC graph-execution framework. Morpheus pipelines are built on top of MRC pipelines, which are comprised of collections of nodes and edges called segments (think sub-graphs), which can in turn be connected by ingress/egress ports. In many common cases, an MRC pipeline will consist of only a single segment. While Morpheus stages are the primary building blocks of Morpheus pipelines, Morpheus modules can be thought of as a way to define basic units of work, which can in turn be composed and (re)used to build more complex stages. Modules can be written in Python or C++. -## The Passthrough Module +## The Pass-through Module -The `passthrough` module is a simple module that takes a single input port and a single output port. It simply passes it forward, in much the same way that the example stage defined in the [Simple Python Stage](./1_simple_python_stage.md) does; however, it only defines the actual unit of work, and must then be loaded either as its own Morpheus stage, or within the context of another stage in order to be used. +The pass-through module is a simple module that takes a single input port and a single output port. It simply passes it forward, in much the same way that the example stage defined in the [Simple Python Stage](./1_simple_python_stage.md) does; however, it only defines the actual unit of work, and must then be loaded either as its own Morpheus stage, or within the context of another stage in order to be used. ### Module Definition and Registration diff --git a/docs/source/developer_guide/guides/8_cpp_modules.md b/docs/source/developer_guide/guides/8_cpp_modules.md index a35f6a9f0f..dce2afa880 100644 --- a/docs/source/developer_guide/guides/8_cpp_modules.md +++ b/docs/source/developer_guide/guides/8_cpp_modules.md @@ -21,7 +21,7 @@ limitations under the License. See [Simple Python Module](./7_python_modules.md) for an introduction to Morpheus modules. -## The Passthrough Module +## The Pass-through Module The following example will create a simple C++ module that passes through the input data without modification. This module will be written in C++ and would be compiled into the Morpheus core library. diff --git a/docs/source/examples.md b/docs/source/examples.md index 8596e7de5d..3c7b8bc424 100644 --- a/docs/source/examples.md +++ b/docs/source/examples.md @@ -29,7 +29,7 @@ Ensure the environment is set up by following [Getting Started with Morpheus](./ * [Agents](../../examples/llm/agents/README.md) * [Completion](../../examples/llm/completion/README.md) * [VDB Upload](../../examples/llm/vdb_upload/README.md) - * [Retreival Augmented Generation (RAG)](../../examples/llm/rag/README.md) + * [Retrieval Augmented Generation (RAG)](../../examples/llm/rag/README.md) ## Environments diff --git a/docs/source/examples/llm/README.md b/docs/source/examples/llm/README.md index f84656cf6a..dec8d88dfe 100644 --- a/docs/source/examples/llm/README.md +++ b/docs/source/examples/llm/README.md @@ -17,7 +17,7 @@ limitations under the License. # LLM -- [completion](./completion/README.md) -- [vdb_upload](./vdb_upload/README.md) -- [rag](./rag/README.md) -- [agents](./agents/README.md) \ No newline at end of file +- [`completion`](./completion/README.md) +- [`vdb_upload`](./vdb_upload/README.md) +- [`rag`](./rag/README.md) +- [`agents`](./agents/README.md) diff --git a/docs/source/extra_info/glossary.md b/docs/source/extra_info/glossary.md index a369ad5c1c..fe982cd9d9 100644 --- a/docs/source/extra_info/glossary.md +++ b/docs/source/extra_info/glossary.md @@ -30,7 +30,7 @@ A Helm Chart for deploying the infrastructure of Morpheus. It includes the [Trit ## Morpheus SDK CLI A Helm Chart that deploys the Morpheus container. Refer to [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/morpheus/helm-charts/morpheus-sdk-client](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/morpheus/helm-charts/morpheus-sdk-client) -## morpheus-sdk-client +## `morpheus-sdk-client` Another name for the [Morpheus SDK CLI](#morpheus-sdk-cli) Helm Chart. ## MRC diff --git a/docs/source/extra_info/known_issues.md b/docs/source/extra_info/known_issues.md index af9cb91077..7feb1dca62 100644 --- a/docs/source/extra_info/known_issues.md +++ b/docs/source/extra_info/known_issues.md @@ -18,6 +18,6 @@ limitations under the License. # Known Issues - TrainAEStage fails with a Segmentation fault ([#1641](https://github.com/nv-morpheus/Morpheus/issues/1641)) -- vdb_upload example pipeline triggers an internal error in Triton ([#1649](https://github.com/nv-morpheus/Morpheus/issues/1649)) +- `vdb_upload` example pipeline triggers an internal error in Triton ([#1649](https://github.com/nv-morpheus/Morpheus/issues/1649)) Refer to [open issues in the Morpheus project](https://github.com/nv-morpheus/Morpheus/issues) diff --git a/docs/source/extra_info/troubleshooting.md b/docs/source/extra_info/troubleshooting.md index 6dd62a3121..5f3b14be20 100644 --- a/docs/source/extra_info/troubleshooting.md +++ b/docs/source/extra_info/troubleshooting.md @@ -56,7 +56,7 @@ loaded_model = model_cache.load_model(self._client) ModuleNotFoundError: No module named 'dfencoder' ``` -The work arounds available for this issue are: +The workarounds available for this issue are: * Revert to the previous version of Morpheus until the models can be re-trained. * Re-train the model using the current version of Morpheus @@ -76,7 +76,7 @@ docker compose up mlflow **Debugging Python Code** -To debug issues in python code, several Visual Studio Code launch configurations have been included in the repo. These launch configurations can be found in `${MORPHEUS_ROOT}/morpheus.code-workspace`. To launch the debugging environment, ensure Visual Studio Code has opened the morpheus workspace file (File->Open Workspace from File...). Once the workspace has been loaded, the launch configurations should be available in the debugging tab. +To debug issues in python code, several Visual Studio Code launch configurations have been included in the repo. These launch configurations can be found in `${MORPHEUS_ROOT}/morpheus.code-workspace`. To launch the debugging environment, ensure Visual Studio Code has opened the Morpheus workspace file (File->Open Workspace from File...). Once the workspace has been loaded, the launch configurations should be available in the debugging tab. **Debugging C++ Code** diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index 7abe40bd1f..a8c55c2741 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -30,7 +30,7 @@ More advanced users, or those who are interested in using the latest pre-release - Volta architecture GPU or better - [CUDA 12.1](https://developer.nvidia.com/cuda-12-1-0-download-archive) - [Docker](https://docs.docker.com/get-docker/) -- [The NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) +- [The NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation) - [NVIDIA Triton Inference Server](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) `23.06` or higher > **Note about Docker:** @@ -57,7 +57,7 @@ More advanced users, or those who are interested in using the latest pre-release > Users who want to ensure they are running with the latest bug fixes should use a release image tag (`YY.MM-runtime`). Users who need to deploy a specific version into production should use a point release image tag (`vYY.MM.00-runtime`). ### Starting the Morpheus Container -1. Ensure that [The NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) is installed. +1. Ensure that [The NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation) is installed. 1. Start the container downloaded from the previous section: ```bash docker run --rm -ti --runtime=nvidia --gpus=all --net=host -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/morpheus/morpheus:24.10-runtime bash @@ -67,7 +67,7 @@ Note about some of the flags above: | Flag | Description | | ---- | ----------- | | `--runtime=nvidia` | Choose the NVIDIA docker runtime, this enables access to the GPU inside the container. This flag isn't needed if the `nvidia` runtime is already set as the default runtime for Docker. | -| `--gpus=all` | Specify which GPUs the container has access to. Alternately, a specific GPU could be chosen with `--gpus=` | +| `--gpus=all` | Specify which GPUs the container has access to. Alternately, a specific GPU could be chosen with `--gpus=` | | `--net=host` | Most of the Morpheus pipelines utilize [NVIDIA Triton Inference Server](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver), which will be running in another container. For simplicity we will give the container access to the host system's network, production deployments may opt for an explicit network configuration. | | `-v /var/run/docker.sock:/var/run/docker.sock` | Enables access to the Docker socket file from within the running container, this allows launching other Docker containers from within the Morpheus container. This flag is required for launching Triton with access to the included Morpheus models, users with their own models can omit this. | @@ -101,10 +101,10 @@ scripts/fetch_data.py fetch [...] ``` At time of writing the defined datasets are: -* all - Metaset includes all others +* all - Meta-set includes all others * datasets - Input files needed for many of the examples * docs - Graphics needed for documentation -* examples - Data needed by scripts in the `examples` subdir +* examples - Data needed by scripts in the `examples` directory * models - Morpheus models (largest dataset) * tests - Data used by unittests * validation - Subset of the models dataset needed by some unittests @@ -173,7 +173,7 @@ docker run --rm -ti --gpus=all -p8000:8000 -p8001:8001 -p8002:8002 \ This will launch Triton using the default network ports (8000 for HTTP, 8001 for GRPC, and 8002 for metrics), loading all of the examples models in the Morpheus repo. -Note: The above command is useful for testing out Morpheus, however it does load several models into GPU memory, which at time of writing consumes roughly 2GB of GPU memory. Production users should consider only loading the specific model(s) they plan on using with the `--model-control-mode=explicit` and `--load-model` flags. For example to launch Triton only loading the `abp-nvsmi-xgb` model: +Note: The above command is useful for testing out Morpheus, however it does load several models into GPU memory, which at time of writing consumes roughly 2GB of GPU memory. Production users should consider only loading the specific models they plan on using with the `--model-control-mode=explicit` and `--load-model` flags. For example to launch Triton only loading the `abp-nvsmi-xgb` model: ```bash docker run --rm -ti --gpus=all -p8000:8000 -p8001:8001 -p8002:8002 \ nvcr.io/nvidia/morpheus/morpheus-tritonserver-models:24.10 \ @@ -352,8 +352,8 @@ Commands: deserialize Messages are logically partitioned based on the pipeline config's `pipeline_batch_size` parameter. dropna Drop null data entries from a DataFrame. filter Filter message by a classification threshold. - from-appshield Source stage is used to load Appshield messages from one or more plugins into a dataframe. It normalizes nested json messages and arranges them - into a dataframe by snapshot and source. + from-appshield Source stage is used to load Appshield messages from one or more plugins into a DataFrame. It normalizes nested json messages and arranges them + into a DataFrame by snapshot and source. from-file Load messages from a file. from-kafka Load messages from a Kafka cluster. inf-identity Perform inference for testing that performs a no-op. @@ -384,7 +384,7 @@ Commands: delay (Deprecated) Delay results for a certain duration. filter Filter message by a classification threshold. from-azure Source stage is used to load Azure Active Directory messages. - from-cloudtrail Load messages from a Cloudtrail directory. + from-cloudtrail Load messages from a CloudTrail directory. from-duo Source stage is used to load Duo Authentication messages. inf-pytorch Perform inference with PyTorch. inf-triton Perform inference with Triton Inference Server. diff --git a/docs/source/loaders/core/file_to_df_loader.md b/docs/source/loaders/core/file_to_df_loader.md index a80fe17687..39fe7233ae 100644 --- a/docs/source/loaders/core/file_to_df_loader.md +++ b/docs/source/loaders/core/file_to_df_loader.md @@ -17,13 +17,13 @@ limitations under the License. ## File to DataFrame Loader -[DataLoader](../../modules/core/data_loader.md) module is used to load data files content into a dataframe using custom loader function. This loader function can be configured to use different processing methods, such as single-threaded, dask, or dask_thread, as determined by the `MORPHEUS_FILE_DOWNLOAD_TYPE` environment variable. When download_method starts with "dask," a dask client is created to process the files, otherwise, a single thread is used. +[DataLoader](../../modules/core/data_loader.md) module is used to load data files content into a DataFrame using custom loader function. This loader function can be configured to use different processing methods, such as `"single_thread"`, `"dask"`, or `"dask_thread"`, as determined by the `MORPHEUS_FILE_DOWNLOAD_TYPE` environment variable. When `download_method` is `"dask"`, or `"dask_thread"`, a Dask client is created to process the files, otherwise, a single thread is used. -After processing, the resulting dataframe is cached using a hash of the file paths. This loader also has the ability to load file content from S3 buckets, in addition to loading data from the disk. +After processing, the resulting DataFrame is cached using a hash of the file paths. This loader also has the ability to load file content from S3 buckets, in addition to loading data from the disk. ### Example Loader Configuration -Using below configuration while loading DataLoader module, specifies that the DataLoader module should utilize the `file_to_df` loader when loading files into a dataframe. +Using below configuration while loading DataLoader module, specifies that the DataLoader module should utilize the `file_to_df` loader when loading files into a DataFrame. ```json { @@ -41,21 +41,21 @@ The parameters that can be configured for this specific loader at load task leve | Parameter | Type | Description | Example Value | Default Value | | ------------------ | ---------- | -------------------------------- | ------------------------ | -------------- | -| `batcher_config ` | dictionary | Options for batching | See below | `[Required]` | -| `files` | array | List of files to load | ["/path/to/input/files"] | `[]` | -| `loader_id` | string | Unique identifier for the loader | "file_to_df" | `[Required]` | +| `batcher_config ` | dictionary | Options for batching | Refer Below | `[Required]` | +| `files` | array | List of files to load | `["/path/to/input/files"]` | `[]` | +| `loader_id` | string | Unique identifier for the loader | `"file_to_df"` | `[Required]` | ### `batcher_config` | Key | Type | Description | Example Value | Default Value | |-------------------------|------------|--------------------------------------------|----------------------|---------------| -| `cache_dir` | string | Directory to cache the rolling window data | "/path/to/cache" | `-` | -| `file_type` | string | Type of the input file | "csv" | `"JSON"` | -| `filter_null` | boolean | Whether to filter out null values | true | `false` | -| `parser_kwargs` | dictionary | Keyword arguments to pass to the parser | {"delimiter": ","} | `-` | -| `schema` | dictionary | Schema of the input data | See Below | `-` | -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `-` | +| `cache_dir` | string | Directory to cache the rolling window data | `"/path/to/cache"` | `-` | +| `file_type` | string | Type of the input file | `"csv"` | `"JSON"` | +| `filter_null` | boolean | Whether to filter out null values | `true` | `false` | +| `parser_kwargs` | dictionary | Keyword arguments to pass to the parser | `{"delimiter": ","}` | `-` | +| `schema` | dictionary | Schema of the input data | Refer Below | `-` | +| `timestamp_column_name` | string | Name of the timestamp column | `"timestamp"` | `-` | ### Example Load Task Configuration diff --git a/docs/source/loaders/core/fsspec_loader.md b/docs/source/loaders/core/fsspec_loader.md index e21ec94049..7e503349fd 100644 --- a/docs/source/loaders/core/fsspec_loader.md +++ b/docs/source/loaders/core/fsspec_loader.md @@ -17,7 +17,7 @@ limitations under the License. ## Filesystem Spec Loader -[DataLoader](../../modules/core/data_loader.md) module is configured to use this loader function. It is responsible for loading data from external sources using the fsspec library, and returns the updated ControlMessage object with payload as MessageMeta, which contains dataframe (with filenames). +[DataLoader](../../modules/core/data_loader.md) module is configured to use this loader function. It is responsible for loading data from external sources using the [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) library, and returns the updated ControlMessage object with payload as MessageMeta, which contains DataFrame (with filenames). ### Example Loader Configuration @@ -38,8 +38,8 @@ The parameters that can be configured for this specific loader at load task leve | Parameter | Type | Description | Example Value | Default Value | | ------------------ | ---------- | -------------------------------- | --------------------------------- | -------------- | -| `files` | array | List of files to load | ["/your/input/filepath"] | `[]` | -| `loader_id` | string | Unique identifier for the loader | "file_to_df" | `[Required]` | +| `files` | array | List of files to load | `["/your/input/filepath"]` | `[]` | +| `loader_id` | string | Unique identifier for the loader | `"file_to_df"` | `[Required]` | diff --git a/docs/source/loaders/core/rest_to_df_loader.md b/docs/source/loaders/core/rest_to_df_loader.md index 950c697ce2..07e98dd6f2 100644 --- a/docs/source/loaders/core/rest_to_df_loader.md +++ b/docs/source/loaders/core/rest_to_df_loader.md @@ -17,11 +17,11 @@ limitations under the License. ## REST to DataFrame Loader -[DataLoader](../../modules/core/data_loader.md) module is used to load data files content into a dataframe using custom loader function. This loader function can be configured to send REST requests with customized parameters to retrieve data from endpoints. See below for the specific configuration format. +[DataLoader](../../modules/core/data_loader.md) module is used to load data files content into a DataFrame using custom loader function. This loader function can be configured to send REST requests with customized parameters to retrieve data from endpoints. Refer Below for the specific configuration format. ### Example Loader Configuration -Using below configuration while loading DataLoader module, specifies that the DataLoader module should utilize the `rest` loader when loading files into a dataframe. +Using below configuration while loading DataLoader module, specifies that the DataLoader module should utilize the `rest` loader when loading files into a DataFrame. ```json { @@ -39,23 +39,23 @@ The parameters that can be configured for this specific loader at load task leve | Parameter | Type | Description | Example Value | Default Value | | ----------- | ------ | ----------------------------------- | ------------- | ------------- | -| `loader_id` | string | Unique identifier for the loader | "rest" | `[Required]` | -| `strategy` | string | Strategy for constructing dataframe | "aggregate" | `[Required]` | -| `queries` | array | parameters of REST queries | See below | `[Required]` | +| `loader_id` | string | Unique identifier for the loader | `"rest"` | `[Required]` | +| `strategy` | string | Strategy for constructing DataFrame | `"aggregate"` | `[Required]` | +| `queries` | array | parameters of REST queries | Refer Below | `[Required]` | ### `queries` | Key | Type | Description | Example Value | Default Value | | -------------- | ---------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------------ | ------------- | -| `method` | string | Method of request | "GET" | `"GET"` | -| `endpoint` | string | Endpoint of request | "0.0.0.0/path/to/target?param1=true" | `[Required]` | -| `port` | string | Target port of request | "80" | `"80"` | -| `http_version` | string | HTTP version of request | "1.1" | `"1.1"` | -| `content_type` | string | Content type of request body in a POST request | "text/plain" | `-` | -| `body` | string | Request body in a POST request | "param1=true¶m2=false" | `-` | -| `X-Headers` | dictionary | Customized X-Headers of request | "{"X-Header1":"header1", "X-Header2":"header2"}" | `-` | -| `params` | array | Parameters of requested URL, override values included in endpoint | "[{"param1": "true", "param2":"false"}, {"param1": "false", "param2":"true"}]" | `-` | +| `method` | string | Method of request | `"GET"` | `"GET"` | +| `endpoint` | string | Endpoint of request | `"0.0.0.0/path/to/target?param1=true"` | `[Required]` | +| `port` | string | Target port of request | `"80"` | `"80"` | +| `http_version` | string | HTTP version of request | `"1.1"` | `"1.1"` | +| `content_type` | string | Content type of request body in a POST request | `"text/plain"` | `-` | +| `body` | string | Request body in a POST request | `"param1=true¶m2=false"` | `-` | +| `X-Headers` | dictionary | Customized X-Headers of request | `'{"X-Header1":"header1", "X-Header2":"header2"}'` | `-` | +| `params` | array | Parameters of requested URL, override values included in endpoint | `'[{"param1": "true", "param2":"false"}, {"param1": "false", "param2":"true"}]'` | `-` | ### Example Load Task Configuration @@ -96,4 +96,4 @@ Below JSON configuration specifies how to pass additional configuration to the l ] } } -``` \ No newline at end of file +``` diff --git a/docs/source/loaders/core/sql_loader.md b/docs/source/loaders/core/sql_loader.md index 7a5318f746..c9b148a1ad 100644 --- a/docs/source/loaders/core/sql_loader.md +++ b/docs/source/loaders/core/sql_loader.md @@ -41,23 +41,23 @@ The parameters that can be configured for this specific loader at load task leve | Parameter | Type | Description | Example Value | Default Value | |--------------|------------|------------------------------------------|--------------------|---------------| -| `strategy` | string | Strategy for combining queries | "aggregate" | `aggregate` | -| `loader_id` | string | Unique identifier for the loader | "file_to_df" | `[Required]` | -| `sql_config` | dictionary | Dictionary containing SQL queries to run | "file_to_df" | `See below` | +| `strategy` | string | Strategy for combining queries | `"aggregate"` | `aggregate` | +| `loader_id` | string | Unique identifier for the loader | `"file_to_df"` | `[Required]` | +| `sql_config` | dictionary | Dictionary containing SQL queries to run | `"file_to_df"` | Refer Below | `sql_config` | Parameter | Type | Description | Example Value | Default Value | |-----------|------|---------------------------------------------------|--------------------------------------------|---------------| -| `queries` | list | List of dictionaries composing a query definition | "[query_dict_1, ..., query_dict_n]" | `See below` | +| `queries` | list | List of dictionaries composing a query definition | `"[query_dict_1, ..., query_dict_n]"` | Refer Below | `queries` | Parameter | Type | Description | Example Value | Default Value | |---------------------|------------|--------------------------------------|-----------------------------------------------------------------|---------------| -| `connection_string` | string | Strategy for combining queries | "postgresql://postgres:postgres@localhost:5432/postgres" | `[required]` | -| `query` | string | SQL Query to execute | "SELECT * FROM test_table WHERE id IN (?, ?, ?)" | `[Required]` | -| `params` | dictionary | Named or positional paramters values | "[foo, bar, baz]" | `-` | +| `connection_string` | string | Strategy for combining queries | `"postgresql://postgres:postgres@localhost:5432/postgres"` | `[required]` | +| `query` | string | SQL Query to execute | `"SELECT * FROM test_table WHERE id IN (?, ?, ?)"` | `[Required]` | +| `params` | dictionary | Named or positional parameters values | `"[foo, bar, baz]"` | `-` | ### Example Load Task Configuration diff --git a/docs/source/loaders/index.md b/docs/source/loaders/index.md index 9e68c342a8..cc92a8d684 100644 --- a/docs/source/loaders/index.md +++ b/docs/source/loaders/index.md @@ -17,9 +17,7 @@ limitations under the License. # Loaders -Custom functions called "Loaders" can be utilized by the DataLoader Module to load data into the pipeline. The user can -choose to register their own customized loader function and add it to a dataloader registry, which will then become -accessible to the DataLoader module during module loading. +Custom functions called "Loaders" can be utilized by the DataLoader Module to load data into the pipeline. The user can choose to register their own customized loader function and add it to a data loader registry, which will then become accessible to the DataLoader module during module loading. **Note** : Loaders receive configuration from the `load` task via [control message](../../developer_guide/guides/9_control_messages.md) during runtime. diff --git a/docs/source/models_and_datasets.md b/docs/source/models_and_datasets.md index 1ab12f7650..c204a2fea9 100644 --- a/docs/source/models_and_datasets.md +++ b/docs/source/models_and_datasets.md @@ -21,8 +21,8 @@ Morpheus comes with a number of pre-trained models with corresponding training, |Model|GPU Mem Req|Description| |-----|-----------|-----------| -|Anomalous Behavior Profiling (ABP)|2015MiB|This model is an example of a binary classifier to differentiate between anomalous GPU behavior such as crypto mining / GPU malware, and non-anomalous GPU-based workflows (for example, ML/DL training). The model is an XGBoost model.| +|Anomalous Behavior Profiling (ABP)|2015MiB|This model is an example of a binary classifier to differentiate between anomalous GPU behavior such as cryptocurrency mining / GPU malware, and non-anomalous GPU-based workflows (for example, ML/DL training). The model is an XGBoost model.| |Digital Fingerprinting (DFP)|4.97MiB|This use case is currently implemented to detect changes in a users' behavior that indicates a change from a human to a machine or a machine to a human. The model is an ensemble of an Autoencoder and fast Fourier transform reconstruction.| |Fraud Detection|76.55MiB|This model shows an application of a graph neural network for fraud detection in a credit card transaction graph. A transaction dataset that includes three types of nodes, transaction, client, and merchant nodes is used for modeling. A combination of [GraphSAGE](https://snap.stanford.edu/graphsage/) along with [XGBoost](https://xgboost.readthedocs.io/en/stable/) is used to identify frauds in the transaction networks.| -|Ransomware Detection Model|n/a|This model shows an application of DOCA AppShield to use data from volatile memory to classify processes as ransomware or bengin. This model uses a sliding window over time and feeds derived data into a random forest classifiers of various lengths depending on the amount of data collected.| +|Ransomware Detection Model|n/a|This model shows an application of DOCA AppShield to use data from volatile memory to classify processes as ransomware or benign. This model uses a sliding window over time and feeds derived data into a random forest classifiers of various lengths depending on the amount of data collected.| |Flexible Log Parsing|1612MiB|This model is an example of using Named Entity Recognition (NER) for log parsing, specifically [Apache HTTP Server](https://httpd.apache.org/) logs.| diff --git a/docs/source/modules/core/data_loader.md b/docs/source/modules/core/data_loader.md index eb175206b3..530399e5c3 100644 --- a/docs/source/modules/core/data_loader.md +++ b/docs/source/modules/core/data_loader.md @@ -25,7 +25,7 @@ are specified in the module configuration file at the time of object constructio | Parameter | Type | Description | Example Value | Default Value | |-----------|-------|---------------------------------------------------|---------------|-----------------| -| `loaders` | array | An array containing information on loaders to use | See Below | `[]` | +| `loaders` | array | An array containing information on loaders to use | Refer Below | `[]` | ### `loaders` diff --git a/docs/source/modules/core/file_batcher.md b/docs/source/modules/core/file_batcher.md index fe7e7b4006..da5d12ec76 100644 --- a/docs/source/modules/core/file_batcher.md +++ b/docs/source/modules/core/file_batcher.md @@ -24,18 +24,18 @@ remaining files by period that fall inside the window. | Parameter | Type | Description | Example Value | Default Value | |-------------------------|------------|-------------------------------|------------------------|---------------| -| `batching_options` | dictionary | Options for batching | See below | `-` | +| `batching_options` | dictionary | Options for batching | Refer Below | `-` | | `cache_dir` | string | Cache directory | "./file_batcher_cache" | `None` | | `file_type` | string | File type | "JSON" | `"JSON"` | | `filter_nulls` | boolean | Whether to filter null values | false | `false` | -| `schema` | dictionary | Data schema | See below | `[Required]` | +| `schema` | dictionary | Data schema | Refer Below | `[Required]` | | `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `"timestamp"` | ### `batching_options` | Key | Type | Description | Example Value | Default Value | |--------------------------|-----------------|-------------------------------------|---------------------------------------------|----------------------------| -| `end_time` | datetime/string | Endtime of the time window | "2023-03-14T23:59:59" | `None` | +| `end_time` | datetime/string | End of the time window | "2023-03-14T23:59:59" | `None` | | `iso_date_regex_pattern` | string | Regex pattern for ISO date matching | "\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}" | `` | | `parser_kwargs` | dictionary | Additional arguments for the parser | {} | `{}` | | `period` | string | Time period for grouping files | "1d" | `"D"` | diff --git a/docs/source/modules/core/file_to_df.md b/docs/source/modules/core/file_to_df.md index ec3133d39f..f5c6e121c8 100644 --- a/docs/source/modules/core/file_to_df.md +++ b/docs/source/modules/core/file_to_df.md @@ -17,19 +17,19 @@ limitations under the License. ## File to DataFrame Module -This module reads data from the batched files into a dataframe after receiving input from the "FileBatcher" module. In +This module reads data from the batched files into a DataFrame after receiving input from the `FileBatcher` module. In addition to loading data from the disk, it has the ability to load the file content from S3 buckets. ### Configurable Parameters | Parameter | Type | Description | Example Value | Default Value | |-------------------------|------------|--------------------------------------------|----------------------|---------------| -| `cache_dir` | string | Directory to cache the rolling window data | "/path/to/cache" | `-` | -| `file_type` | string | Type of the input file | "csv" | `"JSON"` | -| `filter_null` | boolean | Whether to filter out null values | true | `false` | -| `parser_kwargs` | dictionary | Keyword arguments to pass to the parser | {"delimiter": ","} | `-` | -| `schema` | dictionary | Schema of the input data | See Below | `-` | -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `-` | +| `cache_dir` | string | Directory to cache the rolling window data | `"/path/to/cache"` | `-` | +| `file_type` | string | Type of the input file | `"csv"` | `"JSON"` | +| `filter_null` | boolean | Whether to filter out null values | `True` | `False` | +| `parser_kwargs` | dictionary | Keyword arguments to pass to the parser | `{"delimiter": ","}` | `-` | +| `schema` | dictionary | Schema of the input data | Refer Below | `-` | +| `timestamp_column_name` | string | Name of the timestamp column | `"timestamp"` | `-` | ### Example JSON Configuration diff --git a/docs/source/modules/core/filter_control_message.md b/docs/source/modules/core/filter_control_message.md index 085104f569..a24fc8cb38 100644 --- a/docs/source/modules/core/filter_control_message.md +++ b/docs/source/modules/core/filter_control_message.md @@ -23,10 +23,10 @@ When the requirements are met, this module gently discards the control messages. | Parameter | Type | Description | Example Value | Default Value | |------------------------------|---------|--------------------------------------|---------------------|---------------| -| `enable_data_type_filtering` | boolean | Enables filtering based on data type | true | `false` | -| `enable_task_filtering` | boolean | Enables filtering based on task type | true | `false` | -| `filter_data_type` | string | The data type to be used as a filter | "desired_data_type" | `None` | -| `filter_task_type` | string | The task type to be used as a filter | "specific_task" | `None` | +| `enable_data_type_filtering` | boolean | Enables filtering based on data type | `true` | `false` | +| `enable_task_filtering` | boolean | Enables filtering based on task type | `true` | `false` | +| `filter_data_type` | string | The data type to be used as a filter | `"desired_data_type"` | `None` | +| `filter_task_type` | string | The task type to be used as a filter | `"specific_task"` | `None` | ### Example JSON Configuration diff --git a/docs/source/modules/core/filter_detections.md b/docs/source/modules/core/filter_detections.md index dd26d8f5b9..a3328a1cdd 100644 --- a/docs/source/modules/core/filter_detections.md +++ b/docs/source/modules/core/filter_detections.md @@ -19,27 +19,27 @@ limitations under the License. Filter message by a classification threshold. -The Filter Detections module is used to filter rows from a dataframe based on values in a tensor using a specified -criteria. Rows in the `meta` dataframe are excluded if their associated value in the `probs` array is less than or equal +The Filter Detections module is used to filter rows from a DataFrame based on values in a tensor using a specified +criteria. Rows in the `meta` DataFrame are excluded if their associated value in the `probs` array is less than or equal to `threshold`. ### Configurable Parameters | Parameter | Type | Description | Example Value | Default Value | |-----------------|------------|----------------------------------------|---------------|-----------------| -| `copy` | boolean | Whether to copy the rows or slice them | true | `true` | -| `field_name` | string | Name of the field to filter on | "probs" | `probs` | -| `filter_source` | string | Source of the filter field | "AUTO" | `AUTO` | -| `schema` | dictionary | Schema configuration | See Below | `-` | -| `threshold` | float | Threshold value to filter on | 0.5 | `0.5` | +| `copy` | boolean | Whether to copy the rows or slice them | `true` | `true` | +| `field_name` | string | Name of the field to filter on | `"probs"` | `"probs"` | +| `filter_source` | string | Source of the filter field | `"AUTO"` | `"AUTO"` | +| `schema` | dictionary | Schema configuration | Refer Below | `-` | +| `threshold` | float | Threshold value to filter on | `0.5` | `0.5` | ### `schema` | Key | Type | Description | Example Value | Default Value | |----------------------|--------|----------------------|-----------------------|---------------| -| `encoding` | string | Encoding | "latin1" | `latin1` | -| `input_message_type` | string | Pickled message type | "pickle_message_type" | `[Required]` | -| `schema_str` | string | Schema string | "string" | `[Required]` | +| `encoding` | string | Encoding | `"latin1"` | `"latin1"` | +| `input_message_type` | string | Pickled message type | `"pickle_message_type"` | `[Required]` | +| `schema_str` | string | Schema string | `"string"` | `[Required]` | ### Example JSON Configuration @@ -55,12 +55,3 @@ to `threshold`. } } ``` - -### Default Settings - -| Property | Value | -| -------------| --------| -| copy | False | -| field_name | probs | -| filter_source| AUTO | -| threshold | 0.5 | diff --git a/docs/source/modules/core/mlflow_model_writer.md b/docs/source/modules/core/mlflow_model_writer.md index 1ba2161323..fd3f310d4f 100644 --- a/docs/source/modules/core/mlflow_model_writer.md +++ b/docs/source/modules/core/mlflow_model_writer.md @@ -23,18 +23,18 @@ This module uploads trained models to the MLflow server. | Parameter | Type | Description | Example Value | Default Value | |-----------------------------|------------|-----------------------------------|-------------------------------|---------------| -| `conda_env` | string | Conda environment for the model | "path/to/conda_env.yml" | `[Required]` | -| `databricks_permissions` | dictionary | Permissions for the model | See Below | `None` | -| `experiment_name_formatter` | string | Formatter for the experiment name | "experiment_name_{timestamp}" | `[Required]` | -| `model_name_formatter` | string | Formatter for the model name | "model_name_{timestamp}" | `[Required]` | -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `timestamp` | +| `conda_env` | string | Conda environment for the model | `"path/to/conda_env.yml"` | `[Required]` | +| `databricks_permissions` | dictionary | Permissions for the model | Refer Below | `None` | +| `experiment_name_formatter` | string | Formatter for the experiment name | `"experiment_name_{timestamp}"` | `[Required]` | +| `model_name_formatter` | string | Formatter for the model name | `"model_name_{timestamp}"` | `[Required]` | +| `timestamp_column_name` | string | Name of the timestamp column | `"timestamp"` | `timestamp` | ### `databricks_permissions` | Key | Type | Description | Example Value | Default Value | |---------|-------|--------------------------------------|----------------------------------|---------------| -| `read` | array | List of users with read permissions | ["read_user1", "read_user2"] | `-` | -| `write` | array | List of users with write permissions | ["write_user1", "write_user2"] | `-` | +| `read` | array | List of users with read permissions | `["read_user1", "read_user2"]` | `-` | +| `write` | array | List of users with write permissions | `["write_user1", "write_user2"]` | `-` | ### Example JSON Configuration diff --git a/docs/source/modules/core/payload_batcher.md b/docs/source/modules/core/payload_batcher.md index 06729cbe1f..4a7a767853 100644 --- a/docs/source/modules/core/payload_batcher.md +++ b/docs/source/modules/core/payload_batcher.md @@ -25,7 +25,7 @@ This module batches incoming control message data payload into smaller batches b |-----------------------------|------------|-----------------------------------|---------------------------------|---------------| | `max_batch_size` | integer | The maximum size of each batch | 256 | `256` | | `raise_on_failure` | boolean | Whether to raise an exception if a failure occurs during processing | false | `false` | -| `group_by_columns` | list | The column names to group by when batching | ["col1", "col2"] | `[]` | +| `group_by_columns` | list | The column names to group by when batching | `["col1", "col2"]` | `[]` | | `disable_max_batch_size` | boolean | Whether to disable the `max_batch_size` and only batch by group | false | `false` | | `timestamp_column_name` | string | The name of the timestamp column | None | `None` | | `timestamp_pattern` | string | The pattern to parse the timestamp column | None | `None` | diff --git a/docs/source/modules/core/serialize.md b/docs/source/modules/core/serialize.md index 01b32b8d54..ea7528a1eb 100644 --- a/docs/source/modules/core/serialize.md +++ b/docs/source/modules/core/serialize.md @@ -23,11 +23,11 @@ This module filters columns from a `MultiMessage` object, emitting a `MessageMet | Parameter | Type | Description | Example Value | Default Value | |-----------------|--------------|--------------------------------------------------------------|-------------------------------------|-----------------------| -| `columns` | list[string] | List of columns to include | ["column1", "column2", "column3"] | `None` | -| `exclude` | list[string] | List of regex patterns to exclude columns | ["column_to_exclude"] | `[r'^ID$', r'^_ts_']` | -| `fixed_columns` | bool | If true, the columns are fixed and not determined at runtime | true | `true` | -| `include` | string | Regex to include columns | "^column" | `None` | -| `use_cpp` | bool | If true, use C++ to serialize | true | `false` | +| `columns` | list[string] | List of columns to include | `["column1", "column2", "column3"]` | `None` | +| `exclude` | list[string] | List of regex patterns to exclude columns | `["column_to_exclude"]` | `[r'^ID$', r'^_ts_']` | +| `fixed_columns` | boolean | If true, the columns are fixed and not determined at runtime | `true` | `true` | +| `include` | string | Regex to include columns | `"^column"` | `None` | +| `use_cpp` | boolean | If true, use C++ to serialize | `true` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/core/write_to_elasticsearch.md b/docs/source/modules/core/write_to_elasticsearch.md index 5ddca72707..51cc6a06a7 100644 --- a/docs/source/modules/core/write_to_elasticsearch.md +++ b/docs/source/modules/core/write_to_elasticsearch.md @@ -23,11 +23,11 @@ This module reads an input data stream, converts each row of data to a document | Parameter | Type | Description | Example Value | Default Value | |-------------------------|--------------|---------------------------------------------------------------------------------------------------------|-------------------------------|---------------| -| `index` | str | Elasticsearch index. | "my_index" | `[Required]` | -| `connection_kwargs` | dict | Elasticsearch connection kwargs configuration. | {"hosts": ["host": "localhost", ...} | `[Required]` | -| `raise_on_exception` | bool | Raise or suppress exceptions when writing to Elasticsearch. | true | `false` | -| `pickled_func_config` | str | Pickled custom function configuration to update connection_kwargs as needed for the client connection. | See below | None | -| `refresh_period_secs` | int | Time in seconds to refresh the client connection. | 3600 | `2400` | +| `index` | `str` | Elasticsearch index. | `"my_index"` | `[Required]` | +| `connection_kwargs` | `dict` | Elasticsearch connection keyword arguments configuration. | `{"hosts": ["host": "localhost", ...}` | `[Required]` | +| `raise_on_exception` | `bool` | Raise or suppress exceptions when writing to Elasticsearch. | `true` | `false` | +| `pickled_func_config` | `str` | Pickled custom function configuration to update `connection_kwargs` as needed for the client connection. | Refer Below | `None` | +| `refresh_period_secs` | `int` | Time in seconds to refresh the client connection. | `3600` | `2400` | ### Example JSON Configuration diff --git a/docs/source/modules/core/write_to_file.md b/docs/source/modules/core/write_to_file.md index 61b6f6983c..2304e54faa 100644 --- a/docs/source/modules/core/write_to_file.md +++ b/docs/source/modules/core/write_to_file.md @@ -23,11 +23,11 @@ This module writes messages to a file. | Parameter | Type | Description | Example Value | Default Value | |---------------------|-----------|------------------------------------------|---------------|------------------| -| `filename` | string | Path to the output file | "output.csv" | `None` | -| `file_type` | string | Type of file to write | "CSV" | `AUTO` | -| `flush` | bool | If true, flush the file after each write | false | `false ` | -| `include_index_col` | bool | If true, include the index column | false | `true` | -| `overwrite` | bool | If true, overwrite the file if it exists | true | `false` | +| `filename` | string | Path to the output file | `"output.csv"` | `None` | +| `file_type` | string | Type of file to write | `"CSV"` | `AUTO` | +| `flush` | boolean | If true, flush the file after each write | `false` | `false ` | +| `include_index_col` | boolean | If true, include the index column | `false` | `true` | +| `overwrite` | boolean | If true, overwrite the file if it exists | `true` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_data_prep.md b/docs/source/modules/examples/digital_fingerprinting/dfp_data_prep.md index 4f304f30ce..24bb2dea87 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_data_prep.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_data_prep.md @@ -23,16 +23,16 @@ This module function prepares data for either inference or model training. | Parameter | Type | Description | Example Value | Default Value | |-------------------------|--------|------------------------------|---------------|---------------| -| `schema` | dict | Schema configuration | See Below | `-` | -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `timestamp` | +| `schema` | dict | Schema configuration | Refer Below | `-` | +| `timestamp_column_name` | string | Name of the timestamp column | `"timestamp"` | `timestamp` | #### `schema` | Key | Type | Description | Example Value | Default Value | |----------------------|--------|----------------------------------|-------------------------|---------------| -| `schema_str` | string | Serialized schema string | "cPickle schema string" | `-` | -| `encoding` | string | Encoding used for the schema_str | "latin1" | `-` | -| `input_message_type` | string | Pickled message type | "message type" | `-` | +| `schema_str` | string | Serialized schema string | `"cPickle schema string"` | `-` | +| `encoding` | string | Encoding used for the `schema_str` | `"latin1"` | `-` | +| `input_message_type` | string | Pickled message type | `"message type"` | `-` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_deployment.md b/docs/source/modules/examples/digital_fingerprinting/dfp_deployment.md index ad094ee81a..5695dcc664 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_deployment.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_deployment.md @@ -23,126 +23,126 @@ This module function sets up modular Digital Fingerprinting Pipeline instance. | Parameter | Type | Description | Example Value | Default Value | |---------------------|------|-------------------------------------------|---------------|---------------| -| `inference_options` | dict | Options for the inference pipeline module | See Below | `[Required]` | -| `training_options` | dict | Options for the training pipeline module | See Below | `[Required]` | +| `inference_options` | `dict` | Options for the inference pipeline module | Refer Below | `[Required]` | +| `training_options` | `dict` | Options for the training pipeline module | Refer Below | `[Required]` | ### Training Options Parameters | Parameter | Type | Description | Example Value | Default Value | |------------------------------|------|------------------------------------------------|----------------------|---------------| -| `batching_options` | dict | Options for batching the data | See Below | `-` | -| `cache_dir` | str | Directory to cache the rolling window data | "/path/to/cache/dir" | `./.cache` | -| `dfencoder_options` | dict | Options for configuring the data frame encoder | See Below | `-` | -| `mlflow_writer_options` | dict | Options for the MLflow model writer | See Below | `-` | -| `preprocessing_options` | dict | Options for preprocessing the data | See Below | `-` | -| `stream_aggregation_options` | dict | Options for aggregating the data by stream | See Below | `-` | -| `timestamp_column_name` | str | Name of the timestamp column used in the data | "my_timestamp" | `timestamp` | -| `user_splitting_options` | dict | Options for splitting the data by user | See Below | `-` | +| `batching_options` | `dict` | Options for batching the data | Refer Below | `-` | +| `cache_dir` | `str` | Directory to cache the rolling window data |` "/path/to/cache/dir"` | `"./.cache"` | +| `dfencoder_options` | `dict` | Options for configuring the data frame encoder | Refer Below | `-` | +| `mlflow_writer_options` | `dict` | Options for the MLflow model writer | Refer Below | `-` | +| `preprocessing_options` | `dict` | Options for preprocessing the data | Refer Below | `-` | +| `stream_aggregation_options` | `dict` | Options for aggregating the data by stream | Refer Below | `-` | +| `timestamp_column_name` | `str` | Name of the timestamp column used in the data | `"my_timestamp"` | `"timestamp"` | +| `user_splitting_options` | `dict` | Options for splitting the data by user | Refer Below | `-` | ### Inference Options Parameters | Parameter | Type | Description | Example Value | Default Value | |------------------------------|------|------------------------------------------------|----------------------|----------------| -| `batching_options` | dict | Options for batching the data | See Below | `-` | -| `cache_dir` | str | Directory to cache the rolling window data | "/path/to/cache/dir" | `./.cache` | -| `detection_criteria` | dict | Criteria for filtering detections | See Below | `-` | -| `fallback_username` | str | User ID to use if user ID not found | "generic_user" | `generic_user` | -| `inference_options` | dict | Options for the inference module | See Below | `-` | -| `model_name_formatter` | str | Format string for the model name | "model_{timestamp}" | `[Required]` | -| `num_output_ports` | int | Number of output ports for the module | 3 | `-` | -| `timestamp_column_name` | str | Name of the timestamp column in the input data | "timestamp" | `timestamp` | -| `stream_aggregation_options` | dict | Options for aggregating the data by stream | See Below | `-` | -| `user_splitting_options` | dict | Options for splitting the data by user | See Below | `-` | -| `write_to_file_options` | dict | Options for writing the detections to a file | See Below | `-` | +| `batching_options` | `dict` | Options for batching the data | Refer Below | `-` | +| `cache_dir` | `str` | Directory to cache the rolling window data | `"/path/to/cache/dir"` | `"./.cache"` | +| `detection_criteria` | `dict` | Criteria for filtering detections | Refer Below | `-` | +| `fallback_username` | `str` | User ID to use if user ID not found | `"generic_user"` | `"generic_user"`| +| `inference_options` | `dict` | Options for the inference module | Refer Below | `-` | +| `model_name_formatter` | `str` | Format string for the model name | `"model_{timestamp}"` | `[Required]` | +| `num_output_ports` | `int` | Number of output ports for the module | `3` | `-` | +| `timestamp_column_name` | `str` | Name of the timestamp column in the input data | `"timestamp"` | `"timestamp"` | +| `stream_aggregation_options` | `dict` | Options for aggregating the data by stream | Refer Below | `-` | +| `user_splitting_options` | `dict` | Options for splitting the data by user | Refer Below | `-` | +| `write_to_file_options` | `dict` | Options for writing the detections to a file | Refer Below | `-` | ### `batching_options` | Key | Type | Description | Example Value | Default Value | |--------------------------|-----------------|-------------------------------------|---------------------------------------------|----------------------------| -| `end_time` | datetime/string | Endtime of the time window | "2023-03-14T23:59:59" | `None` | -| `iso_date_regex_pattern` | string | Regex pattern for ISO date matching | "\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}" | `` | -| `parser_kwargs` | dictionary | Additional arguments for the parser | {} | `{}` | -| `period` | string | Time period for grouping files | "1d" | `D` | -| `sampling_rate_s` | integer | Sampling rate in seconds | 0 | `None` | -| `start_time` | datetime/string | Start time of the time window | "2023-03-01T00:00:00" | `None` | +| `end_time` | `datetime`|`str` | End of the time window | `"2023-03-14T23:59:59"` | `None` | +| `iso_date_regex_pattern` | `str` | Regex pattern for ISO date matching | `"\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}"` | `` | +| `parser_kwargs` | `dict` | Additional arguments for the parser | `{}` | `{}` | +| `period` | `str` | Time period for grouping files | `"1d"` | `D` | +| `sampling_rate_s` | `int` | Sampling rate in seconds | `0` | `None` | +| `start_time` | `datetime`|`str` | Start time of the time window | `"2023-03-01T00:00:00" ` | `None` | ### `dfencoder_options` | Parameter | Type | Description | Example Value | Default Value | |-------------------|-------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| -| `feature_columns` | list | List of feature columns to train on | ["column1", "column2", "column3"] | `-` | -| `epochs` | int | Number of epochs to train for | 50 | `-` | -| `model_kwargs` | dict | Keyword arguments to pass to the model | {"encoder_layers": [64, 32], "decoder_layers": [32, 64], "activation": "relu", "swap_p": 0.1, "lr": 0.001, "lr_decay": 0.9, "batch_size": 32, "verbose": 1, "optimizer": "adam", "scalar": "min_max", "min_cats": 10, "progress_bar": false, "device": "cpu"} | `-` | -| `validation_size` | float | Size of the validation set | 0.1 | `-` | +| `feature_columns` | `list` | List of feature columns to train on | `["column1", "column2", "column3"]` | `-` | +| `epochs` | `int` | Number of epochs to train for | `50` | `-` | +| `model_kwargs` | `dict` | Keyword arguments to pass to the model | `{"encoder_layers": [64, 32], "decoder_layers": [32, 64], "activation": "relu", "swap_p": 0.1, "lr": 0.001, "lr_decay": 0.9, "batch_size": 32, "verbose": 1, "optimizer": "adam", "scalar": "min_max", "min_cats": 10, "progress_bar": False, "device": "cpu"}` | `-` | +| `validation_size` | `float` | Size of the validation set | `0.1` | `-` | ### `monitor_options` | Key | Type | Description | Example Value | Default Value | | ----------------------------|---------|------------------------------------------------------------|---------------|---------------| -| `description` | string | Name to show for this Monitor Stage in the console window | "Progress" | `Progress` | -| `silence_monitors` | bool | Silence the monitors on the console | True | `False` | -| `smoothing` | float | Smoothing parameter to determine how much the throughput should be averaged | 0.01 | `0.05` | -| `unit` | string | Units to show in the rate value | "messages" | `messages` | -| `delayed_start` | bool | When delayed_start is enabled, the progress bar will not be shown until the first message is received. Otherwise, the progress bar is shown on pipeline startup and will begin timing immediately. In large pipelines, this option may be desired to give a more accurate timing. | True | `False` | -| `determine_count_fn_schema` | string | Custom function for determining the count in a message | "Progress" | `Progress` | -| `log_level` | string | Enable this stage when the configured log level is at `log_level` or lower. | "DEBUG" | `INFO` | +| `description` | `str` | Name to show for this Monitor Stage in the console window | `"Progress"` | `Progress` | +| `silence_monitors` | `bool` | Silence the monitors on the console | `True` | `False` | +| `smoothing` | `float` | Smoothing parameter to determine how much the throughput should be averaged | `0.01` | `0.05` | +| `unit` | `str` | Units to show in the rate value | `"messages"` | `"messages"` | +| `delayed_start` | `bool` | When enabled, the progress bar will not be shown until the first message is received. Otherwise, the progress bar is shown on pipeline startup and will begin timing immediately. In large pipelines, this option may be desired to give a more accurate timing. | `True`|`False` | +| `determine_count_fn_schema` | `str` | Custom function for determining the count in a message | `"Progress"` | `"Progress"` | +| `log_level` | `str` | Enable this stage when the configured log level is at `log_level` or lower. | `"DEBUG"` | `"INFO"` | ### `mlflow_writer_options` | Key | Type | Description | Example Value | Default Value | |-----------------------------|------------|-----------------------------------|-------------------------------|---------------| -| `conda_env` | string | Conda environment for the model | "path/to/conda_env.yml" | `[Required]` | -| `databricks_permissions` | dictionary | Permissions for the model | See Below | `None` | -| `experiment_name_formatter` | string | Formatter for the experiment name | "experiment_name_{timestamp}" | `[Required]` | -| `model_name_formatter` | string | Formatter for the model name | "model_name_{timestamp}" | `[Required]` | -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `timestamp` | +| `conda_env` | `str` | Conda environment for the model | `"path/to/conda_env.yml"` | `[Required]` | +| `databricks_permissions` | `dict` | Permissions for the model | Refer Below | `None` | +| `experiment_name_formatter` | `str` | Formatter for the experiment name | `"experiment_name_{timestamp}"` | `[Required]` | +| `model_name_formatter` | `str` | Formatter for the model name | `"model_name_{timestamp}"` | `[Required]` | +| `timestamp_column_name` | `str` | Name of the timestamp column | `"timestamp"` | `"timestamp"` | ### `stream_aggregation_options` | Parameter | Type | Description | Example Value | Default Value | |-------------------------|--------|-------------------------------------------------------------|---------------|---------------| -| `cache_mode` | string | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | "batch" | `batch` | -| `min_history` | int | Minimum history to trigger a new training event | 1 | `1` | -| `max_history` | int | Maximum history to include in a new training event | 0 | `0` | -| `timestamp_column_name` | string | Name of the column containing timestamps | "timestamp" | `timestamp` | -| `aggregation_span` | string | Lookback timespan for training data in a new training event | "60d" | `60d` | -| `cache_to_disk` | bool | Whether or not to cache streaming data to disk | false | `false` | -| `cache_dir` | string | Directory to use for caching streaming data | "./.cache" | `./.cache` | +| `cache_mode` | `str` | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | `"batch"` | `"batch"` | +| `min_history` | `int` | Minimum history to trigger a new training event | `1` | `1` | +| `max_history` | `int` | Maximum history to include in a new training event | `0` | `0` | +| `timestamp_column_name` | `str` | Name of the column containing timestamps | `"timestamp"` | `"timestamp"` | +| `aggregation_span` | `str` | Look back time span for training data in a new training event | `"60d"` | `"60d"` | +| `cache_to_disk` | `bool` | Whether or not to cache streaming data to disk | `False` | `False` | +| `cache_dir` | `str` | Directory to use for caching streaming data | `"./.cache"` | `"./.cache"` | ### `user_splitting_options` | Key | Type | Description | Example Value | Default Value | |-------------------------|------|------------------------------------------------------|-----------------------------|----------------| -| `fallback_username` | str | The user ID to use if the user ID is not found | "generic_user" | `generic_user` | -| `include_generic` | bool | Whether to include a generic user ID in the output | false | `false` | -| `include_individual` | bool | Whether to include individual user IDs in the output | true | `false` | -| `only_users` | list | List of user IDs to include; others will be excluded | ["user1", "user2", "user3"] | `[]` | -| `skip_users` | list | List of user IDs to exclude from the output | ["user4", "user5"] | `[]` | -| `timestamp_column_name` | str | Name of the column containing timestamps | "timestamp" | `timestamp` | -| `userid_column_name` | str | Name of the column containing user IDs | "username" | `username` | +| `fallback_username` | `str` | The user ID to use if the user ID is not found | `"generic_user"` | `"generic_user"` | +| `include_generic` | `bool` | Whether to include a generic user ID in the output | `False` | `False` | +| `include_individual` | `bool` | Whether to include individual user IDs in the output | `True` | `False` | +| `only_users` | `list` | List of user IDs to include; others will be excluded | `["user1", "user2", "user3"]` | `[]` | +| `skip_users` | `list` | List of user IDs to exclude from the output | `["user4", "user5"]` | `[]` | +| `timestamp_column_name` | `str` | Name of the column containing timestamps | `"timestamp"` | `"timestamp"` | +| `userid_column_name` | `str` | Name of the column containing user IDs | `"username"` | `"username"` | ### `detection_criteria` | Key | Type | Description | Example Value | Default Value | |--------------|-------|------------------------------------------|---------------|---------------| -| `threshold` | float | Threshold for filtering detections | 0.5 | `0.5` | -| `field_name` | str | Name of the field to filter by threshold | "score" | `probs` | +| `threshold` | `float` | Threshold for filtering detections | `0.5` | `0.5` | +| `field_name` | `str` | Name of the field to filter by threshold | `"score"` | `"probs"` | ### `inference_options` | Parameter | Type | Description | Example Value | Default Value | |-------------------------|--------|------------------------------------------------------|-------------------------|---------------| -| `model_name_formatter` | string | Formatter for model names | "user_{username}_model" | `[Required]` | -| `fallback_username` | string | Fallback user to use if no model is found for a user | "generic_user" | `generic_user`| -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `timestamp` | +| `model_name_formatter` | `str` | Formatter for model names | `"user_{username}_model"` | `[Required]` | +| `fallback_username` | `str` | Fallback user to use if no model is found for a user | `"generic_user"` | `"generic_user"`| +| `timestamp_column_name` | `str` | Name of the timestamp column | `"timestamp"` | `"timestamp"` | ### `write_to_file_options` | Key | Type | Description | Example Value | Default Value | |---------------------|-----------|------------------------------------------|-----------------|------------------| -| `filename` | string | Path to the output file | "output.csv" | `None` | -| `file_type` | string | Type of file to write | "CSV" | `AUTO` | -| `flush` | bool | If true, flush the file after each write | false | `false` | -| `include_index_col` | bool | If true, include the index column | false | `true` | -| `overwrite` | bool | If true, overwrite the file if it exists | true | `false` | +| `filename` | `str` | Path to the output file | `"output.csv"` | `None` | +| `file_type` | `str` | Type of file to write | `"CSV"` | `"AUTO"` | +| `flush` | `bool` | If true, flush the file after each write | `False` | `False` | +| `include_index_col` | `bool` | If true, include the index column | `False` | `True` | +| `overwrite` | `bool` | If true, overwrite the file if it exists | `True` | `False` | diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_inference.md b/docs/source/modules/examples/digital_fingerprinting/dfp_inference.md index 50f0698aa1..203f4eeee7 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_inference.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_inference.md @@ -23,9 +23,9 @@ This module function performs the inference process. | Parameter | Type | Description | Example Value | Default Value | |-----------------------|--------|------------------------------------------------------|-------------------------|-----------------| -| model_name_formatter | string | Formatter for model names | "user_{username}_model" | `[Required]` | -| fallback_username | string | Fallback user to use if no model is found for a user | "generic_user" | `generic_user` | -| timestamp_column_name | string | Name of the timestamp column | "timestamp" | `timestamp` | +| `model_name_formatter` | string | Formatter for model names | `"user_{username}_model"` | `[Required]` | +| `fallback_username` | string | Fallback user to use if no model is found for a user | `"generic_user"` | `"generic_user"` | +| `timestamp_column_name` | string | Name of the timestamp column | `"timestamp"` | `"timestamp"` | ### Example JSON Configuration @@ -36,11 +36,3 @@ This module function performs the inference process. "timestamp_column_name": "timestamp" } ``` - -### Default Settings - -| Property | Value | -| -------- | ----- | -| fallback_username | generic_user | -| model_name_formatter | None | -| timestamp_column_name | timestamp | diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_inference_pipe.md b/docs/source/modules/examples/digital_fingerprinting/dfp_inference_pipe.md index b13fc42b10..88431f3302 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_inference_pipe.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_inference_pipe.md @@ -15,66 +15,65 @@ See the License for the specific language governing permissions and limitations under the License. --> -## dfp_inference_pipe +## `dfp_inference_pipe` -This module function allows for the consolidation of multiple dfp pipeline modules relevant to the inference process -into a single module. +This module function allows for the consolidation of multiple DFP pipeline modules relevant to the inference process into a single module. ### Configurable Parameters | Parameter | Type | Description | Example Value | Default Value | |------------------------------|------------|--------------------------------------------------|---------------|---------------| -| `batching_options` | dictionary | Options for batching files. | See below | `-` | -| `cache_dir` | string | Directory used for caching intermediate results. | "/tmp/cache" | `-` | +| `batching_options` | dictionary | Options for batching files. | Refer below | `-` | +| `cache_dir` | string | Directory used for caching intermediate results. | `"/tmp/cache"` | `-` | | `detection_criteria` | dictionary | Criteria for filtering detections. | - | `-` | -| `inference_options` | dictionary | Options for configuring the inference process. | See below | `-` | +| `inference_options` | dictionary | Options for configuring the inference process. | Refer below | `-` | | `preprocessing_options` | dictionary | Options for preprocessing data. | - | `-` | -| `stream_aggregation_options` | dictionary | Options for aggregating data by stream. | See below | `-` | -| `timestamp_column_name` | string | Name of the column containing timestamps. | "timestamp" | `-` | -| `user_splitting_options` | dictionary | Options for splitting data by user. | See below | `-` | +| `stream_aggregation_options` | dictionary | Options for aggregating data by stream. | Refer below | `-` | +| `timestamp_column_name` | string | Name of the column containing timestamps. | `"timestamp"` | `-` | +| `user_splitting_options` | dictionary | Options for splitting data by user. | Refer below | `-` | | `write_to_file_options` | dictionary | Options for writing results to a file. | - | `-` | #### `batching_options` | Parameter | Type | Description | Example Value | Default Value | |--------------------------|--------|------------------------------------------|----------------------------------------------|---------------| -| `end_time` | string | End time of the time range to process. | "2022-01-01T00:00:00Z" | `-` | -| `iso_date_regex_pattern` | string | ISO date regex pattern. | "\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z" | `-` | +| `end_time` | string | End time of the time range to process. | `"2022-01-01T00:00:00Z"` | `-` | +| `iso_date_regex_pattern` | string | ISO date regex pattern. | `"\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z"` | `-` | | `parser_kwargs` | dict | Keyword arguments to pass to the parser. | - | `-` | -| `period` | string | Time period to batch the data. | "1D" | `-` | -| `sampling_rate_s` | float | Sampling rate in seconds. | "1.0" | `-` | -| `start_time` | string | Start time of the time range to process. | "2021-01-01T00:00:00Z" | `-` | +| `period` | string | Time period to batch the data. | `"1D"` | `-` | +| `sampling_rate_s` | float | Sampling rate in seconds. | `"1.0"` | `-` | +| `start_time` | string | Start time of the time range to process. | `"2021-01-01T00:00:00Z"` | `-` | #### `user_splitting_options` | Parameter | Type | Description | Example Value | Default Value | |----------------------|---------|-------------------------------------------------------|-------------------------|-----------------| -| `fallback_username` | string | Fallback user to use if no model is found for a user. | "generic_user" | `generic_user` | -| `include_generic` | boolean | Include generic models in the results. | true | `true` | -| `include_individual` | boolean | Include individual models in the results. | true | `false` | -| `only_users` | list | List of users to include in the results. | ["user_a","user_b"] | `-` | -| `skip_users` | list | List of users to exclude from the results. | ["user_c"] | `-` | -| `userid_column_name` | string | Column | "name for the user ID." | `user_id` | +| `fallback_username` | string | Fallback user to use if no model is found for a user. | `"generic_user"` | `"generic_user"` | +| `include_generic` | boolean | Include generic models in the results. | `True` | `True` | +| `include_individual` | boolean | Include individual models in the results. | `True` | `False` | +| `only_users` | list | List of users to include in the results. | `["user_a","user_b"]` | `-` | +| `skip_users` | list | List of users to exclude from the results. | `["user_c"]` | `-` | +| `userid_column_name` | string | Column | `"name for the user ID."` | `"user_id"` | ### `stream_aggregation_options` | Parameter | Type | Description | Example Value | Default Value | |-------------------------|--------|-------------------------------------------------------------|---------------|---------------| -| `cache_mode` | string | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | "batch" | `batch` | -| `min_history` | int | Minimum history to trigger a new training event | 1 | `1` | -| `max_history` | int | Maximum history to include in a new training event | 0 | `0` | -| `timestamp_column_name` | string | Name of the column containing timestamps | "timestamp" | `timestamp` | -| `aggregation_span` | string | Lookback timespan for training data in a new training event | "60d" | `60d` | -| `cache_to_disk` | bool | Whether or not to cache streaming data to disk | false | `false` | -| `cache_dir` | string | Directory to use for caching streaming data | "./.cache" | `./.cache` | +| `cache_mode` | string | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | `"batch"` | `batch` | +| `min_history` | int | Minimum history to trigger a new training event | `1` | `1` | +| `max_history` | int | Maximum history to include in a new training event | `0` | `0` | +| `timestamp_column_name` | string | Name of the column containing timestamps | `"timestamp"` | `timestamp` | +| `aggregation_span` | string | Look back time span for training data in a new training event | `"60d"` | `60d` | +| `cache_to_disk` | boolean | Whether or not to cache streaming data to disk | `False` | `False` | +| `cache_dir` | string | Directory to use for caching streaming data | `"./.cache"` | `"./.cache"` | ### `inference_options` | Parameter | Type | Description | Example Value | Default Value | |-------------------------|--------|------------------------------------------------------|-------------------------|-----------------| -| `model_name_formatter` | string | Formatter for model names | "user_{username}_model" | `[Required]` | -| `fallback_username` | string | Fallback user to use if no model is found for a user | "generic_user" | `generic_user` | -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `timestamp` | +| `model_name_formatter` | string | Formatter for model names | `"user_{username}_model"` | `[Required]` | +| `fallback_username` | string | Fallback user to use if no model is found for a user | `"generic_user"` | `"generic_user"` | +| `timestamp_column_name` | string | Name of the timestamp column | `"timestamp"` | `"timestamp"` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_monitor.md b/docs/source/modules/examples/digital_fingerprinting/dfp_monitor.md index 6f026dd396..933a16f668 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_monitor.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_monitor.md @@ -23,10 +23,10 @@ This module function monitors the pipeline message flow rate. | Key | Type | Description | Example Value | Default Value | | ----------------------------|---------|------------------------------------------------------------|---------------|---------------| -| `description` | string | Name to show for this Monitor Stage in the console window | "Progress" | `Progress` | -| `silence_monitors` | bool | Silence the monitors on the console | True | `False` | -| `smoothing` | float | Smoothing parameter to determine how much the throughput should be averaged | 0.01 | `0.05` | -| `unit` | string | Units to show in the rate value | "messages" | `messages` | -| `delayed_start` | bool | When delayed_start is enabled, the progress bar will not be shown until the first message is received. Otherwise, the progress bar is shown on pipeline startup and will begin timing immediately. In large pipelines, this option may be desired to give a more accurate timing. | True | `False` | -| `determine_count_fn_schema` | string | Custom function for determining the count in a message | "Progress" | `Progress` | -| `log_level` | string | Enable this stage when the configured log level is at `log_level` or lower. | "DEBUG" | `INFO` | +| `description` | string | Name to show for this Monitor Stage in the console window | `"Progress"` | `"Progress"` | +| `silence_monitors` | boolean | Silence the monitors on the console | `True` | `False` | +| `smoothing` | float | Smoothing parameter to determine how much the throughput should be averaged | `0.01` | `0.05` | +| `unit` | string | Units to show in the rate value | `"messages"` | `"messages"` | +| `delayed_start` | boolean | When enabled, the progress bar will not be shown until the first message is received. Otherwise, the progress bar is shown on pipeline startup and will begin timing immediately. In large pipelines, this option may be desired to give a more accurate timing. | `True` | `False` | +| `determine_count_fn_schema` | string | Custom function for determining the count in a message | `"Progress"` | `"Progress"` | +| `log_level` | string | Enable this stage when the configured log level is at `log_level` or lower. | `"DEBUG"` | `"INFO"` | diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_preproc.md b/docs/source/modules/examples/digital_fingerprinting/dfp_preproc.md index 39ec800ed3..886e3614fb 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_preproc.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_preproc.md @@ -15,52 +15,52 @@ See the License for the specific language governing permissions and limitations under the License. --> -## dfp_preproc +## `dfp_preproc` -This module function allows for the consolidation of multiple dfp pipeline modules relevant to inference/training +This module function allows for the consolidation of multiple DFP pipeline modules relevant to inference/training process into a single module. ### Configurable Parameters | Parameter | Type | Description | Example Value | Default Value | |--------------------------|------------|--------------------------------------------------|---------------|----------------| -| `cache_dir` | string | Directory used for caching intermediate results. | "/tmp/cache" | `-` | -| `timestamp_column_name` | string | Name of the column containing timestamps. | "timestamp" | `-` | -| `pre_filter_options` | dictionary | Options for pre-filtering control messages. | See Below | `-` | -| `batching_options` | dictionary | Options for batching files. | See Below | `-` | -| `user_splitting_options` | dictionary | Options for splitting data by user. | See Below | `-` | +| `cache_dir` | string | Directory used for caching intermediate results. | `"/tmp/cache"` | `-` | +| `timestamp_column_name` | string | Name of the column containing timestamps. | `"timestamp"` | `-` | +| `pre_filter_options` | dictionary | Options for pre-filtering control messages. | Refer Below | `-` | +| `batching_options` | dictionary | Options for batching files. | Refer Below | `-` | +| `user_splitting_options` | dictionary | Options for splitting data by user. | Refer Below | `-` | | `supported_loaders` | dictionary | Supported data loaders for different file types. | - | `-` | #### `pre_filter_options` | Parameter | Type | Description | Example Value | Default Value | |-------------------------|---------|---------------------------------------|---------------|---------------| -| `enable_task_filtering` | boolean | Enables filtering based on task type. | true | `-` | -| `filter_task_type` | string | The task type to be used as a filter. | "task_a" | `-` | -| `enable_data_filtering` | boolean | Enables filtering based on data type. | true | `-` | -| `filter_data_type` | string | The data type to be used as a filter. | "type_a" | `-` | +| `enable_task_filtering` | boolean | Enables filtering based on task type. | `true` | `-` | +| `filter_task_type` | string | The task type to be used as a filter. | `"task_a"` | `-` | +| `enable_data_filtering` | boolean | Enables filtering based on data type. | `true` | `-` | +| `filter_data_type` | string | The data type to be used as a filter. | `"type_a"` | `-` | #### `batching_options` | Parameter | Type | Description | Example Value | Default Value | |--------------------------|------------|------------------------------------------|----------------------------------------|---------------| -| `end_time` | string | End time of the time range to process. | "2022-01-01T00:00:00Z" | `-` | -| `iso_date_regex_pattern` | string | ISO date regex pattern. | "\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z" | `-` | -| `parser_kwargs` | dictionary | Keyword arguments to pass to the parser. | {} | `-` | -| `period` | string | Time period to batch the data. | "1D" | `-` | -| `sampling_rate_s` | float | Sampling rate in seconds. | "1.0" | `-` | -| `start_time` | string | Start time of the time range to process. | "2021-01-01T00:00:00Z" | `-` | +| `end_time` | string | End time of the time range to process. | `"2022-01-01T00:00:00Z"` | `-` | +| `iso_date_regex_pattern` | string | ISO date regex pattern. | `"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z"` | `-` | +| `parser_kwargs` | dictionary | Keyword arguments to pass to the parser. | `{}` | `-` | +| `period` | string | Time period to batch the data. | `"1D"` | `-` | +| `sampling_rate_s` | float | Sampling rate in seconds. | `"1.0"` | `-` | +| `start_time` | string | Start time of the time range to process. | `"2021-01-01T00:00:00Z"` | `-` | #### `user_splitting_options` | Parameter | Type | Description | Example Value | Default Value | |----------------------|---------|-------------------------------------------------------|------------------------|---------------| -| `fallback_username` | string | Fallback user to use if no model is found for a user. | "generic" | `-` | -| `include_generic` | boolean | Include generic models in the results. | "true" | `-` | -| `include_individual` | boolean | Include individual models in the results. | "true" | `-` | -| `only_users` | list | List of users to include in the results. | ["user_a", "user_b"] | `-` | -| `skip_users` | list | List of users to exclude from the results. | ["user_c"] | `-` | -| `userid_column_name` | string | Column name for the user ID. | "user_id" | `-` | +| `fallback_username` | string | Fallback user to use if no model is found for a user. | `"generic"` | `-` | +| `include_generic` | boolean | Include generic models in the results. | `true` | `-` | +| `include_individual` | boolean | Include individual models in the results. | `true` | `-` | +| `only_users` | list | List of users to include in the results. | `["user_a", "user_b"]` | `-` | +| `skip_users` | list | List of users to exclude from the results. | `["user_c"]` | `-` | +| `userid_column_name` | string | Column name for the user ID. | `"user_id"` | `-` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_rolling_window.md b/docs/source/modules/examples/digital_fingerprinting/dfp_rolling_window.md index 7104937956..7508852163 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_rolling_window.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_rolling_window.md @@ -23,13 +23,13 @@ This module is responsible for maintaining a rolling window of historical data, | Parameter | Type | Description | Example Value | Default Value | |--------------------------|--------|--------------------------------------------------------------|---------------|---------------| -| cache_mode | string | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | "batch" | "batch" | -| trigger_on_min_history | int | Minimum history to trigger a new training event | 1 | 1 | -| trigger_on_min_increment | int | Minmum increment from the last trained to new training event | 0 | 0 | -| timestamp_column_name | string | Name of the column containing timestamps | "timestamp" | "timestamp" | -| aggregation_span | string | Lookback timespan for training data in a new training event | "60d" | "60d" | -| cache_to_disk | bool | Whether or not to cache streaming data to disk | false | false | -| cache_dir | string | Directory to use for caching streaming data | "./.cache" | "./.cache" | +| `cache_mode` | string | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | `"batch"` | `"batch"` | +| `trigger_on_min_history` | integer | Minimum history to trigger a new training event | `1` | `1` | +| `trigger_on_min_increment` | integer | Minimum increment from the last trained to new training event | `0` | `0` | +| `timestamp_column_name` | string | Name of the column containing timestamps | `"timestamp"` | `"timestamp"` | +| `aggregation_span` | string | Look back time span for training data in a new training event | `"60d"` | `"60d"` | +| `cache_to_disk` | boolean | Whether or not to cache streaming data to disk | `false` | `false` | +| `cache_dir` | string | Directory to use for caching streaming data | `"./.cache"` | `"./.cache"` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_split_users.md b/docs/source/modules/examples/digital_fingerprinting/dfp_split_users.md index 1479d3c166..9adcad2c4a 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_split_users.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_split_users.md @@ -23,13 +23,13 @@ This module function splits the data based on user IDs. | Key | Type | Description | Example Value | Default Value | |-----------------------|------|------------------------------------------------------|-----------------------------|----------------| -| fallback_username | str | The user ID to use if the user ID is not found | "generic_user" | `generic_user` | -| include_generic | bool | Whether to include a generic user ID in the output | false | `false` | -| include_individual | bool | Whether to include individual user IDs in the output | true | `false` | -| only_users | list | List of user IDs to include; others will be excluded | ["user1", "user2", "user3"] | `[]` | -| skip_users | list | List of user IDs to exclude from the output | ["user4", "user5"] | `[]` | -| timestamp_column_name | str | Name of the column containing timestamps | "timestamp" | `timestamp` | -| userid_column_name | str | Name of the column containing user IDs | "username" | `username` | +| `fallback_username` | `str` | The user ID to use if the user ID is not found | `"generic_user"` | `"generic_user"` | +| `include_generic` | `bool` | Whether to include a generic user ID in the output | `false` | `false` | +| `include_individual` | `bool` | Whether to include individual user IDs in the output | `true` | `false` | +| `only_users` | `list` | List of user IDs to include; others will be excluded | `["user1", "user2", "user3"]` | `[]` | +| `skip_users` | `list` | List of user IDs to exclude from the output | `["user4", "user5"]` | `[]` | +| `timestamp_column_name` | `str` | Name of the column containing timestamps | `"timestamp"` | `"timestamp"` | +| `userid_column_name` | `str` | Name of the column containing user IDs | `"username"` | `"username"` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_training.md b/docs/source/modules/examples/digital_fingerprinting/dfp_training.md index 8c8ffdd5ed..c987b04189 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_training.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_training.md @@ -23,10 +23,10 @@ This module function is responsible for training the model. | Parameter | Type | Description | Example Value | Default Value | |-----------------|-------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| -| feature_columns | list | List of feature columns to train on | ["column1", "column2", "column3"] | `-` | -| epochs | int | Number of epochs to train for | 50 | `-` | -| model_kwargs | dict | Keyword arguments to pass to the model | {"encoder_layers": [64, 32], "decoder_layers": [32, 64], "activation": "relu", "swap_p": 0.1, "lr": 0.001, "lr_decay": 0.9, "batch_size": 32, "verbose": 1, "optimizer": "adam", "scalar": "min_max", "min_cats": 10, "progress_bar": false, "device": "cpu"} | `-` | -| validation_size | float | Size of the validation set | 0.1 | `-` | +| `feature_columns` | `list` | List of feature columns to train on | `["column1", "column2", "column3"]` | `-` | +| `epochs` | `int` | Number of epochs to train for | `50` | `-` | +| `model_kwargs` | `dict` | Keyword arguments to pass to the model | `{"encoder_layers": [64, 32], "decoder_layers": [32, 64], "activation": "relu", "swap_p": 0.1, "lr": 0.001, "lr_decay": 0.9, "batch_size": 32, "verbose": 1, "optimizer": "adam", "scalar": "min_max", "min_cats": 10, "progress_bar": false, "device": "cpu"}` | `-` | +| `validation_size` | `float` | Size of the validation set | `0.1 ` | `-` | ## JSON Example diff --git a/docs/source/modules/examples/digital_fingerprinting/dfp_training_pipe.md b/docs/source/modules/examples/digital_fingerprinting/dfp_training_pipe.md index c1a9699886..637a28fdf6 100644 --- a/docs/source/modules/examples/digital_fingerprinting/dfp_training_pipe.md +++ b/docs/source/modules/examples/digital_fingerprinting/dfp_training_pipe.md @@ -23,65 +23,65 @@ This module function consolidates multiple DFP pipeline modules relevant to the | Key | Type | Description | Example Value | Default Value | |------------------------------|------|-----------------------------------------------------------------------------------------|---------------|---------------| -| `timestamp_column_name` | str | Name of the timestamp column used in the data. | "timestamp" | `-` | -| `cache_dir` | str | Directory to cache the rolling window data. | "/tmp/cache" | `-` | -| `batching_options` | dict | Options for batching files. | See Below | `-` | -| `user_splitting_options` | dict | Options for splitting data by user. | See Below | `-` | -| `stream_aggregation_options` | dict | Options for aggregating data by stream. | See Below | `-` | -| `preprocessing_options` | dict | Options for preprocessing the data. | `-` | `-` | -| `dfencoder_options` | dict | Options for configuring the data frame encoder, used for training the model. | See Below | `-` | -| `mlflow_writer_options` | dict | Options for the MLflow model writer, which is responsible for saving the trained model. | See Below | `-` | +| `timestamp_column_name` | `str` | Name of the timestamp column used in the data. | `"timestamp"` | `-` | +| `cache_dir` | `str` | Directory to cache the rolling window data. | `"/tmp/cache"` | `-` | +| `batching_options` | `dict` | Options for batching files. | Refer Below | `-` | +| `user_splitting_options` | `dict` | Options for splitting data by user. | Refer Below | `-` | +| `stream_aggregation_options` | `dict` | Options for aggregating data by stream. | Refer Below | `-` | +| `preprocessing_options` | `dict` | Options for preprocessing the data. | `-` | `-` | +| `dfencoder_options` | `dict` | Options for configuring the data frame encoder, used for training the model. | Refer Below | `-` | +| `mlflow_writer_options` | `dict` | Options for the MLflow model writer, which is responsible for saving the trained model. | Refer Below | `-` | ### `batching_options` | Key | Type | Description | Example Value | Default Value | |--------------------------|-------|------------------------------------------|---------------------------------------------------------|---------------| -| `end_time` | str | End time of the time range to process. | "2023-03-01T00:00:00" | `-` | -| `iso_date_regex_pattern` | str | ISO date regex pattern. | "\\\\d{4}-\\\\d{2}-\\\\d{2}T\\\\d{2}:\\\\d{2}:\\\\d{2}" | `-` | -| `parser_kwargs` | dict | Keyword arguments to pass to the parser. | {} | `-` | -| `period` | str | Time period to batch the data. | "1min" | `-` | -| `sampling_rate_s` | float | Sampling rate in seconds. | 60 | `-` | -| `start_time` | str | Start time of the time range to process. | "2023-02-01T00:00:00" | `-` | +| `end_time` | `str` | End time of the time range to process. | `"2023-03-01T00:00:00"` | `-` | +| `iso_date_regex_pattern` | `str` | ISO date regex pattern. | `"\\\\d{4}-\\\\d{2}-\\\\d{2}T\\\\d{2}:\\\\d{2}:\\\\d{2}"` | `-` | +| `parser_kwargs` | `dict` | Keyword arguments to pass to the parser. | `{}` | `-` | +| `period` | `str` | Time period to batch the data. | `"1min"` | `-` | +| `sampling_rate_s` | `float` | Sampling rate in seconds. | `60` | `-` | +| `start_time` | `str` | Start time of the time range to process. | `"2023-02-01T00:00:00"` | `-` | ### `user_splitting_options` | Key | Type | Description | Example Value | Default Value | |----------------------|-----------|-------------------------------------------------------|---------------|---------------| -| `fallback_username` | str | Fallback user to use if no model is found for a user. | "generic" | `-` | -| `include_generic` | bool | Include generic models in the results. | true | `-` | -| `include_individual` | bool | Include individual models in the results. | true | `-` | -| `only_users` | List[str] | List of users to include in the results. | [] | `-` | -| `skip_users` | List[str] | List of users to exclude from the results. | [] | `-` | -| `userid_column_name` | str | Column name for the user ID. | "user_id" | `-` | +| `fallback_username` | `str` | Fallback user to use if no model is found for a user. | `"generic"` | `-` | +| `include_generic` | `bool` | Include generic models in the results. | `true` | `-` | +| `include_individual` | `bool` | Include individual models in the results. | `true` | `-` | +| `only_users` | `list[str]` | List of users to include in the results. | `[]` | `-` | +| `skip_users` | `list[str]` | List of users to exclude from the results. | `[]` | `-` | +| `userid_column_name` | `str` | Column name for the user ID. | `"user_id"` | `-` | ### `stream_aggregation_options` | Key | Type | Description | Example Value | Default Value | |-------------------------|--------|-------------------------------------------------------------|---------------|---------------| -| `cache_mode` | string | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | "batch" | `batch` | -| `min_history` | int | Minimum history to trigger a new training event | 1 | `1` | -| `max_history` | int | Maximum history to include in a new training event | 0 | `0` | -| `timestamp_column_name` | string | Name of the column containing timestamps | 'timestamp' | `timestamp` | -| `aggregation_span` | string | Lookback timespan for training data in a new training event | "60d" | `60d` | -| `cache_to_disk` | bool | Whether or not to cache streaming data to disk | false | `false` | -| `cache_dir` | string | Directory to use for caching streaming data | "./.cache" | `./.cache` | +| `cache_mode` | `str` | Mode for managing user cache. Setting to `batch` flushes cache once trigger conditions are met. Otherwise, continue to aggregate user's history. | `"batch"` | `"batch"` | +| `min_history` | `int` | Minimum history to trigger a new training event | `1` | `1` | +| `max_history` | `int` | Maximum history to include in a new training event | `0` | `0` | +| `timestamp_column_name` | `str` | Name of the column containing timestamps | `'timestamp'` | `'timestamp'` | +| `aggregation_span` | `str` | Look back time span for training data in a new training event | `"60d"` | `60d` | +| `cache_to_disk` | `bool` | Whether or not to cache streaming data to disk | `false` | `false` | +| `cache_dir` | `str` | Directory to use for caching streaming data | `"./.cache"` | `"./.cache"` | ### `dfencoder_options` | Parameter | Type | Description | Example Value | Default Value | |-------------------|-------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| -| `feature_columns` | list | List of feature columns to train on | ["column1", "column2", "column3"] | `- ` | -| `epochs` | int | Number of epochs to train for | 50 | `-` | -| `model_kwargs` | dict | Keyword arguments to pass to the model | {"encoder_layers": [64, 32], "decoder_layers": [32, 64], "activation": "relu", "swap_p": 0.1, "lr": 0.001, "lr_decay": 0.9, "batch_size": 32, "verbose": 1, "optimizer": "adam", "scalar": "min_max", "min_cats": 10, "progress_bar": false, "device": "cpu"} | `-` | -| `validation_size` | float | Size of the validation set | 0.1 | `-` | +| `feature_columns` | `list` | List of feature columns to train on | `["column1", "column2", "column3"]`| `- ` | +| `epochs` | `int` | Number of epochs to train for | `50` | `-` | +| `model_kwargs` | `dict` | Keyword arguments to pass to the model | `{"encoder_layers": [64, 32], "decoder_layers": [32, 64], "activation": "relu", "swap_p": 0.1, "lr": 0.001, "lr_decay": 0.9, "batch_size": 32, "verbose": 1, "optimizer": "adam", "scalar": "min_max", "min_cats": 10, "progress_bar": false, "device": "cpu"}` | `-` | +| `validation_size` | `float` | Size of the validation set | `0.1` | `-` | ### `mlflow_writer_options` | Key | Type | Description | Example Value | Default Value | |-----------------------------|------------|-----------------------------------|-------------------------------|---------------| -| `conda_env` | string | Conda environment for the model | "path/to/conda_env.yml" | `[Required]` | -| `databricks_permissions` | dictionary | Permissions for the model | See Below | `None` | -| `experiment_name_formatter` | string | Formatter for the experiment name | "experiment_name_{timestamp}" | `[Required]` | -| `model_name_formatter` | string | Formatter for the model name | "model_name_{timestamp}" | `[Required]` | -| `timestamp_column_name` | string | Name of the timestamp column | "timestamp" | `timestamp` | +| `conda_env` | `str` | Conda environment for the model | `"path/to/conda_env.yml"` | `[Required]` | +| `databricks_permissions` | `dict` | Permissions for the model | - | `None` | +| `experiment_name_formatter` | `str` | Formatter for the experiment name | `"experiment_name_{timestamp}"` | `[Required]` | +| `model_name_formatter` | `str` | Formatter for the model name | `"model_name_{timestamp}"` | `[Required]` | +| `timestamp_column_name` | `str` | Name of the timestamp column | `"timestamp"` | `"timestamp"` | diff --git a/docs/source/modules/examples/spear_phishing/sp_email_enrichment.md b/docs/source/modules/examples/spear_phishing/sp_email_enrichment.md index c501b6a752..3cf1f06b72 100644 --- a/docs/source/modules/examples/spear_phishing/sp_email_enrichment.md +++ b/docs/source/modules/examples/spear_phishing/sp_email_enrichment.md @@ -17,8 +17,8 @@ limitations under the License. ## Spear Phishing Email Enrichment Module -Module ID: email_enrichment -Module Namespace: morpheus_spear_phishing +Module ID: `email_enrichment` +Module Namespace: `morpheus_spear_phishing` This module performs spear phishing email enrichment. @@ -26,10 +26,10 @@ This module performs spear phishing email enrichment. | Parameter | Type | Description | Example Value | Default Value | |--------------------------|------|---------------------------------------------------------------------|------------------------|---------------| -| `sender_sketches` | list | List of sender strings naming sender sketch inputs. | ["sender1", "sender2"] | `[]` | -| `intents` | list | List of intent strings naming computed intent inputs. | ["intent1", "intent2"] | `[]` | -| `raise_on_failure` | boolean | Indicate if we should treat processing errors as pipeline failures. | false | `false` | -| `token_length_threshold` | integer | Minimum token length to use when computing syntax similarity | 5 | None | +| `sender_sketches` | list | List of sender strings naming sender sketch inputs. | `["sender1", "sender2"]` | `[]` | +| `intents` | list | List of intent strings naming computed intent inputs. | `["intent1", "intent2"]` | `[]` | +| `raise_on_failure` | boolean | Indicate if we should treat processing errors as pipeline failures. | `false` | `false` | +| `token_length_threshold` | integer | Minimum token length to use when computing syntax similarity | `5` | `None` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/spear_phishing/sp_inference_intent.md b/docs/source/modules/examples/spear_phishing/sp_inference_intent.md index 05ba8d81c7..3e687bc867 100644 --- a/docs/source/modules/examples/spear_phishing/sp_inference_intent.md +++ b/docs/source/modules/examples/spear_phishing/sp_inference_intent.md @@ -17,8 +17,8 @@ limitations under the License. ## Inference Intent -Module ID: infer_email_intent -Module Namespace: morpheus_spear_phishing +Module ID: `infer_email_intent` +Module Namespace: `morpheus_spear_phishing` Infers an 'intent' for a given email body. @@ -26,16 +26,16 @@ Infers an 'intent' for a given email body. | Parameter | Type | Description | Example Value | Default Value | |--------------------|------|-----------------------------------------|-----------------------|-------------------------| -| `intent` | string | The intent for the model | "classify" | `None` | -| `task` | string | The task for the model | "text-classification" | `"text-classification"` | -| `model_path` | string | The path to the model | "/path/to/model" | `None` | -| `truncation` | boolean | If true, truncates inputs to max_length | true | `true` | -| `max_length` | integer | Maximum length for model input | 512 | `512` | -| `batch_size` | integer | The size of batches for processing | 256 | `256` | -| `feature_col` | string | The feature column to use | "body" | `"body"` | -| `label_col` | string | The label column to use | "label" | `"label"` | -| `device` | integer | The device to run on | 0 | `0` | -| `raise_on_failure` | boolean | If true, raise exceptions on failures | false | `false` | +| `intent` | string | The intent for the model | `"classify"` | `None` | +| `task` | string | The task for the model | `"text-classification"` | `"text-classification"` | +| `model_path` | string | The path to the model | `"/path/to/model"` | `None` | +| `truncation` | boolean | If true, truncates inputs to `max_length` | `true` | `true` | +| `max_length` | integer | Maximum length for model input | `512` | `512` | +| `batch_size` | integer | The size of batches for processing | `256` | `256` | +| `feature_col` | string | The feature column to use | `"body"` | `"body"` | +| `label_col` | string | The label column to use | `"label"` | `"label"` | +| `device` | integer | The device to run on | `0` | `0` | +| `raise_on_failure` | boolean | If true, raise exceptions on failures | `false` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/spear_phishing/sp_inference_sp_classifier.md b/docs/source/modules/examples/spear_phishing/sp_inference_sp_classifier.md index 34f17cf18c..c07788da95 100644 --- a/docs/source/modules/examples/spear_phishing/sp_inference_sp_classifier.md +++ b/docs/source/modules/examples/spear_phishing/sp_inference_sp_classifier.md @@ -17,8 +17,8 @@ limitations under the License. ## Spear Phishing Inference Module -Module ID: inference -Module Namespace: morpheus_spear_phishing +Module ID: `inference` +Module Namespace: `morpheus_spear_phishing` This module defines a setup for spear-phishing inference. @@ -26,10 +26,10 @@ This module defines a setup for spear-phishing inference. | Parameter | Type | Description | Example Value | Default Value | |------------------------|------|---------------------------------------|--------------------|---------------| -| `tracking_uri` | string | The tracking URI for the model | "/path/to/uri" | `None` | -| `registered_model` | string | The registered model for inference | "model_1" | `None` | -| `input_model_features` | list | The input features for the model | ["feat1", "feat2"] | `[]` | -| `raise_on_failure` | boolean | If true, raise exceptions on failures | false | `false` | +| `tracking_uri` | string | The tracking URI for the model | `"/path/to/uri"` | `None` | +| `registered_model` | string | The registered model for inference | `"model_1"` | `None` | +| `input_model_features` | list | The input features for the model | `["feat1", "feat2"]` | `[]` | +| `raise_on_failure` | boolean | If true, raise exceptions on failures | `false` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/spear_phishing/sp_label_and_score.md b/docs/source/modules/examples/spear_phishing/sp_label_and_score.md index 01a34bcf8b..edb39c4c00 100644 --- a/docs/source/modules/examples/spear_phishing/sp_label_and_score.md +++ b/docs/source/modules/examples/spear_phishing/sp_label_and_score.md @@ -17,8 +17,8 @@ limitations under the License. ## Spear Phishing Email Scoring Module -Module ID: label_and_score -Module Namespace: morpheus_spear_phishing +Module ID: `label_and_score` +Module Namespace: `morpheus_spear_phishing` This module defines a setup for spear-phishing email scoring. @@ -26,8 +26,8 @@ This module defines a setup for spear-phishing email scoring. | Parameter | Type | Description | Example Value | Default Value | |--------------------|------|---------------------------------------|---------------------------|---------------| -| `scoring_config` | dictionary | The scoring configuration | {"method": "probability"} | `None` | -| `raise_on_failure` | boolean | If true, raise exceptions on failures | false | `false` | +| `scoring_config` | dictionary | The scoring configuration | `{"method": "probability"}` | `None` | +| `raise_on_failure` | boolean | If true, raise exceptions on failures | `false` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/spear_phishing/sp_preprocessing.md b/docs/source/modules/examples/spear_phishing/sp_preprocessing.md index 29eefa121b..8be5b4b195 100644 --- a/docs/source/modules/examples/spear_phishing/sp_preprocessing.md +++ b/docs/source/modules/examples/spear_phishing/sp_preprocessing.md @@ -17,8 +17,8 @@ limitations under the License. ## Spear Phishing Inference Pipeline Preprocessing Module -Module ID: inference_pipeline_preproc -Module Namespace: morpheus_spear_phishing +Module ID: `inference_pipeline_preproc` +Module Namespace: `morpheus_spear_phishing` This module defines a pre-processing setup for the spear phishing inference pipeline. @@ -26,8 +26,8 @@ This module defines a pre-processing setup for the spear phishing inference pipe | Parameter | Type | Description | Example Value | Default Value | |--------------------|------|---------------------------------------------------|---------------|---------------| -| `attach_uuid` | boolean | If true, attach a unique identifier to each input | true | `false` | -| `raise_on_failure` | boolean | If true, raise exceptions on failures | false | `false` | +| `attach_uuid` | boolean | If true, attach a unique identifier to each input | `true` | `false` | +| `raise_on_failure` | boolean | If true, raise exceptions on failures | `false` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/spear_phishing/sp_sender_sketch_aggregator.md b/docs/source/modules/examples/spear_phishing/sp_sender_sketch_aggregator.md index a6a6c47a98..04bf2b028d 100644 --- a/docs/source/modules/examples/spear_phishing/sp_sender_sketch_aggregator.md +++ b/docs/source/modules/examples/spear_phishing/sp_sender_sketch_aggregator.md @@ -17,8 +17,8 @@ limitations under the License. ## Spear Phishing Sender Sketch Aggregator Module -Module ID: sender_sketch_aggregator -Module Namespace: morpheus_spear_phishing +Module ID: `sender_sketch_aggregator` +Module Namespace: `morpheus_spear_phishing` This module aggregates sender sketches in the spear phishing detection pipeline. @@ -26,14 +26,14 @@ This module aggregates sender sketches in the spear phishing detection pipeline. | Parameter | Type | Description | Example Value | Default Value | |------------------------|------------|--------------------------------------------|---------------|---------------| -| `sender_sketch_config` | dictionary | The configuration for the sender sketches. | See Below | `None` | +| `sender_sketch_config` | dictionary | The configuration for the sender sketches. | Refer Below | `None` | ### `sender_sketch_config` | Key | Type | Description | Example Value | Default Value | |--------------------|-------|------------------------------------------|------------------------|---------------| -| `sender_sketches` | array | The list of sender sketches to aggregate | ["sketch1", "sketch2"] | `[]` | -| `raise_on_failure` | boolean | If true, raise exceptions on failures | false | `false` | +| `sender_sketches` | array | The list of sender sketches to aggregate | `["sketch1", "sketch2"]` | `[]` | +| `raise_on_failure` | boolean | If true, raise exceptions on failures | `false` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/spear_phishing/sp_sender_sketch_query_constructor.md b/docs/source/modules/examples/spear_phishing/sp_sender_sketch_query_constructor.md index 1e00b3c219..40804b0687 100644 --- a/docs/source/modules/examples/spear_phishing/sp_sender_sketch_query_constructor.md +++ b/docs/source/modules/examples/spear_phishing/sp_sender_sketch_query_constructor.md @@ -18,8 +18,8 @@ limitations under the License. ## SQL Sender Sketch Query Constructor Module -Module ID: sql_sender_sketch_query_constructor -Module Namespace: morpheus_spear_phishing +Module ID: `sql_sender_sketch_query_constructor` +Module Namespace: `morpheus_spear_phishing` This module constructs SQL sender sketch queries in the spear phishing detection pipeline. @@ -27,7 +27,7 @@ This module constructs SQL sender sketch queries in the spear phishing detection | Parameter | Type | Description | Example Value | Default Value | |--------------------|------|---------------------------------------|---------------|---------------| -| `raise_on_failure` | boolean | If true, raise exceptions on failures | false | `false` | +| `raise_on_failure` | boolean | If true, raise exceptions on failures | `false` | `false` | ### Example JSON Configuration @@ -35,3 +35,4 @@ This module constructs SQL sender sketch queries in the spear phishing detection { "raise_on_failure": false } +``` diff --git a/docs/source/modules/examples/spear_phishing/sp_sender_sketch_update.md b/docs/source/modules/examples/spear_phishing/sp_sender_sketch_update.md index ae90c2f402..8d8017ba62 100644 --- a/docs/source/modules/examples/spear_phishing/sp_sender_sketch_update.md +++ b/docs/source/modules/examples/spear_phishing/sp_sender_sketch_update.md @@ -18,8 +18,8 @@ limitations under the License. ## Sender Sketch Update Module -Module ID: sender_sketch_update -Module Namespace: morpheus_spear_phishing +Module ID: `sender_sketch_update` +Module Namespace: `morpheus_spear_phishing` This module updates the sender sketch for spear phishing detection. @@ -27,16 +27,16 @@ This module updates the sender sketch for spear phishing detection. | Parameter | Type | Description | Example Value | Default Value | |------------------------|------------|---------------------------------|---------------|---------------| -| `sender_sketch_config` | dictionary | Configuration for sender sketch | See Below | `None` | +| `sender_sketch_config` | dictionary | Configuration for sender sketch | Refer Below | `None` | ### `sender_sketch_config` | Key | Type | Description | Example Value | Default Value | |-------------------------------|------------|----------------------------------------|--------------------------|---------------| -| `endpoint` | string | The endpoint configuration | "http://my-endpoint.com" | `None` | -| `required_intents` | list | List of required intents | ["intent1", "intent2"] | `[]` | -| `sender_sketch_tables_config` | dictionary | Configuration for sender sketch tables | {"table1": "config1"} | `{}` | -| `raise_on_failure` | boolean | If true, raise exceptions on failures | false | `false` | +| `endpoint` | string | The endpoint configuration | `"http://my-endpoint.com"` | `None` | +| `required_intents` | list | List of required intents | `["intent1", "intent2"]` | `[]` | +| `sender_sketch_tables_config` | dictionary | Configuration for sender sketch tables | `{"table1": "config1"}` | `{}` | +| `raise_on_failure` | boolean | If true, raise exceptions on failures | `false` | `false` | ### Example JSON Configuration diff --git a/docs/source/modules/examples/spear_phishing/sp_spear_phishing_post_inference.md b/docs/source/modules/examples/spear_phishing/sp_spear_phishing_post_inference.md index fcf786cf09..f728a361ae 100644 --- a/docs/source/modules/examples/spear_phishing/sp_spear_phishing_post_inference.md +++ b/docs/source/modules/examples/spear_phishing/sp_spear_phishing_post_inference.md @@ -17,8 +17,8 @@ limitations under the License. ## Pre-inference Module -Module ID: post_inference -Module Namespace: morpheus_spear_phishing +Module ID: `post_inference` +Module Namespace: `morpheus_spear_phishing` This module represents the post-inference phase of the spear phishing inference pipeline. It handles the output from the label and score module, updates the sender sketch, and prepares the final output. @@ -27,7 +27,7 @@ label and score module, updates the sender sketch, and prepares the final output | Parameter | Type | Description | |------------------------|------|--------------------------------------------------------------------------------------------------------------| -| `scoring_config` | dictionary | Configuration for scoring, can include custom parameters for the scoring module. See below for more details. | +| `scoring_config` | dictionary | Configuration for scoring, can include custom parameters for the scoring module. Refer below for more details. | | `sender_sketch_config` | dictionary | Configuration for sender sketch module, including parameters such as endpoint details and sketch settings. | #### `scoring_config` @@ -35,14 +35,14 @@ label and score module, updates the sender sketch, and prepares the final output | Key | Type | Description | |--------------------|-------|--------------------------------------------------------------------| | `threshold` | float | Detection threshold for scoring. | -| `scoring_type` | string | Type of scoring to use. Currently only "probability" is supported. | +| `scoring_type` | string | Type of scoring to use. Currently only `"probability"` is supported. | | `raise_on_failure` | boolean | If true, raise exceptions on failures. Default is False. | #### `sender_sketch_config` | Key | Type | Description | Default Value | |-------------------------------|------|--------------------------------------------------------------|---------------| -| `endpoint` | dictionary | See `endpoint` subparameters | `None` | +| `endpoint` | dictionary | See `endpoint` sub-parameters | `None` | | `sender_sketches` | list | List of sender sketches | `[]` | | `required_intents` | list | List of required intents | `[]` | | `raise_on_failure` | boolean | If true, raise exceptions on failures | `False` | diff --git a/docs/source/modules/examples/spear_phishing/sp_spear_phishing_pre_inference.md b/docs/source/modules/examples/spear_phishing/sp_spear_phishing_pre_inference.md index c946f3aad4..37aff003b8 100644 --- a/docs/source/modules/examples/spear_phishing/sp_spear_phishing_pre_inference.md +++ b/docs/source/modules/examples/spear_phishing/sp_spear_phishing_pre_inference.md @@ -17,8 +17,8 @@ limitations under the License. ## Pre-inference Module -Module ID: pre_inference -Module Namespace: morpheus_spear_phishing +Module ID: `pre_inference` +Module Namespace: `morpheus_spear_phishing` Pre-inference phase of the spear phishing inference pipeline. It loads the necessary modules and establishes the required connections between modules. @@ -29,8 +29,8 @@ required connections between modules. |------------------------|------|------------------------------------------|---------------| | `raise_on_failure` | boolean | If true, raise exceptions on failures | `False` | | `max_batch_size` | integer | Maximum size of each batch | `500` | -| `intent_config` | dictionary | See `intent_config` subparameters | `{}` | -| `sender_sketch_config` | dictionary | See `sender_sketch_config` subparameters | `None` | +| `intent_config` | dictionary | See `intent_config` sub-parameters | `{}` | +| `sender_sketch_config` | dictionary | See `sender_sketch_config` sub-parameters | `None` | #### `intent_config` @@ -43,7 +43,7 @@ required connections between modules. | Key | Type | Description | Default Value | |-------------------------------|------|--------------------------------------------------------------|---------------| -| `endpoint` | dictionary | See `endpoint` subparameters | `None` | +| `endpoint` | dictionary | See `endpoint` sub-parameters | `None` | | `sender_sketches` | list | List of sender sketches | `[]` | | `required_intents` | list | List of required intents | `[]` | | `raise_on_failure` | boolean | If true, raise exceptions on failures | `False` | diff --git a/docs/source/stages/morpheus_stages.md b/docs/source/stages/morpheus_stages.md index 7b6e31d832..db2d533606 100644 --- a/docs/source/stages/morpheus_stages.md +++ b/docs/source/stages/morpheus_stages.md @@ -20,7 +20,7 @@ limitations under the License. Stages are the building blocks of Morpheus pipelines. Below is a list of the most commonly used stages. For a full list of stages, refer to the stages API {py:mod}`morpheus.stages`. In addition to this there are several custom stages contained in the [Examples](../examples.md) and [Developer Guides](../developer_guide/guides.md). ## Table of Contents -- [Doca](#doca) +- [DOCA](#doca) - [General](#general) - [Inference](#inference) - [Input](#input) @@ -30,10 +30,10 @@ Stages are the building blocks of Morpheus pipelines. Below is a list of the mos - [Pre-process](#pre-process) -## Doca +## DOCA -- Doca Source Stage {py:class}`~morpheus.stages.doca.doca_source_stage.DocaSourceStage` A source stage used to receive raw packet data in GPU memory from a ConnectX NIC using DOCA GPUNetIO function within a CUDA kernel to actually receive and process Ethernet network packets. Receive packets information is passed to next pipeline stage in the form of RawPacketMessage. This stage is not compiled by default refer to the [Doca Example](../../../examples/doca/README.md) for details on building this stage. -- Doca Convert Stage {py:class}`~morpheus.stages.doca.doca_source_stage.DocaConvertStage` Convert the RawPacketMessage format received by the DOCA Source Stage into a more complex message format MetaMessage. Packets' info never leave the GPU memory. This stage is not compiled by default refer to the [Doca Example](../../../examples/doca/README.md) for details on building this stage. +- DOCA Source Stage {py:class}`~morpheus.stages.doca.doca_source_stage.DocaSourceStage` A source stage used to receive raw packet data in GPU memory from a ConnectX NIC using DOCA GPUNetIO function within a CUDA kernel to actually receive and process Ethernet network packets. Receive packets information is passed to next pipeline stage in the form of RawPacketMessage. This stage is not compiled by default refer to the [DOCA Example](../../../examples/doca/README.md) for details on building this stage. +- DOCA Convert Stage {py:class}`~morpheus.stages.doca.doca_source_stage.DocaConvertStage` Convert the RawPacketMessage format received by the DOCA Source Stage into a more complex message format MetaMessage. Packets' info never leave the GPU memory. This stage is not compiled by default refer to the [DOCA Example](../../../examples/doca/README.md) for details on building this stage. ## General @@ -50,21 +50,21 @@ Stages are the building blocks of Morpheus pipelines. Below is a list of the mos ## Input -- AppShield Source Stage {py:class}`~morpheus.stages.input.appshield_source_stage.AppShieldSourceStage` Load Appshield messages from one or more plugins into a dataframe. +- App Shield Source Stage {py:class}`~morpheus.stages.input.appshield_source_stage.AppShieldSourceStage` Load App Shield messages from one or more plugins into a DataFrame. - Azure Source Stage {py:class}`~morpheus.stages.input.azure_source_stage.AzureSourceStage` Load Azure Active Directory messages. -- Cloud Trail Source Stage {py:class}`~morpheus.stages.input.cloud_trail_source_stage.CloudTrailSourceStage` Load messages from a Cloudtrail directory. -- Control Message File Source Stage {py:class}`~morpheus.stages.input.control_message_file_source_stage.ControlMessageFileSourceStage` Recieves control messages from different sources specified by a list of (fsspec)[https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files] strings. +- Cloud Trail Source Stage {py:class}`~morpheus.stages.input.cloud_trail_source_stage.CloudTrailSourceStage` Load messages from a CloudTrail directory. +- Control Message File Source Stage {py:class}`~morpheus.stages.input.control_message_file_source_stage.ControlMessageFileSourceStage` Receives control messages from different sources specified by a list of (fsspec)[https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files] strings. - Control Message Kafka Source Stage {py:class}`~morpheus.stages.input.control_message_kafka_source_stage.ControlMessageKafkaSourceStage` Load control messages from a Kafka cluster. - Databricks Delta Lake Source Stage {py:class}`~morpheus.stages.input.databricks_deltalake_source_stage.DataBricksDeltaLakeSourceStage` Source stage used to load messages from a DeltaLake table. - Duo Source Stage {py:class}`~morpheus.stages.input.duo_source_stage.DuoSourceStage` Load Duo Authentication messages. - File Source Stage {py:class}`~morpheus.stages.input.file_source_stage.FileSourceStage` Load messages from a file. - HTTP Client Source Stage {py:class}`~morpheus.stages.input.http_client_source_stage.HttpClientSourceStage` Poll a remote HTTP server for incoming data. - HTTP Server Source Stage {py:class}`~morpheus.stages.input.http_server_source_stage.HttpServerSourceStage` Start an HTTP server and listens for incoming requests on a specified endpoint. -- In Memory Source Stage {py:class}`~morpheus.stages.input.in_memory_source_stage.InMemorySourceStage` Input source that emits a pre-defined list of dataframes. +- In Memory Source Stage {py:class}`~morpheus.stages.input.in_memory_source_stage.InMemorySourceStage` Input source that emits a pre-defined list of DataFrames. - Kafka Source Stage {py:class}`~morpheus.stages.input.kafka_source_stage.KafkaSourceStage` Load messages from a Kafka cluster. - RSS Source Stage {py:class}`~morpheus.stages.input.rss_source_stage.RSSSourceStage` Load RSS feed items into a pandas DataFrame. -## LLM +## LLM - LLM Engine Stage {py:class}`~morpheus.stages.llm.llm_engine_stage.LLMEngineStage` Execute an LLM engine within a Morpheus pipeline. @@ -86,11 +86,11 @@ Stages are the building blocks of Morpheus pipelines. Below is a list of the mos - Generate Viz Frames Stage {py:class}`~morpheus.stages.postprocess.generate_viz_frames_stage.GenerateVizFramesStage` Write out visualization DataFrames. - MLflow Drift Stage {py:class}`~morpheus.stages.postprocess.ml_flow_drift_stage.MLFlowDriftStage` Report model drift statistics to MLflow. - Serialize Stage {py:class}`~morpheus.stages.postprocess.serialize_stage.SerializeStage` Include & exclude columns from messages. -- Timeseries Stage {py:class}`~morpheus.stages.postprocess.timeseries_stage.TimeSeriesStage` Perform time series anomaly detection and add prediction. +- Time Series Stage {py:class}`~morpheus.stages.postprocess.timeseries_stage.TimeSeriesStage` Perform time series anomaly detection and add prediction. ## Pre-process -- Deserialize Stage {py:class}`~morpheus.stages.preprocess.deserialize_stage.DeserializeStage` Partition messages based on the pipeline config's `pipeline_batch_size` parameter. +- Deserialize Stage {py:class}`~morpheus.stages.preprocess.deserialize_stage.DeserializeStage` Partition messages based on the `pipeline_batch_size` parameter of the pipeline's `morpheus.config.Config` object. - Drop Null Stage {py:class}`~morpheus.stages.preprocess.drop_null_stage.DropNullStage` Drop null data entries from a DataFrame. - Preprocess AE Stage {py:class}`~morpheus.stages.preprocess.preprocess_ae_stage.PreprocessAEStage` Prepare Autoencoder input DataFrames for inference. - Preprocess FIL Stage {py:class}`~morpheus.stages.preprocess.preprocess_fil_stage.PreprocessFILStage` Prepare FIL input DataFrames for inference. diff --git a/examples/abp_nvsmi_detection/README.md b/examples/abp_nvsmi_detection/README.md index cbaf809086..fd63821568 100644 --- a/examples/abp_nvsmi_detection/README.md +++ b/examples/abp_nvsmi_detection/README.md @@ -55,7 +55,7 @@ $ nvidia-smi dmon Each line in the output represents the GPU metrics at a single point in time. As the tool progresses the GPU begins to be utilized and the SM% and Mem% values increase as memory is loaded into the GPU and computations are performed. The model we will be using can ingest this information and determine whether or not the GPU is mining cryptocurrencies without needing additional information from the host machine. -In this example we will be using the `examples/data/nvsmi.jsonlines` dataset that is known to contain mining behavior profiles. The dataset is in the `.jsonlines` format which means each new line represents a new JSON object. In order to parse this data, it must be ingested, split by lines into individual JSON objects, and parsed into cuDF dataframes. This will all be handled by Morpheus. +In this example we will be using the `examples/data/nvsmi.jsonlines` dataset that is known to contain mining behavior profiles. The dataset is in the `.jsonlines` format which means each new line represents a new JSON object. In order to parse this data, it must be ingested, split by lines into individual JSON objects, and parsed into cuDF DataFrames. This will all be handled by Morpheus. #### Generating your own dataset diff --git a/examples/abp_pcap_detection/README.md b/examples/abp_pcap_detection/README.md index 3220da33c9..77beb6675b 100644 --- a/examples/abp_pcap_detection/README.md +++ b/examples/abp_pcap_detection/README.md @@ -51,7 +51,7 @@ Once Triton server finishes starting up, it will display the status of all loade ``` ## ABP Detection Pipeline -Use Morpheus to run the Anomalous Behavior Profiling Detection Pipeline with the pcap data. A pipeline has been configured in `run.py` with several command line options: +Use Morpheus to run the Anomalous Behavior Profiling Detection Pipeline with the PCAP data. A pipeline has been configured in `run.py` with several command line options: From the root of the Morpheus repo, run: ```bash @@ -79,8 +79,8 @@ Options: [x>=1] --model_name TEXT The name of the model that is deployed on Tritonserver. - --iterative Iterative mode will emit dataframes one at a - time. Otherwise a list of dataframes is + --iterative Iterative mode will emit DataFrames one at a + time. Otherwise a list of DataFrames is emitted. Iterative mode is good for interleaving source stages. --server_url TEXT Tritonserver url. [required] diff --git a/examples/abp_pcap_detection/run.py b/examples/abp_pcap_detection/run.py index 8937351d16..bdd7fb7fb8 100644 --- a/examples/abp_pcap_detection/run.py +++ b/examples/abp_pcap_detection/run.py @@ -83,7 +83,7 @@ "--iterative", is_flag=True, default=False, - help=("Iterative mode will emit dataframes one at a time. Otherwise a list of dataframes is emitted. " + help=("Iterative mode will emit DataFrames one at a time. Otherwise a list of DataFrames is emitted. " "Iterative mode is good for interleaving source stages."), ) @click.option("--server_url", required=True, help="Tritonserver url.", default="localhost:8000") diff --git a/examples/developer_guide/2_2_rabbitmq/README.md b/examples/developer_guide/2_2_rabbitmq/README.md index 51ebee8347..cadd6075a2 100644 --- a/examples/developer_guide/2_2_rabbitmq/README.md +++ b/examples/developer_guide/2_2_rabbitmq/README.md @@ -36,7 +36,7 @@ docker run --rm -it --hostname my-rabbit -p 15672:15672 -p 5672:5672 rabbitmq:3- The image can be verified with the web management console by opening http://localhost:15672 in a web browser. Enter "guest" for both the username and the password. ## Installing Pika -The `RabbitMQSourceStage` and `WriteToRabbitMQStage` stages use the [pika](https://pika.readthedocs.io/en/stable/#) RabbitMQ client for Python. To install this into the current env run: +The `RabbitMQSourceStage` and `WriteToRabbitMQStage` stages use the [pika](https://pika.readthedocs.io/en/stable/#) RabbitMQ client for Python. To install this into the current environment run: ```bash pip install -r examples/developer_guide/2_2_rabbitmq/requirements.txt ``` diff --git a/examples/developer_guide/4_rabbitmq_cpp_stage/README.md b/examples/developer_guide/4_rabbitmq_cpp_stage/README.md index 313fa34f98..8c9414ae88 100644 --- a/examples/developer_guide/4_rabbitmq_cpp_stage/README.md +++ b/examples/developer_guide/4_rabbitmq_cpp_stage/README.md @@ -29,13 +29,13 @@ This example adds two flags to the `read_simple.py` script. A `--use_cpp` flag w | Dev Container | ✘ | | ## Installing Pika -The `RabbitMQSourceStage` and `WriteToRabbitMQStage` stages use the [pika](https://pika.readthedocs.io/en/stable/#) RabbitMQ client for Python. To install this into the current env run: +The `RabbitMQSourceStage` and `WriteToRabbitMQStage` stages use the [pika](https://pika.readthedocs.io/en/stable/#) RabbitMQ client for Python. To install this into the current environment run: ```bash pip install -r examples/developer_guide/4_rabbitmq_cpp_stage/requirements.txt ``` ## Building the Example -There are two ways to build the example. The first is to build the examples along with Morpheus by passing the `-DMORPHEUS_BUILD_EXAMPLES=ON` flag to cmake, for users using the `scripts/compile.sh` at the root of the Morpheus repo can do this by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: +There are two ways to build the example. The first is to build the examples along with Morpheus by passing the `-DMORPHEUS_BUILD_EXAMPLES=ON` flag to CMake, for users using the `scripts/compile.sh` at the root of the Morpheus repo can do this by setting the `CMAKE_CONFIGURE_EXTRA_ARGS` environment variable: ```bash CMAKE_CONFIGURE_EXTRA_ARGS="-DMORPHEUS_BUILD_EXAMPLES=ON" ./scripts/compile.sh ``` diff --git a/examples/digital_fingerprinting/demo/submit_messages.md b/examples/digital_fingerprinting/demo/submit_messages.md index 03a30fbd9b..bfadadd015 100644 --- a/examples/digital_fingerprinting/demo/submit_messages.md +++ b/examples/digital_fingerprinting/demo/submit_messages.md @@ -28,7 +28,7 @@ The UI will look like a dynamic form with various buttons and input fields that By clicking on the `Add Control Message` button adds a new control message to the form. Each control message has a type selector and three buttons, one to add metadata properties, to add task and the other to remove control message. ![DFP Add Control Message](./images/dfp_add_control_message.png) -- `Type`: A user may select a control message of either the `streaming` or `payload` kind. In the backend digital fingerprinting workflow handles the message in accordance with the type provided. +- `Type`: A user may select a control message of either the `streaming` or `payload` kind. The DFP pipeline handles the message in accordance with the type provided. - `Add Metadata`: button adds a new metadata section to the control message. Each metadata section has a key selector, a data type selector, a value input field, and a `Remove` button. - `Add Task`: button adds a new task section to the control message. Each task section has a type selector, a `Properties` section, and a `Remove` button. - `Add Property`: button inside the `Properties` section adds a new property to the task. Each property has a key input field, a data type selector, a value input field, and a `Remove` button. diff --git a/examples/digital_fingerprinting/demo/training.md b/examples/digital_fingerprinting/demo/training.md index ff50cdbb65..f9004ab34c 100644 --- a/examples/digital_fingerprinting/demo/training.md +++ b/examples/digital_fingerprinting/demo/training.md @@ -18,7 +18,7 @@ limitations under the License. # Training Control Message GUI ## Introduction -This document demonstrates how to use a GUI to submit training control messages to a Kafka topic, which will be consumed by the DFP Morpheus pipeline in the backend. To begin, let's assume that we have a set of training data files located in the file system at `/workspace/examples/data/dfp/duo-training-data`. We can use these files as an input data for the training message. +This document demonstrates how to use a GUI to submit training control messages to a Kafka topic, which will be consumed by the DFP Morpheus pipeline in the back end. To begin, let's assume that we have a set of training data files located in the file system at `/workspace/examples/data/dfp/duo-training-data`. We can use these files as an input data for the training message. ## Home To submit a training message, we need to provide some input values. The following screenshot shows the input values we need to enter: diff --git a/examples/digital_fingerprinting/production/README.md b/examples/digital_fingerprinting/production/README.md index 48a3033a20..289634790e 100644 --- a/examples/digital_fingerprinting/production/README.md +++ b/examples/digital_fingerprinting/production/README.md @@ -100,15 +100,15 @@ Both scripts are capable of running either a training or inference pipeline for | `--train_users` | One of: `all`, `generic`, `individual`, `none` | Indicates whether or not to train per user or a generic model for all users. Selecting `none` runs the inference pipeline. | | `--skip_user` | TEXT | User IDs to skip. Mutually exclusive with `only_user` | | `--only_user` | TEXT | Only users specified by this option will be included. Mutually exclusive with `skip_user` | -| `--start_time` | TEXT | The start of the time window, if undefined start_date will be `now()-duration` | +| `--start_time` | TEXT | The start of the time window, if undefined `start_date` will be `now()-duration` | | `--duration` | TEXT | The duration to run starting from now [default: 60d] | -| `--cache_dir` | TEXT | The location to cache data such as S3 downloads and pre-processed data [env var: `DFP_CACHE_DIR`; default: `./.cache/dfp`] | +| `--cache_dir` | TEXT | The location to cache data such as S3 downloads and pre-processed data [environment variable: `DFP_CACHE_DIR`; default: `./.cache/dfp`] | | `--log_level` | One of: `CRITICAL`, `FATAL`, `ERROR`, `WARN`, `WARNING`, `INFO`, `DEBUG` | Specify the logging level to use. [default: `WARNING`] | -| `--sample_rate_s` | INTEGER | Minimum time step, in milliseconds, between object logs. [env var: `DFP_SAMPLE_RATE_S`; default: 0] | -| `-f`, `--input_file` | TEXT | List of files to process. Can specify multiple arguments for multiple files. Also accepts glob (*) wildcards and schema prefixes such as `s3://`. For example, to make a local cache of an s3 bucket, use `filecache::s3://mybucket/*`. Refer to [fsspec documentation](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files) for list of possible options. | +| `--sample_rate_s` | INTEGER | Minimum time step, in milliseconds, between object logs. [environment variable: `DFP_SAMPLE_RATE_S`; default: 0] | +| `-f`, `--input_file` | TEXT | List of files to process. Can specify multiple arguments for multiple files. Also accepts glob (*) wildcards and schema prefixes such as `s3://`. For example, to make a local cache of an s3 bucket, use `filecache::s3://mybucket/*`. Refer to [`fsspec` documentation](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=open_files#fsspec.open_files) for list of possible options. | | `--watch_inputs` | FLAG | Instructs the pipeline to continuously check the paths specified by `--input_file` for new files. This assumes that the at least one paths contains a wildcard. | | `--watch_interval` | FLOAT | Amount of time, in seconds, to wait between checks for new files. Only used if --watch_inputs is set. [default `1.0`] | -| `--tracking_uri` | TEXT | The MLflow tracking URI to connect to the tracking backend. [default: `http://localhost:5000`] | +| `--tracking_uri` | TEXT | The MLflow tracking URI to connect to. [default: `http://localhost:5000`] | | `--help` | | Show this message and exit. | ##### Steps to Run Example Pipeline @@ -164,7 +164,7 @@ The commands in the previous section run stage-based example DFP pipelines. The Commands to run equivalent module-based DFP pipelines can be found [here](../../../docs/source/developer_guide/guides/10_modular_pipeline_digital_fingerprinting.md#running-example-modular-dfp-pipelines). #### Optional MLflow Service -Starting either the `morpheus_pipeline` or the `jupyter` service, will start the `mlflow` service in the background. For debugging purposes it can be helpful to view the logs of the running MLflow service. +Starting either the `morpheus_pipeline` or the `jupyter` service, will start the `mlflow` service in the background. For debugging purposes it can be helpful to view the logs of the running MLflow service. From the `examples/digital_fingerprinting/production` dir run: ```bash @@ -188,7 +188,7 @@ MLflow for this production digital fingerprint use case can be installed from NG The deployment of the [Morpheus SDK Client](../../../docs/source/cloud_deployment_guide.md#install-morpheus-sdk-client) is also done _almost_ the same way as what's specified in the Cloud Deployment Guide. However, you would specify command arguments differently for this production DFP use case. -Note: The published Morpheus image includes a minimal set of packages for launching JupyterLab but you will likely still want to update the conda environment inside the running pod with the `conda_env.yml` file in this same directory to install other use case dependencies such as boto3 and s3fs. +Note: The published Morpheus image includes a minimal set of packages for launching JupyterLab but you will likely still want to update the Conda environment inside the running pod with the `conda_env.yml` file in this same directory to install other use case dependencies such as boto3 and s3fs. #### Notebooks diff --git a/examples/digital_fingerprinting/production/grafana/README.md b/examples/digital_fingerprinting/production/grafana/README.md index f79fe6e92b..a63f6d30b2 100644 --- a/examples/digital_fingerprinting/production/grafana/README.md +++ b/examples/digital_fingerprinting/production/grafana/README.md @@ -20,7 +20,7 @@ This example builds on the [Azure DFP pipeline example](../production/README.md) ## Grafana Configuration -The data sources and dashboards in this example are managed using config files. [Grafana's provisioning system](https://grafana.com/docs/grafana/latest/administration/provisioning/) then uses these files to add the data sources and dashboards to Grafana upon startup. +The data sources and dashboards in this example are managed using configuration files. [Grafana's provisioning system](https://grafana.com/docs/grafana/latest/administration/provisioning/) then uses these files to add the data sources and dashboards to Grafana upon startup. ### Data Sources @@ -38,7 +38,7 @@ The [CSV data source plugin](https://grafana.com/grafana/plugins/marcusolsson-cs Please note that the use of the CSV plugin is for demonstration purposes only. Grafana includes support for many data sources more suitable for production deployments. See [here](https://grafana.com/docs/grafana/latest/datasources/) for more information. -#### Updates to grafana.ini +#### Updates to `grafana.ini` The following is added to the default `grafana.ini` to enable local mode for CSV data source plugin. This allows the CSV data source plugin to access files on local file system. @@ -111,7 +111,7 @@ Click on `DFP Logs` in the `General` folder. You may need to expand the `General -This dashboard was provisioned using config files but can also be manually created with the following steps: +This dashboard was provisioned using configuration files but can also be manually created with the following steps: 1. Click `Dashboards` in the left-side menu. 2. Click `New` and select `New Dashboard`. 3. On the empty dashboard, click `+ Add visualization`. @@ -146,9 +146,9 @@ python run.py --log_level DEBUG --train_users generic --start_time "2022-08-01" -12. Finally, click `Save rule and exit` at top right of the page. +12. Finally, click `Save rule and exit` at top right of the page. -By default, all alerts will be sent through the `grafana-default-email` contact point. You can add email addresses to this contact point by clicking on `Contact points` under `Alerting` in the left-side menu. You would also have to configure SMTP in the `[smtp]` section of your `grafana.ini`. More information about about Grafana Alerting contact points can found [here](https://grafana.com/docs/grafana/latest/alerting/fundamentals/contact-points/). +By default, all alerts will be sent through the `grafana-default-email` contact point. You can add email addresses to this contact point by clicking on `Contact points` under `Alerting` in the left-side menu. You would also have to configure SMTP in the `[smtp]` section of your `grafana.ini`. More information about Grafana Alerting contact points can found [here](https://grafana.com/docs/grafana/latest/alerting/fundamentals/contact-points/). ## Run Azure DFP Inference: diff --git a/examples/digital_fingerprinting/production/morpheus/benchmarks/README.md b/examples/digital_fingerprinting/production/morpheus/benchmarks/README.md index a9c09197d2..b002e7fa47 100644 --- a/examples/digital_fingerprinting/production/morpheus/benchmarks/README.md +++ b/examples/digital_fingerprinting/production/morpheus/benchmarks/README.md @@ -43,7 +43,7 @@ Now install Morpheus: pip install -e /workspace ``` -Install additonal required dependencies: +Install additional required dependencies: ```bash mamba env update \ -n ${CONDA_DEFAULT_ENV} \ @@ -101,15 +101,15 @@ To ensure the [file_to_df_loader.py](../../../../../morpheus/loaders/file_to_df_ export MORPHEUS_FILE_DOWNLOAD_TYPE=dask ``` -Benchmarks for an individual workflow can be run from `examples/digital_fingerprinting/production/morpheus` in your dev container: +Benchmarks for an individual workflow can be run from `examples/digital_fingerprinting/production/morpheus` in your container: ``` - pytest -s --log-level=WARN --benchmark-enable --benchmark-warmup=on --benchmark-warmup-iterations=1 --benchmark-autosave benchmarks/test_bench_e2e_dfp_pipeline.py:: ``` + The `-s` option allows outputs of pipeline execution to be displayed so you can ensure there are no errors while running your benchmarks. -The `--benchmark-warmup` and `--benchmark-warmup-iterations` options are used to run the workflow(s) once before starting measurements. This is because, if it does not already exist, the preprocessed data is cached during the initial run. +The `--benchmark-warmup` and `--benchmark-warmup-iterations` options are used to run the workflows once before starting measurements. This is because, if it does not already exist, the preprocessed data is cached during the initial run. `` is the name of the test to run benchmarks on. This can be one of the following: - `test_dfp_modules_azure_payload_inference_e2e` @@ -143,7 +143,7 @@ To run E2E benchmarks on all workflows: pytest -s --benchmark-enable --benchmark-warmup=on --benchmark-warmup-iterations=1 --benchmark-autosave benchmarks/test_bench_e2e_dfp_pipeline.py ``` -Here are the benchmark comparisons for individual tests. When the control message type is "payload", the rolling window stage is bypassed, whereas when it is "streaming", the windows are created with historical data. +Here are the benchmark comparisons for individual tests. When the control message type is `payload`, the rolling window stage is bypassed, whereas when it is `streaming`, the windows are created with historical data. #### Training (Azure): ``` @@ -228,7 +228,7 @@ with `000N` where N is incremented for every run. For example, the report file n A hook to `pytest-benchmark` was developed to add the following information to the JSON report: -GPU(s) used by Morpheus. For example: +GPUs used by Morpheus. For example: ``` "gpu_0": { "id": 0, @@ -241,25 +241,25 @@ GPU(s) used by Morpheus. For example: } ``` -Morpheus config for each workflow: -- num_threads -- pipeline_batch_size -- edge_buffer_size -- start_time -- duration -- userid_column_name -- timestamp_column_name -- source -- use_cpp +Morpheus configuration for each workflow: +- `num_threads` +- `pipeline_batch_size` +- `edge_buffer_size` +- `start_time` +- `duration` +- `userid_column_name` +- `timestamp_column_name` +- `source` +- `use_cpp` Additional benchmark stats for each workflow: -- input_lines -- min_throughput_lines -- max_throughput_lines -- mean_throughput_lines -- median_throughput_lines -- input_bytes -- min_throughput_bytes -- max_throughput_bytes -- mean_throughput_bytes -- median_throughput_bytes +- `input_lines` +- `min_throughput_lines` +- `max_throughput_lines` +- `mean_throughput_lines` +- `median_throughput_lines` +- `input_bytes` +- `min_throughput_bytes` +- `max_throughput_bytes` +- `mean_throughput_bytes` +- `median_throughput_bytes` diff --git a/examples/digital_fingerprinting/starter/README.md b/examples/digital_fingerprinting/starter/README.md index 75c6040707..013300ceed 100644 --- a/examples/digital_fingerprinting/starter/README.md +++ b/examples/digital_fingerprinting/starter/README.md @@ -22,7 +22,7 @@ We show here how to set up and run the DFP pipeline for three log types: CloudTr ## Environment Setup -Follow the instructions [here](../../../docs/source/developer_guide/contributing.md) to set up your development environment in either a Docker container or conda environment. +Follow the instructions [here](../../../docs/source/developer_guide/contributing.md) to set up your development environment in either a Docker container or Conda environment. ## Morpheus CLI @@ -80,7 +80,7 @@ Commands: delay (Deprecated) Delay results for a certain duration filter Filter message by a classification threshold from-azure Source stage is used to load Azure Active Directory messages. - from-cloudtrail Load messages from a Cloudtrail directory + from-cloudtrail Load messages from a CloudTrail directory from-duo Source stage is used to load Duo Authentication messages. gen-viz (Deprecated) Write out visualization data frames inf-pytorch Perform inference with PyTorch @@ -98,46 +98,46 @@ Commands: The commands above correspond to the Morpheus stages that can be used to construct your DFP pipeline. Options are available to configure pipeline and stages. The following table shows mapping between the main Morpheus CLI commands and underlying Morpheus Python stage classes: -| CLI Command | Stage Class | Python File | -| ---------------| -------------------------| --------------------------------------------------------- -| from-azure | AzureSourceStage | morpheus/stages/input/azure_source_stage.py -| from-cloudtrail| CloudTrailSourceStage | morpheus/stages/input/clout_trail_source_stage.py -| from-duo | DuoSourceStage | morpheus/stages/input/duo_source_stage.py -| train-ae | TrainAEStage | morpheus/stages/preprocess/train_ae_stage.py -| preprocess | PreprocessAEStage | morpheus/stages/preprocess/preprocess_ae_stage.py -| inf-pytorch | AutoEncoderInferenceStage| morpheus/stages/inference/auto_encoder_inference_stage.py -| add-scores | AddScoresStage | morpheus/stages/postprocess/add_scores_stage.py -| serialize | SerializeStage | morpheus/stages/postprocess/serialize_stage.py -| to-file | WriteToFileStage | morpheus/stages/output/write_to_file_stage.py +| CLI Command | Stage Class | Python File | +| ------------------| ----------------------------| ----------------------------------------------------------- +| `from-azure` | `AzureSourceStage` | `morpheus/stages/input/azure_source_stage.py` +| `from-cloudtrail` | `CloudTrailSourceStage` | `morpheus/stages/input/clout_trail_source_stage.py` +| `from-duo` | `DuoSourceStage` | `morpheus/stages/input/duo_source_stage.py` +| `train-ae` | `TrainAEStage` | `morpheus/stages/preprocess/train_ae_stage.py` +| `preprocess` | `PreprocessAEStage` | `morpheus/stages/preprocess/preprocess_ae_stage.py` +| `inf-pytorch` | `AutoEncoderInferenceStage` | `morpheus/stages/inference/auto_encoder_inference_stage.py` +| `add-scores` | `AddScoresStage` | `morpheus/stages/postprocess/add_scores_stage.py` +| `serialize` | `SerializeStage` | `morpheus/stages/postprocess/serialize_stage.py` +| `to-file ` | `WriteToFileStage` | `morpheus/stages/output/write_to_file_stage.py` ## Morpheus DFP Stages -**Source stages** - These include `AzureSourceStage`, `CloudTrailSourceStage` and `DuoSourceStage`. They are responsible for reading log file(s) that match provided `--input_glob` (e.g. `/duo_logs/*.json`). Data is grouped by user so that each batch processed by the pipeline will only contain rows corresponding to a single user. Feature engineering also happens in this stage. All DFP source stages must extend `AutoencoderSourceStage` and implement the `files_to_dfs_per_user` abstract method. Feature columns can be managed by overriding the `derive_features` method. Otherwise, all columns from input data pass through to next stage. +**Source stages** - These include `AzureSourceStage`, `CloudTrailSourceStage` and `DuoSourceStage`. They are responsible for reading log files that match provided `--input_glob` (for example `/duo_logs/*.json`). Data is grouped by user so that each batch processed by the pipeline will only contain rows corresponding to a single user. Feature engineering also happens in this stage. All DFP source stages must extend `AutoencoderSourceStage` and implement the `files_to_dfs_per_user` abstract method. Feature columns can be managed by overriding the `derive_features` method. Otherwise, all columns from input data pass through to next stage. **Preprocessing stages** `TrainAEStage` can either train user models using data matching a provided `--train_data_glob` or load pre-trained models from file using `--pretrained_filename`. When using `--train_data_glob`, user models can be saved using the `--models_output_filename` option. The `--source_stage_class` must also be used with `--train_data_glob` so that the training stage knows how to read the training data. The autoencoder implementation used for user model training can be found [here](https://github.com/nv-morpheus/dfencoder). The following are the available CLI options for the `TrainAEStage` (train-ae): -| Option | Description -| ----------------------| --------------------------------------------------------- -| pretrained_filename | File path to pickled user models saved from previous training run using `--models_output_filename`. -| train_data_glob | Glob path to training data. -| source_stage_class | Source stage so that training stage knows how to read/parse training data. -| train_epochs | Number of training epochs. Default is 25. -| min_train_rows | Minimum number of training rows required to train user model. Default is 300. -| train_max_history | Maximum number of training rows per user. Default is 1000. -| seed | When not None, ensure random number generators are seeded with `seed` to control reproducibility of user model. -| sort_glob | If true the list of files matching `input_glob` will be processed in sorted order. Default is False. -| models_output_filename| Can be used with `--train_data_glob` to save trained user models to file using provided file path. Models can be loaded later using `--pretrained_filename`. +| Option | Description +| -------------------------| --------------------------------------------------------- +| `pretrained_filename` | File path to pickled user models saved from previous training run using `--models_output_filename`. +| `train_data_glob` | Glob path to training data. +| `source_stage_class` | Source stage so that training stage knows how to read/parse training data. +| `train_epochs` | Number of training epochs. Default is 25. +| `min_train_rows` | Minimum number of training rows required to train user model. Default is 300. +| `train_max_history` | Maximum number of training rows per user. Default is 1000. +| `seed` | When not None, ensure random number generators are seeded with `seed` to control reproducibility of user model. +| `sort_glob` | If true the list of files matching `input_glob` will be processed in sorted order. Default is False. +| `models_output_filename` | Can be used with `--train_data_glob` to save trained user models to file using provided file path. Models can be loaded later using `--pretrained_filename`. The `PreprocessAEStage` is responsible for creating a Morpheus message that contains everything needed by the inference stage. For DFP inference, this stage must pass a `MultiInferenceAEMessage` to the inference stage. Each message will correspond to a single user and include the input feature columns, the user's model and training data anomaly scores. -**Inference stage** - `AutoEncoderInferenceStage` calculates anomaly scores (i.e., reconstruction loss) and z-scores for each user input dataset. +**Inference stage** - `AutoEncoderInferenceStage` calculates anomaly scores (specifically, reconstruction loss) and z-scores for each user input dataset. **Post-processing stage** - The DFP pipeline uses the `AddScoresStage` for post-processing to add anomaly scores and z-scores from previous inference stage with matching labels. -**Serialize stage** - `SerializeStage` is used to convert `MultiResponseMessage` from previous stage to a `MessageMeta` to make it suitable for output (i.e., write to file or Kafka). +**Serialize stage** - `SerializeStage` is used to convert `MultiResponseMessage` from previous stage to a `MessageMeta` to make it suitable for output (for example writing to file or Kafka). **Write stage** - `WriteToFileStage` writes input data with inference results to an output file path. @@ -281,10 +281,10 @@ morpheus --log_level=DEBUG \ ## Using Morpheus Python API -The DFP pipelines can also be constructed and run via the Morpheus Python API. An [example](./run_cloudtrail_dfp.py) is included for the Cloudtrail DFP pipeline. The following are some commands to +The DFP pipelines can also be constructed and run via the Morpheus Python API. An [example](./run_cloudtrail_dfp.py) is included for the CloudTrail DFP pipeline. The following are some commands to run the example. -Train user models from files in `models/datasets/training-data/dfp-cloudtrail-*.csv` and saves user models to file. Pipeline then uses these models to run inference on Cloudtrail validation data in `models/datasets/validation-data/dfp-cloudtrail-*-input.csv`. Inference results are written to `cloudtrail-dfp-results.csv`. +Train user models from files in `models/datasets/training-data/dfp-cloudtrail-*.csv` and saves user models to file. Pipeline then uses these models to run inference on CloudTrail validation data in `models/datasets/validation-data/dfp-cloudtrail-*-input.csv`. Inference results are written to `cloudtrail-dfp-results.csv`. ``` python ./examples/digital_fingerprinting/starter/run_cloudtrail_dfp.py \ --columns_file=morpheus/data/columns_ae_cloudtrail.txt \ diff --git a/examples/digital_fingerprinting/visualization/README.md b/examples/digital_fingerprinting/visualization/README.md index 820b808b36..9d6542e2e7 100644 --- a/examples/digital_fingerprinting/visualization/README.md +++ b/examples/digital_fingerprinting/visualization/README.md @@ -72,7 +72,7 @@ Duo training data will be saved to `/workspace/examples/data/dfp/duo-training-da ## Running pipeline to generate input for DFP Visualization -The pipeline uses `DFPVizPostprocStage` to perform post-processing on DFP inference output. The inference output is converted to input format expected by the DFP Visualization and saves to multiple files based on specified time period. Time period to group data by must be [one of pandas' offset strings](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases). The default period is one day (D). The output files will be named by appending period to prefix (e.g. `dfp-viz-2022-08-30.csv`). These are the available options used for `DFPVizPostprocStage`: +The pipeline uses `DFPVizPostprocStage` to perform post-processing on DFP inference output. The inference output is converted to input format expected by the DFP Visualization and saves to multiple files based on specified time period. Time period to group data by must be [one of pandas' offset strings](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases). The default period is one day (D). The output files will be named by appending period to prefix (for example, `dfp-viz-2022-08-30.csv`). These are the available options used for `DFPVizPostprocStage`: ``` --period Time period to batch input data and save output files by. [default: `D`] diff --git a/examples/doca/README.md b/examples/doca/README.md index 9dc88c009d..c7c5d21edf 100644 --- a/examples/doca/README.md +++ b/examples/doca/README.md @@ -17,12 +17,12 @@ limitations under the License. # DOCA GPU Real-Time traffic analysis -Examples in this directory use the DOCA Source Stage to receive and pre-process network packets in real-time before passing packets info to the next Morphues stages. +Examples in this directory use the DOCA Source Stage to receive and pre-process network packets in real-time before passing packets info to the next Morpheus stages. ## Obtaining the Morpheus DOCA Container DOCA Support is in early access and may only be used via the Morpheus DOCA Container found in NGC. Please speak to your NVIDIA Morpheus contact for more information. -The container must be run in privileged mode and mount in hugepages as configured according to the DOCA GPUNetIO documentation. +The container must be run in privileged mode and mount `/dev/hugepages` as configured according to the DOCA GPUNetIO documentation. ``` docker run -v /dev/hugepages:/dev/hugepages --privileged --rm -ti --runtime=nvidia --net=host --gpus=all --cap-add=sys_nice ${MORPHEUS_DOCA_IMAGE} bash @@ -30,7 +30,7 @@ docker run -v /dev/hugepages:/dev/hugepages --privileged --rm -ti --runtime=nvid ## Preparing the environment -Prior to running the example, the `rdma-core` conda package needs to be _removed by force_ from the conda environment, otherwise the environment is incompatible with the DOCA-provided packages. +Prior to running the example, the `rdma-core` Conda package needs to be _removed by force_ from the Conda environment, otherwise the environment is incompatible with the DOCA-provided packages. ``` conda remove --force rdma-core ``` @@ -80,7 +80,7 @@ In case of UDP traffic, the sample will launch a simple pipeline with the DOCA S ``` python ./examples/doca/run_udp_raw.py --nic_addr 17:00.1 --gpu_addr ca:00.0 ``` -UDP traffic can be easily sent with nping to the interface where Morpheus is listening: +UDP traffic can be easily sent with `nping` to the interface where Morpheus is listening: ``` nping --udp -c 100000 -p 4100 192.168.2.27 --data-length 1024 --delay 0.1ms ``` @@ -115,13 +115,13 @@ Added stage: **Note:** For this to function correctly, the VDB upload pipeline must have been run previously. @@ -135,7 +133,7 @@ pipeline option of `rag`: ### Run example (Standalone Pipeline): -**Using NGC Nemo LLMs** +**Using NGC NeMo LLMs** ```bash export NGC_API_KEY=[YOUR_KEY_HERE] diff --git a/examples/llm/vdb_upload/README.md b/examples/llm/vdb_upload/README.md index 4f9b00a484..7348a9cde6 100644 --- a/examples/llm/vdb_upload/README.md +++ b/examples/llm/vdb_upload/README.md @@ -30,8 +30,8 @@ limitations under the License. - [Milvus Service](#milvus-service) - [Triton Service](#triton-service) - [Running the Morpheus Pipeline](#running-the-morpheus-pipeline) - - [Options for vdb_upload Command](#options-for-vdb_upload-command) - - [Exporting and Deploying a Different Model from Huggingface](#exporting-and-deploying-a-different-model-from-huggingface) + - [Options for `vdb_upload` Command](#options-for-vdb_upload-command) + - [Exporting and Deploying a Different Model from Hugging Face](#exporting-and-deploying-a-different-model-from-hugging-face) ## Supported Environments All environments require additional Conda packages which can be installed with either the `conda/environments/all_cuda-121_arch-x86_64.yaml` or `conda/environments/examples_cuda-121_arch-x86_64.yaml` environment files. @@ -71,7 +71,7 @@ tasks: ### Embedding Model - The pipeline can accommodate various embedding models that transform text into vectors of floating-point numbers. - Several models from Huggingface, such as `paraphrase-multilingual-mpnet-base-v2`, `e5-large-v2`, + Several models from Hugging Face, such as `paraphrase-multilingual-mpnet-base-v2`, `e5-large-v2`, and `all-mpnet-base-v2`, have been evaluated for compatibility. - For the purposes of this demonstration, the model `all-MiniLM-L6-v2` is employed. This model is included via LFS @@ -97,12 +97,10 @@ The pipeline is composed of three primary components: the feeds, perform preliminary data cleaning, and standardize the format for subsequent steps. 2. **Embedding Generator**: This is the heart of the pipeline, which takes the preprocessed text chunks and computes - their embeddings. Leveraging the model `all-MiniLM-L6-v2` from Huggingface, the text data is transformed into + their embeddings. Leveraging the model `all-MiniLM-L6-v2` from Hugging Face, the text data is transformed into embeddings with a dimension of 384. -3. **Vector Database Uploader**: Post embedding generation, this module takes the embeddings alongside their associated - metadata and pushes them to a Vector Database (VDB). For our implementation, Milvus, a GPU-accelerated vector - database, has been chosen. +3. **Vector Database Uploader**: Post embedding generation, this module takes the embeddings alongside their associated metadata and pushes them to a Vector Database (VDB). For our implementation, Milvus, a GPU-accelerated vector database, has been chosen. ### Rationale Behind Design Decisions @@ -161,7 +159,7 @@ To retrieve datasets from LFS run the following: ### Running the Morpheus Pipeline The top-level entry point for each of the LLM example pipelines is examples/llm/main.py. This script accepts a set of -options and a pipeline to run. For the purposes of this document, we'll focus on the vdb_upload pipeline option, which +options and a pipeline to run. For the purposes of this document, we'll focus on the `vdb_upload` pipeline option, which incorporates various functionalities like handling RSS and filesystem sources, embedding configurations, and vector database (VDB) settings. @@ -241,13 +239,13 @@ The `vdb_upload` command has its own set of options and commands: - `langchain` - `pipeline` -### Exporting and Deploying a Different Model from Huggingface +### Exporting and Deploying a Different Model from Hugging Face -If you're looking to incorporate a different embedding model from Huggingface into the pipeline, follow the steps below +If you're looking to incorporate a different embedding model from Hugging Face into the pipeline, follow the steps below using `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` as an example: 1. **Identify the Desired Model**: - - Head over to the [Huggingface Model Hub](https://huggingface.co/models) and search for the model you want. For + - Head over to the [Hugging Face Model Hub](https://huggingface.co/models) and search for the model you want. For this example, we are looking at `e5-large-v2`. 2. **Run the Pipeline Call with the Chosen Model**: @@ -263,7 +261,7 @@ using `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` as an exampl ```text requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: ``` - This typically means the model name you provided does not match the one available on Huggingface. Double-check + This typically means the model name you provided does not match the one available on Hugging Face. Double-check the model name and try again. 4. **Confirm Successful Model Export**: @@ -314,11 +312,9 @@ using `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` as an exampl sentence-transformers/paraphrase-multilingual-mpnet-base-v2 ``` -### Running the Langchain Pipeline (Optional) +### Running the LangChain Pipeline (Optional) -- Optional guide for running the Langchain pipeline, if applicable.## Developer Docs - -- A link to the developer documentation where the README.md is also linked. +- Optional guide for running the LangChain pipeline, if applicable. > **Note**: This pipeline will, by default, run continuously repeatedly polling the configured RSS sources. To run for a > fixed number of iterations, add the `--stop_after=N` flag. diff --git a/examples/nlp_si_detection/README.md b/examples/nlp_si_detection/README.md index ab69546d22..d08df4ffed 100644 --- a/examples/nlp_si_detection/README.md +++ b/examples/nlp_si_detection/README.md @@ -46,7 +46,7 @@ In this example, we will be using Morpheus' provided NLP SI Detection model. Thi ### The Dataset -The dataset that this workflow was designed to process is PCAP, or Packet Capture data, that is serialized into a JSON format. Several different applications are capable of capurting this type of network traffic. Each packet contains information about the source, destination, timestamp, and body of the packet, among other things. For example, below is a single packet that is from a HTTP POST request to cumulusnetworks.com: +The dataset that this workflow was designed to process is PCAP, or Packet Capture data, that is serialized into a JSON format. Several different applications are capable of capturing this type of network traffic. Each packet contains information about the source, destination, timestamp, and body of the packet, among other things. For example, below is a single packet that is from a HTTP POST request to cumulusnetworks.com: ```json { @@ -197,16 +197,16 @@ Inference Rate[Complete]: 93085inf [00:07, 12673.63inf/s] ``` The output file `detections.jsonlines` will contain the original PCAP messages with the following additional fields added: -* address -* bank_acct -* credit_card -* email -* govt_id -* name -* password -* phone_num -* secret_keys -* user +* `address` +* `bank_acct` +* `credit_card` +* `email` +* `govt_id` +* `name` +* `password` +* `phone_num` +* `secret_keys` +* `user` The value for these fields will be a `1` indicating a detection or a `0` indicating no detection. An example row with a detection is: ```json diff --git a/examples/ransomware_detection/README.md b/examples/ransomware_detection/README.md index 84c48147a4..0388140227 100644 --- a/examples/ransomware_detection/README.md +++ b/examples/ransomware_detection/README.md @@ -15,7 +15,7 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Example Ransomware Detection Morpheus Pipeline for AppShield Data +# Example Ransomware Detection Morpheus Pipeline for App Shield Data Example of a Morpheus Pipeline using Triton Inference server. @@ -37,10 +37,6 @@ Example: ```bash docker pull nvcr.io/nvidia/morpheus/morpheus-tritonserver-models:24.10 ``` -##### Setup Env Variable -```bash -export MORPHEUS_ROOT=$(pwd) -``` ##### Start Triton Inference Server Container From the Morpheus repo root directory, run the following to launch Triton and load the `ransomw-model-short-rf` model: diff --git a/examples/root_cause_analysis/README.md b/examples/root_cause_analysis/README.md index 47a4cd1dc1..c7273f7761 100644 --- a/examples/root_cause_analysis/README.md +++ b/examples/root_cause_analysis/README.md @@ -29,7 +29,7 @@ These examples illustrate how to use Morpheus to build a binary sequence classif ## Background -Like any other Linux based machine, DGX's generate a vast amount of logs. Analysts spend hours trying to identify the root causes of each failure. There could be infinitely many types of root causes of the failures. Some patterns might help to narrow it down; however, regular expressions can only help to identify previously known patterns. Moreover, this creates another manual task of maintaining a search script. +Like any other Linux based machine, DGX systems generate a vast amount of logs. Analysts spend hours trying to identify the root causes of each failure. There could be infinitely many types of root causes of the failures. Some patterns might help to narrow it down; however, regular expressions can only help to identify previously known patterns. Moreover, this creates another manual task of maintaining a search script. In this example, we demonstrate how using Morpheus can accelerate the analysis of the enormous amount of logs using machine learning. Another benefit of analyzing in a probabilistic way is that we can pin down previously undetected root causes. To achieve this, we will fine-tune a pre-trained BERT[^1] model with a classification layer using HuggingFace library. @@ -39,7 +39,7 @@ Once the model is capable of identifying even the new root causes, it can also b ### The Dataset -The dataset comprises kern.log files from multiple DGX's. Each line inside has been labelled as either 0 for ordinary or 1 for root cause by a script that uses some known patterns. We will be especially interested in lines that are marked as ordinary in the test set but predicted as a root cause as they may be new types of root causes of failures. +The dataset comprises kern.log files from multiple DGX systems. Each line inside has been labelled as either 0 for ordinary or 1 for root cause by a script that uses some known patterns. We will be especially interested in lines that are marked as ordinary in the test set but predicted as a root cause as they may be new types of root causes of failures. ## Pipeline Architecture diff --git a/examples/sid_visualization/README.md b/examples/sid_visualization/README.md index faf2c666c3..10aeb4cbee 100644 --- a/examples/sid_visualization/README.md +++ b/examples/sid_visualization/README.md @@ -32,7 +32,7 @@ git submodule update --init --recursive ### Build Morpheus Dev Container -Before launching the demo, we need the dev container for Morpheus to be created: +Before launching the demo, we need the docker container for Morpheus to be created: ```bash export DOCKER_IMAGE_TAG="sid-viz" ./docker/build_container_dev.sh diff --git a/models/README.md b/models/README.md index 87be631170..39a9a4aa47 100644 --- a/models/README.md +++ b/models/README.md @@ -20,9 +20,9 @@ limitations under the License. Pretrained models for Morpheus with corresponding training, validation scripts, and datasets. ## Repo Structure -Every Morpheus use case has a subfolder, **`-models`**, that contains the model files for the use case. Training and validation datasets and scripts are also provided in [datasets](./datasets/), [training-tuning-scripts](./training-tuning-scripts/), and [validation-inference-scripts](./validation-inference-scripts/). Jupyter notebook (`.ipynb`) version of the training and fine-tuning scripts are also provided. +Every Morpheus use case has a directory, **`-models`**, that contains the model files for the use case. Training and validation datasets and scripts are also provided in [`datasets`](./datasets/), [`training-tuning-scripts`](./training-tuning-scripts/), and [`validation-inference-scripts`](./validation-inference-scripts/). Jupyter notebook (`.ipynb`) version of the training and fine-tuning scripts are also provided. -The `triton_model_repo` contains the necessary directory structure and configuration files in order to run the Morpheus Models in Triton Inference Server. This includes symlinks to the above-mentioned model files along with corresponding Triton config files (`.pbtxt`). More information on how to deploy this repository to Triton can be found in the [README](./triton-model-repo/README.md). +The `triton_model_repo` contains the necessary directory structure and configuration files in order to run the Morpheus Models in Triton Inference Server. This includes symlinks to the above-mentioned model files along with corresponding Triton configuration files (`.pbtxt`). More information on how to deploy this repository to Triton can be found in the [README](./triton-model-repo/README.md). Models can also be published to an [MLflow](https://mlflow.org/) server and deployed to Triton using the [MLflow Triton plugin](https://github.com/triton-inference-server/server/tree/main/deploy/mlflow-triton-plugin). The [MLflow](./mlflow/README.md) directory contains information on how to set up a Docker container to run an MLflow server for publishing Morpheus models and deploying them to Triton. @@ -50,10 +50,10 @@ In the root directory, the file `model-information.csv` contains the following i - **Memory footprint** - Memory required by the model - **Thresholds** - Values of thresholds used for validation - **NLP hash file** - Hash file for tokenizer vocabulary - - **NLP max length** - Max_length value for tokenizer + - **NLP max length** - Max length value for tokenizer - **NLP stride** - stride value for tokenizer - - **NLP do lower** - do_lower value for tokenizer - - **NLP do truncate** - do_truncate value for tokenizer + - **NLP do lower** - `do_lower` value for tokenizer + - **NLP do truncate** - `do_truncate` value for tokenizer - **Version CUDA** - CUDA version used during training - **Version Python** - Python version used during training - **Version Ubuntu** - Ubuntu version used during training @@ -62,11 +62,11 @@ In the root directory, the file `model-information.csv` contains the following i # Model Card Info ## Sensitive Information Detection (SID) ### Model Overview -SID is a classifier, designed to detect sensitive information (e.g., AWS credentials, GitHub credentials) in unencrypted data. This example model classifies text containing these 10 categories of sensitive information- address, bank account, credit card number, email address, government id number, full name, password, phone number, secret keys, and usernames. +SID is a classifier, designed to detect sensitive information (for example, AWS credentials, GitHub credentials) in unencrypted data. This example model classifies text containing these 10 categories of sensitive information- address, bank account, credit card number, email address, government id number, full name, password, phone number, secret keys, and usernames. ### Model Architecture Compact BERT-mini transformer model ### Training -Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/google/bert_uncased_L-4_H-256_A-4). The labeled training dataset is 2 million synthetic pcap payloads generated using the [faker package](https://github.com/joke2k/faker) to mimic sensitive and benign data found in nested JSON(s) from web APIs and environmental variables. +Training consisted of fine-tuning the original pretrained [model from Google](https://huggingface.co/google/bert_uncased_L-4_H-256_A-4). The labeled training dataset is 2 million synthetic PCAP payloads generated using the [faker package](https://github.com/joke2k/faker) to mimic sensitive and benign data found in nested JSON objects from web APIs and environmental variables. ### How To Use This Model This model is an example of customized transformer-based sensitive information detection. It can be further fine-tuned for specific detection needs or retrained for alternative categorizations using the fine-tuning scripts in the repo. #### Input @@ -97,11 +97,11 @@ https://arxiv.org/abs/1810.04805 ## Anomalous Behavior Profiling (ABP) ### Model Overview -This model is an example of a binary classifier to differentiate between anomalous GPU behavior such as crypto mining / GPU malware, and non-anomalous GPU-based workflows (e.g., ML/DL training). The model is an XGBoost model. +This model is an example of a binary classifier to differentiate between anomalous GPU behavior such as cryptocurrency mining / GPU malware, and non-anomalous GPU-based workflows (for example, ML/DL training). The model is an XGBoost model. ### Model Architecture XGBoost ### Training -Training consisted of ~1000 labeled nv-smi logs generated from processes running either GPU malware or bengin GPU-based workflows. +Training consisted of ~1000 labeled nv-smi logs generated from processes running either GPU malware or benign GPU-based workflows. ### How To Use This Model This model can be used to flag anomalous GPU activity. #### Input @@ -109,7 +109,10 @@ nv-smi data #### Output Binary classification as anomalous or benign. ### References + + Chen, Guestrin (2016) XGBoost. A scalable tree boosting system. https://arxiv.org/abs/1603.02754 + ## Digital Fingerprinting (DFP) ### Model Overview @@ -125,9 +128,12 @@ aws-cloudtrail logs ### Output Anomalous score of Autoencoder, Binary classification of time series anomaly detection ### References + + - https://github.com/AlliedToasters/dfencoder/blob/master/dfencoder/autoencoder.py - https://github.com/rapidsai/clx/blob/branch-22.12/notebooks/anomaly_detection/FFT_Outlier_Detection.ipynb - Rasheed Peng Alhajj Rokne Jon: Fourier Transform Based Spatial Outlier Mining 2009 - https://link.springer.com/chapter/10.1007/978-3-642-04394-9_39 + ## Flexible Log Parsing ### Model Overview @@ -135,7 +141,7 @@ This model is an example of using Named Entity Recognition (NER) for log parsing ### Model Architecture BERT-based cased transformer model with NER classification layer ### Training -Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/bert-base-cased). The labeled training dataset is 1000 parsed apache web logs from a public dataset [logpai](https://github.com/logpai/loghub) +Training consisted of fine-tuning the original pretrained [model from Google](https://huggingface.co/bert-base-cased). The labeled training dataset is 1000 parsed apache web logs from a public dataset [Loghub](https://github.com/logpai/loghub) ### How To Use This Model This model is one example of a BERT-model trained to parse raw logs. It can be used to parse apache web logs or retrained to parse other types of logs as well. The model file has a corresponding config.json file with the names of the fields it parses. #### Input @@ -162,26 +168,32 @@ Transaction data with nodes including transaction, client, and merchant. #### Output An anomalous score of transactions indicates a probability score of being a fraud. ### References + + - https://stellargraph.readthedocs.io/en/stable/hinsage.html?highlight=hinsage - https://github.com/rapidsai/clx/blob/branch-22.12/examples/forest_inference/xgboost_training.ipynb - Rafaël Van Belle, Charles Van Damme, Hendrik Tytgat, Jochen De Weerdt,Inductive Graph Representation Learning for fraud detection (https://www.sciencedirect.com/science/article/abs/pii/S0957417421017449) + ## Ransomware Detection via AppShield ### Model Overview -This model shows an application of DOCA AppShield to use data from volatile memory to classify processes as ransomware or bengin. This model uses a sliding window over time and feeds derived data into a random forest classifiers of various lengths depending on the amount of data collected. +This model shows an application of DOCA AppShield to use data from volatile memory to classify processes as ransomware or benign. This model uses a sliding window over time and feeds derived data into a random forest classifiers of various lengths depending on the amount of data collected. ### Model Architecture The model uses input from Volatility plugins in DOCA AppShield to aggregate and derive features over snapshots in time. The features are used as input into three random forest binary classifiers. ### Training -Training data consists of 87968 labeled AppShield processes from 32 snapshots collected from 256 unique benign and ransomware activities. +Training data consists of 87968 labeled AppShield processes from 32 snapshots collected from 256 unique benign and ransomware activities. ### How To Use This Model Combined with host data from DOCA AppShield, this model can be used to detect ransomware. A training notebook is also included so that users can update the model as more labeled data is collected. #### Input Snapshots collected from DOCA AppShield #### Output -For each process_id and snapshot there is a probability score between 1 and 0, where 1 is ransomware and 0 is benign. +For each `process_id` and snapshot there is a probability score between 1 and 0, where 1 is ransomware and 0 is benign. ### References + + - Cohen, A,. & Nissim, N. (2018). Trusted detection of ransomware in a private cloud using machine learning methods leveraging meta-features from volatile memory. In Expert Systems With Applications. (https://www.sciencedirect.com/science/article/abs/pii/S0957417418301283) - https://developer.nvidia.com/networking/doca + ## Root Cause Analysis ### Model Overview @@ -191,7 +203,7 @@ BERT-base uncased transformer model ### Training Training consisted of fine-tuning the original pre-trained [model from google](https://huggingface.co/bert-base-uncased). The labeled dataset is Linux kernel logs, and it has two parts. Kernel errors and new errors. Kernel logs will be split into two parts so that the new and unseen error logs can be appended to the test set after the split to later check if the model can catch them despite not seeing such errors in the training. ### How To Use This Model -This model is an example of customized transformer-based root cause analysis. It can be further fine-tuned for specific root cause analysis or predictive maintenance needs and of your enterprise using the fine-tuning scripts in the repo. The hyper parameters can be optimised to adjust to get the best results with your dataset. The aim is to get the model to predict some false positives that could be previously unknown error types. Users can use this root cause analysis method with other log types too. If they have known failures in their logs, they can use them to train along with ordinary logs and can detect other root causes they weren't aware of before. +This model is an example of customized transformer-based root cause analysis. It can be further fine-tuned for specific root cause analysis or predictive maintenance needs and of your enterprise using the fine-tuning scripts in the repo. The hyper parameters can be optimised to adjust to get the best results with your dataset. The aim is to get the model to predict some false positives that could be previously unknown error types. Users can use this root cause analysis method with other log types too. If they have known failures in their logs, they can use them to train along with ordinary logs and can detect other root causes they weren't aware of before. #### Input Kernel logs #### Output diff --git a/models/datasets/README.md b/models/datasets/README.md index 5e5f6930a0..4c3b83603c 100644 --- a/models/datasets/README.md +++ b/models/datasets/README.md @@ -65,7 +65,7 @@ This dataset is stored in our S3 bucket. It can be downloaded using a script. - [fetch_example_data.py](../../examples/digital_fingerprinting/fetch_example_data.py) -### DFP Cloudtrail Logs +### DFP CloudTrail Logs This is a synthetic dataset of AWS CloudTrail logs events with activities from two entities/users in separate files. @@ -93,7 +93,7 @@ Files for `role-g` include a single CSV and split JSON version of the same data: ## Fraud Detection -This is a small dataset augmented from the artificially generated transaction network demo data from the authors of [Inductive Graph Representation Learning for Fraud Detection](https://www.researchgate.net/publication/357706343_Inductive_Graph_Representation_Learning_for_fraud_detection). The original demo data of 753 labeled transactions was downloaded from the paper's [github repo](https://github.com/Charlesvandamme/Inductive-Graph-Representation-Learning-for-Fraud-Detection/blob/master/Demo/demo_ccf.csv) on 02/10/2022 with an MD5 hash `64af64fcc6e3d55d25111a3f257378a4`. We augmented the training dataset to increase benign transactions by replicating that portion of the dataset for a total of 12053 transactions. +This is a small dataset augmented from the artificially generated transaction network demo data from the authors of [Inductive Graph Representation Learning for Fraud Detection](https://www.researchgate.net/publication/357706343_Inductive_Graph_Representation_Learning_for_fraud_detection). The original demo data of 753 labeled transactions was downloaded from the paper's [GitHub repo](https://github.com/Charlesvandamme/Inductive-Graph-Representation-Learning-for-Fraud-Detection/blob/master/Demo/demo_ccf.csv) on 02/10/2022 with an MD5 hash `64af64fcc6e3d55d25111a3f257378a4`. We augmented the training dataset to increase benign transactions by replicating that portion of the dataset for a total of 12053 transactions. ### Sample Training Data - [fraud-detection-training-data.csv](./training-data/fraud-detection-training-data.csv) @@ -104,7 +104,7 @@ This is a small dataset augmented from the artificially generated transaction ne ## Log Parsing -This sample dataset consists of a subset of Apache logs collected from a Linux system running Apache Web server as part of a larger public log dataset on [loghub](https://github.com/logpai/loghub/blob/master/Apache/Apache_2k.log). The file was downloaded on 01/14/2020 with an MD5 hash of `1c3a706386b3ebc03a2ae07a2d864d66`. The logs were parsed using an apache log parsing [package](https://github.com/amandasaurus/apache-log-parser) to create a labeled dataset. +This sample dataset consists of a subset of Apache logs collected from a Linux system running Apache Web server as part of a larger public log dataset on [Loghub](https://github.com/logpai/loghub/blob/master/Apache/Apache_2k.log). The file was downloaded on 01/14/2020 with an MD5 hash of `1c3a706386b3ebc03a2ae07a2d864d66`. The logs were parsed using an apache log parsing [package](https://github.com/amandasaurus/apache-log-parser) to create a labeled dataset. ### Sample Training Data @@ -133,37 +133,37 @@ Additionally a subset of 100 messages from the dataset were augmented to include ## Ransomware Detection via AppShield -The dataset was generated by running ransomware and benign processes in a lab environment and recording the output from several plugins from the [Volatility framework](https://github.com/volatilityfoundation/volatility3) including `cmdline`, `envars`, `handles`, `ldrmodules`, `netscan`, `pslist`, `threadlist`, `vadinfo`. The training csv file contains 530 columns- a combination of features from the Volatility Plugins. This data collection is part of [DOCA AppShield](https://developer.nvidia.com/networking/doca). +The dataset was generated by running ransomware and benign processes in a lab environment and recording the output from several plugins from the [Volatility framework](https://github.com/volatilityfoundation/volatility3) including `cmdline`, `envars`, `handles`, `ldrmodules`, `netscan`, `pslist`, `threadlist`, `vadinfo`. The training CSV file contains 530 columns- a combination of features from the Volatility Plugins. This data collection is part of [DOCA AppShield](https://developer.nvidia.com/networking/doca). ### Sample Training Data Training data CSV consists of 87968 preprocessed and labeled AppShield processes from 32 snapshots collected from 256 unique benign and ransomware activities. -- [ransomware-training-data.csv](./training-data/ransomware-training-data.csv) +- [`ransomware-training-data.csv`](./training-data/ransomware-training-data.csv) ### Pipeline Validation Data The validation set contains raw data from 27 AppShield snapshots. -- [appshield data directory](../../examples/data/appshield/Heur) +- [`appshield` data directory](../../examples/data/appshield/Heur) ## Root Cause This dataset contains a small sample of anonymized Linux kernel logs of a DGX machine prior to a hardware failure. The training dataset contains 1359 logs labeled as indicators of the root cause or not. A model trained on this set can be robust enough to correctly identify previously undetected errors from the `unseen-errors` file as a root cause as well. ### Sample Training Data -- [root-cause-training-data.csv](./training-data/root-cause-training-data.csv) -- [root-cause-unseen-errors.csv](./training-data/root-cause-unseen-errors.csv) +- [`root-cause-training-data.csv`](./training-data/root-cause-training-data.csv) +- [`root-cause-unseen-errors.csv`](./training-data/root-cause-unseen-errors.csv) ### Pipeline Validation Data -- [root-cause-validation-data-input.jsonlines](./validation-data/root-cause-validation-data-input.jsonlines) +- [`root-cause-validation-data-input.jsonlines`](./validation-data/root-cause-validation-data-input.jsonlines) ## Sensitive Information Detection (SID) -This data contains 2000 synthetic pcap payloads generated to mimic sensitive and benign data found in nested JSONs from web APIs and environmental variables. Each row is labeled for the presence or absence of 10 different kinds of sensitive information. The data was generated using the python [faker](https://faker.readthedocs.io/en/master/#) package and lists of most [common passwords](https://github.com/danielmiessler/SecLists/tree/master/Passwords/Common-Credentials). If there is any resemblance to real individuals, it is purely coincidental. +This data contains 2000 synthetic PCAP payloads generated to mimic sensitive and benign data found in nested JSON objects from web APIs and environmental variables. Each row is labeled for the presence or absence of 10 different kinds of sensitive information. The data was generated using the python [faker](https://faker.readthedocs.io/en/master/#) package and lists of most [common passwords](https://github.com/danielmiessler/SecLists/tree/master/Passwords/Common-Credentials). If there is any resemblance to real individuals, it is purely coincidental. ### Sample Training Data -- [sid-sample-training-data.csv](./training-data/sid-sample-training-data.csv) +- [`sid-sample-training-data.csv`](./training-data/sid-sample-training-data.csv) ### Pipeline Validation Data -- [sid-validation-data.csv](./validation-data/sid-validation-data.csv) +- [`sid-validation-data.csv`](./validation-data/sid-validation-data.csv) ## Disclaimer diff --git a/models/mlflow/README.md b/models/mlflow/README.md index aabaace726..ddc3eb1a29 100644 --- a/models/mlflow/README.md +++ b/models/mlflow/README.md @@ -65,11 +65,7 @@ cp -RL models /opt/triton_models ## Start Triton Inference Server in EXPLICIT mode -Use the following command to run Triton with our model -repository you just created. The [NVIDIA Container -Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) must be installed -for Docker to recognize the GPU(s). The --gpus=1 flag indicates that 1 -system GPU should be made available to Triton for inferencing. +Use the following command to run Triton with our model repository you just created. The [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) must be installed for Docker to recognize the GPUs. The `--gpus=1` flag indicates that the GPU with ID `1` should be made available to Triton for inferencing. ```bash docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /opt/triton_models:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models --model-control-mode=explicit @@ -77,7 +73,7 @@ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /opt/triton_mode ## MLflow container -Build MLflow image from Dockerfile from the root of the Morpheus repo: +Build MLflow image, from the root of the Morpheus repo: ```bash cd models/mlflow diff --git a/models/model-cards/abp-model-card.md b/models/model-cards/abp-model-card.md index 634df78317..874530ba78 100644 --- a/models/model-cards/abp-model-card.md +++ b/models/model-cards/abp-model-card.md @@ -21,81 +21,84 @@ limitations under the License. # Model Overview ## Description: -* This model is an example of a binary XGBoost classifier to differentiate between anomalous GPU behavior, such as crypto mining / GPU malware, and non-anomalous GPU-based workflows (e.g., ML/DL training). This model is for demonstration purposes and not for production usage.
+* This model is an example of a binary XGBoost classifier to differentiate between anomalous GPU behavior, such as cryptocurrency mining / GPU malware, and non-anomalous GPU-based workflows (for example, ML/DL training). This model is for demonstration purposes and not for production usage.
-## References(s): -* Chen, Guestrin (2016) XGBoost. A scalable tree boosting system. https://arxiv.org/abs/1603.02754
+## References: + + +* Chen, Guestrin (2016) XGBoost. A scalable tree boosting system. https://arxiv.org/abs/1603.02754
+ -## Model Architecture: -**Architecture Type:** +## Model Architecture: +**Architecture Type:** * Gradient boosting
-**Network Architecture:** -* XGBOOST
+**Network Architecture:** +* XGBoost
## Input: (Enter "None" As Needed) -**Input Format:** -* nvidia-smi output
+**Input Format:** +* `nvidia-smi` output
-**Input Parameters:** -* GPU statistics that are included in the nvidia-smi output
+**Input Parameters:** +* GPU statistics that are included in the `nvidia-smi` output
**Other Properties Related to Output:** N/A
## Output: (Enter "None" As Needed) -**Output Format:** +**Output Format:** * Binary Results
-**Output Parameters:** +**Output Parameters:** * N/A
-**Other Properties Related to Output:** -* N/A
+**Other Properties Related to Output:** +* N/A
## Software Integration: -**Runtime(s):** +**Runtime:** * Morpheus
-**Supported Hardware Platform(s):**
+**Supported Hardware Platforms:**
* Ampere/Turing
-**Supported Operating System(s):**
+**Supported Operating Systems:**
* Linux
-## Model Version(s): +## Model Versions: * v1
-# Training & Evaluation: +# Training & Evaluation: ## Training Dataset: -**Link:** +**Link:** * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/training-data/abp-sample-nvsmi-training-data.json
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** -* Sample dataset consists of over 1000 nvidia-smi outputs
+**Properties (Quantity, Dataset Descriptions, Sensors):** +* Sample dataset consists of over 1000 `nvidia-smi` outputs
## Evaluation Dataset: -**Link:** +**Link:** * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/validation-data/abp-validation-data.jsonlines
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** -* Sample dataset consists of over 1000 nvidia-smi outputs
+**Properties (Quantity, Dataset Descriptions, Sensors):** +* Sample dataset consists of over 1000 `nvidia-smi` outputs
## Inference: -**Engine:** +**Engine:** * Triton
**Test Hardware:**
* DGX (V100)
## Ethical Considerations: -NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). +NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). # Subcards @@ -109,40 +112,40 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ## Model Card ++ Explainability Subcard -### Name example applications and use cases for this model. +### Name example applications and use cases for this model. * The model is primarily designed for testing purposes and serves as a small model specifically used to evaluate and validate the ABP pipeline. Its application is focused on assessing the effectiveness of the pipeline rather than being intended for broader use cases or specific applications beyond testing. ### Intended Users. * The model is primarily designed for testing purposes. This model is intended to be an example for developers that want to test Morpheus ABP pipeline. -### Name who is intended to benefit from this model. -* The intended beneficiaries of this model are developers who aim to test the functionality of the ABP models for detecting crypto mining. +### Name who is intended to benefit from this model. +* The intended beneficiaries of this model are developers who aim to test the functionality of the ABP models for detecting cryptocurrency mining. -### Describe the model output. -* This model output can be used as a binary result, Crypto mining or legitimate GPU usage. +### Describe the model output. +* This model output can be used as a binary result, cryptocurrency mining or legitimate GPU usage. ### Describe how this model works. -* nvidia-smi features are used as the input and the model predicts a label for each row +* `nvidia-smi` features are used as the input and the model predicts a label for each row -### List the technical limitations of the model. +### List the technical limitations of the model. * For different GPU workloads different models need to be trained. ### Has this been verified to have met prescribed NVIDIA quality standards? * Yes - + ### What performance metrics were used to affirm the model's performance? * Accuracy ### What are the potential known risks to users and stakeholders? * N/A -### Link the relevant end user license agreement +### Link the relevant end user license agreement * [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) -## Model Card ++ Saftey & Security Subcard +## Model Card ++ Safety & Security Subcard -### Link the location of the training dataset's repository. +### Link the location of the repository for the training dataset. * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/training-data/abp-sample-nvsmi-training-data.json ### Describe the life critical impact (if present). @@ -178,14 +181,14 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ### Protected classes used to create this model? (The following were used in model the model's training:) * N/A - + ### How often is dataset reviewed? * The dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for any changes. ### Is a mechanism in place to honor data subject right of access or deletion of personal data? * N/A -### If PII collected for the development of this AI model, was it minimized to only what was required? +### If PII collected for the development of this AI model, was it minimized to only what was required? * N/A ### Is there data provenance? diff --git a/models/model-cards/dfp-model-card.md b/models/model-cards/dfp-model-card.md index f17a6911dd..71c0eebc04 100644 --- a/models/model-cards/dfp-model-card.md +++ b/models/model-cards/dfp-model-card.md @@ -20,9 +20,12 @@ limitations under the License. ## Description: This use case is currently implemented to detect changes in users' behavior that indicate a change from a human to a machine or a machine to a human. The model architecture consists of an Autoencoder, where the reconstruction loss of new log data is used as an anomaly score. -## References(s): +## References: + + - https://github.com/AlliedToasters/dfencoder/blob/master/dfencoder/autoencoder.py - Rasheed Peng Alhajj Rokne Jon: Fourier Transform Based Spatial Outlier Mining 2009 - https://link.springer.com/chapter/10.1007/978-3-642-04394-9_39 + ## Model Architecture: The model architecture consists of an Autoencoder, where the reconstruction loss of new log data is used as an anomaly score. @@ -35,7 +38,7 @@ The model architecture consists of an Autoencoder, where the reconstruction loss ## Input: **Input Format:** -* AWS CloudTrail logs in json format +* AWS CloudTrail logs in JSON format **Input Parameters:** * None @@ -49,19 +52,19 @@ The model architecture consists of an Autoencoder, where the reconstruction loss * Reconstruction loss (per feature) **Output Parameters:** -* Pandas Dataframe +* Pandas DataFrame ## Software Integration: -**Runtime(s):** +**Runtime:** * Morpheus -**Supported Hardware Platform(s):**
+**Supported Hardware Platforms:**
* Ampere/Turing
-**Supported Operating System(s):**
+**Supported Operating Systems:**
* Linux
-## Model Version(s): +## Model Versions: * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/dfp-models/hammah-role-g-20211017-dill.pkl * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/dfp-models/hammah-user123-20211017-dill.pkl @@ -72,7 +75,7 @@ The model architecture consists of an Autoencoder, where the reconstruction loss **Link:** * https://github.com/nv-morpheus/Morpheus/tree/branch-24.10/models/datasets/training-data/cloudtrail -**Properties (Quantity, Dataset Descriptions, Sensor(s)):** +**Properties (Quantity, Dataset Descriptions, Sensors):** The training dataset consists of AWS CloudTrail logs. It contains logs from two entities, providing information about their activities within the AWS environment. * [hammah-role-g-training-part1.json](https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/training-data/cloudtrail/hammah-role-g-training-part1.json): 700 records
@@ -85,7 +88,7 @@ The training dataset consists of AWS CloudTrail logs. It contains logs from two **Link:** * https://github.com/nv-morpheus/Morpheus/tree/branch-24.10/models/datasets/validation-data/cloudtrail
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** +**Properties (Quantity, Dataset Descriptions, Sensors):** The evaluation dataset consists of AWS CloudTrail logs. It contains logs from two entities, providing information about their activities within the AWS environment. * [hammah-role-g-validation.json](https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/validation-data/cloudtrail/hammah-role-g-validation.json): 314 records @@ -101,7 +104,7 @@ The evaluation dataset consists of AWS CloudTrail logs. It contains logs from tw * Other ## Ethical Considerations (For NVIDIA Models Only): -NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcard +NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcard # Subcards @@ -122,7 +125,7 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe * This model is designed for developers seeking to test the DFP pipeline with a small pretrained model trained on a synthetic dataset. ### Name who is intended to benefit from this model. -* The intended beneficiaries of this model are developers who aim to test the performance and functionality of the DFP pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world cloudtrail logs analysis. +* The intended beneficiaries of this model are developers who aim to test the performance and functionality of the DFP pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world CloudTrail logs analysis. ### Describe the model output. * The model calculates an anomaly score for each input based on the reconstruction loss obtained from the trained Autoencoder. This score represents the level of anomaly detected in the input data. Higher scores indicate a higher likelihood of anomalous behavior. @@ -133,7 +136,7 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe * [Training notebook](https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/training-tuning-scripts/dfp-models/hammah-20211017.ipynb) ### List the technical limitations of the model. -* The model expects cloudtrail logs with specific features that match the training dataset. Data lacking the required features or requiring a different feature set may not be compatible with the model. +* The model expects CloudTrail logs with specific features that match the training dataset. Data lacking the required features or requiring a different feature set may not be compatible with the model. ### Has this been verified to have met prescribed NVIDIA quality standards? * Yes @@ -147,9 +150,9 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ### Link the relevant end user license agreement * [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) -## Model Card ++ Saftey & Security Subcard +## Model Card ++ Safety & Security Subcard -### Link the location of the training dataset's repository (if able to share). +### Link the location of the repository for the training dataset (if able to share). * https://github.com/nv-morpheus/Morpheus/tree/branch-24.10/models/datasets/training-data/cloudtrail ### Describe the life critical impact (if present). diff --git a/models/model-cards/gnn-fsi-model-card.md b/models/model-cards/gnn-fsi-model-card.md index ac37a34b58..27fc6f73a7 100644 --- a/models/model-cards/gnn-fsi-model-card.md +++ b/models/model-cards/gnn-fsi-model-card.md @@ -18,85 +18,88 @@ limitations under the License. # Model Overview ### Description: -* This model shows an application of a graph neural network for fraud detection in a credit card transaction graph. A transaction dataset that includes three types of nodes, transaction, client, and merchant nodes is used for modeling. A combination of `GraphSAGE` along `XGBoost` is used to identify frauds in the transaction networks. This model is for demonstration purposes and not for production usage.
+* This model shows an application of a graph neural network for fraud detection in a credit card transaction graph. A transaction dataset that includes three types of nodes, transaction, client, and merchant nodes is used for modeling. A combination of `GraphSAGE` along `XGBoost` is used to identify frauds in the transaction networks. This model is for demonstration purposes and not for production usage.
-## References(s): +## References: + + 1. https://stellargraph.readthedocs.io/en/stable/hinsage.html?highlight=hinsage 2. https://github.com/rapidsai/clx/blob/branch-22.12/examples/forest_inference/xgboost_training.ipynb -3. Rafaël Van Belle, Charles Van Damme, Hendrik Tytgat, Jochen De Weerdt,Inductive Graph Representation Learning for fraud detection (https://www.sciencedirect.com/science/article/abs/pii/S0957417421017449)
+3. Rafaël Van Belle, Charles Van Damme, Hendrik Tytgat, Jochen De Weerdt, Inductive Graph Representation Learning for fraud detection (https://www.sciencedirect.com/science/article/abs/pii/S0957417421017449)
+ ## Model Architecture: It uses a bipartite heterogeneous graph representation as input for `GraphSAGE` for feature learning and `XGBoost` as a classifier. Since the input graph is heterogeneous, a heterogeneous implementation of `GraphSAGE` (HinSAGE) is used for feature embedding.
-**Architecture Type:** +**Architecture Type:** * Graph Neural Network and Binary classification
-**Network Architecture:** +**Network Architecture:** * GraphSAGE and XGBoost
## Input Transaction data with nodes including transaction, client, and merchant.
-**Input Parameters:** +**Input Parameters:** * None
-**Input Format:** +**Input Format:** * CSV format
-**Other Properties Related to Output:** +**Other Properties Related to Output:** * None
## Output An anomalous score of transactions indicates a probability score of being a fraud.
-**Output Parameters:** +**Output Parameters:** * None
-**Output Format:** +**Output Format:** * CSV
-**Other Properties Related to Output:** -* None
+**Other Properties Related to Output:** +* None
## Software Integration: -**Runtime(s):** +**Runtime:** * Morpheus
-**Supported Hardware Platform(s):**
+**Supported Hardware Platforms:**
* Ampere/Turing
-**Supported Operating System(s):**
+**Supported Operating Systems:**
* Linux
- -## Model Version(s): + +## Model Versions: * 1.0
### How To Use This Model This model is an example of a fraud detection pipeline using a graph neural network and gradient boosting trees. This can be further retrained or fine-tuned to be used for similar types of transaction networks with similar graph structures. -# Training & Evaluation: +# Training & Evaluation: ## Training Dataset: **Link:** * [fraud-detection-training-data.csv](models/dataset/fraud-detection-training-data.csv)
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** +**Properties (Quantity, Dataset Descriptions, Sensors):** * A training data consists of raw 753 synthetic labeled credit card transaction data with data augmentation in a total of 12053 labeled transaction data.
## Evaluation Dataset: -**Link:** +**Link:** * [fraud-detection-validation-data.csv](models/dataset/fraud-detection-validation-data.csv)
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** +**Properties (Quantity, Dataset Descriptions, Sensors):** * Data consists of raw 265 labeled credit card transaction synthetically created
## Inference: -**Engine:** +**Engine:** * Triton
**Test Hardware:**
* DGX (V100)
## Ethical Considerations: -NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). +NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). # Subcards ## Model Card ++ Bias Subcard @@ -109,19 +112,19 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ## Model Card ++ Explainability Subcard -### Name example applications and use cases for this model. +### Name example applications and use cases for this model. * The model is primarily designed for testing purposes and serves as a small pretrained model specifically used to evaluate and validate the GNN FSI pipeline. Its application is focused on assessing the effectiveness of the pipeline rather than being intended for broader use cases or specific applications beyond testing. ### Fill in the blank for the model technique. * This model is designed for developers seeking to test the GNN fraud detection pipeline with a small pretrained model on a synthetic dataset. ### Intended Users. -* The intended beneficiaries of this model are developers who aim to test the performance and functionality of the GNN fraud detection pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world transactions. +* The intended beneficiaries of this model are developers who aim to test the performance and functionality of the GNN fraud detection pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world transactions. ### Describe the model output. -* This model outputs fraud probability score b/n (0 & 1). +* This model outputs fraud probability score b/n (0 & 1). -### Describe how this model works. +### Describe how this model works. * The model uses a bipartite heterogeneous graph representation as input for `GraphSAGE` for feature learning and `XGBoost` as a classifier. Since the input graph is heterogeneous, a heterogeneous implementation of `GraphSAGE` (HinSAGE) is used for feature embedding. ### List the technical limitations of the model. @@ -133,15 +136,15 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ### What performance metrics were used to affirm the model's performance? * Area under ROC curve and Accuracy -### What are the potential known risks to users and stakeholders? +### What are the potential known risks to users and stakeholders? * None -### Link the relevant end user license agreement +### Link the relevant end user license agreement * [Apache 2.0](https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/LICENSE) -## Model Card ++ Saftey & Security Subcard +## Model Card ++ Safety & Security Subcard -### Link the location of the training dataset's repository (if able to share). +### Link the location of the repository for the training dataset (if able to share). * [training dataset](models/datasets/training-data/fraud-detection-training-data.csv) ### Describe the life critical impact (if present). @@ -183,7 +186,7 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ### Is a mechanism in place to honor data subject right of access or deletion of personal data? * Yes -### If PII collected for the development of this AI model, was it minimized to only what was required? +### If PII collected for the development of this AI model, was it minimized to only what was required? * Not applicable ### Is there data provenance? diff --git a/models/model-cards/phishing-model-card.md b/models/model-cards/phishing-model-card.md index b02f561b7a..e5f9e1908a 100644 --- a/models/model-cards/phishing-model-card.md +++ b/models/model-cards/phishing-model-card.md @@ -21,84 +21,84 @@ limitations under the License. # Model Overview ## Description: -* Phishing detection is a binary classifier differentiating between phishing/spam and benign emails and SMS messages. This model is for demonstration purposes and not for production usage.
+* Phishing detection is a binary classifier differentiating between phishing/spam and benign emails and SMS messages. This model is for demonstration purposes and not for production usage.
-## References(s): +## References: * https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
-* Devlin J. et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805
+* Devlin J. et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805
-## Model Architecture: -**Architecture Type:** +## Model Architecture: +**Architecture Type:** * Transformers
-**Network Architecture:** +**Network Architecture:** * BERT
## Input: (Enter "None" As Needed) -**Input Format:** -* Evaluation script downloads the smsspamcollection.zip and extract tabular information into a dataframe
+**Input Format:** +* Evaluation script downloads the smsspamcollection.zip and extract tabular information into a DataFrame
-**Input Parameters:** +**Input Parameters:** * SMS/emails
-**Other Properties Related to Output:** +**Other Properties Related to Output:** * N/A
## Output: (Enter "None" As Needed) -**Output Format:** +**Output Format:** * Binary Results, Fraudulent or Benign
-**Output Parameters:** +**Output Parameters:** * N/A
-**Other Properties Related to Output:** -* N/A
+**Other Properties Related to Output:** +* N/A
## Software Integration: -**Runtime(s):** +**Runtime:** * Morpheus
-**Supported Hardware Platform(s):**
+**Supported Hardware Platforms:**
* Ampere/Turing
-**Supported Operating System(s):**
+**Supported Operating Systems:**
* Linux
-## Model Version(s): +## Model Versions: * v1
-# Training & Evaluation: +# Training & Evaluation: ## Training Dataset: -**Link:** +**Link:** * http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** -* Dataset consists of SMSs
+**Properties (Quantity, Dataset Descriptions, Sensors):** +* Dataset consists of SMS messages
## Evaluation Dataset: -**Link:** +**Link:** * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/validation-data/phishing-email-validation-data.jsonlines
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** -* Dataset consists of SMSs
+**Properties (Quantity, Dataset Descriptions, Sensors):** +* Dataset consists of SMS messages
## Inference: -**Engine:** +**Engine:** * Triton
**Test Hardware:**
* DGX (V100)
## Ethical Considerations: -NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). +NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). # Subcards @@ -118,22 +118,22 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ## Model Card ++ Explainability Subcard -### Name example applications and use cases for this model. +### Name example applications and use cases for this model. * The model is primarily designed for testing purposes and serves as a small pre-trained model specifically used to evaluate and validate the phishing detection pipeline. Its application is focused on assessing the effectiveness of the pipeline rather than being intended for broader use cases or specific applications beyond testing. ### Intended Users. * This model is designed for developers seeking to test the phishing detection pipeline with a small pre-trained model. -### Name who is intended to benefit from this model. -* The intended beneficiaries of this model are developers who aim to test the performance and functionality of the phishing pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world phishing messages. +### Name who is intended to benefit from this model. +* The intended beneficiaries of this model are developers who aim to test the performance and functionality of the phishing pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world phishing messages. -### Describe the model output. -* This model output can be used as a binary result, Phishing/Spam or Benign +### Describe the model output. +* This model output can be used as a binary result, Phishing/Spam or Benign ### Describe how this model works. * A BERT model gets fine-tuned with the dataset and in the inference it predicts one of the binary classes. Phishing/Spam or Benign. -### List the technical limitations of the model. +### List the technical limitations of the model. * For different email/SMS types and content, different models need to be trained. ### Has this been verified to have met prescribed NVIDIA standards? @@ -145,12 +145,12 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ### What are the potential known risks to users and stakeholders? * N/A -### Link the relevant end user license agreement +### Link the relevant end user license agreement * [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) -## Model Card ++ Saftey & Security Subcard +## Model Card ++ Safety & Security Subcard -### Link the location of the training dataset's repository. +### Link the location of the repository for the training dataset. * http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip ### Describe the life critical impact (if present). @@ -194,7 +194,7 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe ### Is a mechanism in place to honor data subject right of access or deletion of personal data? * N/A -### If PII collected for the development of this AI model, was it minimized to only what was required? +### If PII collected for the development of this AI model, was it minimized to only what was required? * N/A ### Is there data provenance? diff --git a/models/model-cards/root-cause-analysis-model-card.md b/models/model-cards/root-cause-analysis-model-card.md index 92d88d916e..1c2f8bd6d9 100644 --- a/models/model-cards/root-cause-analysis-model-card.md +++ b/models/model-cards/root-cause-analysis-model-card.md @@ -23,73 +23,73 @@ limitations under the License. ## Description: * Root cause analysis is a binary classifier differentiating between ordinary logs and errors/problems/root causes in the log files.
-## References(s): -* Devlin J. et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805
+## References: +* Devlin J. et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805
-## Model Architecture: -**Architecture Type:** +## Model Architecture: +**Architecture Type:** * Transformers
-**Network Architecture:** +**Network Architecture:** * BERT
## Input: (Enter "None" As Needed) -**Input Format:** +**Input Format:** * CSV
-**Input Parameters:** +**Input Parameters:** * kern.log file contents
-**Other Properties Related to Output:** +**Other Properties Related to Output:** * N/A
## Output: (Enter "None" As Needed) -**Output Format:** +**Output Format:** * Binary Results, Root Cause or Ordinary
-**Output Parameters:** +**Output Parameters:** * N/A
-**Other Properties Related to Output:** -* N/A
+**Other Properties Related to Output:** +* N/A
## Software Integration: -**Runtime(s):** +**Runtime:** * Morpheus
-**Supported Hardware Platform(s):**
+**Supported Hardware Platforms:**
* Ampere/Turing
-**Supported Operating System(s):**
+**Supported Operating Systems:**
* Linux
-## Model Version(s): +## Model Versions: * v1
-# Training & Evaluation: +# Training & Evaluation: ## Training Dataset: -**Link:** +**Link:** * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/training-data/root-cause-training-data.csv
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** +**Properties (Quantity, Dataset Descriptions, Sensors):** * kern.log files from DGX machines
## Evaluation Dataset: -**Link:** +**Link:** * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/validation-data/root-cause-validation-data-input.jsonlines
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** +**Properties (Quantity, Dataset Descriptions, Sensors):** * kern.log files from DGX machines
## Inference: -**Engine:** +**Engine:** * Triton
**Test Hardware:**
@@ -107,22 +107,22 @@ limitations under the License. ## Model Card ++ Explainability Subcard -### Name example applications and use cases for this model. +### Name example applications and use cases for this model. * The model is primarily designed for testing purposes and serves as a small pre-trained model specifically used to evaluate and validate the Root Cause Analysis pipeline. This model is an example of customized transformer-based root cause analysis. It can be used for pipeline testing purposes. It needs to be re-trained for specific root cause analysis or predictive maintenance needs with the fine-tuning scripts in the repo. The hyperparameters can be optimised to adjust to get the best results with another dataset. The aim is to get the model to predict some false positives that could be previously unknown error types. Users can use this root cause analysis approach with other log types too. If they have known failures in their logs, they can use them to train along with ordinary logs and can detect other root causes they weren't aware of before. ### Intended Users. * This model is designed for developers seeking to test the root cause analysis pipeline with a small pre-trained model trained on a very small `kern.log` file from a DGX. -### Name who is intended to benefit from this model. +### Name who is intended to benefit from this model. * The intended beneficiaries of this model are developers who aim to test the functionality of the DFP pipeline using synthetic datasets -### Describe the model output. -* This model output can be used as a binary result, Root cause or Ordinary +### Describe the model output. +* This model output can be used as a binary result, Root cause or Ordinary -### Describe how this model works. +### Describe how this model works. * A BERT model gets fine-tuned with the kern.log dataset and in the inference it predicts one of the binary classes. Root cause or Ordinary. -### List the technical limitations of the model. +### List the technical limitations of the model. * For different log types and content, different models need to be trained. ### Has this been verified to have met prescribed NVIDIA quality standards? @@ -134,13 +134,13 @@ limitations under the License. ### What are the potential known risks to users and stakeholders? * N/A -### Link the relevant end user license agreement +### Link the relevant end user license agreement * [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)
-## Model Card ++ Saftey & Security Subcard +## Model Card ++ Safety & Security Subcard -### Link the location of the training dataset's repository. +### Link the location of the repository for the training dataset. * https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/models/datasets/training-data/root-cause-training-data.csv ### Describe the life critical impact (if present). @@ -183,11 +183,11 @@ limitations under the License. ### Is a mechanism in place to honor data subject right of access or deletion of personal data? * N/A -### If PII collected for the development of this AI model, was it minimized to only what was required? +### If PII collected for the development of this AI model, was it minimized to only what was required? * N/A ### Is there data provenance? -* Original raw logs are not saved. The small sample in the repo is saved for testing the pipeline. +* Original raw logs are not saved. The small sample in the repo is saved for testing the pipeline. ### Does data labeling (annotation, metadata) comply with privacy laws? * N/A diff --git a/models/training-tuning-scripts/fraud-detection-models/README.md b/models/training-tuning-scripts/fraud-detection-models/README.md index 14e4b32084..025d871ceb 100644 --- a/models/training-tuning-scripts/fraud-detection-models/README.md +++ b/models/training-tuning-scripts/fraud-detection-models/README.md @@ -60,6 +60,6 @@ python training.py --training-data $DATASET/training-data/fraud-detection-traini --model_dir models\ --model-type HinSAGE ``` -This results is a trained models of HeteroRGCN/HinSAGE (model.pt) and Gradient boosting tree (xgb.pt), hyperparmeters at the `model` directory. +This results is a trained models of HeteroRGCN/HinSAGE (model.pt) and Gradient boosting tree (xgb.pt), hyperparameters at the `model` directory. -Note the `model.py` used for both training & inference script is a symbolink for the `../../../examples/gnn_fraud_detection_pipeline/stages/model.py`. +Note the `model.py` used for both training & inference script is a symbolic link for the `../../../examples/gnn_fraud_detection_pipeline/stages/model.py`. diff --git a/models/triton-model-repo/README.md b/models/triton-model-repo/README.md index 07bce2c454..a173c2078a 100644 --- a/models/triton-model-repo/README.md +++ b/models/triton-model-repo/README.md @@ -63,7 +63,7 @@ To launch Triton with one of the models in `triton-model-repo`, this entire repo docker run --rm --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD:/models --name tritonserver nvcr.io/nvidia/tritonserver:22.08-py3 tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --model-control-mode=explicit --load-model sid-minibert-onnx ``` -### Load `abp-nvsmi-xgb` Model with FIL Backend Triton +### Load `abp-nvsmi-xgb` Model with FIL Back-end Triton ```bash docker run --rm --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD:/models --name tritonserver triton_fil tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --model-control-mode=explicit --load-model abp-nvsmi-xgb diff --git a/morpheus/_lib/README.md b/morpheus/_lib/README.md index 38c08efa73..43b1e6f7e1 100644 --- a/morpheus/_lib/README.md +++ b/morpheus/_lib/README.md @@ -18,13 +18,12 @@ limitations under the License. **General architectural ideas** We build three libraries: -- libmorpheus : defines all the python aware library code for morpheus, and interface proxies for python modules. - - Interface proxies are designed to provide a single consolidated point of interaction between the morpheus - library code and their associated pybind11 module definitions. - - Please avoid declaring adhoc functions/interfaces that link to python modules. -- libmorpheus_utils : matx and table manipulation functions. -- libcudf_helpers : small bridge module used to extract cython based dataframe, and series information from cuDF. +- `libmorpheus` : Defines all the python aware library code for Morpheus, and interface proxies for python modules. + - Interface proxies are designed to provide a single consolidated point of interaction between the Morpheus library code and their associated pybind11 module definitions. + - Please avoid declaring ad-hoc functions/interfaces that link to python modules. +- `libmorpheus_utils` : MatX and table manipulation functions. +- `libcudf_helpers` : Small bridge module used to extract Cython based DataFrame, and series information from cuDF. -Python modules should be defined in `_lib/src/python_modules`, with an associated cmake declaration in -`_lib/cmake/.cmake` which can be included in `_lib/CMakeLists.txt`. \ No newline at end of file +Python modules should be defined in `_lib/src/python_modules`, with an associated CMake declaration in +`_lib/cmake/.cmake` which can be included in `_lib/CMakeLists.txt`. diff --git a/morpheus/stages/input/cloud_trail_source_stage.py b/morpheus/stages/input/cloud_trail_source_stage.py index cb27ce93f5..968fee7ef2 100644 --- a/morpheus/stages/input/cloud_trail_source_stage.py +++ b/morpheus/stages/input/cloud_trail_source_stage.py @@ -32,7 +32,7 @@ @register_stage("from-cloudtrail", modes=[PipelineModes.AE]) class CloudTrailSourceStage(AutoencoderSourceStage): """ - Load messages from a Cloudtrail directory. + Load messages from a CloudTrail directory. """ diff --git a/scripts/validation/kafka_testing.md b/scripts/validation/kafka_testing.md index 1b64f2aa2f..7b66be04c3 100644 --- a/scripts/validation/kafka_testing.md +++ b/scripts/validation/kafka_testing.md @@ -39,7 +39,7 @@ pytest --run_slow --run_kafka ``` 1. Launch Kafka using instructions from the [Quick Launch Kafka Cluster](../../docs/source/developer_guide/contributing.md#quick-launch-kafka-cluster) section of [contributing.md](../../docs/source/developer_guide/contributing.md) following steps 1-6. -1. The testing steps below will require two separate terminal windows. Each will need to have the `KAFKA_ADVERTISED_HOST_NAME`, `BROKER_LIST` and `MORPHEUS_ROOT` environment variables set. In the example below both morpheus and kafka-docker repositories have been checked out into the `~work` directory, replacing these paths with the location of your checkouts. +1. The testing steps below will require two separate terminal windows. Each will need to have the `KAFKA_ADVERTISED_HOST_NAME`, `BROKER_LIST` and `MORPHEUS_ROOT` environment variables set. In the example below both `morpheus` and `kafka-docker` repositories have been checked out into the `~work` directory, replacing these paths with the location of your checkouts. ```bash export MORPHEUS_ROOT=~/work/morpheus export KAFKA_ADVERTISED_HOST_NAME=$(docker network inspect bridge | jq -r '.[0].IPAM.Config[0].Gateway') @@ -52,7 +52,7 @@ pytest --run_slow --run_kafka -v ${MORPHEUS_ROOT}:/workspace wurstmeister/kafka /bin/bash ``` - Leave this terminal open the testing steps will refer to these as the "Kafka terminal", and commands executed from this terminal will be within the kafka container. + Leave this terminal open the testing steps will refer to these as the "Kafka terminal" and commands executed from this terminal will be within the Kafka container. 1. Open a new terminal and navigate to the root of the Morpheus repo, this will be referred to as the "Morpheus terminal" and will be used for running Morpheus pipelines and verifying output. @@ -70,7 +70,7 @@ ulimit -n 4096 ## Simple Data Copying ### Checking KafkaSourceStage #### Single Partition Topic Test -1. From the Kafka terminal, create a topic called "morpheus-src-copy-test" with only a single partition. +1. From the Kafka terminal, create a topic called `morpheus-src-copy-test` with only a single partition. ```bash $KAFKA_HOME/bin/kafka-topics.sh --create --topic=morpheus-src-copy-test --partitions 1 --bootstrap-server `broker-list.sh` ``` @@ -134,7 +134,7 @@ ulimit -n 4096 ### Checking WriteToKafkaStage #### Single Partition Topic Test -1. From the Kafka terminal create a topic called "morpheus-sink-copy-test" with only a single partition, and start a consumer on that topic: +1. From the Kafka terminal create a topic called `morpheus-sink-copy-test` with only a single partition, and start a consumer on that topic: ```bash $KAFKA_HOME/bin/kafka-topics.sh --create --topic=morpheus-sink-copy-test --partitions 1 --bootstrap-server `broker-list.sh` @@ -162,12 +162,12 @@ ulimit -n 4096 ```bash diff -q --ignore-all-space <(cat ${MORPHEUS_ROOT}/.tmp/morpheus-sink-copy-test.jsonlines | jq --sort-keys) <(cat ${MORPHEUS_ROOT}/tests/tests_data/filter_probs.jsonlines | jq --sort-keys) ``` - Note the usage of `jq --sort-keys` which will reformat the json output, sorting the keys, this ensures that `{"a": 5, "b": 6}` and `{"b": 6, "a": 5}` are considered equivalent. + Note the usage of `jq --sort-keys` which will reformat the JSON output, sorting the keys, this ensures that `{"a": 5, "b": 6}` and `{"b": 6, "a": 5}` are considered equivalent. 1. Stop the consumer in the Kafka terminal. #### Partitioned Topic Test -1. From the Kafka terminal create a new topic named "morpheus-sink-copy-test-p" with three partitions, and start a consumer on that topic: +1. From the Kafka terminal create a new topic named `morpheus-sink-copy-test-p` with three partitions, and start a consumer on that topic: ```bash $KAFKA_HOME/bin/kafka-topics.sh --create --topic=morpheus-sink-copy-test-p --partitions 3 --bootstrap-server `broker-list.sh` diff --git a/tests/benchmarks/README.md b/tests/benchmarks/README.md index 7d6352c1ee..9aa0bd105a 100644 --- a/tests/benchmarks/README.md +++ b/tests/benchmarks/README.md @@ -108,7 +108,7 @@ pytest -s --run_benchmark --run_milvus --benchmark-enable --benchmark-warmup=on The `-s` option allows outputs of pipeline execution to be displayed so you can ensure there are no errors while running your benchmarks. -The `--benchmark-warmup` and `--benchmark-warmup-iterations` options are used to run the workflow(s) once before starting measurements. This is because the models deployed to Triton are configured to convert from ONNX to TensorRT on first use. Since the conversion can take a considerable amount of time, we don't want to include it in the measurements. The `--run_milvus` flag enables benchmarks which require the Milvus database. +The `--benchmark-warmup` and `--benchmark-warmup-iterations` options are used to run the workflows once before starting measurements. This is because the models deployed to Triton are configured to convert from ONNX to TensorRT on first use. Since the conversion can take a considerable amount of time, we don't want to include it in the measurements. The `--run_milvus` flag enables benchmarks which require the Milvus database. #### Running with an existing Milvus database @@ -158,7 +158,7 @@ with `000N` where N is incremented for every run. For example, the report file n A hook to `pytest-benchmark` was developed to add the following information to the JSON report: -GPU(s) used by Morpheus. For example: +GPUs used by Morpheus. For example: ``` "gpu_0": { "id": 0, @@ -171,24 +171,24 @@ GPU(s) used by Morpheus. For example: } ``` -Morpheus config for each workflow: -- num_threads -- pipeline_batch_size -- model_max_batch_size -- feature_length -- edge_buffer_size +Morpheus configuration for each workflow: +- `num_threads` +- `pipeline_batch_size` +- `model_max_batch_size` +- `feature_length` +- `edge_buffer_size` Additional benchmark stats for each workflow: -- input_lines -- min_throughput_lines -- max_throughput_lines -- mean_throughput_lines -- median_throughput_lines -- input_bytes -- min_throughput_bytes -- max_throughput_bytes -- mean_throughput_bytes -- median_throughput_bytes +- `input_lines` +- `min_throughput_lines` +- `max_throughput_lines` +- `mean_throughput_lines` +- `median_throughput_lines` +- `input_bytes` +- `min_throughput_bytes` +- `max_throughput_bytes` +- `mean_throughput_bytes` +- `median_throughput_bytes` ### Production DFP E2E Benchmarks