To run algorithms such as SM-NAS and SP-NAS, you need to install the open-source software mmdetection. For details, see the installation guide of the software.
Before running the benchmark, install the open-source software NASBench. For details, see the installation guide of the software.
After Vega is installed using the pip3 install noah-vega --user
command, the third-party open-source software on which Vega depends is not installed. You need to run the python3 -m vega.tools.install_pkgs
command to install the third-party open-source software. You can view the installation options through python3 -m vega.tools.install_pkgs -h
. The install_dependencies.sh
file is generated in the current directory to facilitate troubleshooting and commissioning during the installation.
The possible causes are as follows:
- The network is not registered with the Vega. Before invoking the network, you need to use
@ClassFactory.register
to register the network. For details, see https://github.com/huawei-noah/vega/tree/master/examples/fully_train/fmd. - The model description file of the network is incorrect. You can locate the fault based on
<model desc>
in the exception information.
1.5 Exception ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory
The opencv-python system dependency library is missing. Run the following command to rectify the fault:
sudo apt install libglib2.0-0
This type of exception is usually caused by the failure to find dask-scheduler
in PATH
. Generally, the file is installed in /<user home path>/.local/bin
.
After the Vega is installed and python3 -m vega.tools.install_pkgs
is executed, /<user home path>/.local/bin/
is automatically added to the PATH
environment variable. The setting does not take effect immediately. The setting takes effect only after the user logs in to the server again.
If the problem persists after you log in to the server again, check whether the dask-scheduler file exists in the /<user home path>/.local/bin
directory. If the file does not exist, run the python3 -m vega.tools.install_pkgs
command again and ensure that no exception occurs during the installation.
If the file already exists, manually add /<user home path>/.local/bin
to the environment variable PATH
.
If multiple GPUs or NPUs are deployed on the host running Vega, you can set the following configuration items to support multiple GPUs or NPUs:
general:
parallel_search: True
parallel_fully_train: True
devices_per_trainer: 1
Where:
- parallel_search:Controls whether multiple models are searched in parallel during the model search phase, each of which uses one or more GPUs/NPUs.
- parallel_fully_train: Controls whether to train multiple models concurrently in the Fully Train phase. Each model uses one or more GPUs or NPUs.
- devices_per_trainer: If any of the preceding parameters is set to True, this parameter specifies the number of GPUs/NPUs corresponding to a model.
Note: The CARS and DARTS algorithms do not support parallel search.
If there are multiple GPUs in the running environment, run the following command to control the GPUs used by Vega:
Using a single GPU:
CUDA_VISIBLE_DEVICES=1 python3 ./run_pipeline.py ./nas/backbone_nas/backbone_nas.yml
Using multiple GPUs:
CUDA_VISIBLE_DEVICES=2,3 python3 ./run_pipeline.py ./nas/backbone_nas/backbone_nas.yml
You can load the pre-training model by modifying the configuration item. For example, load the pre-training model simple_cnn.pth.
model:
model_desc:
modules: [backbone]
backbone:
type: SimpleCnn
num_class: 10
fp16: False
pretrained_model_file: "./simple_cnn.pth"
By default, Vega logs are stored in the following path:
./tasks/<task id>/logs
To configure the log level, modify the following configuration items:
general:
logger:
level: info # debug|info|warn|error|
Vega provides the visualized progress of the model search process. User could set VisualCallBack
within USER.yml
as follow,
trainer:
type: Trainer
callbacks: [VisualCallBack, ]
The output directory of the visualized information is as follows:
./tasks/<task id>/visual
Run the tensorboard --logdir PATH
command on the active node to start the service and view the progress in the browser. For details, see TensorBoard commands and instructions.