Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Once the experiment reaches a certain point, it generally stops running and reports an error. #5802

Open
EternityJune25 opened this issue Aug 9, 2024 · 1 comment

Comments

@EternityJune25
Copy link

Describe the issue:

"Once the experiment reaches a certain point, it generally stops running and reports an error."

[2024-08-09 23:59:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 10
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlerscommand
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
self._handle_final_metric_data(data)
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in handle_final_metric_data
self.tuner.receive_trial_result(id
, trial_params[id], value, customized=customized,
File "/root/miniconda3/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
params = self._running_params.pop(parameter_id)
KeyError: 10

content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "PERIODICAL", "sequence": 199, "value": "0.2895440735801888"}'
}
[2024-08-10 00:00:06] ERROR (WsChannel.default) Channel closed. Ignored command {
type: 'ME',
content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "FINAL", "sequence": 0, "value": "0.2898187191127104"}'
}
[2024-08-10 00:00:07] INFO (NNIManager) Trial job YbXt7 status changed from RUNNING to SUCCEEDED
[2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command {
type: 'EN',
content: '{"trial_job_id":"YbXt7","event":"SUCCEEDED","hyper_params":"{\"parameter_id\": 12, \"parameter_source\": \"algorithm\", \"parameters\": {\"activate\": \"elu\", \"d_emb\": 64, \"d_hid\": 32, \"drop\": 0.3884039376983632, \"gamma\": 6.4905452738897065, \"l1\": 1.4578424787079767, \"l2\": 38.44410448714523, \"l4\": 0.29277084068918136, \"lr\": 9.015207683143664e-05, \"mask\": 0.004542790568841141, \"mode\": \"GAT\", \"t\": 0.6139793721895512, \"mask_edge\": 0.07705512469912157, \"instance_temperature\": 0.6737029785000441, \"cluster_temperature\": 0.5472419195458156}, \"parameter_index\": 0}"}'
}
[2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command { type: 'GE', content: '1' }

Environment:

  • NNI version:
  • Training service (local|remote|pai|aml|etc):
  • Client OS:
  • Server OS (for remote mode only):
  • Python version:
  • PyTorch/TensorFlow version:
  • Is conda/virtualenv/venv used?:
  • Is running in Docker?:

Configuration:

  • Experiment config (remember to remove secrets!):
  • Search space:

Log message:

  • nnimanager.log:
  • dispatcher.log:
  • nnictl stdout and stderr:

How to reproduce it?:

@DiamondNova
Copy link

I have the same issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants