You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Connecting to AWS IOT using iot.AwsIotMqtt5ClientConfigBuilder.newWebsocketMqttBuilderWithSigv4Auth works fine on the initial connection.
After 24h the broker closed the socket (this is normal and expected) so the client tries to reconnect but fails, forever, with an error: reasonString CONNACK:ClientId is invalid:..., reasonCode 133.
Some investigation:
Logging the initial connection event's connack the clientId it looks something like assignedClientIdentifier: '$GEN/8b42676c-3300-4b4d-8c4f-9b75a17f7999' while the failed attempts' event connack states reasonString: 'CONNACK:ClientId is invalid:c91911b9-d401-8fe2-261c-e4374f470aff'. I'm not sure where the $GEN/ is coming from but maybe it's a clue.
Expected Behavior
The re-connection succeeds, just like the initial connection.
Current Behavior
The Mqtt5Client fires a connectionFailure event on re-connection and tries to reconnect forever. Restarting the process/client helps and the new initial connection succeeds immediately.
Reproduction Steps
Make sure AWS credentials (like AWS_SECRET_ACCESS_KEY in the env) are present and have the policy AWSIoTDataAccess granted.
Put your AWS IoT core endpoint url into the env at BROKER_ENDPOINT
Notice the initial connection succeeding via the console log of connectionSuccess.
Now wait 24h for the broker to close the websocket 😄
...or here is a method to force-close the TCP socket like this (under linux). Maybe there is an easier way by disconnecting the network interface but I haven't tried that.
netstat -tnp to find the PID of the node process connected to AWS IoT (foreign address should be port 443 as we are using mqtt over ws)
lsof -np $PID where $PID is the PID found via the previous step. Look for the fileDescriptor column, or "FD". Remember the FD of the TCP connection to the broker (the one connected to remote port 443)
gdb -p $PID where $PID is from step 1
in the gdb console run call close($FD) where $FD is the fileDescriptor obtained by running lsof
type quit to exit gdb
wait max 2 minutes for the client to heartbeat and realize the socket is closed and attampt to reconnect
Here is my log output, attached as log.txt file. log.txt
Possible Solution
I read in the docs that the client id is regenerated on re-connection, which is precisely what we want, but maybe it is malformed. Also, the $GEN/ part in the initial connection's id might be a clue.
Additional Information/Context
As for authentication I tried both Fargate task roles and simple access keys in a local test environment, same behavior. The assigned policy was AWSIoTDataAccess which is pretty permissive, so I think I've ruled out an authorization issue.
SDK version used
1.17.0
Environment details (OS name and version, etc.)
Docker container running node:20.9 image, Linux host OS.
The text was updated successfully, but these errors were encountered:
Thanks @bretambrose for your quick reply and identifying the issue. Using a locally generated UUID seems to have fixed the problem, thanks for that. I'm not sure if I should close this issue or you want it to stay open.
For anyone else facing the issue, here is some code for the workaround:
import { v4 as uuidv4 } from 'uuid';
// ...
builder.withConnectProperties({ keepAliveIntervalSeconds: 120, clientId: uuidv4() });
// ...
Let's keep it open until the overall issue is resolved. We don't know yet if IoT Core is willing to try and change the behavior here (it's not compliant with the MQTT5 spec, but sometimes you need to bend the rules a bit when scaling a service to the size of AWS). If they fix it great, otherwise we need to disable the ability to use auto-assigned topic aliasing from the SDK (likely by using uuid for client id if none is provided).
jmklix
added
p2
This is a standard priority issue
and removed
p1
This is a high priority issue
labels
Jan 2, 2024
Describe the bug
Connecting to AWS IOT using
iot.AwsIotMqtt5ClientConfigBuilder.newWebsocketMqttBuilderWithSigv4Auth
works fine on the initial connection.After 24h the broker closed the socket (this is normal and expected) so the client tries to reconnect but fails, forever, with an error: reasonString
CONNACK:ClientId is invalid:...
, reasonCode133
.Some investigation:
assignedClientIdentifier: '$GEN/8b42676c-3300-4b4d-8c4f-9b75a17f7999'
while the failed attempts' event connack statesreasonString: 'CONNACK:ClientId is invalid:c91911b9-d401-8fe2-261c-e4374f470aff'
. I'm not sure where the$GEN/
is coming from but maybe it's a clue.Expected Behavior
The re-connection succeeds, just like the initial connection.
Current Behavior
The
Mqtt5Client
fires aconnectionFailure
event on re-connection and tries to reconnect forever. Restarting the process/client helps and the new initial connection succeeds immediately.Reproduction Steps
AWSIoTDataAccess
granted.Notice the initial connection succeeding via the console log of
connectionSuccess
.Now wait 24h for the broker to close the websocket 😄
...or here is a method to force-close the TCP socket like this (under linux). Maybe there is an easier way by disconnecting the network interface but I haven't tried that.
netstat -tnp
to find the PID of the node process connected to AWS IoT (foreign address should be port 443 as we are using mqtt over ws)lsof -np $PID
where $PID is the PID found via the previous step. Look for the fileDescriptor column, or "FD". Remember the FD of the TCP connection to the broker (the one connected to remote port 443)gdb -p $PID
where $PID is from step 1call close($FD)
where $FD is the fileDescriptor obtained by runninglsof
quit
to exit gdbHere is my log output, attached as log.txt file.
log.txt
Possible Solution
I read in the docs that the client id is regenerated on re-connection, which is precisely what we want, but maybe it is malformed. Also, the $GEN/ part in the initial connection's id might be a clue.
Additional Information/Context
As for authentication I tried both Fargate task roles and simple access keys in a local test environment, same behavior. The assigned policy was
AWSIoTDataAccess
which is pretty permissive, so I think I've ruled out an authorization issue.SDK version used
1.17.0
Environment details (OS name and version, etc.)
Docker container running node:20.9 image, Linux host OS.
The text was updated successfully, but these errors were encountered: