"ClientId is invalid" error after automatic reconnection to broker #446

tobyfoo · 2023-11-22T16:56:05Z

Describe the bug

Connecting to AWS IOT using iot.AwsIotMqtt5ClientConfigBuilder.newWebsocketMqttBuilderWithSigv4Auth works fine on the initial connection.

After 24h the broker closed the socket (this is normal and expected) so the client tries to reconnect but fails, forever, with an error: reasonString CONNACK:ClientId is invalid:..., reasonCode 133.

Some investigation:

Logging the initial connection event's connack the clientId it looks something like assignedClientIdentifier: '$GEN/8b42676c-3300-4b4d-8c4f-9b75a17f7999' while the failed attempts' event connack states reasonString: 'CONNACK:ClientId is invalid:c91911b9-d401-8fe2-261c-e4374f470aff'. I'm not sure where the $GEN/ is coming from but maybe it's a clue.

Expected Behavior

The re-connection succeeds, just like the initial connection.

Current Behavior

The Mqtt5Client fires a connectionFailure event on re-connection and tries to reconnect forever. Restarting the process/client helps and the new initial connection succeeds immediately.

Reproduction Steps

Make sure AWS credentials (like AWS_SECRET_ACCESS_KEY in the env) are present and have the policy AWSIoTDataAccess granted.
Put your AWS IoT core endpoint url into the env at BROKER_ENDPOINT

import { iot, mqtt5 } from 'aws-iot-device-sdk-v2';

async function init() {
  const builder = iot.AwsIotMqtt5ClientConfigBuilder.newWebsocketMqttBuilderWithSigv4Auth(
    process.env.BROKER_ENDPOINT as string,
    { region: 'eu-central-1' },
  );
  builder.withConnectProperties({ keepAliveIntervalSeconds: 120 });
  const client = new mqtt5.Mqtt5Client(builder.build());

  client.on('error', (err) => { console.log('MQTT error', err); });
  client.on('attemptingConnect', () => { console.log('Attempting Connect event'); });
  client.on('connectionSuccess', (eventData: mqtt5.ConnectionSuccessEvent) => { console.log('Connection Success event', eventData); });
  client.on('connectionFailure', (eventData: mqtt5.ConnectionFailureEvent) => { console.log('Connection failure event', eventData); });

  client.start();
}

init().catch(console.log);

Notice the initial connection succeeding via the console log of connectionSuccess.

Now wait 24h for the broker to close the websocket 😄

...or here is a method to force-close the TCP socket like this (under linux). Maybe there is an easier way by disconnecting the network interface but I haven't tried that.

netstat -tnp to find the PID of the node process connected to AWS IoT (foreign address should be port 443 as we are using mqtt over ws)
lsof -np $PID where $PID is the PID found via the previous step. Look for the fileDescriptor column, or "FD". Remember the FD of the TCP connection to the broker (the one connected to remote port 443)
gdb -p $PID where $PID is from step 1
in the gdb console run call close($FD) where $FD is the fileDescriptor obtained by running lsof
type quit to exit gdb
wait max 2 minutes for the client to heartbeat and realize the socket is closed and attampt to reconnect

Here is my log output, attached as log.txt file.
log.txt

Possible Solution

I read in the docs that the client id is regenerated on re-connection, which is precisely what we want, but maybe it is malformed. Also, the $GEN/ part in the initial connection's id might be a clue.

Additional Information/Context

As for authentication I tried both Fargate task roles and simple access keys in a local test environment, same behavior. The assigned policy was AWSIoTDataAccess which is pretty permissive, so I think I've ruled out an authorization issue.

SDK version used

1.17.0

Environment details (OS name and version, etc.)

Docker container running node:20.9 image, Linux host OS.

The text was updated successfully, but these errors were encountered:

bretambrose · 2023-11-22T17:36:27Z

This is a bug in IoT Core. I am putting together a ticket to them right now.

bretambrose · 2023-11-22T18:33:19Z

As a followup, until there's resolution, please just generate uuids locally and use them rather than leaving client id empty.

tobyfoo · 2023-11-23T08:59:12Z

Thanks @bretambrose for your quick reply and identifying the issue. Using a locally generated UUID seems to have fixed the problem, thanks for that. I'm not sure if I should close this issue or you want it to stay open.

For anyone else facing the issue, here is some code for the workaround:

import { v4 as uuidv4 } from 'uuid';

// ...
builder.withConnectProperties({ keepAliveIntervalSeconds: 120, clientId: uuidv4() });
// ...

bretambrose · 2023-11-23T14:46:12Z

Let's keep it open until the overall issue is resolved. We don't know yet if IoT Core is willing to try and change the behavior here (it's not compliant with the MQTT5 spec, but sometimes you need to bend the rules a bit when scaling a service to the size of AWS). If they fix it great, otherwise we need to disable the ability to use auto-assigned topic aliasing from the SDK (likely by using uuid for client id if none is provided).

tobyfoo added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 22, 2023

bretambrose removed the needs-triage This issue or PR still needs to be triaged. label Nov 22, 2023

jmklix added the p1 This is a high priority issue label Nov 22, 2023

jmklix added p2 This is a standard priority issue and removed p1 This is a high priority issue labels Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"ClientId is invalid" error after automatic reconnection to broker #446

"ClientId is invalid" error after automatic reconnection to broker #446

tobyfoo commented Nov 22, 2023 •

edited

Loading

bretambrose commented Nov 22, 2023

bretambrose commented Nov 22, 2023

tobyfoo commented Nov 23, 2023

bretambrose commented Nov 23, 2023

"ClientId is invalid" error after automatic reconnection to broker #446

"ClientId is invalid" error after automatic reconnection to broker #446

Comments

tobyfoo commented Nov 22, 2023 • edited Loading

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

SDK version used

Environment details (OS name and version, etc.)

bretambrose commented Nov 22, 2023

bretambrose commented Nov 22, 2023

tobyfoo commented Nov 23, 2023

bretambrose commented Nov 23, 2023

tobyfoo commented Nov 22, 2023 •

edited

Loading