Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE attributes are not published / can't be set after a while #155

Open
cyberhuman opened this issue May 8, 2022 · 5 comments
Open

BLE attributes are not published / can't be set after a while #155

cyberhuman opened this issue May 8, 2022 · 5 comments
Labels

Comments

@cyberhuman
Copy link

Describe the bug
ble2mqtt stops publishing BLE GATT values after working for some time. After it happens, trying to set any GATT value does not have any effect either.
Interestingly, it still continues to publish beacon information.

To Reproduce

  1. Subscribe to the MQTT topic prefix defined in the configuration.
  2. Reset the device.
  3. Observe updates are being published for nearby devices.
  4. After less than 10 minutes, observe no updates are being published.
  5. Trying to set a GATT value by publishing to <mqtt_prefix>/<bt_mac>/<service_uuid>/<attr_uuid>/Set does not have any effect.

Expected behavior
GATT values can be read and written at all times.

Configuration and logs

{
  "network": {
    "hostname": "ble2mqtt/rooms/or",
    "wifi": {
      "ssid": "ssid",
      "password": "password"
    }
  },
  "mqtt": {
    "server": {
      "host": "192.168.1.1",
      "port": 1883,
      "username": "username",
      "password": "password"
    },
    "publish": {
      "retain": false
    },
    "topics": {
      "prefix": "ble2mqtt/devices/"
    }
  },
  "ble": {
    "whitelist": [
      "a4:0e:17:83:fa:f6",
      "cb:71:a1:6b:1c:b7",
      "dd:34:02:??:??:??",
      "fd:5a:5a:2b:ed:99",
      "f0:70:f0:c7:2c:3f"
    ]
  }
}

Logs obtained by running ./idf.py monitor:
2022-05-08.log

MQTT:
mqtt.log

Additional context
I'm running esp32-ble2mqtt on M5Stack's AtomU ESP32 Development Kit with USB-A.

@cyberhuman cyberhuman added the bug label May 8, 2022
@shmuelzon
Copy link
Owner

Hey,

Thanks for the log! First of all, there's a crash there, which is kind of odd, but I can make sure it doesn't happen.
As for it stopping to publish, I can see in line 1337 that it's trying to connect to device a4:0e:17:83:fa:f6 but I don't see any response from the ESP32 BLE libraries about that connection attempt succeeding or failing. I do see that while trying to connect to it, we disconnected from two other devices (lines 1345 and 1348). The same goes later for fd:5a:5a:2b:ed:99 and cb:71:a1:6b:1c:b7 where we try to reconnect to them but no response is received.
To be honest, I'm not quite sure what to do with this. Does it happen often?

It might also be related that the ESP32 and the BLE devices have a very poor connection. I see many connection attempts that fail and the ones that do succeed disconnect shortly after.

@cyberhuman
Copy link
Author

Hi! Thank you for looking into this issue. Yes it happens to all four of my esp32 sticks all the time, just need to wait 10 minutes or so. I can help with debugging if you provide instructions; I really want it to work 🙂 But idk where to start, I don't write anything for esp32.

@cyberhuman
Copy link
Author

I figured out how to enable verbose logging and here it is: 2022-05-08-002.log
For some reason, this time it took much longer for the issue to appear.

@shmuelzon
Copy link
Owner

Thanks for the log, some weird and interesting stuff there.

When we try to connect to a BLE device, we stop scanning during that time frame. IIRC, at the time, the ESP had issues to do both simultaneously so that was added. What I see from the logs is that at some point, we try to connect to a device and stop scanning, then (before we get a response from that connection attempt) we try to connect to a different device (which happens because we got a notification on it before the scanning was actually stopped) and that also tries to stop scanning. That second stopping event fails with an error Device in wrong mode for request which makes sense since it's already stopped. In the meantime, we get a response for the first connection attempt and try to re-enable scanning but that also fails, for some reason, with Device has been reset 🤷‍♂️. After the second connection attempt we try again to start scanning but receive the same error and then we no longer scan for BLE devices. We never detect new devices (or old ones that we've disconnected from) and we basically do nothing.

So, we can first try to remove that mechanism altogether since it might not be needed anymore and will remove some noise. Can you apply the below (untested) patch and see what happens?

diff --git a/main/ble.c b/main/ble.c
index e189601..7ab75a3 100644
--- a/main/ble.c
+++ b/main/ble.c
@@ -159,8 +159,6 @@ static inline void ble_operation_perform(ble_operation_t *operation)
     switch (operation->type)
     {
     case BLE_OPERATION_TYPE_CONNECT:
-        /* Stop scanning while attempting to connect */
-        esp_ble_gap_stop_scanning();
         esp_ble_gattc_open(g_gattc_if, operation->device->mac,
             operation->device->addr_type, true);
         break;
@@ -805,10 +803,6 @@ static void esp_gattc_cb(esp_gattc_cb_event_t event, esp_gatt_if_t gattc_if,

         need_dequeue = 1;

-        /* Resume scanning, if requested */
-        if (scan_requested)
-            esp_ble_gap_start_scanning(-1);
-
         if (param->open.status != ESP_GATT_OK)
         {
             ESP_LOGE(TAG, "Open failed, status = 0x%x", param->open.status);

As for a real fix, we have a queue that's supposed to make sure we don't try to perform multiple BLE operations at once and we wait until the operation completed before dequeuing the next operation. There's also another mechanism that if the queue is empty, we don't do the operation right away but wait for 0.5s to let it fill up (like when you first connect to a device and want to read all of its characteristics). This was intended just to keep the CPU at bay when querying a device and a burst of BLE operations is added without the overhead of actually doing anything yet. I now understand that this also had an added (unintended) benefit for the case where the last operation in the queue, which is still in progress (like trying to read a characteristic), has time to complete before a new operation is added to the (now empty) queue is actually executed.

What happens in your case is that connecting to devices often fails. I don't know if it's because of the distance between the ESP and the BLE devices or just a noisy environment, but connection attempts do timeout and that takes over 0.5 seconds. So, I have a connect event in the queue, I dequeue it and ask the ESP library to connect. Then I add another connection request to queue, wait 0.5s and send it too to the ESP library before the first one completed. We can try to add another flag somewhere that an operation is still pending a response and make sure not to dequeue anything if the 0.5s timer expires and that way it'll be dequeued only when the first operation completed (as is the regular case).

This might be achieved with something (also untested) like:

diff --git a/main/ble.c b/main/ble.c
index e189601..5396858 100644
--- a/main/ble.c
+++ b/main/ble.c
@@ -59,6 +59,7 @@ typedef struct ble_operation_t {
 
 /* Internal state */
 static uint8_t scan_requested = 0;
+static uint8_t operation_in_progress = 0;
 static esp_gatt_if_t g_gattc_if = ESP_GATT_IF_NONE;
 static ble_device_t *devices_list = NULL;
 static ble_operation_t *operation_queue = NULL;
@@ -200,10 +201,14 @@ static void ble_operation_dequeue(ble_operation_t **queue)
         "len: %u, val: %p", operation->type, MAC_PARAM(operation->device->mac),
         UUID_PARAM(operation->characteristic->uuid), operation->len,
         operation->value);
+    operation_in_progress = 1;
     ble_operation_perform(operation);
 
     if (operation->type == BLE_OPERATION_TYPE_WRITE_CHAR)
+    {
+        operation_in_progress = 0;
         ble_operation_dequeue(queue);
+    }
 
     if (operation->len)
         free(operation->value);
@@ -212,7 +217,9 @@ static void ble_operation_dequeue(ble_operation_t **queue)
 
 static void ble_queue_timer_cb(TimerHandle_t xTimer)
 {
-    ESP_LOGD(TAG, "Queue timer expired");
+    ESP_LOGD(TAG, "Queue timer expired, operation_in_progress: %u", operation_in_progress);
+    if (operation_in_progress)
+        return;
     ble_operation_dequeue(&operation_queue);
 }
 
@@ -994,7 +1001,10 @@ static void esp_gattc_cb(esp_gattc_cb_event_t event, esp_gatt_if_t gattc_if,
     }
 
     if (need_dequeue)
+    {
+        operation_in_progress = 0;
         ble_operation_dequeue(&operation_queue);
+    }
 }
 
 static void ble_purge_device_list_timer_cb(TimerHandle_t xTimer)

I apologize if the above was too much but at least I have a better understanding of your issue now.

@cyberhuman
Copy link
Author

Thanks for the very detailed analysis!
I tried the proposed patches. With the first patch, ESP32 doesn't try to connect to any device at all 🤷🏻‍♂️ Here is the log with both patches applied (the second patch doesn't have any effect on this behaviour):
2022-05-10-001.log.
Currently I'm running the device with only the second patch, will send the log later if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants