-
Notifications
You must be signed in to change notification settings - Fork 46
Loop detection
This page gives some background about Z-Pushs loop detection mechanism.
With a "loop" we mean that a mobile device gets stuck at a certain point in the synchronization. This has several negative side-effects, as:
- Battery drain
- no new messages arriving on the phone & no changes made on the phone saved on the server
- Increased mobile data usage, possibly leading to additional costs
A loop is basically a re-request of the same synchronization state. Normally, Z-Push will answer to the request the same way it did the first time (if there were no other changes on the server).
Requesting the same data again might be normal, imagine when entering a tunnel with your car/phone, your connection drops and your phone only gets part of the response. In this case, after the connection is restored, the mobile will request the same data again (with the same synchronization identifier) so it will receive the data again. This is normal, healthy and expected.
The other case is that the mobile is not able to understand the data it received from the server. E.g. imagine that the server sends a new appointment to the mobile phone, where the start time is after the end time. This is semantically wrong (how can it end before it started?) and the phone is not able to process this appointment. Most mobile phones just re-request the same synchronization state "hoping" to receive data that makes sense. In the case that the data is really wrong on the server, Z-Push would send the same broken appointment again, causing the phone to re-request it again. This goes on forever, the phone is constantly re-trying to obtain correct data from the server, but the server keeps serving the same data. This causes the mentioned battery drain and increased data usage as the phone keeps constantly trying. It only stops when your battery is dead or your mobile data plan cut by the operator. As the synchronization is not able to proceed, any other changes from the server or the mobile are not synchronized to the other side.
Just to be clear: this case is constructed to explain what a loop is. Z-Push will not send such an appointment to the mobile.
The so called loop detection of Z-Push is a set of logic that tries to identify these situations and to "solve" the issue by itself, preventing the mentioned side-effects.
As said, there are "natural" cases for a re-request (loop), especially because we talk about mobile and potentially unstable connections. You move with high speed and sometimes a connection gets interrupted unexpectedly. This is even aggravated as the mobile can decide how many objects (e.g. emails) it wants to receive from the server. Most mobile request between 5 and 50 objects, but a value up to 512 is possible. Processing a large amount of objects (imagine 100 emails) takes time. If these emails also contain attachments we speak easily about 10 and more MB that need to be streamed to your phone.
The process of "detecting" is easy: Z-Push knows which synchronization state was requested the last time (let's call the last successful one state5). So, the mobile requested state6 with 50 emails once already. Z-Push knows this. If the phone now tries to request state6 again, we also conclude that the last request was either not successfully received by the phone, or one data object could not be processed. What the exact cause was we do not know, but we know that something went wrong.
It could be that your mobile connection is unstable or that one of the 50 items is broken. The potential solution for both cases is the same, we enter the "loop detection mode". The server indicates this in the log file by printing:
Mobile loop detected! Messages sent to the mobile will be restricted to 1 items in order to identify the conflict.
For this request. instead of Z-Push sending 50 emails, it only sends 1 for state6. In case of the unstable connection, it will probably succeed, as only 1/50th of the amount of data is sent by the server (numerically, this is of course not exact as the amount of data per item is different). If the phone receives it correctly and is able to process the data received, it will go for the next state7 in the next request. At this point, we advanced (passed from state6 to 7) and we also know, that the object (email) sent to the phone was correctly processed.
This will go on until we have sent all 50 initially sent items from the first request of state6 to the phone. This will also mean, that you see the above message 50 times for different requests in your log.
This is still kind of normal, because this could happen in case of an unstable connection. It should go through fine, but at some cost:
- more battery will be used, because instead of 1 request, at least 51 requests are made by the phone in this example.
- more mobile data will be used, because there is overhead for headers and other meta data for the 50 requests together with the real data to be sent.
- there is more load on the server, because there are more requests to be handled.
Still, there is a good part: the data will get to your phone without getting stuck.
There is of course still the case of the "broken" object, the incorrect appointment from above. Resuming the loop detection logic, let's assume the synchronization advanced up to state20, so 14 objects have been successfully synchronized. The next object to be synchronized is the the broken one.
So, the mobile requests state20, Z-Push sends the broken appointment, the mobile is not able to process it and requests state20 again. In this case, the loop detection knows that only one object was sent before (we are in loop detection mode) and that we sent the item once already. Z-Push does not know (in this example) that the object is broken. As this could still be an unstable connection, Z-Push will just send the object again, but increase an internal counter indicating that we sent the same item twice already. In this example we of course know that this will fail again, so the phone asks for state20 again. The internal counter then gets to 3, but still, this could still be a very bad connection issue, so we send it one more time (3 times in total, alone, only one object per request). If the phone is still not able to process it, and requests state20 again the next time, Z-Push will ignore the item, meaning, it will not send it, but send the next item on the list instead. The item will be internally announced as "broken - reason: causing loop". You will be able to see this information in the device details via z-push-admin (or via the output of the z-push webservice) - in the "needs attention" list.
If one item is ignored, loop detection is cleared immediately, because we assume that only one broken item was in the original list of 50 objects.
If the mobile then requests state20 again, after ignoring one object already, the internal counter for the new item is set to 1 (start counting for the next item as well). If state21 is sent we resume as normal, sending the mount of requested objects, e.g. 50.
Where do we get with this:
- Z-Push splits big synchronization requests into 1 objects per request.
- We keep sending them 1 by 1 until we reached the total amount of items send in the first repeated request or a broken object is ignored.
- You might see several log messages in your log. None of them indicates something is really wrong, it's just code in action. Be patient.
- There will be items that are not visible on your phone. Keep an eye on z-push-admin to identify such cases. To force the synchronization of such an item, localize it (e.g. with Zarafa Webapp), edit and save it. This will relist it in the synchronization list and the sync will be attempted again.
I hope this clarifies the loop detection a little bit. This is just an overview, there are few details and corner cases that are not described here.
Some additional notes:
- these are just examples. Z-Push would ignore an appointment with "a start date after the end date" even before sending it to the mobile, because it's semantically broken. For the user it's the same, the item is not on the mobile. The reason will be stated in the output of z-push-admin.
- A "windowSize" of 50 was used in this example (the amount of items requested by the phone). This can be any other number, but usually it's 5, 10, 20, 25, 50 or 100. This also determines how long the loop detection might take (5 requests take roughly only 1/4th of the time of 20 requests).
- When we speak about "emails", "objects", "appointments" or "items" is all the same to us here. The synchronization and loop detection logic is the same for any kind of data object (email, contact, appointment, task or note).
z-push-admin has an action called "clearloop". This throws away the internal counters of the loop detection. Either system wide (if no user/deviceid are specified) or only for a specific user/device combination).
This can be useful in the case that loop detection should not be attempted. After calling "clearloop", running loop detection will be dropped. But if re-requests happen again, loop detection will be initiated again.
If you write your own backend, you should critically look at the data sent by the server (WBXML log). Most probably, some of the values (or even tags) are not understood by the mobile.
Try eliminating corner cases (e.g. recurrences) first in order to narrow down the potential issue.