New functionality:
- The topic tab now also displays partition information.
- Implemented Java-based in-memory data storage which provides some new functionality. Right now this is lightly implemented into the front-end, but over time will replace the current implementation. This allowed me to implement some new functionality:
- Report if consumer-group is currently active
- This will eventually allow us to report on inactive consumer-groups
- (Experimental) Report Burrow-like consumer-group status calculation via REST endpoint (/consumergroup), while updating Burrows rules a bit. The rules I implemented are as follows:
- Evaluate per consumer-group topic-partition:
- Rule 0: If there are no committed offsets, then there is nothing to calculate and the period is OK.
- Rule 1: If the difference between now and the last offset time-stamp is greater than the difference between the last and first offset time-stamps, the consumer has stopped committing offsets for that partition (error)
- Rule 2: If the consumer offset decreases from one interval to the next the partition is marked as a rewind (error)
- Rule 3: If over the stored period, the lag is ever zero for the partition, the period is OK
- Rule 4: If the consumer offset does not change, and the lag is non-zero, it's an error (partition is stalled)
- Rule 5: If the consumer offsets are moving, but the lag is consistently increasing, it's a warning (consumer is slow)
- Roll-up all consumer-group topic-partitions per consumer-group and report a consumer-group status:
- Set consumer-group status to ERROR if any topic-partition status is STOP
- Set consumer-group status to ERROR if any topic-partition status is REWIND
- Set consumer-group status to ERROR if any topic-partition status is STALL
- Set consumer-group status to WARN if any topic-partition status is WARN
- Set consumer-group status to OK if none of the above rules match
- Evaluate per consumer-group topic-partition:
- Report if consumer-group is currently active
Of course some of the bugs you were seeing were fixed as well:
- Synchronizing around all SQLite DB activity. SQLite only allows one operation at a time with the DB file.
- This fixed all DB create/update/delete issues at the expense of sometimes blocking DB operations while another DB operation is taking place. This is unavoidable using SQLite. Long term fix will be to replace SQLite with a more appropriate DB engine.
- Fixed an issue where LogEndOffset and Lag can display incorrect values.
- Added retry logic around building the ZkUtils object. This fixed the issue where we would not re-connect to Zookeeper if the zk service went down and then was restored.
- Updated some dependency versions.