Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xor filters #44

Open
est31 opened this issue Jan 14, 2020 · 14 comments
Open

xor filters #44

est31 opened this issue Jan 14, 2020 · 14 comments

Comments

@est31
Copy link

est31 commented Jan 14, 2020

(originally filed at mozilla/rust-cascade#18 when this repo hasn't yet been opened)

It would be interesting to investigate usage of xor filters for crlite: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/

The authors claim they are both faster and create smaller filters than bloom filters. Probably this is true for crlite's use case as well?

@BruceMaggs
Copy link
Collaborator

BruceMaggs commented Jan 15, 2020 via email

@lemire
Copy link

lemire commented Jan 4, 2021

Note that xor filters can also be faster than Bloom filter. A Bloom filter requires k has values and k memory accesses whereas the xor filter uses a flat 3-wise model.

mozkeeler pushed a commit that referenced this issue Apr 19, 2021
mozkeeler pushed a commit that referenced this issue Apr 21, 2021
mozkeeler added a commit that referenced this issue Apr 22, 2021
* S3 Population code

* Initial python code

* Move to the latest github.com/google/certificate-transparency/go

* Add metadata storage for disk database

* Convert diskdatabase to write single-files, but it isn't handling threading correctly yet

* Complete single-threaded conversion

* Update dependencies for the move of Google's CT go lib to its own repo

* Fix indentation

* Fix #4, #5, by tolerating cert errors

* Fix #6

* Fix #2 to increase throughput

Thanks to @damz for this fix.

* Change away from iniflags

* Remove S3 support

* Add a least-recently-used cache to speed up concurrent disk access.

* Fix #7 - Make DiskDatabase threadsafe

* Move to glog logging

* Add a cache size config option

* Fix a cache timing error

* Put log URL instead of ID into pem files

* Nicer percentage complete info

* Refactor log sync code to one object per URL

* Rename the LogDownloader to LogSyncEngine

* Reorder

* Change progress bars to github.com/vbauerster/mpb

* Improve styling and accuracy of progress bars

* Smooth out EWMA

* Add jitter to polling delay

* Update imports

* Update to mpb 3.3.4 using contexts

* Fix #13 - Add a output_refresh_ms CLI option

* Fix pre-commit errors

* WIP - rework diskdatabase

* Rename storage adapters

DiskDatabase -> FilesystemDatabase
DiskBackend -> LocalDiskBackend

* Restructure the Backend interface to use explicit methods

Mostly breaks the LocalDiskBackend. I might fix it or remove it later.

* Add firestore config

* Move to int64 since uint64 is forbidden for firestore

* Refactor log state storage

* Issuer ID should be urlsafeb64(sha256(issuer.rawSPKI))

Fixes #29

Make the issuer value across the datastore be the SHA256 digest of the
issuer SPKI as a urlsafebase64-encoded string.

* Catch setting up both local disk and firestore

* Keep track of log entry IDs in the PEM headers

* Disable local disk store

* Abstract out the storageDB construction to an engine package

* Fix output speed flag to expect lowercase

* Use standard SIGINT for signal handling

* Convert to using redis cache for known certificates

* Add a savePeriod

* Add mechanism for metric gathering

* Add useful timing metrics

* Fix config processing. Use a normally-distributed jitter mean for runForever.

* Fix readme

* Add more diagnostic metrics

* emit startup stats

* Improve metrics management

* Intelligently choose whether to call Exists

* Skip CAs and provide more context when warning about missing issuers

* Fix #17 - DownloadCTRangeToChannel should not drop entries on contention

* Print log position on erroneous certs

* Fix off-by-one error in log prints

* Deepen the work queue size for more CPUs

* Add new StoreKnownCertificateList method

* Make CTConfig init an explicit step

Works around golang/go issue #31859

* Make a method to initialize telemetry

* Flush glog at closure

* Move context construction to util setup

* Fix issue #14 - Use mpb v4

* Fix issue #19 - Close progress bars when the downloader errors out

* Fix #22 - Back-off on HTTP 429

* Fix logging for retries

* Fix #12 - Make time estimates real

* Adjust backoff

* Metrics for 429 backoffs

* Make metrics more uniform

* Rename ct-fetch/main.go to ct-fetch.go

* Fix log message

* Fix stackdriver errors on URL-scheming

* Remove some metrics

* Move to google/certificate-transparency-go v1.1 and set a user agent

* Move to mpb v5

* More details for exiting with improper usage

* Fix #40: Add a health check (#41)

* Add the infrastructure to do health checks

* Use the update time instead of entry time for the health check

* Use a configuration value and more explicit error reporting

* Wire up the server address configuration to the health server

* Revert "Fix #40: Add a health check (#41)" (#44)

This reverts commit 2b0680e.

* Provide a higher-performance health check (#42) (#45)

This changes the health check from #41 from a database-check to a local
timestamp updated by all the CT downloader threads.

* Add a -nobars option to quiet log output

* update references to jcjones/ct-mapreduce, run go fmt

Co-authored-by: J.C. Jones <[email protected]>
Co-authored-by: J.C. Jones <[email protected]>
mozkeeler pushed a commit that referenced this issue Apr 26, 2021
mozkeeler pushed a commit that referenced this issue Apr 26, 2021
@bitwiseshiftleft
Copy link

Another possibility, similar to xor filters and specifically tuned for CRLs:

https://github.com/bitwiseshiftleft/frayed_cascade

These improve on CRLite's Bloom filters in two ways: by a factor of 1.44 by using "frayed ribbon filters" instead of Bloom filters, and by another factor of ~1.15 by using only a 2-level cascade instead of (log n)-level, so a typical full database would use about 40% less space than a CRLite one. The 1.44 factor is constant, while the 1.15 one mostly increases as more certs get revoked, so it's less than 1.15 for compressing daily updates, and much more if there's another Heartbleed. This filter is always within 11% of the Shannon entropy of the "revoked vs not" distribution of the certs, so to the extent that revocation is iid, it's almost the best you can do.

In principle, frayed cascade has slower construction and faster queries than CRLite, but due to being written in C instead of Python it's faster for construction as well. However, the implementation also definitely needs testing and fuzzing -- it's currently research grade. But I'm happy to help out with this if you think the space and speed improvements are worthwhile.

Also, other researchers have found other good ways to compress dictionaries, and xor filters are not quite state of the art anymore. Ribbon filters and binary fuse filters are both faster to construct than frayed ribbon filters, and can also be used in a 2-level cascade structure, but they achieve 5-8% worse compression than frayed cascade.

@thomasmueller
Copy link

thomasmueller commented Dec 14, 2021

@bitwiseshiftleft I'm afraid I don't fully understand this topic yet (CRLite's, frayed_cascade). In frayed_cascade, you wrote "The other ~10.9% comes from the encoding that handles non-uniform values." Is this because you encode ranges (0...x) in the form of a number of bits? If yes, then I wonder if non-power-of-two functions would help. E.g. xor filters support 10.666 bits but really any range is possible (of course there is a performance penalty). The same can be applied to e.g. cuckoo filters, and probably (not sure) ribbon filters.

@bitwiseshiftleft
Copy link

Hi Thomas,

Something like that, yes. So for CRLite, you encode the CRL as a Bloom filter B1 of revoked certs. Bloom filters have false positives, but you know the universe of issued certs (issued by a certain collection of CAs as of a certain time) from CT logs, so you know all the false positives. So you also can create a Bloom filter B2 of all false positives in B1, but that also has false positives, so you need a B3, B4 etc. The filters get smaller by a certain factor every time, so after O(log n) levels there are no more false positives. To get the best compression, the false positive rates of the filters are carefully tuned.

For Frayed Cascade, you can use a static function as an approximate set (i.e. a Bloom filter replacement that's more efficient by a factor of log 2). So you can encode the CRL as a frayed ribbon filter F1 representing an approximate set of revoked certs, again with an appropriately tuned false positive rate. (If approximately 50% of certs are revoked, then the optimal false positive probability is 100% so the first layer is skipped.) The second level is then a function that maps { false positives in F1 => 0; true positives in F1 => 1 }, again encoded using frayed ribbons. So you don't need a third level. This technique is more efficient than the log-n-level filter cascade, but it doesn't reach the Shannon entropy. The worst case is about 10.9% overhead vs Shannon, which occurs when about 20% of certs are revoked.

This technique can't be done with Bloom filters, but it can be done with xor filters, ribbon filters, binary fuse filters etc because they work as static functions, not just approximate set membership. However, if I understand correctly, balanced ribbon filters (the variant with the least space overhead) supports only approximate set membership and not static functions, so they can't be used for the second level of the cascade. Instead standard ribbon filters (or xor+ filters, or binary fuse filters...) would need to be used for that stage, with an overhead more like 8%. Also those libraries would need to be reworked slightly to support an arbitrary integer number of bits instead of just {8, 16, 10.666 etc}, but that's straightforward.

The space overhead can be improved from 10.9% to something like 9% (I don't remember the exact value) by supporting non-power-of-two functions, but it still doesn't reach the Shannon limit. To get the the best compression you'd need to support mod-n filters for all n, and then compress the filter on the wire using arithmetic coding. However, since the worst cases are for a small number of bits, you might get most of the way there by implementing just mod-3 filters (1.58 bits) and maybe mod-5 filters (2.32 bits). This could be done with a fixed encoding such as 19 bits / 12 symbols or 65 bits / 41 symbols, but it won't align as cleanly as the 10.666-bit filters.

But for this case, the improvement in worst-case outcomes is only about 2% and the cost is having to implement matrix arithmetic and encoding over multiple rings, so I didn't do it.

For the more general case of multiple possible outputs according to an arbitrary distribution (e.g. if for some reason you wanted to encode revocation reasons), the generalization also uses at most 10.9% overhead, though it may take more than 2 levels. Of course for some cases arbitrary rings would be a big improvement: eg if you have 3 equally likely outputs then using a GF(3) map (1.58 bits / key-value pair) would of course be much better than building it out of GF(2) maps and an encoding (uses 5/3 bits/kvp ~ 5% worse). But again you'd need to support those extra rings.

@thomasmueller
Copy link

Thanks a lot! A few random notes:

3-way binary fuse filters have an overhead of 12.5%, 4-way 7.5%, and e.g. 8-way would have less overhead but I don't know how low we could go - maybe it makes sense to test this.

The definitions, and numbers, are not completely clear to me... In the original CRLite paper if I read correctly there are 30 million valid, unexpired certificates, but 12.7 are revoked. So the universe is 30 million? If yes, then with a static function over the valid, unexpired certificates, that should be around 30 million bits (plus overhead, e.g. 7.5% for 4-way binary fuse filter, but we can go within a few percent with 8-way or so), that would be 3.9 MB or 3.7 MiB. CRLite is 10 MB.

Is the "universe" (certificates we need to care about) the valid, unexpired certificates, or does it include all certificates (around 184 million when the CRLite paper was written)?

CRLite uses Bloom filters. I assume they are "optimal" k, so the overhead is 44% per Bloom filter. But if e.g. k=1 is used, then the Bloom filter could be compressed over the wire (e.g. arithmetic coding) to a lower overhead, I believe. So I wonder how important is compression over the wire, vs. compression on disk / in memory (runtime).

CRLite uses log(n) Bloom filters. That sounds like BBHash, a minimal perfect hash algorithm (which needs about 3 bits / key). I wonder if we indeed create a minimal perfect hash function here. It sounds like. There would be other minimal perfect hash functions that need much less space, e.g. RecSplit. The theoretical lower bound to create a static function like this would be much higher than with xor / binary fuse / ribbon, as we would need 1.44 bits / key plus the array of function results (which could be compressed with arithmetic coding).

Frayed Cascade uses one or two levels. To me it sounds a simpler design. But I don't currently understand part of the overhead you talk about... e.g. I don't understand what you mean with "mod-n filters": is that a static function that can return a value between 0 (including) and n (excluding)? Don't you need just 0 (valid) vs. 1 (invalid)? I see that it's likely not 50% are 0 and 50% are 1, but it's different from a static function that can return other values than 0 and 1.

@bitwiseshiftleft
Copy link

Huh, yeah the original CRLite paper has very different numbers. The more recent blog post has 750k revoked from a universe of 100 million. In the paper case it's best to use a static function directly, since the entropy is about 0.98 bits. And indeed that's what frayed_cascade does on toy data of this size: it skips the first level and uses a single static function, taking up 1.001 bits per key = 3.75 MB = Shannon + 1.8%. A binary fuse filter would be only slightly larger. CRLite performs worse the more certs are revoked, so that's why it's so much larger in this case.

The universe is only unexpired, valid-except-possibly-revoked certificates from participating CAs -- I understand this to be CAs who publish CRLs in a timely fashion and participate in cert transparency.

It might be worth using a different CRLite (or improved version) filter per CA. They will have different user bases, different revocation policies etc, and thus different revocation rates. Any information about the revocation probability of a cert will help compression. Eg if one CA attracts shady users that get revoked a lot, or if it revokes certs a few days after you request a new one as a matter of policy (I think maybe Apple'd developer CA does this??) then it will have lots of revocations, whereas maybe Let's Encrypt issues zillions of certs but doesn't revoke them often because they're short-lived (but they don't publish a CRL so I have no idea).

CRLite uses optimal uncompressed Bloom filters. Using compressed Bloom filters would also work, but also if you're going to reimplement CRLite there's not point in using Bloom filters anymore, since these other filters work better ... except maybe if there's some way to rescue CRLite's incremental update scheme. Their update scheme scheme can't use less bandwidth than sending down a daily filter of "what certs were revoked today" and simply querying all daily filters since either a full download or since the cert was issued; and it probably won't use less disk space either if it's using compressed Bloom filters; but maybe it would have faster queries.

I agree that minimal perfect hashing is not ideal for this use case.

Frayed cascade is optimized for {0,1} output but it does support wider ranges, since it's designed with CRLite in mind but not exclusively for that case. For the {0,1} case, if the revocation fraction is less than 1/3, it has two levels: first an approximate set membership filter using k bits per revocation achieving a 2^-k false positive rate; and then a static function which tells you whether a positive result is a false positive or a true positive using 1 bit per (true or false) positive. The approximate set membership filter is just a static function which returns k bits, but always returns (eg) 0 for revoked certs, and should return an arbitrary result for non-revoked ones. Here the false positive rate 2^-k must be tuned for optimal compression, as something like k = floor(log2(nonrevoked / revoked)). (With revoked > 1/3, this gives k=0, i.e. the first level is skipped.)

If the false positive rate could be tuned to something other than a power of 2, you'd get slightly better compression in some cases. For this, a mod-n filter could be used to get a non-power-of-2 false positive rate, e.g. a static function that returns {0,1,2,3,4} but always returns 0 for revoked certs. Likewise xor filters with 10.666 bits per key are really just doing all the math mod 1625 = 2^10.666..., and packing 3 such coefficients into each 32-bit word. But since the worst cases for frayed_cascade are for rather larger p (eg p = 1/5 or 1/9, not 1/1625) you'd want mod-3 or mod-5 filters and not mod-1625.

But anyway this only saves a couple percent in the worst case, and it's kind of a pain, so I didn't do it.

@thomasmueller
Copy link

@bitwiseshiftleft thanks a lot for your time! I think I now fully understand the problem. What is needed (on a high level) is a static function for 100 million entries that returns 1 in 0.7% of the cases and 0 in 99.3% of the cases. You do this with an approximate membership filter, plus a static function to correct the false positives. Yes this all makes sense.

I ask myself, could this be done with just one static function over the whole 101 million entries, and then compression (arithmetic coding / ANS). I guess you have already thought about that, right, and came to the conclusion that it won't work :-)

@bitwiseshiftleft
Copy link

Oh also: I have no idea if something like 8-way binary fuse filters would work. It's unpublished, but as I understand it it's still based on hypergraph peeling, which depends on some hash buckets being almost-empty. At least for xor filters, the global optimum is actually 3-way, so I wouldn't be surprised if 4-way is the global optimum for BFFs. Could be worth asking the authors of BFF for insight.

@bitwiseshiftleft
Copy link

I ask myself, could this be done with just one static function over the whole 101 million entries, and then compression (arithmetic coding / ANS). I guess you have already thought about that, right, and came to the conclusion that it won't work :-)

Well, not so much a conclusion, but I didn't figure out how to do it. But I'm actually mainly a cryptographer and not a data structures researcher so maybe there is a more efficient way that I just don't know / couldn't figure out.

@lemire
Copy link

lemire commented Dec 15, 2021

@thomasmueller : I am game to investigate this issue further if you are (in 2022).

@bitwiseshiftleft
Copy link

Oh also: I have no idea if something like 8-way binary fuse filters would work. It's unpublished, but as I understand it it's still based on hypergraph peeling, which depends on some hash buckets being almost-empty. At least for xor filters, the global optimum is actually 3-way, so I wouldn't be surprised if 4-way is the global optimum for BFFs. Could be worth asking the authors of BFF for insight.

... wait a minute, you all are the authors of BFF. So you know better than I do on this. How do BFFs work anyway? Skimming the code it looks like if you choose the indices "just so" then the graph remains peelable even with a lower load?

@thomasmueller
Copy link

How do BFFs work anyway?

The theory behind this is found in https://arxiv.org/pdf/1907.04749.pdf "Dense Peelable Random Uniform Hypergraphs" - here you see that the overhead for 7-way would be about 3% over the lower bound.

BTW I tried "static function over the whole 101 million entries, and then compression"... with xor and fuse filters at least, compression didn't work: even if there is just one entry that has a different value than all the others, I couldn't compress the filter.

So what you do seems to be the best possible solutions, as far as I can tell. I could only imagine that a static function built from a perfect hash might work. (It wouldn't be a minimal perfect hash... it would be similar to a k-perfect hash function.) I don't think it would be any smaller than what you do however.

@bitwiseshiftleft
Copy link

That's super cool. It's simpler (to generate anyway) and performs better than frayed ribbons, so it's definitely a reasonable tradeoff to take a 3-7% space hit for many applications.

I wonder if it would significantly help with frayed ribbons to make them not wrap around, like fuse filters? It might improve performance but I guess the generation code wouldn't be any simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants