Use BoltDB to map "Names" to Files #8

djherbis · 2016-03-02T01:18:53Z

The default FileSystem has to do some ugly things to map the "key/name" into a filename. I should just use a Key/Value store like BoltDB.

amozoss · 2016-03-02T16:54:09Z

As far as caching is concerned, it doesn't really care about what the original name was. Why do you need to know the name at all?

Seems like md5.Sum of the path combined with the cache root is all that is needed for all file operations.

md5sum := md5.Sum("path/to/my/file")
fpath:= filepath.Join(fs.root, md5sum)
os.Create(fpath)
os.Open(fpath)
os.Stat(fpath)
os.Remove(fpath)

djherbis · 2016-03-02T17:10:23Z

That's only the start of the ugliness I do in there.

It basically says: if the key can be base64 encoded and fit in a std. filename then do that (then the key is decodable from the filename).

Otherwise, hash the key and use that as the filename. Then create another file with the name "${hash}.key" and write the key to that file so it can be restored later.

I'd much rather have a nice clean, key/value store for managing which keys map to which files.

ghost · 2018-01-28T13:07:49Z

Did this go anywhere, as I am using this in a prototype to have multi topology caches.
The cache code runs on lots of servers layers and finally in the browser.

djherbis · 2018-01-28T22:02:46Z

Submitting a sample version (very rough), but boltdb is incredibly slow on my windows box (boltdb/bolt#516) sooo don't know if it's worth including.

ghost · 2018-06-01T11:38:42Z

badger will get your better perf. Its often used instead of bolt.
Also for files you might want to consider minio as your file store ?

djherbis · 2018-06-03T16:08:21Z

Badger seems like a neat idea to use instead of bolt for managing the key db (certainly open to pull requests to add new "FileSystems" etc. though I might consider adding a subpkg. or something for holding extra ones besides the basic set).

I've certain considered trying to build an FS which uses a distributed/cloud-based file system, the main issues I've run into are:

Most cloud FS doesn't show the object as 'existing' until it's completely written, one of the major features of fscache is that you can read from data while it's being written.
Currently, fscache uses sync.Cond signalling to notify readers when there is more data to read. This requires that it runs within the same executable, which def. makes it impossible for the cache to work across multiple running instances/tasks.

I'd like to build a new library actually that takes advantage of:

abstracting over distributed fs (so backends like minio/aws/google cloud storage are an option)
using a key/value store w/ distributed notifications (instead of signals), so that readers in different instances of access to the same data (cache can be shared between tasks).

deluan · 2020-04-15T22:53:12Z

Hey @djherbis, I didn't get why you can't use a md5 hash of the key as the filename and avoid having a key file. That would also help with a potential issue of having too many files in one folder.

Ex:
Key is abc123.300x300.jpg
Hash would be: b6c59a0370c107676c1cb457ccf02310
File in cache would be:
`b6/c5/9a/b6c59a0370c107676c1cb457ccf02310

This way, every time you need to do a lookup for some key, you'd need to do something like this:

key := "abc123.300x300.jpg"
hash := fmt.Sprintf("%x", md5.Sum([]byte(key)))
filePath := fmt.Sprintf("%s/%s/%s", hash[0:2], hash[2:4], hash)

Thoughts?

djherbis · 2020-04-15T23:30:43Z

The main issue was with Filesystem.Reload which needs to produce the key from the name.

That being said, I might be able to drop that for a Filesystem impl depending on the features that degrades.

What's the number of files in a folder limitation your seeing? I thought the limit on that was pretty high.

deluan · 2020-04-15T23:43:28Z

I haven't experienced any slowdowns based on the number of files (yet), but this is a notorious thing to avoid when storing data in the filesystem. Good example is Git, that implements a schema similar to what I proposed.

For my use case (the music server) I can get to as much as 2-10 times the number of music files in a library (one file for each image size requested by the clients). For my personal library of 40K songs, this would be 400K files. I have a user with 150K+ songs, which would yield to 1.5million files!!

I was looking into Reload, can't the Filesystem implementation just return the hash identifiers for the files, instead of a key/name pair? That way, you convert the key to the the hash before calling the Filesystem methods (Create, Delete, etc...)

djherbis · 2020-04-15T23:56:03Z

I think the main problem is that things like expiration run over all the keys, and we don't know what all the keys are if we can't reload them. But that may not be all that important.

Good to know about files per dir and performance, I'll look into another Filesystem implementation.

deluan · 2020-10-24T04:14:30Z

Hey @djherbis, I took a stab at creating a FileSystem with the structure I proposed above, storing only the files and not the keys: https://github.com/deluan/navidrome/blob/spread-fscache/core/cache/spread_fs.go

Without making any changes to the fscache lib, I had to make some assumptions and it may feel I bit... hacky? Let me know what you think.

djherbis · 2020-10-24T05:09:32Z

@deluan I took a quick look, I think the problem is that your Reload method returns different keys than were passed to Create. This means when you call cache.Get with a key that should already exist (from Reload), the cache won't think it has it (the map will contain the hashed paths, instead of the passed in key) and you'll get a cache miss on things that should have been reloaded.

Now I'm thinking of a simpler soln though, I think I could lift the hashing key logic into the cache map. Then the cache map keys and the filenames would match, instead of having this silly convertion logic.

I might run some tests to verify backwards compatibility, or just introduce a feature to control the hashing and a boring filesystem.

deluan · 2020-10-24T16:48:27Z

You are right, it was late at night and I overlooked the key mapping... I was trying to do two things with this filesystem:

Spread the files to avoid overpopulating the cache dir (I have a use case with 2mi+ files)
Get rid of the key file

Well, looks like the item 2 is unattainable without changes to the fscache library. Couldn't we just move the key mapping to the FileSystem? Maybe add to the FileSystem's interface: Traverse(func) (to be used by the cache eviction routines) and Exist(key), and change Open(), Stat() and Remove() to accept keys instead of filenames. With this, we wouldn't need to load all keys in memory, reducing memory requirements. Of course this is based on my understanding of the internals of the library :D

Anyway, I added a key mapping, similar to the original FS, to my SpreadFS, seems to be working now. Would really appreciate if you could take another look: https://github.com/deluan/navidrome/blob/spread-fscache/core/cache/spread_fs.go

Again, thanks for you awesome library!

djherbis · 2020-10-24T17:41:21Z

From a very brief look at your updates, I think your FS should work (through testing is of course the best way to verify).

Key mapping does already happen in the filesystem.
The Open/State/Remove methods actually accept w/e is returned by a Filesystem.Create'd File.Name() method, since you control that File interface, you could theoretically make this w/e you want (key or filename), other Filesystem operations just need to work on w/e you choose.

Not sure what you mean about not needing to load all keys into memory, we do need an internal map of [key] => [fileStream] at least for running streams, because that's how we're able to internally lookup the stream.Stream which lets you read while it is being written.

I think signature-wise Reload/Traverse would actually be the same as both iterate the FS and report all (key, filename) pairs.
However, I'm not sure of the race conditions that might arise during such a traversal on a running cache, and the possible performance implications (iterating the map is very fast, iterating the filesystem, probably not).

I considered allowing cache-miss in memory to check the filesytem to load it 'on demand', I suppose we'd need to be able to do a spot-check on that load to verify it hasn't expired, if that's even possible (depends on how the expiration system works).

I'm still gonna give a shot at my other idea of moving the key production up a level (optionally), since that would solve "getting rid of the key file".

Thanks for the appreciation for this library :) I'm really amazed how much value people get from it, because looking back at it 4-5 years later I'd probably have built it differently from the ground up, but I'm still happy it's useful even if the API is fairly messy imho.

Someday I'll get around to building a v2 or new package redesigned from the ground-up that has a cleaner API, and hopefully support for more of the cool 'cloud' use cases that are more relevant today.

djherbis · 2020-10-24T19:11:13Z

I pushed some changes which should allow this kind of control, dumb example in the tests.

I'll expand a bit here:

// Create the FileSystem which doesn't mess with the filenames it is given:
fs, _ := NewFs("./cachex", 0700)
fs.EncodeKey = IdentityCodeKey // In the Fileystem Create(key) creates a File named key.
fs.DecodeKey = IdentityCodeKey // Since File.Name() == key, just returns File.Name().

// Create a Cache with that Filesystem:
ck, _ := NewCache(fs, NewReaper(time.Hour, time.Hour))

// Set the function which is used in the cache to create cache keys from the keys given to things like Get/Exists etc.
// This function essentially ends up creating the Filename when combined with the above FileSystem changes.
ck.SetKeyMapper(func(key string) string {
        // This is just a boring example which moved the EncodeKey from the Filesystem into the Cache.
	name, _ := B64OrMD5HashEncodeKey(key)
	return name
})

I think it would take a little more messing with EncodeKey/DecodeKey/SetKeyMapper to get a set of functions that will work correctly going forward but can also load existing cached files. But I think this is enough for a Saturday :)

djherbis · 2020-10-24T19:13:33Z

^ note I think your new FS is just a special case of this, if you want to remove the keyfiles you can just do this but set your KeyMapper function to your function which produces the /xx/xx/xxxxxxxx hashed filenames.

djherbis added the enhancement label Mar 2, 2016

djherbis added a commit that referenced this issue Jan 28, 2018

initial boltdb fs #8

8c34bf4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use BoltDB to map "Names" to Files #8

Use BoltDB to map "Names" to Files #8

djherbis commented Mar 2, 2016

amozoss commented Mar 2, 2016

djherbis commented Mar 2, 2016

ghost commented Jan 28, 2018

djherbis commented Jan 28, 2018

ghost commented Jun 1, 2018

djherbis commented Jun 3, 2018

deluan commented Apr 15, 2020 •

edited

Loading

djherbis commented Apr 15, 2020

deluan commented Apr 15, 2020 •

edited

Loading

djherbis commented Apr 15, 2020

deluan commented Oct 24, 2020

djherbis commented Oct 24, 2020 •

edited

Loading

deluan commented Oct 24, 2020

djherbis commented Oct 24, 2020

djherbis commented Oct 24, 2020

djherbis commented Oct 24, 2020

Use BoltDB to map "Names" to Files #8

Use BoltDB to map "Names" to Files #8

Comments

djherbis commented Mar 2, 2016

amozoss commented Mar 2, 2016

djherbis commented Mar 2, 2016

ghost commented Jan 28, 2018

djherbis commented Jan 28, 2018

ghost commented Jun 1, 2018

djherbis commented Jun 3, 2018

deluan commented Apr 15, 2020 • edited Loading

djherbis commented Apr 15, 2020

deluan commented Apr 15, 2020 • edited Loading

djherbis commented Apr 15, 2020

deluan commented Oct 24, 2020

djherbis commented Oct 24, 2020 • edited Loading

deluan commented Oct 24, 2020

djherbis commented Oct 24, 2020

djherbis commented Oct 24, 2020

djherbis commented Oct 24, 2020

deluan commented Apr 15, 2020 •

edited

Loading

deluan commented Apr 15, 2020 •

edited

Loading

djherbis commented Oct 24, 2020 •

edited

Loading