-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving url_to_filename #20
Comments
I like @kgrz's idea, as adapted by you, dan, about not stashing everything in a flat directory... e.g. stashes/#{clean_index_url}/#{clean_url}/ I do think readable filenames are important. Nothing wrong with your suggestion of letting people choose between CGI.escape and md5 in the options hash, though I'm still not sure if there's a benefit to md5 ever. (It might be limited to the downloader itself and overridden in Upton proper?) |
Come to think of it, though, how useful is it for the user to recognize the stashed filenames? If the scrape goes well, they shouldn't ever have to worry about the stash. When a scrape fails, a user may want to find the stashed file to debug it, but in my opinion, that's kind of an anti-pattern. Finding that actual file (i.e. by looking through the OS Finder) will vary in difficulty, depending on the number of pages scraped and the ugliness of the URL...they should be used to debugging against the live page at the remote URL. Moreover, the users probably should not have to worry about the existence of the stash at all, as it seems to be mainly a system to speed up performance (through caching)...if users intend to be saving these pages for exploratory purposes (beyond scraping the datapoints), then they should be using a site-spidering library like http://anemone.rubyforge.org/ MD5 may not be the ultimate solution (for example, you might find it better to ultimately have Upton stash in SQLite by default), but it may be worthwhile in the long run to not make users think of a readable-stash as an actual feature...it's not useful in the majority of cases but is pretty hard to perfect and maintain. |
In most cases, the live site is a better thing to debug against, for sure. But there are a few cases where what Upton gets isn't easily replicable in the browser; e.g. debugging logins or wacky sites that behave differently based on POST data or headers. It's useful in these cases to have access to the pages themselves. One middle ground would be to output a CSV mapping URLs to md5 hashes... but that might be an unnecessary complication. |
Yeah, one thing that just came to mind...what has been useful to me is to have, if not readable cache-file names, reversible cache names...(the solution just ends up being both). The use case is that I've hammered a site but am not sure whether I got all the pages I wanted...so I write myself a little script that tells me how many (or what) pages I've downloaded from a given domain, or for a particular scrape. This is pretty much impossible if the URLs are hashed with a one-way function. But again, that's an edge case (i.e. you're not expecting the average Upton user to be writing their own scraping-status-report tool). Ultimately, I think that the happy medium will be to abstract the caching system from Upton's core spidering/scraping functionality so that you can drop in SQLite, which would let you handle all the edge and collision cases, while providing a better interface for the users who do want to query what pages they've downloaded (and when). And for users who don't want to go through the hassle of installing SQLite, you still have your default save to the /stash folder...I think it's safe to assume that users who don't/can't handle the hassle of installing SQLite will have relatively basic scraping needs. And as they scrape bigger things, well, they'll just have to make the jump in complexity themselves :) |
The readable url method I added (which is off by default, in favor of md5 filenames) in #15, which I just merged, uses the first 233 characters of the gsubbed URL, then a timestamp. This will solve the collision issue, while retaining readability. Obviously still open to better suggestions and the SQLite solution. One good call @kgrz made was, as you suggested, pulling the caching stuff into its own file, |
Yep. Ignoring the reversibility of the MD5-ed filenames was my mistake; it never crossed my mind. As @dannguyen, an SQLite or such similar interface should be present. I will see if I can make that easy and include that in the gem. That way, Upton need not be polluted with all that logic. |
Hmm... I should have seen that coming, but timestamps don't work, since they're not reversible. If I'm looking for a cached file, I don't know what time it was cached. (Without the as-yet-non-existent database.) |
A tangent to this discussion: #15
The gsubbing of all non-word characters may cause more collisions than desired in some edge cases, and more importantly, makes it difficult/impossible to reverse the filename and get the original URL. Maybe it's possible to use
CGI.escape
to convert to a proper file name and thenCGI.unescape
to reverse the change?Second issue: extremely long filenames
I've run into this before but am not going to take the time to reproduce it...operating systems may restrict the length of a filename to something shorter than the saved URL. This could be mitigated by saving the file to subdirectories according to its URL path:
Gets saved to:
In some extreme scraping cases, having tens of thousands of files in the stash directory might cause some performance issues. Not sure what the speed of trying to hash the file identity through recursive subdirectory search will be (but that's probably premature optimization)
The text was updated successfully, but these errors were encountered: