-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloading and caching extracted #15
Downloading and caching extracted #15
Conversation
Extracting the downloading and caching logic to its own class. This cleans up the main class quite a bit. Some of the explicit encoding options are removed. modified: spec/spec_helper.rb The repo uses webmock which makes stubbing HTTP connections easier
@jeremybmerrill Before merging this commit, please check if you have any encoding-related test cases that fail. If yes, I should add them to the |
Oh, this is awesome! Thanks! I want to take a closer look, but I expect to merge this soon. I stupidly forgot to note where the encoding stuff was broken, but I'll try to find it and write a test with it. I definitely didn't put in that encoding stuff for no reason... :) I may put back in those accept headers, since some sites don't work right without them; I'll put them in some Downloader class variable though. I was able to find which site I was scraping that gave me this problem and I'll try to reduce it to a minimal test case. |
Using tmpdir is an interesting choice... I suppose it makes sense as a default, since debugging files won't need to persist after a reboot (in most cases) and the default can be overridden if you want to hold onto the stashed files. Good idea! I'm curious, before I merge, why did you want to write cached files with md5 filenames? I'd love to hear your thought process. (Worried about collisions with old method of removing special chars from URI?) I think having human-readable filenames is a benefit, since I sometimes look at the stashed files. I haven't been able to come up with a test case for either the accept headers or the encoding stuff. If it comes up, I'll re-add it. |
I think I wrote the first implementation of the caching part as one where the downloaded files are kept in separate directories named after the URL in-turn grouped under the main stash directory — We could add paginated pages under the same URL hostname or parent page to the same directory — and so, since there would be too many parameters to In retrospect, however, I suppose this is a fair choice in avoiding collisions as you mentioned and also to deal with UTF-8 characters in the URL that might be difficult to |
Maybe there should be an option in the Upton constructor to allow MD5 hashing? I do think that that use case ends up being inferior when it comes to stashing on the file system. However, if Upton is extended to allow storage into a database, that kind of hashing would be useful. |
Good call, Dan. I'll cut out the md5 stuff and then merge (when I get a chance, which may not be today), and we can continue this discussion on #20 to eventually improve url_to_filename. I'll note that I really like the idea of putting stashed pages in subfolders (though it makes sense for that to live in Upton proper, not the downloader) |
I didn't get this. We are not MD5-ing the whole file. Just the name of the file. It straight away makes it easy to deal with the edge cases mentioned in #20. |
I think MD5 makes perfectly good sense for the downloader. In Upton proper, however, I think it comes with a very big downside: namely, that you can't easily figure out where a specific stashed page is to examine the stashed HTML (for whatever reason). I don't think CGI.escape and the stuff discussed in #20 is hard enough to justify not doing it. |
@kgrz Sorry, what I meant was that if it is important for the user to manage/read through the stash, then MD5 is not a good solution...however, in the last 5 minutes, I've since changed my mind on that :) (tl;dr for what seems like the typical Upton use-case, in which a user has implemented a successful scraper and data extraction routine and just needs to run it from time to time, the stash should be of no interest to the user, so why make it readable in the first place?) I think if the choice is between MD5 or doing the somewhat difficult work of keeping filenames readable (and preventing as many collisions as possible), I think MD5 seems like the more reasonable tradeoff at this point. |
What do you all think (keeping in mind discussion on #20?) about keeping the md5 filename method, keeping the gsub method in hopes of eventually implementing a |
(I made that merge at home, but forgot to push last night; will push to master tonight) PS: Good call on the file tree structure changes, i.e. |
I haven't taken a close look at the new downloading/caching code yet, may try to look at it this weekend. My first step, if it already isn't, would be to create an abstract UptonStorage class that interfaces with the file system or db (or whatever)...for some reason though, that sounds like a "lets make an ORM, but for files!" project, which seems weird :) Then the next step would be to figure out a basic schema for each saved page...probably something like a table with columns for URL, hostname (for quicker indexing, even if redundant), joined to another table that contains just blobs for content. |
Most of the logic inside the
#get_page
now extracted to theUpton::Downloader
class. I did remove most of the encoding specific parts. IF they are needed, I will add them in.