Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading and caching extracted #15

Merged
merged 2 commits into from
Aug 16, 2013

Conversation

kgrz
Copy link
Contributor

@kgrz kgrz commented Aug 9, 2013

Most of the logic inside the #get_page now extracted to the Upton::Downloader class. I did remove most of the encoding specific parts. IF they are needed, I will add them in.

kgrz added 2 commits August 9, 2013 20:20
Extracting the downloading and caching logic to its own class. This
cleans up the main class quite a bit. Some of the explicit encoding
options are removed.

	modified:   spec/spec_helper.rb
The repo uses webmock which makes stubbing HTTP connections easier
@kgrz
Copy link
Contributor Author

kgrz commented Aug 9, 2013

@jeremybmerrill Before merging this commit, please check if you have any encoding-related test cases that fail. If yes, I should add them to the Downloader class and the corresponding spec.

@jeremybmerrill
Copy link
Contributor

Oh, this is awesome! Thanks! I want to take a closer look, but I expect to merge this soon.

I stupidly forgot to note where the encoding stuff was broken, but I'll try to find it and write a test with it. I definitely didn't put in that encoding stuff for no reason... :)

I may put back in those accept headers, since some sites don't work right without them; I'll put them in some Downloader class variable though. I was able to find which site I was scraping that gave me this problem and I'll try to reduce it to a minimal test case.

@jeremybmerrill
Copy link
Contributor

Using tmpdir is an interesting choice... I suppose it makes sense as a default, since debugging files won't need to persist after a reboot (in most cases) and the default can be overridden if you want to hold onto the stashed files. Good idea!

I'm curious, before I merge, why did you want to write cached files with md5 filenames? I'd love to hear your thought process. (Worried about collisions with old method of removing special chars from URI?) I think having human-readable filenames is a benefit, since I sometimes look at the stashed files.

I haven't been able to come up with a test case for either the accept headers or the encoding stuff. If it comes up, I'll re-add it.

@kgrz
Copy link
Contributor Author

kgrz commented Aug 14, 2013

I'm curious, before I merge, why did you want to write cached files with md5 filenames? I'd love to hear your thought process. (Worried about collisions with old method of removing special chars from URI?)

I think I wrote the first implementation of the caching part as one where the downloaded files are kept in separate directories named after the URL in-turn grouped under the main stash directory — We could add paginated pages under the same URL hostname or parent page to the same directory — and so, since there would be too many parameters to gsub from the URL, I just hashed the name. When I extracted the functionality into a gem, I thought it made more sense to just generalize the name and make sure a unique file is saved per URL.

In retrospect, however, I suppose this is a fair choice in avoiding collisions as you mentioned and also to deal with UTF-8 characters in the URL that might be difficult to gsub using regex.

@dannguyen
Copy link
Contributor

Maybe there should be an option in the Upton constructor to allow MD5 hashing? I do think that that use case ends up being inferior when it comes to stashing on the file system. However, if Upton is extended to allow storage into a database, that kind of hashing would be useful.

@jeremybmerrill
Copy link
Contributor

Good call, Dan.

I'll cut out the md5 stuff and then merge (when I get a chance, which may not be today), and we can continue this discussion on #20 to eventually improve url_to_filename.

I'll note that I really like the idea of putting stashed pages in subfolders (though it makes sense for that to live in Upton proper, not the downloader)

@kgrz
Copy link
Contributor Author

kgrz commented Aug 14, 2013

I do think that that use case ends up being inferior when it comes to stashing on the file system.

I didn't get this. We are not MD5-ing the whole file. Just the name of the file. It straight away makes it easy to deal with the edge cases mentioned in #20.

@jeremybmerrill
Copy link
Contributor

I think MD5 makes perfectly good sense for the downloader. In Upton proper, however, I think it comes with a very big downside: namely, that you can't easily figure out where a specific stashed page is to examine the stashed HTML (for whatever reason). I don't think CGI.escape and the stuff discussed in #20 is hard enough to justify not doing it.

@dannguyen
Copy link
Contributor

@kgrz Sorry, what I meant was that if it is important for the user to manage/read through the stash, then MD5 is not a good solution...however, in the last 5 minutes, I've since changed my mind on that :)

#20 (comment)

(tl;dr for what seems like the typical Upton use-case, in which a user has implemented a successful scraper and data extraction routine and just needs to run it from time to time, the stash should be of no interest to the user, so why make it readable in the first place?)

I think if the choice is between MD5 or doing the somewhat difficult work of keeping filenames readable (and preventing as many collisions as possible), I think MD5 seems like the more reasonable tradeoff at this point.

@jeremybmerrill
Copy link
Contributor

What do you all think (keeping in mind discussion on #20?) about keeping the md5 filename method, keeping the gsub method in hopes of eventually implementing aCGI.escape -type method, and, if @dannguyen writes a PR, adding a SQlite method? Which is the default isn't a big deal to me; md5 solves most of the problems in easy cases, but having the other options available easily is important.

@jeremybmerrill
Copy link
Contributor

(I made that merge at home, but forgot to push last night; will push to master tonight)

PS: Good call on the file tree structure changes, i.e. lib/upton/utils.rb etc.

@dannguyen
Copy link
Contributor

I haven't taken a close look at the new downloading/caching code yet, may try to look at it this weekend. My first step, if it already isn't, would be to create an abstract UptonStorage class that interfaces with the file system or db (or whatever)...for some reason though, that sounds like a "lets make an ORM, but for files!" project, which seems weird :)

Then the next step would be to figure out a basic schema for each saved page...probably something like a table with columns for URL, hostname (for quicker indexing, even if redundant), joined to another table that contains just blobs for content.

@jeremybmerrill jeremybmerrill merged commit 57703cc into propublica:master Aug 16, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants