-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor API #5
Comments
@adelevie, the immediate fix is to put url in a single-element array, as
In Scraper.new(), Upton expects a list of pages to apply |
@jeremybmerrill I did some refactoring after facing the same issue. kgrz@4db4507 Two changes:
u = Upton::Scraper.new("www.google.com", "http://www.fcc.gov/encyclopedia/fcc-and-courts")
u.scrape_to_csv(..)
i = Upton::Utils::Index.new("http://website.com/list_of_stories.html", "a#article-link", :css))
u = Upton::Scraper.new("www.google.com", i, "http://www.fcc.gov/encyclopedia/fcc-and-courts") ( Some things might break as I haven't written any tests. This is just to give an idea :) ) |
This is a smart idea, @kgrz. Lemme think for a bit on the issue, but I like your idea of an |
No hurry :) Two more reasons to consider: It would also create an easy channel to expand the Index class's Kashyap KMBC |
Oh, I'm already convinced! Great idea, seriously. How does this sound as a game plan? Would love your feedback and, if you're interested, your contributions.
I created the |
Yep, that looks good. How about a little renaming though:
Example: Upton::Uri.new("<url>") do |i|
i.login "username", "password"
end
# But what if, the login fields are similar to:
# <input type="text" name="uid"><input>
# <input type="password" name="pass"></input>
#
Upton::Uri.new("<uri>") do |i|
i.login :uid => "username", :pass => "password"
end
# Index page
Upton::Uri.new("<uri>") do |i|
i.selector "//table/tr"
i.selector_type :xpath
end This way, if the page index needs to be fetched after logging in, it can be done. For some deeper idea of the way this would work, the |
(That said, I agree that this is too different from the point the library is right now :| ) |
@jeremybmerrill Here is an example of what I was talking about: https://gist.github.com/kgrz/6095520. In the call |
Or, on second thought, something like this won't change the API by a big way Instead of a = Upton::Uri.new(<uri>, {:xpath => "//table/tr"})
a = Upton::Uri.new(<uri>) do
def get_index
# Jus' overriddin'
end
end
a = Upton::Uri.new(<uri>, {:xpath => "//table/tr"} ) do
def get_instance
# Jus' overriddin'
end
end |
I will take a look at this tomorrow and respond! Was out of the country and without internet access. |
I'd suggest skipping all of what I said above altogether for now. The plan These would be the best two things to do fir now:
Kashyap KMBC |
I think the better solution is to be agnostic about whether the page is an "index" or a data-page and let the user define via blocks how the scraper should act on a given type of page. Right now, it seems like it'd be awkward to scrape a nested index. Or what if an index page contains some data (other than links to end-point-pages) that the user wants to include in a CSV file (for example, the :category parameter of an index of Ebay searches). The Spidey gem is an example implementation of this. The user has to define Scraper behavior for each anticipated page type. For 95% of the use cases, that's just an index and each individual listed page, so either way, it's the same amount of work for that kind of user, but it allows the Upton framework to be a lot simpler internally, while at the same time allowing flexibility for power-use-cases. I'm not sure how the implementation would be much different than what Spidey does, but the API would definitely have to be re-worked quite a bit. |
I haven't had a chance to check out Spidey in-depth yet, but it's on my todo list. I'll try to come up with a way to allow greater flexibility for the examples you mention. I want to strike a balance between enabling power-user cases and enabling super-simple scraping in common cases. @kgrz's idea of breaking out stashing into another gem (on which Upton'll depend) will help towards enabling power users to use the useful parts of Upton without being constrained by the choices made to enable simple cases. |
I'm thinking, too, about changing the API to be organized like ActiveRecord, where methods can be chained onto each other. I'm imagining something like (in principle):
The The The Each Upton object would have access to methods like That could be used for simple cases of scraping a single page:
Or for nested index cases too.
I'd love to hear y'all's thoughts. I'm (perhaps obviously) totally just fleshing this out right now. |
@dannguyen The one reason why I would separate the @jeremybmerrill That looks neat. But aren't |
@kgrz I think it's possible to keep the concept of #index and #instance type pages in the public facing API, as conveniences. But under the hood, there should be no distinction. An HTML page is an HTML page, and the parsing is all the same, whether it's to find links or to extract text values from other kinds of nodes. The problem is that as is, the initialization of an Upton instance is unnecessarily confusing, and results in a convoluted number of flows between index and instance parsing. |
In any case, I think rewriting and reorganizing the tests, as per #6 , should be the proper course of action before radically refactoring the API. If it seems that the |
@kgrz, I agree with @dannguyen that keeping index and instance methods is important, even if they (eventually) do mostly the same thing. A main goal of the Upton project is to make scraping in most cases easy. Given the choice between domain coverage and ease-of-use, ease-of-use ought to be the priority. Obviously, we should aim to maximize domain coverage given the ease-of-use constraint. Therefore, I am totally open to ways of creating custom scraping techniques beyond My idea on ActiveRecord-style querying, would be to have a (yes, artificial) distinction between "things to scrape" (i.e. links) and "things having been scraped" (i.e. data). There obviously would be a mapping relationship between these. The main difference between the But yes, I agree that better test coverage is more impt than radical refactoring [perhaps I'll start a branch, if I get inspired], though figuring out a roadmap is neat. That said, I've never written RSpec before, so I appreciate y'all's pull requests and critiques. I may have time over the weekend to work on this. |
@jeremybmerrill I'd be careful with creating a chaining concept without getting a better idea of how people approach scraping beyond the simple scenario of indexpage -> data page. In the simple scenario, method chaining wouldn't make things much simpler, because it's basically
I admit that's an uncommon case but presumably, the point of reworking the API would be to allow for more complicated cases...but I think reworking it toward a chainability model may not make things significantly easier for the end user (and certainly would complicate your life as the developer :) ) |
@dannguyen , I've actually stumbled into one of those more complicated cases, so I plan on using that to do some thinking about how to be more flexible to account for those sorts of patterns. |
Code:
Error:
The readme example uses a
String
as the argument forUpton::Scraper::new
, but for some reason something that responds to#each_with_index
is expected here. It could be that thexpath
isn't good, but if so, the error shouldn't be pointing to theString
provided on line 8.The text was updated successfully, but these errors were encountered: