Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbind error while retrieving results #39

Open
jmecahansen opened this issue Feb 15, 2018 · 10 comments
Open

rbind error while retrieving results #39

jmecahansen opened this issue Feb 15, 2018 · 10 comments

Comments

@jmecahansen
Copy link

jmecahansen commented Feb 15, 2018

Hi,

I'm writing an R extractor for both Google Analytics (GA) and Google Search Console (GSC) and I'm running into a problem I don't know how to solve.

As I don't know how many rows will I get for a given day in GSC, what I do is extract the data inside a repeat loop. I get a default number of rows (5000 by default) and, if the number of rows match the row limit, I retry the extraction with the row limit increased (by another 5000 rows in each iteration), so what it does is fetch 5000, 10000, 15000, 20000, ..., N rows.

If a given day has less than 5000 rows, it works well. But if it happens to have more than that, I tend to get the following error message in the console:

Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match

When I'm not getting that, sometimes the extraction seems to be stuck and, after some time, it gives me the following error message in the console:

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached

This happens to me once in a while, but never in an apparently predictable fashion. It either times out after successfully fetching 12 or 13 results or it times out after the first retrieval.

I've provided a sample in case you could lend me a hand. It's fragmented so the -base.R file initializes some project data and the -upload.R file is the one responsible for extraction and upload. The other file just specifies precise data for the customer I want to extract data from.

Can you give it a look?

Thanks in advance,
Julio :)

sample.zip

@MarkEdmondson1234
Copy link
Owner

??

@jmecahansen
Copy link
Author

Sorry, it's my first post and I blew it while pressing CTRL+RETURN. My bad :)

@daauerbach
Copy link

@jmecahansen, I haven't dug into your zip, and this may not be the fix, but have you experimented with sys.sleep or other pauses in your loop to ensure you're not hitting API call limits? also have you switched to your own google cloud project or are you using the default scopes/client/secret etc?

I've dealt with similar issues, but haven't tried to automate it, as it's not a very frequent need and i've found it faster to just trial and error "by-hand" with larger jumps in the max rows...

@jmecahansen
Copy link
Author

I did it... and no luck. Even if I do some spacing between calls (let's say, 5 seconds), I am bitten by the curl_fetch_memory error. I have tried different configurations and the result is always the same. I get kind-of random errors related to cURL.

@MarkEdmondson1234
Copy link
Owner

I think I need to see some more code, I don't understand why not set your rowLimit to 99999 or similar so its gets all rows, it should batch it for you.

If it doesn't get any results, then it returns a data.frame with NA instead, that may be the source of your trouble.

I also tend to use tryCatch() to test the API output, so you can do that before your rbind.

But add some code and I can take a look better.

@MarkEdmondson1234
Copy link
Owner

Oh ok, code in the zip - in the future could you paste the code in the question?

@MarkEdmondson1234
Copy link
Owner

The pertinent bit seems to be

# configuration: row retrieval limit
config.gsc.rowLimit <- 5000

....

    # extract Google Search Console (GSC) data
    rowCount <- 0
    rowLimit <- config.gsc.rowLimit

    repeat {
      data <- search_analytics(
        siteURL = config.gsc.domain,
        startDate = date_range[d],
        endDate = date_range[d],
        dimensions = s,
        rowLimit = rowLimit,
        walk_data = "byBatch"
      )

      if (is.data.frame(data)) {
        rowCount <- nrow(data)
      }

      if (rowCount == 0 || rowCount < rowLimit) {
        break
      } else {
        rowLimit <- rowLimit + config.gsc.rowLimit
        Sys.sleep(5)
      }
    }

which I think could be

config.gsc.rowLimit <- 9999999L
....
      data <- tryCatch(search_analytics(
        siteURL = config.gsc.domain,
        startDate = date_range[d],
        endDate = date_range[d],
        dimensions = s,
        rowLimit = rowLimit,
        walk_data = "byBatch"
      ), error = function(err){
         message("Problem fetching data, returning NULL")
         NULL
     })

@jmecahansen
Copy link
Author

That solution isn't right... at least not for my case. I've tried your example and I had to manually stop it. Here's the short log from RStudio:

Batching data via method: byBatch With rowLimit set to 9999999 will need up to [2000] API calls Request #: 0 : 5000 : 10000 Request #: 15000 : 20000 : 25000 Request #: 30000 : 35000 : 40000 Request #: 45000 : 50000 : 55000 Request #: 60000 : 65000 : 70000 Request #: 75000 : 80000 : 85000 Request #: 90000 : 95000 : 1e+05 Request #: 105000 : 110000 : 115000 Request #: 120000 : 125000 : 130000 Request #: 135000 : 140000 : 145000 Request #: 150000 : 155000 : 160000 Request #: 165000 : 170000 : 175000 Request #: 180000 : 185000 : 190000 Request #: 195000 : 2e+05 : 205000 Request #: 210000 : 215000 : 220000 Request #: 225000 : 230000 : 235000 Request #: 240000 : 245000 : 250000 Request #: 255000 : 260000 : 265000

By default, the date range is Sys.Date() - 3 to match GSC and, in this case, the dataset contains 33224 rows for one of the dimension sets (date, page, country and device in this case).

It doesn't seem right to me to try to gather N batches because, with the row limit set at 9999999L and the iterator limit being 5000 IIRC, that means you're gonna process ~2000 calls for gathering just 33224 rows :P

@jmecahansen
Copy link
Author

Maybe a good thing could be to provide a function which makes a first call to the API to return the total number of rows for a given query and then, we could set the limit with that value. The call count would be (N / 5000) + 1, not in the range of 1..2000.

@MarkEdmondson1234
Copy link
Owner

Ah ok fair enough, problem is the API doesn't return the total number like the Google Analytics API so you can't plan how many you will need, AFAIK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants