rbind error while retrieving results #39

jmecahansen · 2018-02-15T13:25:13Z

Hi,

I'm writing an R extractor for both Google Analytics (GA) and Google Search Console (GSC) and I'm running into a problem I don't know how to solve.

As I don't know how many rows will I get for a given day in GSC, what I do is extract the data inside a repeat loop. I get a default number of rows (5000 by default) and, if the number of rows match the row limit, I retry the extraction with the row limit increased (by another 5000 rows in each iteration), so what it does is fetch 5000, 10000, 15000, 20000, ..., N rows.

If a given day has less than 5000 rows, it works well. But if it happens to have more than that, I tend to get the following error message in the console:

Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match

When I'm not getting that, sometimes the extraction seems to be stuck and, after some time, it gives me the following error message in the console:

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached

This happens to me once in a while, but never in an apparently predictable fashion. It either times out after successfully fetching 12 or 13 results or it times out after the first retrieval.

I've provided a sample in case you could lend me a hand. It's fragmented so the -base.R file initializes some project data and the -upload.R file is the one responsible for extraction and upload. The other file just specifies precise data for the customer I want to extract data from.

Can you give it a look?

Thanks in advance,
Julio :)

sample.zip

The text was updated successfully, but these errors were encountered:

MarkEdmondson1234 · 2018-02-15T13:36:58Z

??

jmecahansen · 2018-02-15T13:39:28Z

Sorry, it's my first post and I blew it while pressing CTRL+RETURN. My bad :)

daauerbach · 2018-03-01T22:09:49Z

@jmecahansen, I haven't dug into your zip, and this may not be the fix, but have you experimented with sys.sleep or other pauses in your loop to ensure you're not hitting API call limits? also have you switched to your own google cloud project or are you using the default scopes/client/secret etc?

I've dealt with similar issues, but haven't tried to automate it, as it's not a very frequent need and i've found it faster to just trial and error "by-hand" with larger jumps in the max rows...

jmecahansen · 2018-03-02T07:13:48Z

I did it... and no luck. Even if I do some spacing between calls (let's say, 5 seconds), I am bitten by the curl_fetch_memory error. I have tried different configurations and the result is always the same. I get kind-of random errors related to cURL.

MarkEdmondson1234 · 2018-03-02T08:42:01Z

I think I need to see some more code, I don't understand why not set your rowLimit to 99999 or similar so its gets all rows, it should batch it for you.

If it doesn't get any results, then it returns a data.frame with NA instead, that may be the source of your trouble.

I also tend to use tryCatch() to test the API output, so you can do that before your rbind.

But add some code and I can take a look better.

MarkEdmondson1234 · 2018-03-02T08:55:00Z

Oh ok, code in the zip - in the future could you paste the code in the question?

MarkEdmondson1234 · 2018-03-02T09:03:43Z

The pertinent bit seems to be

# configuration: row retrieval limit
config.gsc.rowLimit <- 5000

....

    # extract Google Search Console (GSC) data
    rowCount <- 0
    rowLimit <- config.gsc.rowLimit

    repeat {
      data <- search_analytics(
        siteURL = config.gsc.domain,
        startDate = date_range[d],
        endDate = date_range[d],
        dimensions = s,
        rowLimit = rowLimit,
        walk_data = "byBatch"
      )

      if (is.data.frame(data)) {
        rowCount <- nrow(data)
      }

      if (rowCount == 0 || rowCount < rowLimit) {
        break
      } else {
        rowLimit <- rowLimit + config.gsc.rowLimit
        Sys.sleep(5)
      }
    }

which I think could be

config.gsc.rowLimit <- 9999999L
....
      data <- tryCatch(search_analytics(
        siteURL = config.gsc.domain,
        startDate = date_range[d],
        endDate = date_range[d],
        dimensions = s,
        rowLimit = rowLimit,
        walk_data = "byBatch"
      ), error = function(err){
         message("Problem fetching data, returning NULL")
         NULL
     })

jmecahansen · 2018-03-02T10:20:13Z

That solution isn't right... at least not for my case. I've tried your example and I had to manually stop it. Here's the short log from RStudio:

Batching data via method: byBatch With rowLimit set to 9999999 will need up to [2000] API calls Request #: 0 : 5000 : 10000 Request #: 15000 : 20000 : 25000 Request #: 30000 : 35000 : 40000 Request #: 45000 : 50000 : 55000 Request #: 60000 : 65000 : 70000 Request #: 75000 : 80000 : 85000 Request #: 90000 : 95000 : 1e+05 Request #: 105000 : 110000 : 115000 Request #: 120000 : 125000 : 130000 Request #: 135000 : 140000 : 145000 Request #: 150000 : 155000 : 160000 Request #: 165000 : 170000 : 175000 Request #: 180000 : 185000 : 190000 Request #: 195000 : 2e+05 : 205000 Request #: 210000 : 215000 : 220000 Request #: 225000 : 230000 : 235000 Request #: 240000 : 245000 : 250000 Request #: 255000 : 260000 : 265000

By default, the date range is Sys.Date() - 3 to match GSC and, in this case, the dataset contains 33224 rows for one of the dimension sets (date, page, country and device in this case).

It doesn't seem right to me to try to gather N batches because, with the row limit set at 9999999L and the iterator limit being 5000 IIRC, that means you're gonna process ~2000 calls for gathering just 33224 rows :P

jmecahansen · 2018-03-02T10:24:42Z

Maybe a good thing could be to provide a function which makes a first call to the API to return the total number of rows for a given query and then, we could set the limit with that value. The call count would be (N / 5000) + 1, not in the range of 1..2000.

MarkEdmondson1234 · 2018-03-02T10:25:37Z

Ah ok fair enough, problem is the API doesn't return the total number like the Google Analytics API so you can't plan how many you will need, AFAIK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbind error while retrieving results #39

rbind error while retrieving results #39

jmecahansen commented Feb 15, 2018 •

edited

Loading

MarkEdmondson1234 commented Feb 15, 2018

jmecahansen commented Feb 15, 2018

daauerbach commented Mar 1, 2018

jmecahansen commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

jmecahansen commented Mar 2, 2018

jmecahansen commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

rbind error while retrieving results #39

rbind error while retrieving results #39

Comments

jmecahansen commented Feb 15, 2018 • edited Loading

MarkEdmondson1234 commented Feb 15, 2018

jmecahansen commented Feb 15, 2018

daauerbach commented Mar 1, 2018

jmecahansen commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

jmecahansen commented Mar 2, 2018

jmecahansen commented Mar 2, 2018

MarkEdmondson1234 commented Mar 2, 2018

jmecahansen commented Feb 15, 2018 •

edited

Loading