Only one worker collecting repos data despite parallel setup? #18

vlad43210 · 2021-02-01T16:18:27Z

Describe the bug
When I run the example code in the repos script, I do get output, but it's slow (a half hour or more per batch); furthermore, the "From worker" line only ever lists one worker, which makes me concerned the code is not actually parallelized.

Sample output:

From worker 2:    10×4 SubDataFrame
      From worker 2:     Row │ spdx      created                            queries  query_group
      From worker 2:         │ String    Interval…                          Int16    Int16
      From worker 2:    ─────┼───────────────────────────────────────────────────────────────────
      From worker 2:       1 │ 0BSD      [2020-10-11T00:00:00 .. 2020-10-…      100            1
      From worker 2:       2 │ AGPL-3.0  [2011-07-01T00:00:00 .. 2012-08-…      100            1
      From worker 2:       3 │ AGPL-3.0  [2014-02-10T00:00:00 .. 2014-04-…      100            1
      From worker 2:       4 │ AGPL-3.0  [2015-02-22T00:00:00 .. 2015-04-…      100            1
      From worker 2:       5 │ AGPL-3.0  [2015-04-02T00:00:00 .. 2015-05-…      100            1
      From worker 2:       6 │ AGPL-3.0  [2015-05-12T00:00:00 .. 2015-06-…      100            1
      From worker 2:       7 │ AGPL-3.0  [2015-10-27T00:00:00 .. 2015-12-…      100            1
      From worker 2:       8 │ AGPL-3.0  [2015-12-07T00:00:00 .. 2016-01-…      100            1
      From worker 2:       9 │ AGPL-3.0  [2016-05-16T00:00:00 .. 2016-06-…      100            1
      From worker 2:      10 │ AGPL-3.0  [2016-10-31T00:00:00 .. 2016-11-…      100            1

Environment
I set up the GHOST package using Pkg as described in the installation instructions. I am running it inside a Docker container. I double checked and I have 6 CPUs allocated per container.

Minimal Reproducible Example (MRE)
Steps to reproduce the behavior:
Run the example code in the repos script

Expected behavior
I would expect different worker ids in the "From worker" output lines

The text was updated successfully, but these errors were encountered:

Nosferican · 2021-02-01T17:15:11Z

Hi!
May I ask how many GitHub Personal Access token you have written to the pats table in the database?
setup_parallel will spawn as many as many workers as GitHub Personal Access Tokens as available (by default, can be overwritten by the limit keyword argument). If there aren't multiple, it will just run it sequentially. For generating the full queries tables with around 38 GitHub PAT I believe it was around an hour or so. It would probably be a few with 5 PAT.

If a batch took half an hour or so, that is most likely due to having hit the resource limit for that account (5k requests / hour). When that happens, the code sleeps until the reset time when it is allowed to query the API again. You can check the remaining and reset time for that token through

# after setup() / setup_parallel();
# when in parallel you can query the variable from the worker to get the information for that PAT
julia> GHOST.PARALLELENABLER.pat 
GitHub Personal Access Token
  login: Nosferican
  core remaining: 5000
  core reset: 2021-02-01T17:56:05
  graphql remaining: 5000
  graphql reset: 2021-02-01T17:56:05

Just in case, you only need to run the package on one container. If you run multiple containers using the same tokens, it will not work since each of those is using the same PAT. The PAT will be making calls from each container and hitting the resource limits. It should be a single container with multiple cores and and the package through setup_parallel handles the parallelization (e.g., initializes the processes, assigns a PAT and a database connection to each worker, handles work distribution, uploads data to database).

A side note on queries, see for example this file

using GHOST
time_start = now()
setup()
setup_parallel(5)
spdxs = execute(GHOST.PARALLELENABLER.conn,
                "SELECT spdx FROM gh_2007_2020.licenses ORDER BY spdx;",
                not_null = true) |>
    (obj -> getproperty.(obj, :spdx))
for spdx in spdxs
    queries(spdx)
end
time_end = now()
canonicalize(CompoundPeriod(time_end - time_start))

For queries, each license is queried sequentially and the parallelization occurs on-demand internally which is different from say querying repositories, commits or users. Discovering the intervals depends on the results for others intervals within the license. When querying each license, it is actually dividing up the intervals and using multiple processes as needed (small licenses like 0BSD will likely run faster sequentially so it will not hit the multiple workers, MIT will definitely use the multiple processes as that one will hit multiple intervals to query / identify).

Let me know if you have any additional questions.

vlad43210 · 2021-02-01T17:19:33Z

Thank you very much, that's very helpful! I didn't realize the parallelization depended on using multiple tokens. Do you happen to know if multiple tokens means multiple accounts, or if multiple tokens can be linked to one GitHub account? I would assume GitHub would have one shared rate limit for all tokens coming from the same account, but maybe I'm wrong! Vlad

…

On Mon, Feb 1, 2021 at 12:15 PM José Bayoán Santiago Calderón < ***@***.***> wrote: Hi! May I ask how many GitHub Personal Access token you have written to the pats table in the database? setup_parallel will spawn as many as many workers as GitHub Personal Access Tokens as available (by default, can be overwritten by the limit keyword argument). If there aren't multiple, it will just run it sequentially. For generating the full queries tables with around 38 GitHub PAT I believe it was around an hour or so. It would probably be a few with 5 PAT. If a batch took half an hour or so, that is most likely due to having hit the resource limit for that account (5k requests / hour). When that happens, the code sleeps until the reset time when it is allowed to query the API again. You can check the remaining and reset time for that token through # after setup() / setup_parallel(); # when in parallel you can query the variable from the worker to get the information for that PAT julia> GHOST.PARALLELENABLER.pat GitHub Personal Access Token login: Nosferican core remaining: 5000 core reset: 2021-02-01T17:56:05 graphql remaining: 5000 graphql reset: 2021-02-01T17:56:05 Just in case, you only need to run the package on one container. If you run multiple containers using the same tokens, it will not work since each of those is using the same PAT. The PAT will be making calls from each container and hitting the resource limits. It should be a single container with multiple cores and and the package through setup_parallel handles the parallelization (e.g., initializes the processes, assigns a PAT and a database connection to each worker, handles work distribution, uploads data to database). A side note on queries, see for example this file using GHOST time_start = now() setup() setup_parallel(5) spdxs = execute(GHOST.PARALLELENABLER.conn, "SELECT spdx FROM gh_2007_2020.licenses ORDER BY spdx;", not_null = true) |> (obj -> getproperty.(obj, :spdx)) for spdx in spdxs queries(spdx) end time_end = now() canonicalize(CompoundPeriod(time_end - time_start)) For queries, each license is queried sequentially and the parallelization occurs on-demand internally which is different from say querying repositories, commits or users. Discovering the intervals depends on the results for others intervals within the license. When querying each license, it is actually dividing up the intervals and using multiple processes as needed (small licenses like 0BSD will likely run faster sequentially so it will not hit the multiple workers, MIT will definitely use the multiple processes as that one will hit multiple intervals to query / identify). Let me know if you have any additional questions. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFIDHFH4G4DJWEJJRRQUD3S43OTBANCNFSM4W5CCMMQ> .

Nosferican · 2021-02-01T17:31:46Z

Aye. You can have multiple tokens but they would all share the resources across for the same account. For being able to increase the rate limits you need to have access to tokens from multiple GitHub accounts.

vlad43210 · 2021-02-03T16:50:04Z

That makes sense. I am considering whether to add more tokens.

For what it's worth, with one token it does look like each batch takes between 15-20 minutes. A quick count of the query batches yielded 1376 query batches (after having processed some), so that to me looks like it would take between 14 and 19 days to build the full queries table. I don't know how much longer it will take to get the commits. That seems quite a bit longer than your estimate, you said that it took an hour to build the queries table with 38 tokens, so I would expect it to take ~ 40 hours to build the queries table with 1 token.

Alternatively... this code is for grabbing absolutely everything with a public license, right? Could I ask you to post a sample query of grabbing repos by name or by tag? E.g. your repo has the "open-source," "julia-language," and "research-tool" tags.

Thank you for all your help :)

Nosferican · 2021-02-03T18:08:18Z

Getting the queries table should be a couple hours with 5 tokens in my experience (more than 5 tokens won't help there since around 5 is more than enough and additional ones won't be used). Using 5 tokens allows the program to cycle through those so you never have to sleep to wait for a reset.

Currently the code looks for public, with a given license, non-fork, non-mirror, non-archived. The query is defined in

GHOST.jl/src/03_Queries.jl

Line 8 in d1332ec

    
           subsquery = join([ string("_$idx:search(query:\"is:public fork:false mirror:false archived:false license:$spdx created:",

and by default it looks for all repos that meet the filter until the end of the previous year. I have a few ideas on how to make it more customizable (e.g., keyword arguments like archived::Bool = false). More generally, you could pass the graphQL file directly but for search queries it is a bit harder than other types. For example, for repos I use

GHOST.jl/src/assets/graphql/02_repos.graphql

Lines 1 to 29 in d1332ec

    
           fragment A on SearchResultItemConnection { 
        
             pageInfo { 
        
               endCursor 
        
               hasNextPage 
        
             } 
        
             edges { 
        
               node { 
        
                 ... on Repository { 
        
                   id 
        
                   createdAt 
        
                   nameWithOwner 
        
                   description 
        
                   primaryLanguage { 
        
                     name 
        
                   } 
        
                   defaultBranchRef { 
        
                     id 
        
                     target { 
        
                       ... on Commit { 
        
                         history(until: $until) { 
        
                           totalCount 
        
                         } 
        
                       } 
        
                     } 
        
                   } 
        
                 } 
        
               } 
        
             } 
        
           }

For say filtering based on tags, I don't think it is supported in the search syntax (could be submitted as a feature request). One approach would be to collect say all the public repositories using a similar approach to how the package currently finds queries, use the queries to collect the repositories and then you can get the attributes for the repositories. For example,

you can try the following query in the GraphQL explorer

{
  repository(owner: "uva-bi-sdad", name: "GHOST.jl") {
    repositoryTopics(first: 100) {
      edges {
        node {
          id
          topic {
            id
            name
          }
        }
      }
    }
  }
}

vlad43210 added the bug Something isn't working label Feb 1, 2021

vlad43210 assigned Nosferican Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only one worker collecting repos data despite parallel setup? #18

Only one worker collecting repos data despite parallel setup? #18

vlad43210 commented Feb 1, 2021

Nosferican commented Feb 1, 2021

vlad43210 commented Feb 1, 2021 via email

Nosferican commented Feb 1, 2021

vlad43210 commented Feb 3, 2021

Nosferican commented Feb 3, 2021

Only one worker collecting repos data despite parallel setup? #18

Only one worker collecting repos data despite parallel setup? #18

Comments

vlad43210 commented Feb 1, 2021

Nosferican commented Feb 1, 2021

vlad43210 commented Feb 1, 2021 via email

Nosferican commented Feb 1, 2021

vlad43210 commented Feb 3, 2021

Nosferican commented Feb 3, 2021