Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only one worker collecting repos data despite parallel setup? #18

Open
vlad43210 opened this issue Feb 1, 2021 · 5 comments
Open

Only one worker collecting repos data despite parallel setup? #18

vlad43210 opened this issue Feb 1, 2021 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@vlad43210
Copy link

Describe the bug
When I run the example code in the repos script, I do get output, but it's slow (a half hour or more per batch); furthermore, the "From worker" line only ever lists one worker, which makes me concerned the code is not actually parallelized.

Sample output:

From worker 2:    10×4 SubDataFrame
      From worker 2:     Row │ spdx      created                            queries  query_group
      From worker 2:         │ String    Interval…                          Int16    Int16
      From worker 2:    ─────┼───────────────────────────────────────────────────────────────────
      From worker 2:       1 │ 0BSD      [2020-10-11T00:00:00 .. 2020-10-…      100            1
      From worker 2:       2 │ AGPL-3.0  [2011-07-01T00:00:00 .. 2012-08-…      100            1
      From worker 2:       3 │ AGPL-3.0  [2014-02-10T00:00:00 .. 2014-04-…      100            1
      From worker 2:       4 │ AGPL-3.0  [2015-02-22T00:00:00 .. 2015-04-…      100            1
      From worker 2:       5 │ AGPL-3.0  [2015-04-02T00:00:00 .. 2015-05-…      100            1
      From worker 2:       6 │ AGPL-3.0  [2015-05-12T00:00:00 .. 2015-06-…      100            1
      From worker 2:       7 │ AGPL-3.0  [2015-10-27T00:00:00 .. 2015-12-…      100            1
      From worker 2:       8 │ AGPL-3.0  [2015-12-07T00:00:00 .. 2016-01-…      100            1
      From worker 2:       9 │ AGPL-3.0  [2016-05-16T00:00:00 .. 2016-06-…      100            1
      From worker 2:      10 │ AGPL-3.0  [2016-10-31T00:00:00 .. 2016-11-…      100            1

Environment
I set up the GHOST package using Pkg as described in the installation instructions. I am running it inside a Docker container. I double checked and I have 6 CPUs allocated per container.

Minimal Reproducible Example (MRE)
Steps to reproduce the behavior:
Run the example code in the repos script

Expected behavior
I would expect different worker ids in the "From worker" output lines

@vlad43210 vlad43210 added the bug Something isn't working label Feb 1, 2021
@Nosferican
Copy link
Collaborator

Hi!
May I ask how many GitHub Personal Access token you have written to the pats table in the database?
setup_parallel will spawn as many as many workers as GitHub Personal Access Tokens as available (by default, can be overwritten by the limit keyword argument). If there aren't multiple, it will just run it sequentially. For generating the full queries tables with around 38 GitHub PAT I believe it was around an hour or so. It would probably be a few with 5 PAT.

If a batch took half an hour or so, that is most likely due to having hit the resource limit for that account (5k requests / hour). When that happens, the code sleeps until the reset time when it is allowed to query the API again. You can check the remaining and reset time for that token through

# after setup() / setup_parallel();
# when in parallel you can query the variable from the worker to get the information for that PAT
julia> GHOST.PARALLELENABLER.pat 
GitHub Personal Access Token
  login: Nosferican
  core remaining: 5000
  core reset: 2021-02-01T17:56:05
  graphql remaining: 5000
  graphql reset: 2021-02-01T17:56:05

Just in case, you only need to run the package on one container. If you run multiple containers using the same tokens, it will not work since each of those is using the same PAT. The PAT will be making calls from each container and hitting the resource limits. It should be a single container with multiple cores and and the package through setup_parallel handles the parallelization (e.g., initializes the processes, assigns a PAT and a database connection to each worker, handles work distribution, uploads data to database).

A side note on queries, see for example this file

using GHOST
time_start = now()
setup()
setup_parallel(5)
spdxs = execute(GHOST.PARALLELENABLER.conn,
                "SELECT spdx FROM gh_2007_2020.licenses ORDER BY spdx;",
                not_null = true) |>
    (obj -> getproperty.(obj, :spdx))
for spdx in spdxs
    queries(spdx)
end
time_end = now()
canonicalize(CompoundPeriod(time_end - time_start))

For queries, each license is queried sequentially and the parallelization occurs on-demand internally which is different from say querying repositories, commits or users. Discovering the intervals depends on the results for others intervals within the license. When querying each license, it is actually dividing up the intervals and using multiple processes as needed (small licenses like 0BSD will likely run faster sequentially so it will not hit the multiple workers, MIT will definitely use the multiple processes as that one will hit multiple intervals to query / identify).

Let me know if you have any additional questions.

@vlad43210
Copy link
Author

vlad43210 commented Feb 1, 2021 via email

@Nosferican
Copy link
Collaborator

Aye. You can have multiple tokens but they would all share the resources across for the same account. For being able to increase the rate limits you need to have access to tokens from multiple GitHub accounts.

@vlad43210
Copy link
Author

That makes sense. I am considering whether to add more tokens.

For what it's worth, with one token it does look like each batch takes between 15-20 minutes. A quick count of the query batches yielded 1376 query batches (after having processed some), so that to me looks like it would take between 14 and 19 days to build the full queries table. I don't know how much longer it will take to get the commits. That seems quite a bit longer than your estimate, you said that it took an hour to build the queries table with 38 tokens, so I would expect it to take ~ 40 hours to build the queries table with 1 token.

Alternatively... this code is for grabbing absolutely everything with a public license, right? Could I ask you to post a sample query of grabbing repos by name or by tag? E.g. your repo has the "open-source," "julia-language," and "research-tool" tags.

Thank you for all your help :)

@Nosferican
Copy link
Collaborator

Getting the queries table should be a couple hours with 5 tokens in my experience (more than 5 tokens won't help there since around 5 is more than enough and additional ones won't be used). Using 5 tokens allows the program to cycle through those so you never have to sleep to wait for a reset.

Currently the code looks for public, with a given license, non-fork, non-mirror, non-archived. The query is defined in

subsquery = join([ string("_$idx:search(query:\"is:public fork:false mirror:false archived:false license:$spdx created:",

and by default it looks for all repos that meet the filter until the end of the previous year. I have a few ideas on how to make it more customizable (e.g., keyword arguments like archived::Bool = false). More generally, you could pass the graphQL file directly but for search queries it is a bit harder than other types. For example, for repos I use

fragment A on SearchResultItemConnection {
pageInfo {
endCursor
hasNextPage
}
edges {
node {
... on Repository {
id
createdAt
nameWithOwner
description
primaryLanguage {
name
}
defaultBranchRef {
id
target {
... on Commit {
history(until: $until) {
totalCount
}
}
}
}
}
}
}
}

For say filtering based on tags, I don't think it is supported in the search syntax (could be submitted as a feature request). One approach would be to collect say all the public repositories using a similar approach to how the package currently finds queries, use the queries to collect the repositories and then you can get the attributes for the repositories. For example,

you can try the following query in the GraphQL explorer

{
  repository(owner: "uva-bi-sdad", name: "GHOST.jl") {
    repositoryTopics(first: 100) {
      edges {
        node {
          id
          topic {
            id
            name
          }
        }
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants