-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only one worker collecting repos data despite parallel setup? #18
Comments
Hi! If a batch took half an hour or so, that is most likely due to having hit the resource limit for that account (5k requests / hour). When that happens, the code sleeps until the reset time when it is allowed to query the API again. You can check the remaining and reset time for that token through
Just in case, you only need to run the package on one container. If you run multiple containers using the same tokens, it will not work since each of those is using the same PAT. The PAT will be making calls from each container and hitting the resource limits. It should be a single container with multiple cores and and the package through A side note on
For queries, each license is queried sequentially and the parallelization occurs on-demand internally which is different from say querying repositories, commits or users. Discovering the intervals depends on the results for others intervals within the license. When querying each license, it is actually dividing up the intervals and using multiple processes as needed (small licenses like Let me know if you have any additional questions. |
Thank you very much, that's very helpful! I didn't realize the
parallelization depended on using multiple tokens. Do you happen to know if
multiple tokens means multiple accounts, or if multiple tokens can be
linked to one GitHub account? I would assume GitHub would have one shared
rate limit for all tokens coming from the same account, but maybe I'm wrong!
Vlad
…On Mon, Feb 1, 2021 at 12:15 PM José Bayoán Santiago Calderón < ***@***.***> wrote:
Hi!
May I ask how many GitHub Personal Access token you have written to the
pats table in the database?
setup_parallel will spawn as many as many workers as GitHub Personal
Access Tokens as available (by default, can be overwritten by the limit
keyword argument). If there aren't multiple, it will just run it
sequentially. For generating the full queries tables with around 38 GitHub
PAT I believe it was around an hour or so. It would probably be a few with
5 PAT.
If a batch took half an hour or so, that is most likely due to having hit
the resource limit for that account (5k requests / hour). When that
happens, the code sleeps until the reset time when it is allowed to query
the API again. You can check the remaining and reset time for that token
through
# after setup() / setup_parallel();
# when in parallel you can query the variable from the worker to get the information for that PAT
julia> GHOST.PARALLELENABLER.pat
GitHub Personal Access Token
login: Nosferican
core remaining: 5000
core reset: 2021-02-01T17:56:05
graphql remaining: 5000
graphql reset: 2021-02-01T17:56:05
Just in case, you only need to run the package on one container. If you
run multiple containers using the same tokens, it will not work since each
of those is using the same PAT. The PAT will be making calls from each
container and hitting the resource limits. It should be a single container
with multiple cores and and the package through setup_parallel handles
the parallelization (e.g., initializes the processes, assigns a PAT and a
database connection to each worker, handles work distribution, uploads data
to database).
A side note on queries, see for example this file
using GHOST
time_start = now()
setup()
setup_parallel(5)
spdxs = execute(GHOST.PARALLELENABLER.conn,
"SELECT spdx FROM gh_2007_2020.licenses ORDER BY spdx;",
not_null = true) |>
(obj -> getproperty.(obj, :spdx))
for spdx in spdxs
queries(spdx)
end
time_end = now()
canonicalize(CompoundPeriod(time_end - time_start))
For queries, each license is queried sequentially and the parallelization
occurs on-demand internally which is different from say querying
repositories, commits or users. Discovering the intervals depends on the
results for others intervals within the license. When querying each
license, it is actually dividing up the intervals and using multiple
processes as needed (small licenses like 0BSD will likely run faster
sequentially so it will not hit the multiple workers, MIT will definitely
use the multiple processes as that one will hit multiple intervals to query
/ identify).
Let me know if you have any additional questions.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFIDHFH4G4DJWEJJRRQUD3S43OTBANCNFSM4W5CCMMQ>
.
|
Aye. You can have multiple tokens but they would all share the resources across for the same account. For being able to increase the rate limits you need to have access to tokens from multiple GitHub accounts. |
That makes sense. I am considering whether to add more tokens. For what it's worth, with one token it does look like each batch takes between 15-20 minutes. A quick count of the query batches yielded 1376 query batches (after having processed some), so that to me looks like it would take between 14 and 19 days to build the full queries table. I don't know how much longer it will take to get the commits. That seems quite a bit longer than your estimate, you said that it took an hour to build the queries table with 38 tokens, so I would expect it to take ~ 40 hours to build the queries table with 1 token. Alternatively... this code is for grabbing absolutely everything with a public license, right? Could I ask you to post a sample query of grabbing repos by name or by tag? E.g. your repo has the "open-source," "julia-language," and "research-tool" tags. Thank you for all your help :) |
Getting the queries table should be a couple hours with 5 tokens in my experience (more than 5 tokens won't help there since around 5 is more than enough and additional ones won't be used). Using 5 tokens allows the program to cycle through those so you never have to sleep to wait for a reset. Currently the code looks for public, with a given license, non-fork, non-mirror, non-archived. The Line 8 in d1332ec
and by default it looks for all repos that meet the filter until the end of the previous year. I have a few ideas on how to make it more customizable (e.g., keyword arguments like GHOST.jl/src/assets/graphql/02_repos.graphql Lines 1 to 29 in d1332ec
For say filtering based on tags, I don't think it is supported in the search syntax (could be submitted as a feature request). One approach would be to collect say all the public repositories using a similar approach to how the package currently finds queries, use the queries to collect the repositories and then you can get the attributes for the repositories. For example, you can try the following query in the GraphQL explorer
|
Describe the bug
When I run the example code in the repos script, I do get output, but it's slow (a half hour or more per batch); furthermore, the "From worker" line only ever lists one worker, which makes me concerned the code is not actually parallelized.
Sample output:
Environment
I set up the GHOST package using Pkg as described in the installation instructions. I am running it inside a Docker container. I double checked and I have 6 CPUs allocated per container.
Minimal Reproducible Example (MRE)
Steps to reproduce the behavior:
Run the example code in the repos script
Expected behavior
I would expect different worker ids in the "From worker" output lines
The text was updated successfully, but these errors were encountered: