Git scraper is a tool developer by Nikola Nikushev for a university project at Vienna University of Economics and Business. The tool collects github repository information using the github rest api client.
The git scraper collects information found relative to create a user activity analysis.
It links a REPO
with all the contributors
for the REPO
to collect the information relevant for the contributors.
The information found relevant for one project can be seen in the graph below:
- A user creates a repository.
- Multiple users can contribute to a project using one or more of the methods:
- Add commits
- Add comments on Pull requests or Issues
- Creating issues or pull requests
- Close issues or pull requests
The relevant information, for analysis, is stored in a CSV folder. For contextual data we also store multiple results in JSON format to understand what data produced the CSVs
- To use this project you need to have NodeJS 10+.
- Obtain a person github token from the settings page
- Copy
.env.sample
file to.env
with the following contents, by replacing<your token>
with your token:GITHUB_TOKEN=<your token> OUTPUT_FOLDER=<your output folder> SINGLE_CSV_FILE=<true/false> RETRY_ON_RATE_LIMIT_REACHED=<true/false>
- run
yarn
ornpm install
-
Assuming you have the project configured, you can provide the following variables in the
.env
file:- OUTPUT_FOLDER - this allows you to configure where would you like the outputs of the project to be
- SINGLE_CSV_FILE - should all outputs go into 1 single file, instead of grouping by project
- RETRY_ON_RATE_LIMIT_REACHED - If you are requesting data for projects with more than 3000 issues you will get an error that you have reached
your RATE_LIMIT. If you set to
true
the process will continue from where it last left off when rate limit error is thrown. Default users have a rate of 5000 per 30 minute, but you can check here rate limit by github.
-
Then you can provide a list of projects inside the input.json.
All the projects inside the input.json
will be loaded and sent as output to the OUTPUT_FOLDER
If for some reason you believe your data might have duplicates, you can run:
yarn neek --input <your_output_folder>/csv/<fileName>.csv --output pathToOther/unique.csv
The project structure is as follows:
src
- Folder with the main core codeexample
- Folder which shows examples for a single repoapi.ts
- Main class wrapper around the octokit rest api endpoints that we useCustomOctokit.ts
- Creates a wrapper on the octokit client to configure rate limitingindex.ts
- Main workflow file that loads all the projects inside theinput.json
input.json
- Holds all the projects that will be loaded when we runyarn start
loadEnv.ts
- Uses thedotenv
package to load the environment variablesProjectLoader.ts
- Creates a class that executes the API calls and wraps the responses from the rest API and writes the outputs to JSON and CSV filestoCSV.ts
- Holds functions that transform a JSON into a CSV file entrywriteToFile.ts
- Holds functions that write contents to a file. All folders get created recursively and we do not overwrite/delete old files
Make a pull request or post an issue and tag NikolaNikushev
All commits follow the rules for Semantic Commits
More details can be seen for allowed commits in ./commitlint.config.js
The project uses automatic release notes generated using git actions in the semantic-release workflow
Once a change has been pushed to master, the workflow processes the current commits by running the semantic-release
command.