Git scraper

Git scraper is a tool developer by Nikola Nikushev for a university project at Vienna University of Economics and Business. The tool collects github repository information using the github rest api client.

What content does the scraper collect?

The git scraper collects information found relative to create a user activity analysis. It links a REPO with all the contributors for the REPO to collect the information relevant for the contributors. The information found relevant for one project can be seen in the graph below:

A user creates a repository.
Multiple users can contribute to a project using one or more of the methods:
- Add commits
- Add comments on Pull requests or Issues
- Creating issues or pull requests
- Close issues or pull requests

The relevant information, for analysis, is stored in a CSV folder. For contextual data we also store multiple results in JSON format to understand what data produced the CSVs

Setup

To use this project you need to have NodeJS 10+.
Obtain a person github token from the settings page

Copy .env.sample file to .env with the following contents, by replacing <your token> with your token:

GITHUB_TOKEN=<your token>
OUTPUT_FOLDER=<your output folder>
SINGLE_CSV_FILE=<true/false>
RETRY_ON_RATE_LIMIT_REACHED=<true/false>

run yarn or npm install

How to use

Assuming you have the project configured, you can provide the following variables in the .env file:
- OUTPUT_FOLDER - this allows you to configure where would you like the outputs of the project to be
- SINGLE_CSV_FILE - should all outputs go into 1 single file, instead of grouping by project
- RETRY_ON_RATE_LIMIT_REACHED - If you are requesting data for projects with more than 3000 issues you will get an error that you have reached your RATE_LIMIT. If you set to true the process will continue from where it last left off when rate limit error is thrown. Default users have a rate of 5000 per 30 minute, but you can check here rate limit by github.
Then you can provide a list of projects inside the input.json.

All the projects inside the input.json will be loaded and sent as output to the OUTPUT_FOLDER

Make the CSV unique

If for some reason you believe your data might have duplicates, you can run: yarn neek --input <your_output_folder>/csv/<fileName>.csv --output pathToOther/unique.csv

Project architecture

The project structure is as follows:

src - Folder with the main core code
- example - Folder which shows examples for a single repo
- api.ts - Main class wrapper around the octokit rest api endpoints that we use
- CustomOctokit.ts - Creates a wrapper on the octokit client to configure rate limiting
- index.ts - Main workflow file that loads all the projects inside the input.json
- input.json - Holds all the projects that will be loaded when we run yarn start
- loadEnv.ts - Uses the dotenv package to load the environment variables
- ProjectLoader.ts - Creates a class that executes the API calls and wraps the responses from the rest API and writes the outputs to JSON and CSV files
- toCSV.ts - Holds functions that transform a JSON into a CSV file entry
- writeToFile.ts - Holds functions that write contents to a file. All folders get created recursively and we do not overwrite/delete old files

Contributing

Make a pull request or post an issue and tag NikolaNikushev

Commits

All commits follow the rules for Semantic Commits

More details can be seen for allowed commits in ./commitlint.config.js

Release

The project uses automatic release notes generated using git actions in the semantic-release workflow Once a change has been pushed to master, the workflow processes the current commits by running the semantic-release command.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
.nycrc		.nycrc
.releaserc		.releaserc
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
commitlint.config.js		commitlint.config.js
eslint-local-rules.js		eslint-local-rules.js
jest.config.js		jest.config.js
package.json		package.json
tsconfig.json		tsconfig.json
tslint.json		tslint.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Git scraper

Contents

What content does the scraper collect?

Setup

How to use

Make the CSV unique

Project architecture

Contributing

Commits

Release

About

Releases 10

Packages

Contributors 2

Languages

License

NikolaNikushev/git-scraper

Folders and files

Latest commit

History

Repository files navigation

Git scraper

Contents

What content does the scraper collect?

Setup

How to use

Make the CSV unique

Project architecture

Contributing

Commits

Release

About

Resources

License

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 2

Languages

Packages