diff --git a/README.md b/README.md index 9e2c02d..12aa4c6 100644 --- a/README.md +++ b/README.md @@ -1,96 +1,50 @@ ![OSCI Logo](images/OSCI_Logo.png) -# OSCI, the Open Source Contributor Index - -## What is OSCI? - -* OSCI ranks corporate contributions to open source based on the organization’s number of Active Contributors to GitHub -* OSCI also tracks the Total Community of open source contributors for these companies -* The OSCI rankings are published monthly to dynamically track corporate contributions to GitHub. The latest result can be found at EPAM SolutionsHub website's [OSCI page](https://opensourceindex.io/) - -## [News update](news.md) -### [July 12th, 2021](news.md#july-12th-2021) - -The OSCI ranking has now been updated with the data for [**June 2021**](https://opensourceindex.io/) - -The table shows the OSCI ranking for GitHub activity in June 2021. The leading organisations remain consistent once again this month and the overall level of activity has stabilised approaching second half of the year. In addition Amazon and IBM are progressing well above their closest neighbours in growth figures this month. - -[Previous updates](news.md) - +# Open Source Contributor Index (OSCI) +OSCI, an open source project, aiming to track and measure open source activity on GitHub by commercial organizations. It allows organizations, communities, analysts and individuals involved in Open Source to get insights about contribution trends among commercial organizations by providing access to up-to-date data through an intuitive interface. + +### Table of contents +- [How does OSCI work?](#how-does-osci-work) +- [How are commit authors linked to commercial organizations?](#how-are-commit-authors-linked-to-commercial-organizations) +- [How can I submit my company for ranking?](#how-can-i-submit-my-company-for-ranking) +- [How can I contribute to OSCI?](#how-can-i-contribute-to-osci) +- [Quick Start](#quickstart) + * [Installation](#installation) + * [Configuration](#configuration) + * [Sample run](#sample-run) +- [OSCI Versioning](#osci-versioning) +- [License](#license) +- [Contact Us](#contacting-us) ## How does OSCI work? -* OSCI analyses push event data from [GH Archive](https://www.gharchive.org/) -* The Author Email address field in the commit event data is used to identify the organization to which the commit author belongs -* OSCI measures the Active Community (10+ commits) and the Total Community (1+ commit) at each organization -* Analysis is done for the current year-to-date -* OSCI’s algorithm is transparently published as an open source project on GitHub - +To create this index, the system processes GitHub push events data from [GH Archive](https://www.gharchive.org/): ![GitHub OSCI Schematic Diagram](images/OSCI_Schematic_Architecture.png) -## OSCI Versioning :newspaper: -We decided to use special versioning `(..)` e.g. `2021.05.0`. This will provide us with a -clearer understanding of the relevance of the product. -Also, date of -[adding a new company](#how-can-i-add-a-company-which-is-missing-from-the-osci-ranking) is very important and versioning -releases depending on the date looks more logical. -This is supposed to be a monthly update of the release. - - -## How did we decide on the ranking logic? - -We realize that there are many approaches which could be used to develop this ranking. We experimented extensively with these before arriving at the logic now used. - -We concluded the Author's Email Domain in the Push Event data is the most reliable way to identify the organization to which a commit author belongs. It is a single unambiguous identifier. -It is true that many people on GitHub do not use the email address of their employer or keep it private, but it is still a more reliable single measure compared with alternatives. The org of the repo is not a good measure of the employer of an individual because the employees of most companies maintain projects in multiple GitHub orgs. - -We concluded that 10+ commits in a year - while arbitrary - is a reasonable measure of whether a person is an active contributor to GitHub. -It is interesting to compare this with the size of the broader community at an organization - this is why we also record the number of people with just 1 or more commits in the time period. - -We decided to base the ranking on the number of people making commits, rather than the number of commits. The GitHub push event data includes large numbers of pushes made by automated processes using an email domain from an organization. Counting the number of pushes is not a good measure of the community at an organization, when the results are considerably skewed by these automated processes. +OSCI tracks two measures at each organization: + - **Active contributors**, the number of people who authored 10 or more commits over a period of time + - **Total community**, the number of people who made at least one commit over a period of time -Our technical design assessed multiple source of data and we concluded that [GHArchive](https://www.gharchive.org/) is best suited for our needs, based on ease of access to the data, the amount of data it records, and the size of the data. -[GHTorrent](http://ghtorrent.org/) is also very interesting for future use, since it contains a richer set of data, however the data sizes are considerably larger which drove our decision to go with GHArchive. +## How are commit authors linked to commercial organizations? -## What does OSCI not do? +The system uses email domain of the commit author to identify the organization. Your organization is missing in the ranking? Feel free to add your organization to the list. -OSCI does not include educational and research institutions, contributions from free email providers, etc. The focus is on commercial organizations. +*Note: OSCI does not rank open source activity contributed by universities, research institutions and individual entrepreneurs.* -## Prior work in this area +## How can I submit my company for ranking? -Our inspiration for OSCI is the work done earlier in the Open Source community, so we wish to give credit to these: -* GitHub published an analysis for 2016 which included the organizations with the most open source contributors https://octoverse.github.com/2016/. GitHub have also published similar studies in 2017 and 2018. -* Felipe Hoffa published a detailed analysis of 2017 data at https://medium.freecodecamp.org/the-top-contributors-to-github-2017-be98ab854e87. Felipe's logic used email domain to identify organizations, counted commits only to projects with more than 20 stars during the period, and counted users with more than 3 pushes during the period. Further details are available in the article. -* Fil Maj also analysed 2017 data and counted users with 10 or more commits during the period https://www.infoworld.com/article/3253948/who-really-contributes-to-open-source.html - - -## Where can I see the latest rankings -This project is created by EPAM Systems and the latest results are visible on the [OSCI page](https://opensourceindex.io/). The results will be updated and published each month. - -## How I can contribute to OSCI -If you would like to contribute to OSCI, please take a look at our guidelines [here.](CONTRIBUTING.md) - -## What if your think your organization is missing or you believe there is an error in our logic -If your organization is missing from our ranking then simply follow the instructions below to modify the companies filter and add your own organisation. We're also more than happy to listen to any feedback you have that may help us to improve. Contact us at [OSCI@epam.com](mailto:OSCI@epam.com) to share your feedback and raise any questions. - -## How can I add a company which is missing from the OSCI ranking -The goal of the OSCI is to rank the GitHub contributions by companies (commercial organizations). - -In order to add a company to the OSCI ranking, do the following: - -1) Check whether the organization you propose to add matches our definition of a company: - - is not an educational, governmental, non-profit or research institution; - - is not a free-mail, like gmail.com, yahoo.com, hotmail.com, etc; - - is a registered, commercial company; - - a simple "rule of thumb" - does the organization's website sell a product or service? If not, it is probably not a company by our definition. +1) Check whether the organization you propose to add matches OSCI definition: + - not an educational, governmental, non-profit or research institution; + - registered, commercial organization; + - sells goods or services for the purpose of making a profit. 1) Create a new pull request. 1) Go to company domain match list ([company_domain_match_list.yaml](osci/preprocess/match_company/company_domain_match_list.yaml)) -1) Confirm that the company you wish to add is not listed. +1) Double check that the organization you want to add is not listed. -1) Add the **main domain** of the company and the company name to the table. For example: +1) Add the **email domain** of the company and the company name to the table. For example: ```yaml - company: Facebook domains: @@ -134,16 +88,18 @@ In order to add a company to the OSCI ranking, do the following: industry: Media & Telecoms ``` -We will review your pull request and if it matches our requirements, we will merge it. -It's important to add at **the start -of the month** a new company, because the rating depends on previous data, i.e. data for the beginning of the month. -Furthermore, this will lead to OSCI release consistency. +Our team will review your pull request and merge it if everything is correct. -# QuickStart -## Technical Note -We built OSCI this an Azure cloud environment using Azure DataFactory, Azure Function and Azure HDInsight. -The code published here on GitHub does not require the Azure cloud. You can reproduce everything in the corresponding instruction with the CLI (command line interface). -## Installation +*Note: since OSCI processes the data for the previous month, you'll see your organization's rank in the beginning of the next month.* + +## How can I contribute to OSCI? +See [CONTRIBUTING.md](CONTRIBUTING.md) for details on contribution process. + +## QuickStart +OSCI is deployed into Azure Cloud environment using Azure DataFactory, Azure Function and Azure DataBricks. However, the code available on GitHub does not require using of Azure Cloud. +Run the application from the command line using the instruction below. + +### Installation 1) Clone repository ```shell script git clone https://github.com/epam/OSCI.git @@ -157,11 +113,11 @@ The code published here on GitHub does not require the Azure cloud. You can repr pip install -r requirements.txt ``` -## Configuration +### Configuration Create a file `local.yml` (by default this file added to .gitignore) in the directory [`osci/config/files`](osci/config/files). A sample file [`default.yml`](osci/config/files/default.yml) is included, please don't change values in this file -## Sample run +### Sample run 1) Run script to download data from archive (for example for 01 January 2020) ```shell script python3 osci-cli.py get-github-daily-push-events -d 2020-01-01 @@ -174,6 +130,12 @@ A sample file [`default.yml`](osci/config/files/default.yml) is included, please ```shell script python3 osci-cli.py daily-osci-rankings -td 2020-01-02 ``` - -# License + +## OSCI Versioning +For a comprehensive OSCI versioning we adopted the following approach `..`) e.g. 2021.05.0. We expect regularly monthly updates including releases associated with submission of a new company for ranking. + +## License OSCI is licensed under the [GNU General Public License v3.0](LICENSE). + +## Contact Us +For support or help using OSCI, please contact us at [OSCI@epam.com](mailto:OSCI@epam.com). diff --git a/archive/2019.08_OSCI_Ranking.xlsx b/archive/2019.08_OSCI_Ranking.xlsx deleted file mode 100644 index e9d212e..0000000 Binary files a/archive/2019.08_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2019.09_OSCI_Ranking.xlsx b/archive/2019.09_OSCI_Ranking.xlsx deleted file mode 100644 index fc7c7e1..0000000 Binary files a/archive/2019.09_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2019.10_OSCI_Ranking.xlsx b/archive/2019.10_OSCI_Ranking.xlsx deleted file mode 100644 index 3b16fb4..0000000 Binary files a/archive/2019.10_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2019.11_OSCI_Ranking.xlsx b/archive/2019.11_OSCI_Ranking.xlsx deleted file mode 100644 index f6585f9..0000000 Binary files a/archive/2019.11_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2019.12_OSCI_Ranking.xlsx b/archive/2019.12_OSCI_Ranking.xlsx deleted file mode 100644 index caaff5e..0000000 Binary files a/archive/2019.12_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2020.01_OSCI_Ranking.xlsx b/archive/2020.01_OSCI_Ranking.xlsx deleted file mode 100644 index 127e85b..0000000 Binary files a/archive/2020.01_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2020.02_OSCI_Ranking.xlsx b/archive/2020.02_OSCI_Ranking.xlsx deleted file mode 100644 index 4fb3314..0000000 Binary files a/archive/2020.02_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2020.03_OSCI_Ranking.xlsx b/archive/2020.03_OSCI_Ranking.xlsx deleted file mode 100644 index 3d65875..0000000 Binary files a/archive/2020.03_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2020.04_OSCI_Ranking.xlsx b/archive/2020.04_OSCI_Ranking.xlsx deleted file mode 100644 index 0ba5ce3..0000000 Binary files a/archive/2020.04_OSCI_Ranking.xlsx and /dev/null differ diff --git a/archive/2020.05_OSCI_Ranking.xlsx b/archive/2020.05_OSCI_Ranking.xlsx deleted file mode 100644 index c050d6f..0000000 Binary files a/archive/2020.05_OSCI_Ranking.xlsx and /dev/null differ diff --git a/images/OSCI_Logo.png b/images/OSCI_Logo.png index 234c146..f1f7e13 100644 Binary files a/images/OSCI_Logo.png and b/images/OSCI_Logo.png differ diff --git a/news.md b/news.md deleted file mode 100644 index 2432961..0000000 --- a/news.md +++ /dev/null @@ -1,272 +0,0 @@ -# News update - -## Navigation - -| [2019][2019] | | | | -|------ |----- |----- |----- | -| Jan | Feb | Mar | Apr | -| May | Jun | Jul | Aug | -| Sep | [Oct][2019.10] | [Nov][2019.11] | [Dec][2019.10] | - -| [2020][2020] | | | | -|------ |----- |----- |----- | -| [Jan][2020.01] | [Feb][2020.02] | [Mar][2020.03] | [Apr][2020.04] | -| [May][2020.05] | [Jun][2020.06] | [Jul][2020.07] | [Aug][2020.08] | -| [Sep][2020.09] | [Oct][2020.10] | [Nov][2020.11] | [Dec][2020.12] | - -| [2021][2021] | | | | -|------ |----- |----- |----- | -| Jan | [Feb][2021.02] | [Mar][2021.03] | [Apr][2021.04] | -| [May][2021.05] | [Jun][2021.06] | [Jul][2021.07] | Aug | -| Sep | Oct | Nov | Dec | - -## 2021 updates - -### July 12th, 2021 -The OSCI ranking has now been updated with the data for [**June 2021**][opensourceindex.io] - -The table shows the OSCI ranking for GitHub activity in June 2021. The leading organisations remain consistent once again this month and the overall level of activity has stabilised approaching second half of the year. In addition Amazon and IBM are progressing well above their closest neighbours in growth figures this month. - -## 2021 updates - -### June 9th, 2021 -The OSCI ranking has now been updated with the data for [**May 2021**][opensourceindex.io] - -The table below shows the OSCI ranking for GitHub activity in May 2021. Microsoft shows the biggest growth in Active Contributors this month and remains the top organization. In addition, Unity Technologies, HashiCorp and Linkedin each climbed 2 or 3 positions, demonstrating a consistent upward trend in their growth. - - -### May 21st 2021 -The OSCI ranking has now been updated with the data for [**April 2021**][rank-2021.04] - -The leading organizations remain consistent once again this month, although a slight decrease in growth of Active Contributor is noticeable this month. In addition Amazon, NVidia and Huawei are all progressing well above their closest neighbours in growth figures this month. - -[Download rank][rank-2021.04] - -### April 5th 2021 -The OSCI ranking has now been updated with the data for [**March 2021**][rank-2021.03] - -This month Microsoft has reached the top spot in the ranking. Huawei, Tencent and Alibaba climbed in to the Top 20 following a strong upwards trend in activity from the start of the year. - -[Download rank][rank-2021.03] - - -### March 3rd 2021 -The OSCI ranking has now been updated with the data for [**February 2021**][rank-2021.02] - -A significant observation this month is the overall increase in the level of engagement compared to February 2020; suggesting a long-term upward trend in the growth of corporate Open Source activity. - -[Download rank][rank-2021.02] - -### February 10th 2021 -The OSCI ranking has now been updated with the data for [**January 2021**][rank-2021.01] - -We’ve started the new year with an improved algorithm that considers Open Source licenses as a factor of measuring contributions. We’re looking forward to seeing which organisations continue their commitment to Open Source this year and which new comers climb the rankings. - -[Download rank][rank-2021.01] - -### February 2nd 2021 -The OSCI ranking has now been updated with the data for [**December 2020**][rank-2020.12] - -The leading organizations remain consistent once again this month, although the overall trend in a decrease of Active Contributor growth at year end is noticeable. Organizations such as Intel, GitHub and Apple are well above their closest neighbors in their growth figures this month. - -[Download rank][rank-2020.12] - ---- -## 2020 updates -### December 2nd 2020 -The OSCI ranking has now been updated with the data for [**November 2020**][rank-2020.11] - -The top leading organisations remain the same once again this month. There’s steady growth in the total Active Contributors from Huawei, Tencent and Alibaba too. - -[Download rank][rank-2020.11] - ---- -### November 16th 2020 -The OSCI ranking has now been updated with the data for [**October 2020**][rank-2020.10] - -The latest data continues to show Google solidifying their hold on the #1 position, with both Verizon & Capgemini making impressive gains this month too. - -[Download rank][rank-2020.10] - ---- -### October 6th 2020 -The OSCI ranking has now been updated with the data for [**September 2020**][rank-2020.09] - -The latest data continues to show Google solidifying their hold on the #1 position, and we observe significantly above-average growth in contributors at both Huawei and Tencent, reflecting the rise in open source activity in Asia. - -[Download rank][rank-2020.09] - ---- -### September 8th 2020 -The OSCI ranking has now been updated with the data for [**August 2020**][rank-2020.08] - -The latest data shows that Google have strongly established themselves in the #1 position. - -[Download rank][rank-2020.08] - ---- - -### August 5th 2020 - -The OSCI ranking has now been updated with the data for [**July 2020**][rank-2020.07] - -[Download rank][rank-2020.07] - ---- - -### July 6th 2020 - -The OSCI ranking has now been updated with the data for [**June 2020**][rank-2020.06] - -This month we added a further 24 companies to our analysis, and this resulted in 2 new companies appearing in the top 100 - Dell and Rockchip. - -[Download rank][rank-2020.06] - ---- - -### June 9th 2020 - -The OSCI ranking has now been updated with the data for [**May 2020**][rank-2020.05] - -This month we added domains for Alibaba resulting in them leaping up 8 places. - -[Download rank][rank-2020.05] - ---- - -### May 12th 2020 - -The OSCI ranking has now been updated with the data for [**April 2020**][rank-2020.04] - -This month we revised our algorithm to include 165 new companies and many new email domains, resulting in 27 new companies appearing in the top 100. - -[Download rank][rank-2020.04] - ---- - -### April 7th 2020 - -The OSCI ranking has now been updated with the data for [**March 2020**][rank-2020.03] - -[Download rank][rank-2020.03] - ---- - -### March 10th 2020 - -The OSCI ranking has now been updated with the data for [**February 2020**][rank-2020.02] - -We have also published the [OSCI 2016-2019 Deep Dive](https://solutionshub.epam.com/rise-of-open-source) which analyses the OSCI ranking changes from 2016 until now. - -[Download rank][rank-2020.02] - ---- - -### February 6th 2020 - -The OSCI ranking has now been updated with the data for [**January 2020**][rank-2020.01] - -[Download rank][rank-2020.01] - ---- - -### January 2nd 2020 - -The OSCI ranking has now been updated with the data for [**December 2019**][rank-2019.12] - -[Download rank][rank-2019.12] - ---- - - -## 2019 updates -### December 17th 2019 - -The OSCI ranking has now been updated with the data for [**November 2019**][rank-2019.11] - -[Download rank][rank-2019.11] - ---- - -### November 14th 2019 - -The OSCI ranking has now been updated with the data for [**October 2019**][rank-2019.10] -The OSCI ranking has now been extended to the top 50 companies, and updated to the end of October. - -[Download rank][rank-2019.10] - ---- - -### October 10th 2019 - -The OSCI ranking has now been updated with the data for [**September 2019**][rank-2019.09] and [**August 2019**][rank-2019.08] - -[Download rank][rank-2019.09] - ---- - - -[Archive]: Archive - -[rank-2019.08]: Archive/2019.08_OSCI_Ranking.xlsx -[rank-2019.09]: Archive/2019.09_OSCI_Ranking.xlsx -[rank-2019.10]: Archive/2019.10_OSCI_Ranking.xlsx -[rank-2019.11]: Archive/2019.11_OSCI_Ranking.xlsx -[rank-2019.12]: Archive/2019.12_OSCI_Ranking.xlsx - -[rank-2020.01]: Archive/2020.01_OSCI_Ranking.xlsx -[rank-2020.02]: Archive/2020.02_OSCI_Ranking.xlsx -[rank-2020.03]: Archive/2020.03_OSCI_Ranking.xlsx -[rank-2020.04]: Archive/2020.04_OSCI_Ranking.xlsx -[rank-2020.05]: Archive/2020.05_OSCI_Ranking.xlsx -[rank-2020.06]: Archive/2020.06_OSCI_Ranking.xlsx -[rank-2020.07]: Archive/2020.07_OSCI_Ranking.xlsx -[rank-2020.08]: Archive/2020.08_OSCI_Ranking.xlsx -[rank-2020.09]: Archive/2020.09_OSCI_Ranking.xlsx -[rank-2020.10]: Archive/2020.10_OSCI_Ranking.xlsx -[rank-2020.11]: Archive/2020.11_OSCI_Ranking.xlsx -[rank-2020.12]: Archive/2020.12_OSCI_Ranking.xlsx - -[rank-2021.01]: archive/2021.01_OSCI_Ranking.xlsx -[rank-2021.02]: archive/2021.02_OSCI_Ranking.xlsx -[rank-2021.03]: archive/2021.03_OSCI_Ranking.xlsx -[rank-2021.04]: archive/2021.04_OSCI_Ranking.xlsx -[opensourceindex.io]: https://opensourceindex.io/ - - -[2019]: #2019-updates - -[2019.10]: #october-10th-2019 -[2019.11]: #november-14th-2019 -[2019.12]: #december-17th-2019 - - -[2020]: #2020-updates - -[2020.01]: #january-2nd-2020 -[2020.02]: #february-6th-2020 -[2020.03]: #march-10th-2020 -[2020.04]: #april-7th-2020 -[2020.05]: #may-12th-2020 -[2020.06]: #june-9th-2020 -[2020.07]: #july-6th-2020 -[2020.08]: #august-5th-2020 -[2020.09]: #september-8th-2020 -[2020.10]: #october-6th-2020 -[2020.11]: #november-16th-2020 -[2020.12]: #december-2nd-2020 - - -[2021]: #2021-updates - - -[2021]: #2021-updates - -[2021.02]: #february-10th-2021 -[2021.03]: #march-3rd-2021 -[2021.04]: #april-5th-2021 -[2021.05]: #may-21st-2021 -[2021.06]: #june-9th-2021 -[2021.07]: #july-12th-2021 - - diff --git a/osci/__init__.py b/osci/__init__.py index a5f3856..b156011 100644 --- a/osci/__init__.py +++ b/osci/__init__.py @@ -1 +1 @@ -__version__ = '2021.12.0' +__version__ = '2021.12.1' diff --git a/osci/preprocess/match_company/company_domain_match_list.yaml b/osci/preprocess/match_company/company_domain_match_list.yaml index ebddd79..1c4393a 100644 --- a/osci/preprocess/match_company/company_domain_match_list.yaml +++ b/osci/preprocess/match_company/company_domain_match_list.yaml @@ -1110,7 +1110,7 @@ - company: Optum domains: - optum.com - industry: Healthcare and Pharma + industry: Healthcare & Pharma regex: - company: Oracle domains: @@ -1162,7 +1162,7 @@ - company: Philips domains: - philips.com - industry: Healthcare + industry: Healthcare & Pharma regex: - company: Pilz domains: @@ -1572,7 +1572,7 @@ - company: United Healthcare domains: - uhc.com - industry: Healthcare and Pharma + industry: Healthcare & Pharma regex: - company: Unity Technologies domains: @@ -1675,7 +1675,7 @@ - company: datavisyn domains: - datavisyn.io - industry: Healthcare and Pharma + industry: Healthcare & Pharma regex: - company: eyeo domains: