Improve identification of committers' organizations #24

irynastr · 2020-08-14T12:55:20Z

The current OSCI implementation uses the email domain of the committer to identify their organization. Many developers do not use their company email address on GitHub, or do not make their email address public. However many of these people do include their organizational information in their GitHub user profiles.

We would like to improve the identification of committers organization using the data in their user profiles.

<<<>>>
We already made an experiment to do this, but with minimal success. This is described below.
The basic matching algorithm works like this:

The domain is selected from the commiter's email;
Each domain is compared with the list of company domains (google.com, microsoft.com, etc) regardless of case;
If no match is found, a regular expression analysis is performed for situations with domains of 3 and higher levels.

If after applying the basic algorithm the matching did not occur, an extended algorithm was proposed:

The profile information on the user's Github is uploaded;
The website field is taken from the profile, if it is empty, then go to step 3. Otherwise, the basic algorithm is applied on the specified domain. If no matches occurred after applying the basic algorithm, go to step 3.
The company field is taken and compared with the list of companies that we are processing. (Fuzzy band algorithms were used: Levenshtein distance, Sorenson-Dice coefficient, etc)

Result of experiment:
For only 38% of all the users examined, we managed to match a company from their profile. The remaining profiles did not have a clear match. For milder match rules, only 5% is added.

It is also worth noting that for users where we managed to match their company from their profile, the company is the same as that received from the email in all cases.

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits. We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

octogonz · 2022-01-03T23:38:10Z

Including your real email address in public Git commits is a great way to invite spam. 🙁

Instead, the best practice is to commit using an anonymized address like 4673363+octogonz@users.noreply.github.com. GitHub itself uses anonymized addresses when it generates commits such as a PR merge.

octogonz · 2022-01-03T23:54:43Z

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits.

This overhead seems worthwhile and a valuable service to the community. 👍👍 Perhaps GitHub would be willing to publish the aggregated data directly, if asked nicely.

We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

Why? An annual ranking would be sufficient for most purposes. Certainly the ranking is not going to change substantially from day to day.

jeffwilcox · 2022-02-25T12:45:41Z

Regarding the volume of GitHub API calls - we use Conditional Requests and cache responses for so many things on GitHub. As long as someone's profile doesn't change, you aren't charged an API call, for example.

Might help make it more of a reality...

At our company, we ask employees to try and maintain a professional profile, and we internally allow them to choose to tell us who they are...

by proving control and authenticating to both our corporate systems and their GitHub account via OAuth, creating a "link" (this data is not broadly available outside the company, but helps us improve our own data)
by noting the company name in their profile
by publicizing their membership in a company org on their profile
by using their corporate email in public

Hard problem to do at scale, for sure. Happy to help brainstorm.

anausa4eva · 2022-03-10T07:41:38Z

Hey Jeff,

Thanks for your comments! After much analysis, we concluded it was best to use the email address of the commit author to identify the organization to which they belong. Otherwise we loose almost 80% of made contributions, a lot of engineers don't note their companies in their profiles.
However, we're really keen to explore your first idea about authenticating to both corporate system and GitHub account. Do you mean that employees make a commit and then authorize it in your corporate system?

Sealjay · 2022-11-16T18:31:55Z

I can confirm that we take the same approach at @Avanade - and whilst I love the OSCI tool, I worry that the contribution data could become inaccurate over time.

Taking a recent update to the companies list as an example - Release v2022.09.0 (#144) · epam/OSCI@cbf6b35 (github.com) - if we assume that James, Mohit, Guilherme and Justin work at Credera, Infosys, Farfetch, and ebay – then only Mohit’s contribution would have been associated with Infosys, as the others are contributing with personal email addresses or user.noreply emails.

We ask employees to use one GitHub account - and then complete the Organization field / be invited to the Avanade GitHub org. Employees log into both GitHub & our own corporate system.

That way, if someone moves to another organization, they can keep their private commit history, which we feel provides them with a good "CV" and we'd want to support people wherever they choose to go in their career.

This is a GitHub native feature too, if you use something like GitHub Enterprise - https://docs.github.com/en/enterprise-server@3.7/admin/user-management/managing-users-in-your-enterprise/viewing-people-in-your-enterprise

irynastr added the enhancement New feature or request label Aug 25, 2020

patrickstephens1 changed the title ~~Improve identification of commiters' organizations~~ Improve identification of committers' organizations Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve identification of committers' organizations #24

Improve identification of committers' organizations #24

irynastr commented Aug 14, 2020

octogonz commented Jan 3, 2022

octogonz commented Jan 3, 2022 •

edited

Loading

jeffwilcox commented Feb 25, 2022

anausa4eva commented Mar 10, 2022 •

edited

Loading

Sealjay commented Nov 16, 2022 •

edited

Loading

Improve identification of committers' organizations #24

Improve identification of committers' organizations #24

Comments

irynastr commented Aug 14, 2020

octogonz commented Jan 3, 2022

octogonz commented Jan 3, 2022 • edited Loading

jeffwilcox commented Feb 25, 2022

anausa4eva commented Mar 10, 2022 • edited Loading

Sealjay commented Nov 16, 2022 • edited Loading

octogonz commented Jan 3, 2022 •

edited

Loading

anausa4eva commented Mar 10, 2022 •

edited

Loading

Sealjay commented Nov 16, 2022 •

edited

Loading