Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve identification of committers' organizations #24

Open
irynastr opened this issue Aug 14, 2020 · 5 comments
Open

Improve identification of committers' organizations #24

irynastr opened this issue Aug 14, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@irynastr
Copy link
Contributor

The current OSCI implementation uses the email domain of the committer to identify their organization. Many developers do not use their company email address on GitHub, or do not make their email address public. However many of these people do include their organizational information in their GitHub user profiles.

We would like to improve the identification of committers organization using the data in their user profiles.

<<<>>>
We already made an experiment to do this, but with minimal success. This is described below.
The basic matching algorithm works like this:

  1. The domain is selected from the commiter's email;
  2. Each domain is compared with the list of company domains (google.com, microsoft.com, etc) regardless of case;
  3. If no match is found, a regular expression analysis is performed for situations with domains of 3 and higher levels.

If after applying the basic algorithm the matching did not occur, an extended algorithm was proposed:

  1. The profile information on the user's Github is uploaded;
  2. The website field is taken from the profile, if it is empty, then go to step 3. Otherwise, the basic algorithm is applied on the specified domain. If no matches occurred after applying the basic algorithm, go to step 3.
  3. The company field is taken and compared with the list of companies that we are processing. (Fuzzy band algorithms were used: Levenshtein distance, Sorenson-Dice coefficient, etc)

Result of experiment:
For only 38% of all the users examined, we managed to match a company from their profile. The remaining profiles did not have a clear match. For milder match rules, only 5% is added.

It is also worth noting that for users where we managed to match their company from their profile, the company is the same as that received from the email in all cases.

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits. We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

@irynastr irynastr added the enhancement New feature or request label Aug 25, 2020
@patrickstephens1 patrickstephens1 changed the title Improve identification of commiters' organizations Improve identification of committers' organizations Aug 25, 2020
@octogonz
Copy link

octogonz commented Jan 3, 2022

Including your real email address in public Git commits is a great way to invite spam. 🙁

Instead, the best practice is to commit using an anonymized address like 4673363+octogonz@users.noreply.github.com. GitHub itself uses anonymized addresses when it generates commits such as a PR merge.

@octogonz
Copy link

octogonz commented Jan 3, 2022

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits.

This overhead seems worthwhile and a valuable service to the community. 👍👍 Perhaps GitHub would be willing to publish the aggregated data directly, if asked nicely.

We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

Why? An annual ranking would be sufficient for most purposes. Certainly the ranking is not going to change substantially from day to day.

@jeffwilcox
Copy link

Regarding the volume of GitHub API calls - we use Conditional Requests and cache responses for so many things on GitHub. As long as someone's profile doesn't change, you aren't charged an API call, for example.

Might help make it more of a reality...

At our company, we ask employees to try and maintain a professional profile, and we internally allow them to choose to tell us who they are...

  • by proving control and authenticating to both our corporate systems and their GitHub account via OAuth, creating a "link" (this data is not broadly available outside the company, but helps us improve our own data)
  • by noting the company name in their profile
  • by publicizing their membership in a company org on their profile
  • by using their corporate email in public

Hard problem to do at scale, for sure. Happy to help brainstorm.

@anausa4eva
Copy link
Contributor

anausa4eva commented Mar 10, 2022

Hey Jeff,

Thanks for your comments! After much analysis, we concluded it was best to use the email address of the commit author to identify the organization to which they belong. Otherwise we loose almost 80% of made contributions, a lot of engineers don't note their companies in their profiles.
However, we're really keen to explore your first idea about authenticating to both corporate system and GitHub account. Do you mean that employees make a commit and then authorize it in your corporate system?

@Sealjay
Copy link

Sealjay commented Nov 16, 2022

I can confirm that we take the same approach at @Avanade - and whilst I love the OSCI tool, I worry that the contribution data could become inaccurate over time.

Taking a recent update to the companies list as an example - Release v2022.09.0 (#144) · epam/OSCI@cbf6b35 (github.com) - if we assume that James, Mohit, Guilherme and Justin work at Credera, Infosys, Farfetch, and ebay – then only Mohit’s contribution would have been associated with Infosys, as the others are contributing with personal email addresses or user.noreply emails.

We ask employees to use one GitHub account - and then complete the Organization field / be invited to the Avanade GitHub org. Employees log into both GitHub & our own corporate system.

That way, if someone moves to another organization, they can keep their private commit history, which we feel provides them with a good "CV" and we'd want to support people wherever they choose to go in their career.

This is a GitHub native feature too, if you use something like GitHub Enterprise - https://docs.github.com/en/enterprise-server@3.7/admin/user-management/managing-users-in-your-enterprise/viewing-people-in-your-enterprise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants