[WIP] Add repository metrics #177

steffen · 2019-03-01T20:14:58Z

Description

Earlier this week I presented the idea of a modified Hubble to display repository metrics to a few peers (including @larsxschneider) which received very positive feedback.
The proof of concept can be seen here: https://steffen.github.io/rosat/

I called the tool Rosat as that was the next satellite that has been deployed to space after Hubble and fittingly it was a German satellite, built in collaboration with the US and UK that used X-ray imaging to study its orbiting plane (earth).

@larsxschneider suggested that it would simplify maintenance for users if the repository metrics tool could be integrated directly into Hubble, which received a 👍 from @pluehne to my knowledge. 🙂

Therefore I'm creating this PR as a proof of concept to explore how an integration could look like.

Demo

Open Hubble:
https://steffen.github.io/hubble/
Go to Repositories -> Monitored:
https://steffen.github.io/hubble/repos-monitored
Click on a repository:
https://steffen.github.io/hubble/repository/?nwo=test-org/large-repo
See Overview with the repository name and size (fetched via a SQL query)
Click on GitHub Metrics:
https://steffen.github.io/hubble/repository/github-api-requests?nwo=test-org/large-repo
See API Request Types by Count (gathered from the log files)

Screenshots

Comments

Configuration

Repositories to be monitored can be configured through a new configuration option in config.py:
https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/config.py.example#L75-L81

Color

I made the header background blue to make it obvious to the user that they are in "repository view mode". I found the blue fitting as it represents the sky which becomes bluer as one gets closer to planet earth (to follow-up on that Rosat analogy). 🙂

Monitored repositories page

Right now I added the list of monitored repositories and the links to the repository view mode to the Repositories section. I believe it would be handier to the user if we'd display the monitored repositories on Hubble's overview dashboard page, which does not exist yet. I would find it convenient to have an overview dashboard page similar to Rosat's overview dashboard page that displays a few key metrics of the instance. We can discuss this in a separate issue though.

"Back to instance statistics" link

I'm open for style and position suggestions of the "Back to instance statistics" link. For now I found the top left position within the navigation to be the best place.

Question about passing repository name with owner to bash scripts

The repository name with owner has to be passed to the bash scripts, such as to this one_
https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/scripts/repository/github-api-request-types-by-count.sh

I'm not sure yet what the best way is to accomplish that via the executeScript method. Any ideas?

The code

I programmed with Python for the first time this week. 😁 As I have a background in Ruby and JavaScript, it was not hard and I rather enjoyed it. Though, don't expect the best Python code just yet. 🙂 I'm open for suggestions!
Also, obviously it needs more testing, but I think the changes are a good base for discussing the integration.
Having said that, huge kudos to the current Hubble code base, it's fun building on it! ✨

Looking forward to your thoughts!

pluehne · 2019-03-01T22:22:08Z

@steffen: Thank you so much for this huge contribution 👍! I’ll want to look at this in detail, but I’ll be off the next days unfortunately. It will take a week or two until I can come back to your pull request. I hope that’s okay 🙂.

steffen · 2019-03-01T22:37:09Z

@pluehne No worries! Thanks for looking into it! Enjoy your time off. 🙂

steffen · 2019-03-13T07:15:24Z

Hi @pluehne 👋,

Just a "quick" question about my question about passing repository name with owner to bash scripts from the original post.
Do you have any pointers or idea of how to go about it? Here's the question:

The repository name with owner has to be passed to the bash scripts, such as to this one:
https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/scripts/repository/github-api-request-types-by-count.sh

I'm not sure yet what the best way is to accomplish that via the executeScript method.

Basically I want to pass this.repository (which is a string with "owner/repo") to repository/github-api-request-types-by-count.sh via executeScript here: https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/reports/repository/ReportGitHubApiRequestTypesByCount.py#L15

Thanks!!

pluehne · 2019-03-21T16:37:51Z

@steffen: The best way to achieve that would be to extend the executeScript method to take a list of environment variables that it then, it turn, forwards to the executed script. That should make it possible to pass arguments to the script, what do you think?

And sorry that I still haven’t had the time to look over your pull request!

steffen · 2019-03-21T16:41:09Z

@pluehne I've thought about using environment variables as well. Any pointers on how executeScript could be extended to support that? 😬Thanks!! 🙂

pluehne · 2019-03-21T16:47:07Z

@steffen: executeScript uses executeCommand under the hood, which makes a call to subprocess.Popen. According to the Python documentation, subprocess.Popen optionally takes a named parameter called env through which one can set the environment variables passed to the command to be invoked. There’s an answer on StackOverflow that suggests copying the preexisting environment as a basis to add new variables to be sure nothing important is removed by setting a new environment. That’s how I would do it I guess 🙂. If you want to implement this, I’d suggest you open a separate pull request so that I can review and merge it more quickly.

steffen · 2019-03-21T17:45:43Z

@pluehne Great, thanks! Speaking of reviewing and merging the PR. I thought about keeping this one here as small as possible, but big enough to get the whole picture of how I imagine the integration with repository specific metrics. I do have a few changes locally I want to push here which include the functionality of having working directories of the actual repositories in a temporary folder next to hubble-data. Do you think that's best to keep it all in this PR or in what way do you mean I should open a separate pull request for the executeScript extension? Right now the API Request Types by Count chart data gathering is not working correctly because of the missing executeScript functionality. Also, this PR is based off a branch on my fork so not sure against which branch you want me to open the PR. Thanks!! 🙂

larsxschneider · 2019-04-05T18:41:45Z

docs/repository/git-repository-size.html

+						"trees",
+						"blobs",
+						"tags",
+						"refs"


These series can have dramatically different ranges. Should we put them into individual charts?

I separated counts and bytes into different charts. The charts allow to view selected series by clicking on the legend items, which I thought is better than too many charts. But yeah, worth thinking about splitting it up. Which series would you suggest to separate, e.g. tags and refs into a separate chart?

Do you know if refs include tags? If they do, then I would even drop tags.

Plus, I am not sure if the total "number of trees" and/or "number of blobs" are actionable metrics. Maybe number of trees/files of HEAD in the default branch represents the current performance characteristics of the repo for the user better?

In general, I believe we should track all this data but only visualize the most important/actionable subset.

@larsxschneider Great feedback! I changed the git-sizer report to include all values and now we are flexible in picking what to display in the charts.

The refs count includes the tags, so I agree that we only need to show the number of refs.
I tweaked the charts to show more actionable metrics and included some comments from your article:
https://steffen.github.io/hubble/repository/git-repository-size?nwo=test-org/large-repo

Great point on the number of trees/files of HEAD in the default branch. git-sizer doesn't include that metric, but we can just get it directly from the repo.

Thank you! And Happy Eastern! 🙂🐰

larsxschneider · 2019-04-05T18:45:34Z

updater/reports/repository/ReportGitRepositorySizeObjectsSize.py

+		return round(bytes / 1024 ** 2, 3)
+
+	def updateData(self):
+		stdout = self.executeScript(["ghe-repo", self.repository, "-c", self.configuration["gitSizerPath"] + " --json --json-version=2"])


That is nice! But don't we run git-sizer here twice per repo? Here and there: https://github.com/Autodesk/hubble/pull/177/files#diff-6b012ceb81adedd4a22e2089a0102dd3R10

Yep, I'm aware and thought it's ok to optimize later. 🙂It would need a bit of refactoring I think, as the common design in the updater is to have one report file and class per report. But it shouldn't be too hard to find a solution for this.

I see two options:

(1) All reports are executed sequentially. That means we could pass all reports a "cache dictionary" which is used to use data that a previous report calculated.

(2) We add a cache flag to executeScript. If this flag is set, then the output is cached and reused in a subsequent call with the same command.

@pluehne/@steffen Do you see another option? If no, which one would you prefer? I am leaning towards (2).

@larsxschneider Thanks for these ideas! I consolidated the git-sizer reports into one: 7008ec9#diff-66c529dd9266003f0fc5892af8ac0cb8

All quantity values are now tracked in one report. This gives us more flexibility on what to show while having all the data as discussed here. I didn't really think of this idea of using the same report for different charts until I worked on the git-download reports which does exactly that. 🙂

larsxschneider · 2019-04-05T18:48:23Z

updater/scripts/repository/github-api-request-types-by-count.sh

+    uniq -ic |
+    sort -rn |
+    head -20 |
+    awk '{ printf("%s\t%s\n", $2, $1) }'


That looks pretty similar to https://github.com/Autodesk/hubble/blob/master/updater/scripts/api-requests-by-user.sh ... should we add a repo option to the original script to avoid code duplication?

Good point! There are other reports I wanted to add for repo metrics that are pretty much the same as well. I'll explore adding a repo option to the original scripts.

@larsxschneider I added a repo option for git-download.sh. Let me know if you think that's a good solution and if it's ok to use for other reports/scripts as well.

toddocon · 2019-04-05T20:44:25Z

I ran the same greps by hand against my top repo as reported in Hubble:API requests per repo. The results were oddly formatted.

"/repositories/:repository_id/contents/?" 79534
"/repositories/:repository_id/pulls/:id" 12816
"/repositories/:repository_id/collaborators/:collab/permission" 7321
"/repositories/:repository_id/pulls" 1181
"/repositories/:repository_id/branches" 870
"/repositories/:repository_id" 762
"/repositories/:repository_id/statuses/:sha" 45
/user 26
git-server-23205134-da01-11e6-a478-005056849626 25
"/repositories/:repository_id/pulls/:id/commits" 19
"/repositories/:repository_id/git/refs/" 13
/ 7
"/repositories/:repository_id/issues" 6
"/user/:user_id" 4
/user/teams 4
/user/repos 3
"/repositories/:repository_id/pulls/:id/comments" 3
"/repositories/:repository_id/issues/:id" 3
"/repositories/:repository_id/status/*" 2
"/repositories/:repository_id/git" 2

Whereas on another repo, the results looked OK.

"/repositories/:repository_id/pulls/:id/comments" 25860
"/repositories/:repository_id/pulls" 19176
"/repositories/:repository_id/pulls/:id" 10798
"/repositories/:repository_id/collaborators" 9576
"/repositories/:repository_id" 9476
"/repositories/:repository_id/statuses/:sha" 4523
"/repositories/:repository_id/branches" 3977
"/repositories/:repository_id/contents/?" 3158
"/repositories/:repository_id/issues/:id/comments" 1832
"/repositories/:repository_id/git/refs/" 1696
"/repositories/:repository_id/statuses/" 1251
"/repositories/:repository_id/status/" 1196
"/repositories/:repository_id/commits/" 1171
"/repositories/:repository_id/branches/" 988
"/repositories/:repository_id/pulls/:id/files" 513
"/repositories/:repository_id/issues/comments/:id" 293
"/repositories/:repository_id/git" 287
"/repositories/:repository_id/pulls/:id/commits" 178
"/repositories/:repository_id/commits" 160
"/repositories/:repository_id/issues" 62

This was the api-request-types

zcat -f /var/log/github/unicorn.log.1* |
grep -F 'request_category=api' |
grep -Fv 'remote_address=127.0.0.1' |
grep -Fv "controller="Api::Internal" |
grep "repo=OWNER/REPO" |
grep -oP 'route=\K\S+' |
grep -Fvx 'nil' |
sort |
uniq -ic |
sort -rn |
head -20 |
awk '{ printf("%s\t%s\n", $2, $1) }'

toddocon · 2019-04-05T20:50:25Z

Here's the raw data up to the uniq -ic command if it helps.

$ zcat -f /var/log/github/unicorn.log.1* |
grep -F 'request_category=api' |
grep -Fv 'remote_address=127.0.0.1' |
grep -Fv "controller="Api::Internal" |
grep "repo=OWNER/REPO" |
grep -oP 'route=\K\S+' |
grep -Fvx 'nil' |
sort |
uniq -ic
7 /
25 git-server-23205134-da01-11e6-a478-005056849626
2 "/organizations/:organization_id"
1 "/organizations/:organization_id/repos"
1 /rate_limit
762 "/repositories/:repository_id"
870 "/repositories/:repository_id/branches"
7321 "/repositories/:repository_id/collaborators/:collab/permission"
1 "/repositories/:repository_id/commits/"
79534 "/repositories/:repository_id/contents/?"
1 "/repositories/:repository_id/events"
2 "/repositories/:repository_id/git"
13 "/repositories/:repository_id/git/refs/"
1 "/repositories/:repository_id/hooks"
6 "/repositories/:repository_id/issues"
3 "/repositories/:repository_id/issues/:id"
1 "/repositories/:repository_id/issues/:id/comments"
1181 "/repositories/:repository_id/pulls"
12816 "/repositories/:repository_id/pulls/:id"
3 "/repositories/:repository_id/pulls/:id/comments"
19 "/repositories/:repository_id/pulls/:id/commits"
2 "/repositories/:repository_id/status/"
1 "/repositories/:repository_id/statuses/*"
45 "/repositories/:repository_id/statuses/:sha"
1 /search/code
26 /user
1 /user/emails
3 /user/repos
4 /user/teams
4 "/user/:user_id"

steffen · 2019-04-08T07:38:10Z

Hi @toddocon, thank you for running the API queries!!
I'm trying to distinguish what exactly you see as oddly formatted.
I see some routes without quotes, and then I see git-server-*, which fall out of line a little bit. Are these two type of entries the ones that you see as oddly formatted?

Thanks again for your feedback!

codecov-io · 2019-04-17T11:35:22Z

Codecov Report

Merging #177 into master will decrease coverage by 0.19%.
The diff coverage is n/a.

This allows running `python3 update-stats.py` on the GitHub Enterprise appliance

codecov-io · 2019-05-09T13:22:28Z

Codecov Report

Merging #177 into master will decrease coverage by 0.19%.
The diff coverage is n/a.

Which aligns with Hubble install instructions

steffen added 11 commits February 27, 2019 20:48

First version of integrated repository metrics

6cebdbb

Add passing around nwo in query string in navigation links

d5320c7

Add loading repository specific data

69b3577

Add back to instance statistics link to repo view

2480fd4

Add overview table to overview page

4495986

Add first version of creating a report for monitored repositories

5b33604

Add report for GitHub api request types by count

b68d0a3

Add abstract ReportRepository class

0d2b923

Add monitored repositories

93c8f0f

Add showing repository name in page header

06384ab

Add writing repository name directly into the header to avoid flickering

13ee2ec

pluehne self-requested a review March 1, 2019 22:22

pluehne self-assigned this Mar 1, 2019

pluehne added enhancement suggestion labels Mar 1, 2019

steffen added 3 commits April 5, 2019 15:58

Add git-sizer metrics

d9495fd

Make "passing" repository to scripts work

44b62be

Exclude internal api requests from api request types data

221fafc

larsxschneider reviewed Apr 5, 2019

View reviewed changes

steffen added 2 commits April 17, 2019 08:59

Add repository metrics support for non-remote runs

a75e34a

Add repository Git traffic metrics

654b75b

steffen changed the title ~~Proof of concept for repository metrics~~ [WIP] Add repository metrics Apr 17, 2019

Consolidate git-sizer report and include all quantity values

7008ec9

steffen added 2 commits April 18, 2019 17:01

Tweaked charts to focus on actionable metrics

189d46c

Make updater scripts executable

c1b1a8a

This allows running `python3 update-stats.py` on the GitHub Enterprise appliance

Fix issue with relative dataURLs by only supporting absolute dataURLs

eff5fca

Which aligns with Hubble install instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add repository metrics #177

[WIP] Add repository metrics #177

steffen commented Mar 1, 2019 •

edited

Loading

pluehne commented Mar 1, 2019

steffen commented Mar 1, 2019

steffen commented Mar 13, 2019

pluehne commented Mar 21, 2019

steffen commented Mar 21, 2019

pluehne commented Mar 21, 2019

steffen commented Mar 21, 2019

larsxschneider Apr 5, 2019

steffen Apr 5, 2019

larsxschneider Apr 8, 2019

steffen Apr 18, 2019

larsxschneider Apr 5, 2019

steffen Apr 5, 2019

larsxschneider Apr 8, 2019

steffen Apr 17, 2019

larsxschneider Apr 5, 2019

steffen Apr 5, 2019

steffen Apr 17, 2019

toddocon commented Apr 5, 2019 •

edited

Loading

toddocon commented Apr 5, 2019 •

edited

Loading

steffen commented Apr 8, 2019

codecov-io commented Apr 17, 2019

codecov-io commented May 9, 2019

[WIP] Add repository metrics #177

Are you sure you want to change the base?

[WIP] Add repository metrics #177

Conversation

steffen commented Mar 1, 2019 • edited Loading

Description

Demo

Screenshots

Comments

Configuration

Color

Monitored repositories page

"Back to instance statistics" link

Question about passing repository name with owner to bash scripts

The code

pluehne commented Mar 1, 2019

steffen commented Mar 1, 2019

steffen commented Mar 13, 2019

pluehne commented Mar 21, 2019

steffen commented Mar 21, 2019

pluehne commented Mar 21, 2019

steffen commented Mar 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddocon commented Apr 5, 2019 • edited Loading

toddocon commented Apr 5, 2019 • edited Loading

steffen commented Apr 8, 2019

codecov-io commented Apr 17, 2019

Codecov Report

codecov-io commented May 9, 2019

Codecov Report

steffen commented Mar 1, 2019 •

edited

Loading

toddocon commented Apr 5, 2019 •

edited

Loading

toddocon commented Apr 5, 2019 •

edited

Loading