Generate a matrix of pairwise path lengths for all neuroelectro authors #8

stripathy · 2016-02-17T21:28:52Z

@svdavid , here's the neuroelectro spreadsheet, the relevant column name here is 'Pmid': http://dev.neuroelectro.org/static/src/article_ephys_metadata_curated.csv

No need to re-download if you have the link I sent out from a couple weeks ago, the two spreadsheets will be very similar.

svdavid · 2016-02-18T06:51:29Z

Ok, some progress. First pass at the link between NE pmids and NT pids is
here:
http://neurotree.org/tmp/ne_nt_match_v1.txt

tab-delimited, first column is PMID, second is NT pid of last author (0
means no match), third is confidence (0 low, 1 high).

353 matches to some pid gives ~62000 pairs to calculate distances. This
calculation is underway.

On Wed, Feb 17, 2016 at 1:28 PM, Shreejoy Tripathy <notifications@github.com

wrote:

Assigned #8
#8 to
@svdavid https://github.com/svdavid.

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

rgerkin · 2016-02-18T17:30:36Z

@stripathy @svdavid Since the new approach I proposed for using path length vectors as features requires the path length matrix, should I wait until this is more fleshed out, or proceed with the one we have now?

stripathy · 2016-02-18T21:54:19Z

@rgerkin it's up to you - either way we'll want to do the analysis on just CA1 pyramidal cells and the analysis should be mostly the same given more cell types.

svdavid · 2016-02-18T23:23:35Z

At this point it might make sense to see if you can deal with the current
data formatting and/or if there's any additional info you'd want to
extract. The path matrix and pmid-pid links may be improved, but their
structure should not change.

Speaking of that, the table where I'm storing the pairwise distances has
one row for each (pid1,pid2) pair. It might be a little easier to export
it that way, just to avoid having so many hundred of columns. eg

mysql> select p1,p2,d from pairDist limit 20;
+----+-----+------+
| p1 | p2 | d |
+----+-----+------+
| 77 | 135 | 5 |
| 77 | 169 | 8 |
| 77 | 219 | 6 |
| 77 | 266 | 5 |
| 77 | 272 | 6 |
| 77 | 297 | 8 |
| 77 | 366 | 4 |
| 77 | 368 | 5 |
| 77 | 404 | 5 |
| 77 | 455 | 8 |
| 77 | 458 | 8 |
| 77 | 685 | 8 |
| 77 | 691 | 4 |
| 77 | 700 | -1 |
| 77 | 809 | 6 |
| 77 | 874 | 5 |
| 77 | 875 | 4 |
| 77 | 892 | 5 |
| 77 | 906 | 8 |
| 77 | 968 | 5 |
etc....

That work for you?

stephen

On Thu, Feb 18, 2016 at 1:54 PM, Shreejoy Tripathy <notifications@github.com

wrote:

@rgerkin https://github.com/rgerkin it's up to you - either way we'll
want to do the analysis on just CA1 pyramidal cells and the analysis should
be mostly the same given more cell types.

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

svdavid · 2016-02-19T00:39:32Z

Extract pairwise distance matrix through a common ancestor. For ease of use, this contains redundancies, ie., pairs (p1,p2) and (p2,p1) have the same distance. Does this format work? Should generalize to any connection matrix.

http://neurotree.org/beta/include/dist_mtx.php

This is currently being populated (2/18/16 pm). Please don't query repeatedly, since it's not a trivial-sized query.

Note in passing: You'll see that lower pids tend to be a lot more connected while higher pids (added more recently) are more likely not to be connected to most others.

svdavid · 2016-02-19T00:41:14Z

Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate.

rgerkin · 2016-02-19T00:44:02Z

@svdavid I think we can discover the M interesting nodes automatically be first doing the full analysis on the full N x N matrix, and then extracting the interesting M of those N. Or am I overestimating the likelihood that the M of interest will even have papers here (maybe they trained many of the N but otherwise haven't published much in the neuroelectro CA1 time frame).

@stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such.

rgerkin · 2016-02-19T00:44:37Z

Should I abandon some of @stripathy's code in favor of http://neurotree.org/beta/include/dist_mtx.php, or should I wait on that?

stripathy · 2016-02-19T03:39:14Z

Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate.

Yeah - I agree. @rgerkin I think that @svdavid is proposing this mostly as a practicality thing. My guess is that it's not computationally trivial to generate the full N x N NE author matrix (for whatever reason) but it'd be much quicker just to generate the N x M matrix. @svdavid am I right in this?

@stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such.

Ah sorry! Hopefully it's not too painful. This side project seems like a good excuse to migrate over to python 3.

svdavid · 2016-02-19T06:53:49Z

Learning some interesting things about database table locks and the
intricacies of running multiple processes at once on the same db. Running
the NE distance matrix calc on the mirror server now, and it's going a lot
faster. Also realizing how it can be really sped up if/when I get around
to writing some more code.

In the meantime, take 1 of the full matrix should be done tomorrow morning.

I'm also realizing that a reduced size fingerprint scheme may provide an
intersting way of clustering the whole tree and visualizing people's
training profile. Eg, if you pick some big nodes in different fields
(chemistry, physics, math, anthropology, etc), then you can easily
visualize how far their training is from each of those different areas. So
I'll be thinking about ways to integrate fingerprints and active updating
of them into the bigger database.

On Thu, Feb 18, 2016 at 7:39 PM, Shreejoy Tripathy <notifications@github.com

wrote:

Another thought: For fingerprint vectors, we may not need to
populate/analyze the complete N x N NE author matrix. Probably if we found
M interesting nodes (4 grandfathers or maybe a bigger set), we could just
generate the N x M matrix. Of distances to them. This would probably
provide as much useful information and would be faster to analyze and
generate.

Yeah - I agree. @rgerkin https://github.com/rgerkin I think that
@svdavid https://github.com/svdavid is proposing this mostly as a
practicality thing. My guess is that it's not computationally trivial to
generate the full N x N NE author matrix (for whatever reason) but it'd be
much quicker just to generate the N x M matrix. @svdavid
https://github.com/svdavid am I right in this?

@stripathy https://github.com/stripathy Right now I am enduring the
grind of making all of your code work in Python 3. Mostly string encodings
and such.

Ah sorry! Hopefully it's not too painful. This side project seems like a
good excuse to migrate over to python 3.

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

rgerkin · 2016-02-19T06:56:06Z

If you are going to pick some key nodes on which to build the reduced fingerprint, will you just take the N nodes with the most descendants (that have papers in neuroelectro)? Or the N that maximize some other metric?

svdavid · 2016-02-19T07:01:20Z

Haven't decided on what strategy to use yet, but that sound like a good
one. Possibly could do some sort of eigenvector-like reduction of the big
matrix to find important nodes. Thing about just picking based on big
decedent counts is that you tend to get a lot of nodes that are really
close to each other.

For now, we can just stick with the big matrix, since that's worth having
as a baseline for testing any reduction.

On Thu, Feb 18, 2016 at 10:56 PM, Richard C Gerkin <notifications@github.com

wrote:

If you are going to pick some key nodes on which to build the reduced
fingerprint, will you just take the N nodes with the most descendants (that
have papers in neuroelectro)? Or the N that maximize some other metric?

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

rgerkin · 2016-02-19T07:07:15Z

Possibly this would be solved by some eigenvector approach (although I don't know what that looks like in a directed graph), but another approach would be:

Find the node with the most descendants.
Remove all of those descendants from the tree.
Repeat step 1.
Thereby finding a lot of nodes that don't overlap much.

rgerkin · 2016-02-19T07:51:30Z

@stripathy The notebook is working for me now after a few changes (747cf1e). I'll tackle the fingerprint construction and use in model fitting next.

svdavid · 2016-02-19T08:04:02Z

@rgerkin The most-prolific pruning option might work. Only complication
there would be if one or two people were ancestors for basically everyone
in the tree. The tree does get narrow toward the top.

So, yeah, we do want to find hubs with non-overlapping decedents, but not
necessarily a lot of them. Maybe some sort of clustering and then pick the
best exemplars from each cluster?

For an NE specific analysis, we could reference backto the pub data, eg,
define people with the greatest average methodological differences as hubs.
This might be too tautological.

Another passing thought: The hubs don't have to be in the NE pub matrix.
Though maybe you've got papers from all the grandparents?

Need to sleep on it.

On Thu, Feb 18, 2016 at 11:51 PM, Richard C Gerkin <notifications@github.com

wrote:

@stripathy https://github.com/stripathy The notebook is working for me
now after a few changes (747cf1e
747cf1e).
I'll tackle the fingerprint construction and use in model fitting next.

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

stripathy · 2016-02-19T18:18:33Z

Another passing thought: The hubs don't have to be in the NE pub matrix. Though maybe you've got papers from all the grandparents?

Most of the papers indexed in NE are published after 1997, after journals switched to HTML from PDF. So NE doesn't usually have papers from the uber-grandfathers of ephys like Eccles, Sherrington, Kufler, Hodgkin or Huxley, etc.

While our immediate goal is to integrate NE with NT, maybe for the purpose of indexing NT paths, or the intrisinc ephys relevant part of NT, we'd want to include at a minimum all the NE authors + all their grandfathers going back at least 2-3 hops. Remembering back, I think relationships follow sort of an "out of Africa" type model, where if you go back far enough there's really just 10 neuroscientists that everyone trained with. But if you go back only 2-3 generations (i.e., who trained the people who published in NE), there really are specific neurosci schools, centered around a small number of people (like Sakmann, Llinas, David Prince, etc).

svdavid · 2016-02-19T18:22:58Z

Ha... That's an interesting idea. We can find the minimum set of
grandparents, ie, no more than X hops back from the NE set that span the
entire set. Fairly objective and should automatically give us the
diversity we need. Need to do a little coding to take care of that. Which
may happen today if I can get my Cosyne poster done soon.

For some reason my distance matrix calculator slowed down overnight.
Annoying. But it's getting there.

On Fri, Feb 19, 2016 at 10:18 AM, Shreejoy Tripathy <
notifications@github.com> wrote:

Another passing thought: The hubs don't have to be in the NE pub matrix.
Though maybe you've got papers from all the grandparents?

Most of the papers indexed in NE are published after 1997, after journals
switched to HTML from PDF. So NE doesn't usually have papers from the
uber-grandfathers of ephys like Eccles, Sherrington, Kufler, Hodgkin or
Huxley, etc.

While our immediate goal is to integrate NE with NT, maybe for the purpose
of indexing NT paths, or the intrisinc ephys relevant part of NT, we'd want
to include at a minimum all the NE authors + all their grandfathers going
back at least 2-3 hops. Remembering back, I think relationships follow sort
of an "out of Africa" type model, where if you go back far enough there's
really just 10 neuroscientists that everyone trained with. But if you go
back only 2-3 generations (i.e., who trained the people who published in
NE), there really are specific neurosci schools, centered around a small
number of people (like Sakmann, Llinas, David Prince, etc).

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

svdavid · 2016-02-19T22:51:33Z

I broke down and rewrote the pairwise distance code and now it's running at a sane speed (ie, ~ 100x faster). I also noticed that it's interesting to look at who shows up as a common ancestor. I'm now recording that as "p0" in the output of http://neurotree.org/beta/include/dist_mtx.php.

If I look at the most frequent occurrences of common ancestors, I get a list of the usual suspects. Maybe these are good fingerprint hubs? This is what I see for the first 200 or so NE people:

mysql> select p0, count(p1),people.firstname,people.lastname from pairDist left join people on p0=pid where p0>0 group by p0 order by count(p1) desc limit 30;
+-------+-----------+-------------+---------------+
| p0 | count(p1) | firstname | lastname |
+-------+-----------+-------------+---------------+
| 114 | 10295 | Sir John | Eccles |
| 115 | 6201 | Sir Charles | Sherrington |
| 1713 | 2705 | John | Langley |
| 172 | 2641 | Sir Michael | Foster |
| 151 | 2134 | Johannes | Müller |
| 146 | 2034 | Hermann | von Helmholtz |
| 223 | 1396 | Carl | Ludwig |
| 517 | 1370 | Ernst | Weber |
| 3011 | 1179 | Rudolf | Virchow |
| 65 | 1050 | Stephen | Kuffler |
| 116 | 962 | Edgar | Adrian |
| 119 | 741 | Karl | Lashley |
| 6684 | 671 | Friedrich | Goltz |
| 511 | 650 | Franz | Nissl |
| 195 | 631 | Henry | Bowditch |
| 1857 | 606 | David | Prince |
| 122 | 589 | James | Angell |
| 196 | 573 | Claude | Bernard |
| 4339 | 544 | Otto | Meyerhof |
| 135 | 489 | Bert | Sakmann |
| 134 | 471 | Otto | Creutzfeldt |
| 1716 | 448 | Archibald | Hill |
| 188 | 432 | Philip | Bard |
| 204 | 402 | John | Fulton |
| 524 | 350 | Thomas | Huxley |
| 206 | 349 | Harvey | Cushing |
| 812 | 335 | Roger | Nicoll |
| 21405 | 280 | Robert | Bunsen |
| 171 | 278 | Bernard | Katz |
| 1888 | 270 | Oswald | Schmiedeberg |
+-------+-----------+-------------+---------------+

svdavid · 2016-02-20T00:17:18Z

matrix is complete! Of course, about half the pairs (60K/120K) are not connected (yet!)

stripathy · 2016-02-20T01:24:51Z

@svdavid I get a 500 error upon trying to load this page: http://neurotree.org/beta/include/dist_mtx.php. If it's a big file, you could just add it to the github repo.

stripathy · 2016-02-20T01:36:42Z

From looking at this: http://neurotree.org/neurotree/tree.php?pid=115&fontsize=0&pnodecount=4&cnodecount=2 , if my "Out of Africa" hypothesis is true, then Eccles and Sherrington are basically Africa.

stripathy · 2016-02-20T19:49:32Z

Thanks @svdavid for commiting this: 8ed9ba7, I'm closing this issue for now.

nathaliebin · 2016-10-11T19:03:24Z

Hi @svdavid,

Could you please update the distance matrix displayed here: http://neurotree.org/beta/include/dist_mtx.php with the neurotree author PIDs in this file: UniquePID.txt

We have updated the listing of authors in NeuroElectro and we noticed that not all of these authors had corresponding entries in the distance matrix output that you had previously generated.

Thanks,
@nathaliebin and @stripathy

svdavid · 2016-10-19T16:56:49Z

@nathaliebin : got your request, and working on it! I've been traveling and tied up with other stuff. Should be able to get to it soon.

stripathy · 2016-11-01T19:57:15Z

hi @svdavid have you been able to update this yet?

svdavid · 2016-11-02T00:40:56Z

Ok! The table is updating. When I merge the new set of pids and previously included pids, I get 1147 distinct nodes. You might check the list that you get out of http://neurotree.org/beta/include/dist_mtx.php and make sure it includes all the pids you need. It's taking a little while, so the 1147 x 1147 distance matrix may not be completely done til late tonight.

svdavid · 2016-11-03T00:50:33Z

Now there are too many entries and dist_mtx is running out of memory. Do you have a mysql client you can use? I can give you a query to pull directly from the database.

stripathy · 2016-11-03T00:56:29Z

Yes - there's a mysql client we can use. Feel free to send me credentials
to login to your database or a dump of the database that we can load up
here locally.

On Wed, Nov 2, 2016 at 5:50 PM Stephen D notifications@github.com wrote:

Now there are too many entries and dist_mtx is running out of memory. Do
you have a mysql client you can use? I can give you a query to pull
directly from the database.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#8 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWEWXebakbTTHcQtgo-BdH2MEmpeXCJks5q6S_ZgaJpZM4HchOP
.

Shreejoy Tripathy
Post-Doctoral Researcher
Department of Psychiatry
University of British Columbia

svdavid · 2016-11-03T16:09:39Z

@nathaliebin @stripathy : OK, give this a whirl. And let me know how it goes.

To connect:
host=klab.c3se0dtaabmj.us-west-2.rds.amazonaws.com
user=dacuna
pw=dacuna
database=academictree

Then run the query: select * from pairDistNE;

p1,p2 = pid of node pair
d = distance between them through common ancestor (d=-1 means no connection)
p0 = pid of common ancestor

stripathy assigned svdavid Feb 17, 2016

stripathy closed this as completed Feb 20, 2016

nathaliebin reopened this Oct 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a matrix of pairwise path lengths for all neuroelectro authors #8

Generate a matrix of pairwise path lengths for all neuroelectro authors #8

stripathy commented Feb 17, 2016

svdavid commented Feb 18, 2016

rgerkin commented Feb 18, 2016

stripathy commented Feb 18, 2016

svdavid commented Feb 18, 2016

svdavid commented Feb 19, 2016

svdavid commented Feb 19, 2016

rgerkin commented Feb 19, 2016

rgerkin commented Feb 19, 2016

stripathy commented Feb 19, 2016

svdavid commented Feb 19, 2016

rgerkin commented Feb 19, 2016

svdavid commented Feb 19, 2016

rgerkin commented Feb 19, 2016

rgerkin commented Feb 19, 2016

svdavid commented Feb 19, 2016

stripathy commented Feb 19, 2016

svdavid commented Feb 19, 2016

svdavid commented Feb 19, 2016

svdavid commented Feb 20, 2016

stripathy commented Feb 20, 2016

stripathy commented Feb 20, 2016

stripathy commented Feb 20, 2016

nathaliebin commented Oct 11, 2016

svdavid commented Oct 19, 2016

stripathy commented Nov 1, 2016

svdavid commented Nov 2, 2016

svdavid commented Nov 3, 2016

stripathy commented Nov 3, 2016

svdavid commented Nov 3, 2016

Generate a matrix of pairwise path lengths for all neuroelectro authors #8

Generate a matrix of pairwise path lengths for all neuroelectro authors #8

Comments

stripathy commented Feb 17, 2016

svdavid commented Feb 18, 2016

rgerkin commented Feb 18, 2016

stripathy commented Feb 18, 2016

svdavid commented Feb 18, 2016

svdavid commented Feb 19, 2016

svdavid commented Feb 19, 2016

rgerkin commented Feb 19, 2016

rgerkin commented Feb 19, 2016

stripathy commented Feb 19, 2016

svdavid commented Feb 19, 2016

rgerkin commented Feb 19, 2016

svdavid commented Feb 19, 2016

rgerkin commented Feb 19, 2016

rgerkin commented Feb 19, 2016

svdavid commented Feb 19, 2016

stripathy commented Feb 19, 2016

svdavid commented Feb 19, 2016

svdavid commented Feb 19, 2016

svdavid commented Feb 20, 2016

stripathy commented Feb 20, 2016

stripathy commented Feb 20, 2016

stripathy commented Feb 20, 2016

nathaliebin commented Oct 11, 2016

svdavid commented Oct 19, 2016

stripathy commented Nov 1, 2016

svdavid commented Nov 2, 2016

svdavid commented Nov 3, 2016

stripathy commented Nov 3, 2016

svdavid commented Nov 3, 2016