Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a matrix of pairwise path lengths for all neuroelectro authors #8

Open
stripathy opened this issue Feb 17, 2016 · 29 comments
Open
Assignees

Comments

@stripathy
Copy link
Contributor

@svdavid , here's the neuroelectro spreadsheet, the relevant column name here is 'Pmid': http://dev.neuroelectro.org/static/src/article_ephys_metadata_curated.csv

No need to re-download if you have the link I sent out from a couple weeks ago, the two spreadsheets will be very similar.

@svdavid
Copy link
Collaborator

svdavid commented Feb 18, 2016

Ok, some progress. First pass at the link between NE pmids and NT pids is
here:
http://neurotree.org/tmp/ne_nt_match_v1.txt

tab-delimited, first column is PMID, second is NT pid of last author (0
means no match), third is confidence (0 low, 1 high).

353 matches to some pid gives ~62000 pairs to calculate distances. This
calculation is underway.

On Wed, Feb 17, 2016 at 1:28 PM, Shreejoy Tripathy <notifications@github.com

wrote:

Assigned #8
#8 to
@svdavid https://github.com/svdavid.


Reply to this email directly or view it on GitHub
#8 (comment)
.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 18, 2016

@stripathy @svdavid Since the new approach I proposed for using path length vectors as features requires the path length matrix, should I wait until this is more fleshed out, or proceed with the one we have now?

@stripathy
Copy link
Contributor Author

@rgerkin it's up to you - either way we'll want to do the analysis on just CA1 pyramidal cells and the analysis should be mostly the same given more cell types.

@svdavid
Copy link
Collaborator

svdavid commented Feb 18, 2016

At this point it might make sense to see if you can deal with the current
data formatting and/or if there's any additional info you'd want to
extract. The path matrix and pmid-pid links may be improved, but their
structure should not change.

Speaking of that, the table where I'm storing the pairwise distances has
one row for each (pid1,pid2) pair. It might be a little easier to export
it that way, just to avoid having so many hundred of columns. eg

mysql> select p1,p2,d from pairDist limit 20;
+----+-----+------+
| p1 | p2 | d |
+----+-----+------+
| 77 | 135 | 5 |
| 77 | 169 | 8 |
| 77 | 219 | 6 |
| 77 | 266 | 5 |
| 77 | 272 | 6 |
| 77 | 297 | 8 |
| 77 | 366 | 4 |
| 77 | 368 | 5 |
| 77 | 404 | 5 |
| 77 | 455 | 8 |
| 77 | 458 | 8 |
| 77 | 685 | 8 |
| 77 | 691 | 4 |
| 77 | 700 | -1 |
| 77 | 809 | 6 |
| 77 | 874 | 5 |
| 77 | 875 | 4 |
| 77 | 892 | 5 |
| 77 | 906 | 8 |
| 77 | 968 | 5 |
etc....

That work for you?

stephen

On Thu, Feb 18, 2016 at 1:54 PM, Shreejoy Tripathy <notifications@github.com

wrote:

@rgerkin https://github.com/rgerkin it's up to you - either way we'll
want to do the analysis on just CA1 pyramidal cells and the analysis should
be mostly the same given more cell types.


Reply to this email directly or view it on GitHub
#8 (comment)
.

@svdavid
Copy link
Collaborator

svdavid commented Feb 19, 2016

Extract pairwise distance matrix through a common ancestor. For ease of use, this contains redundancies, ie., pairs (p1,p2) and (p2,p1) have the same distance. Does this format work? Should generalize to any connection matrix.

http://neurotree.org/beta/include/dist_mtx.php

This is currently being populated (2/18/16 pm). Please don't query repeatedly, since it's not a trivial-sized query.

Note in passing: You'll see that lower pids tend to be a lot more connected while higher pids (added more recently) are more likely not to be connected to most others.

@svdavid
Copy link
Collaborator

svdavid commented Feb 19, 2016

Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 19, 2016

@svdavid I think we can discover the M interesting nodes automatically be first doing the full analysis on the full N x N matrix, and then extracting the interesting M of those N. Or am I overestimating the likelihood that the M of interest will even have papers here (maybe they trained many of the N but otherwise haven't published much in the neuroelectro CA1 time frame).

@stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 19, 2016

Should I abandon some of @stripathy's code in favor of http://neurotree.org/beta/include/dist_mtx.php, or should I wait on that?

@stripathy
Copy link
Contributor Author

Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate.

Yeah - I agree. @rgerkin I think that @svdavid is proposing this mostly as a practicality thing. My guess is that it's not computationally trivial to generate the full N x N NE author matrix (for whatever reason) but it'd be much quicker just to generate the N x M matrix. @svdavid am I right in this?

@stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such.

Ah sorry! Hopefully it's not too painful. This side project seems like a good excuse to migrate over to python 3.

@svdavid
Copy link
Collaborator

svdavid commented Feb 19, 2016

Learning some interesting things about database table locks and the
intricacies of running multiple processes at once on the same db. Running
the NE distance matrix calc on the mirror server now, and it's going a lot
faster. Also realizing how it can be really sped up if/when I get around
to writing some more code.

In the meantime, take 1 of the full matrix should be done tomorrow morning.

I'm also realizing that a reduced size fingerprint scheme may provide an
intersting way of clustering the whole tree and visualizing people's
training profile. Eg, if you pick some big nodes in different fields
(chemistry, physics, math, anthropology, etc), then you can easily
visualize how far their training is from each of those different areas. So
I'll be thinking about ways to integrate fingerprints and active updating
of them into the bigger database.

On Thu, Feb 18, 2016 at 7:39 PM, Shreejoy Tripathy <notifications@github.com

wrote:

Another thought: For fingerprint vectors, we may not need to
populate/analyze the complete N x N NE author matrix. Probably if we found
M interesting nodes (4 grandfathers or maybe a bigger set), we could just
generate the N x M matrix. Of distances to them. This would probably
provide as much useful information and would be faster to analyze and
generate.

Yeah - I agree. @rgerkin https://github.com/rgerkin I think that
@svdavid https://github.com/svdavid is proposing this mostly as a
practicality thing. My guess is that it's not computationally trivial to
generate the full N x N NE author matrix (for whatever reason) but it'd be
much quicker just to generate the N x M matrix. @svdavid
https://github.com/svdavid am I right in this?

@stripathy https://github.com/stripathy Right now I am enduring the
grind of making all of your code work in Python 3. Mostly string encodings
and such.

Ah sorry! Hopefully it's not too painful. This side project seems like a
good excuse to migrate over to python 3.


Reply to this email directly or view it on GitHub
#8 (comment)
.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 19, 2016

If you are going to pick some key nodes on which to build the reduced fingerprint, will you just take the N nodes with the most descendants (that have papers in neuroelectro)? Or the N that maximize some other metric?

@svdavid
Copy link
Collaborator

svdavid commented Feb 19, 2016

Haven't decided on what strategy to use yet, but that sound like a good
one. Possibly could do some sort of eigenvector-like reduction of the big
matrix to find important nodes. Thing about just picking based on big
decedent counts is that you tend to get a lot of nodes that are really
close to each other.

For now, we can just stick with the big matrix, since that's worth having
as a baseline for testing any reduction.

On Thu, Feb 18, 2016 at 10:56 PM, Richard C Gerkin <notifications@github.com

wrote:

If you are going to pick some key nodes on which to build the reduced
fingerprint, will you just take the N nodes with the most descendants (that
have papers in neuroelectro)? Or the N that maximize some other metric?


Reply to this email directly or view it on GitHub
#8 (comment)
.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 19, 2016

Possibly this would be solved by some eigenvector approach (although I don't know what that looks like in a directed graph), but another approach would be:

  1. Find the node with the most descendants.
  2. Remove all of those descendants from the tree.
  3. Repeat step 1.
    Thereby finding a lot of nodes that don't overlap much.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 19, 2016

@stripathy The notebook is working for me now after a few changes (747cf1e). I'll tackle the fingerprint construction and use in model fitting next.

@svdavid
Copy link
Collaborator

svdavid commented Feb 19, 2016

@rgerkin The most-prolific pruning option might work. Only complication
there would be if one or two people were ancestors for basically everyone
in the tree. The tree does get narrow toward the top.

So, yeah, we do want to find hubs with non-overlapping decedents, but not
necessarily a lot of them. Maybe some sort of clustering and then pick the
best exemplars from each cluster?

For an NE specific analysis, we could reference backto the pub data, eg,
define people with the greatest average methodological differences as hubs.
This might be too tautological.

Another passing thought: The hubs don't have to be in the NE pub matrix.
Though maybe you've got papers from all the grandparents?

Need to sleep on it.

On Thu, Feb 18, 2016 at 11:51 PM, Richard C Gerkin <notifications@github.com

wrote:

@stripathy https://github.com/stripathy The notebook is working for me
now after a few changes (747cf1e
747cf1e).
I'll tackle the fingerprint construction and use in model fitting next.


Reply to this email directly or view it on GitHub
#8 (comment)
.

@stripathy
Copy link
Contributor Author

Another passing thought: The hubs don't have to be in the NE pub matrix. Though maybe you've got papers from all the grandparents?

Most of the papers indexed in NE are published after 1997, after journals switched to HTML from PDF. So NE doesn't usually have papers from the uber-grandfathers of ephys like Eccles, Sherrington, Kufler, Hodgkin or Huxley, etc.

While our immediate goal is to integrate NE with NT, maybe for the purpose of indexing NT paths, or the intrisinc ephys relevant part of NT, we'd want to include at a minimum all the NE authors + all their grandfathers going back at least 2-3 hops. Remembering back, I think relationships follow sort of an "out of Africa" type model, where if you go back far enough there's really just 10 neuroscientists that everyone trained with. But if you go back only 2-3 generations (i.e., who trained the people who published in NE), there really are specific neurosci schools, centered around a small number of people (like Sakmann, Llinas, David Prince, etc).

@svdavid
Copy link
Collaborator

svdavid commented Feb 19, 2016

Ha... That's an interesting idea. We can find the minimum set of
grandparents, ie, no more than X hops back from the NE set that span the
entire set. Fairly objective and should automatically give us the
diversity we need. Need to do a little coding to take care of that. Which
may happen today if I can get my Cosyne poster done soon.

For some reason my distance matrix calculator slowed down overnight.
Annoying. But it's getting there.

On Fri, Feb 19, 2016 at 10:18 AM, Shreejoy Tripathy <
notifications@github.com> wrote:

Another passing thought: The hubs don't have to be in the NE pub matrix.
Though maybe you've got papers from all the grandparents?

Most of the papers indexed in NE are published after 1997, after journals
switched to HTML from PDF. So NE doesn't usually have papers from the
uber-grandfathers of ephys like Eccles, Sherrington, Kufler, Hodgkin or
Huxley, etc.

While our immediate goal is to integrate NE with NT, maybe for the purpose
of indexing NT paths, or the intrisinc ephys relevant part of NT, we'd want
to include at a minimum all the NE authors + all their grandfathers going
back at least 2-3 hops. Remembering back, I think relationships follow sort
of an "out of Africa" type model, where if you go back far enough there's
really just 10 neuroscientists that everyone trained with. But if you go
back only 2-3 generations (i.e., who trained the people who published in
NE), there really are specific neurosci schools, centered around a small
number of people (like Sakmann, Llinas, David Prince, etc).


Reply to this email directly or view it on GitHub
#8 (comment)
.

@svdavid
Copy link
Collaborator

svdavid commented Feb 19, 2016

I broke down and rewrote the pairwise distance code and now it's running at a sane speed (ie, ~ 100x faster). I also noticed that it's interesting to look at who shows up as a common ancestor. I'm now recording that as "p0" in the output of http://neurotree.org/beta/include/dist_mtx.php.

If I look at the most frequent occurrences of common ancestors, I get a list of the usual suspects. Maybe these are good fingerprint hubs? This is what I see for the first 200 or so NE people:

mysql> select p0, count(p1),people.firstname,people.lastname from pairDist left join people on p0=pid where p0>0 group by p0 order by count(p1) desc limit 30;
+-------+-----------+-------------+---------------+
| p0 | count(p1) | firstname | lastname |
+-------+-----------+-------------+---------------+
| 114 | 10295 | Sir John | Eccles |
| 115 | 6201 | Sir Charles | Sherrington |
| 1713 | 2705 | John | Langley |
| 172 | 2641 | Sir Michael | Foster |
| 151 | 2134 | Johannes | Müller |
| 146 | 2034 | Hermann | von Helmholtz |
| 223 | 1396 | Carl | Ludwig |
| 517 | 1370 | Ernst | Weber |
| 3011 | 1179 | Rudolf | Virchow |
| 65 | 1050 | Stephen | Kuffler |
| 116 | 962 | Edgar | Adrian |
| 119 | 741 | Karl | Lashley |
| 6684 | 671 | Friedrich | Goltz |
| 511 | 650 | Franz | Nissl |
| 195 | 631 | Henry | Bowditch |
| 1857 | 606 | David | Prince |
| 122 | 589 | James | Angell |
| 196 | 573 | Claude | Bernard |
| 4339 | 544 | Otto | Meyerhof |
| 135 | 489 | Bert | Sakmann |
| 134 | 471 | Otto | Creutzfeldt |
| 1716 | 448 | Archibald | Hill |
| 188 | 432 | Philip | Bard |
| 204 | 402 | John | Fulton |
| 524 | 350 | Thomas | Huxley |
| 206 | 349 | Harvey | Cushing |
| 812 | 335 | Roger | Nicoll |
| 21405 | 280 | Robert | Bunsen |
| 171 | 278 | Bernard | Katz |
| 1888 | 270 | Oswald | Schmiedeberg |
+-------+-----------+-------------+---------------+

@svdavid
Copy link
Collaborator

svdavid commented Feb 20, 2016

matrix is complete! Of course, about half the pairs (60K/120K) are not connected (yet!)

@stripathy
Copy link
Contributor Author

@svdavid I get a 500 error upon trying to load this page: http://neurotree.org/beta/include/dist_mtx.php. If it's a big file, you could just add it to the github repo.

@stripathy
Copy link
Contributor Author

From looking at this: http://neurotree.org/neurotree/tree.php?pid=115&fontsize=0&pnodecount=4&cnodecount=2 , if my "Out of Africa" hypothesis is true, then Eccles and Sherrington are basically Africa.

@stripathy
Copy link
Contributor Author

Thanks @svdavid for commiting this: 8ed9ba7, I'm closing this issue for now.

@nathaliebin
Copy link
Collaborator

Hi @svdavid,

Could you please update the distance matrix displayed here: http://neurotree.org/beta/include/dist_mtx.php with the neurotree author PIDs in this file: UniquePID.txt

We have updated the listing of authors in NeuroElectro and we noticed that not all of these authors had corresponding entries in the distance matrix output that you had previously generated.

Thanks,
@nathaliebin and @stripathy

@svdavid
Copy link
Collaborator

svdavid commented Oct 19, 2016

@nathaliebin : got your request, and working on it! I've been traveling and tied up with other stuff. Should be able to get to it soon.

@stripathy
Copy link
Contributor Author

hi @svdavid have you been able to update this yet?

@svdavid
Copy link
Collaborator

svdavid commented Nov 2, 2016

Ok! The table is updating. When I merge the new set of pids and previously included pids, I get 1147 distinct nodes. You might check the list that you get out of http://neurotree.org/beta/include/dist_mtx.php and make sure it includes all the pids you need. It's taking a little while, so the 1147 x 1147 distance matrix may not be completely done til late tonight.

@svdavid
Copy link
Collaborator

svdavid commented Nov 3, 2016

Now there are too many entries and dist_mtx is running out of memory. Do you have a mysql client you can use? I can give you a query to pull directly from the database.

@stripathy
Copy link
Contributor Author

Yes - there's a mysql client we can use. Feel free to send me credentials
to login to your database or a dump of the database that we can load up
here locally.

On Wed, Nov 2, 2016 at 5:50 PM Stephen D notifications@github.com wrote:

Now there are too many entries and dist_mtx is running out of memory. Do
you have a mysql client you can use? I can give you a query to pull
directly from the database.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#8 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWEWXebakbTTHcQtgo-BdH2MEmpeXCJks5q6S_ZgaJpZM4HchOP
.

Shreejoy Tripathy
Post-Doctoral Researcher
Department of Psychiatry
University of British Columbia

@svdavid
Copy link
Collaborator

svdavid commented Nov 3, 2016

@nathaliebin @stripathy : OK, give this a whirl. And let me know how it goes.

To connect:
host=klab.c3se0dtaabmj.us-west-2.rds.amazonaws.com
user=dacuna
pw=dacuna
database=academictree

Then run the query: select * from pairDistNE;

p1,p2 = pid of node pair
d = distance between them through common ancestor (d=-1 means no connection)
p0 = pid of common ancestor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants