-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate a matrix of pairwise path lengths for all neuroelectro authors #8
Comments
Ok, some progress. First pass at the link between NE pmids and NT pids is tab-delimited, first column is PMID, second is NT pid of last author (0 353 matches to some pid gives ~62000 pairs to calculate distances. This On Wed, Feb 17, 2016 at 1:28 PM, Shreejoy Tripathy <notifications@github.com
|
@stripathy @svdavid Since the new approach I proposed for using path length vectors as features requires the path length matrix, should I wait until this is more fleshed out, or proceed with the one we have now? |
@rgerkin it's up to you - either way we'll want to do the analysis on just CA1 pyramidal cells and the analysis should be mostly the same given more cell types. |
At this point it might make sense to see if you can deal with the current Speaking of that, the table where I'm storing the pairwise distances has mysql> select p1,p2,d from pairDist limit 20; That work for you? stephen On Thu, Feb 18, 2016 at 1:54 PM, Shreejoy Tripathy <notifications@github.com
|
Extract pairwise distance matrix through a common ancestor. For ease of use, this contains redundancies, ie., pairs (p1,p2) and (p2,p1) have the same distance. Does this format work? Should generalize to any connection matrix. http://neurotree.org/beta/include/dist_mtx.php This is currently being populated (2/18/16 pm). Please don't query repeatedly, since it's not a trivial-sized query. Note in passing: You'll see that lower pids tend to be a lot more connected while higher pids (added more recently) are more likely not to be connected to most others. |
Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate. |
@svdavid I think we can discover the M interesting nodes automatically be first doing the full analysis on the full N x N matrix, and then extracting the interesting M of those N. Or am I overestimating the likelihood that the M of interest will even have papers here (maybe they trained many of the N but otherwise haven't published much in the neuroelectro CA1 time frame). @stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such. |
Should I abandon some of @stripathy's code in favor of http://neurotree.org/beta/include/dist_mtx.php, or should I wait on that? |
Yeah - I agree. @rgerkin I think that @svdavid is proposing this mostly as a practicality thing. My guess is that it's not computationally trivial to generate the full N x N NE author matrix (for whatever reason) but it'd be much quicker just to generate the N x M matrix. @svdavid am I right in this?
Ah sorry! Hopefully it's not too painful. This side project seems like a good excuse to migrate over to python 3. |
Learning some interesting things about database table locks and the In the meantime, take 1 of the full matrix should be done tomorrow morning. I'm also realizing that a reduced size fingerprint scheme may provide an On Thu, Feb 18, 2016 at 7:39 PM, Shreejoy Tripathy <notifications@github.com
|
If you are going to pick some key nodes on which to build the reduced fingerprint, will you just take the N nodes with the most descendants (that have papers in neuroelectro)? Or the N that maximize some other metric? |
Haven't decided on what strategy to use yet, but that sound like a good For now, we can just stick with the big matrix, since that's worth having On Thu, Feb 18, 2016 at 10:56 PM, Richard C Gerkin <notifications@github.com
|
Possibly this would be solved by some eigenvector approach (although I don't know what that looks like in a directed graph), but another approach would be:
|
@stripathy The notebook is working for me now after a few changes (747cf1e). I'll tackle the fingerprint construction and use in model fitting next. |
@rgerkin The most-prolific pruning option might work. Only complication So, yeah, we do want to find hubs with non-overlapping decedents, but not For an NE specific analysis, we could reference backto the pub data, eg, Another passing thought: The hubs don't have to be in the NE pub matrix. Need to sleep on it. On Thu, Feb 18, 2016 at 11:51 PM, Richard C Gerkin <notifications@github.com
|
Most of the papers indexed in NE are published after 1997, after journals switched to HTML from PDF. So NE doesn't usually have papers from the uber-grandfathers of ephys like Eccles, Sherrington, Kufler, Hodgkin or Huxley, etc. While our immediate goal is to integrate NE with NT, maybe for the purpose of indexing NT paths, or the intrisinc ephys relevant part of NT, we'd want to include at a minimum all the NE authors + all their grandfathers going back at least 2-3 hops. Remembering back, I think relationships follow sort of an "out of Africa" type model, where if you go back far enough there's really just 10 neuroscientists that everyone trained with. But if you go back only 2-3 generations (i.e., who trained the people who published in NE), there really are specific neurosci schools, centered around a small number of people (like Sakmann, Llinas, David Prince, etc). |
Ha... That's an interesting idea. We can find the minimum set of For some reason my distance matrix calculator slowed down overnight. On Fri, Feb 19, 2016 at 10:18 AM, Shreejoy Tripathy <
|
I broke down and rewrote the pairwise distance code and now it's running at a sane speed (ie, ~ 100x faster). I also noticed that it's interesting to look at who shows up as a common ancestor. I'm now recording that as "p0" in the output of http://neurotree.org/beta/include/dist_mtx.php. If I look at the most frequent occurrences of common ancestors, I get a list of the usual suspects. Maybe these are good fingerprint hubs? This is what I see for the first 200 or so NE people: mysql> select p0, count(p1),people.firstname,people.lastname from pairDist left join people on p0=pid where p0>0 group by p0 order by count(p1) desc limit 30; |
matrix is complete! Of course, about half the pairs (60K/120K) are not connected (yet!) |
@svdavid I get a 500 error upon trying to load this page: http://neurotree.org/beta/include/dist_mtx.php. If it's a big file, you could just add it to the github repo. |
From looking at this: http://neurotree.org/neurotree/tree.php?pid=115&fontsize=0&pnodecount=4&cnodecount=2 , if my "Out of Africa" hypothesis is true, then Eccles and Sherrington are basically Africa. |
Hi @svdavid, Could you please update the distance matrix displayed here: http://neurotree.org/beta/include/dist_mtx.php with the neurotree author PIDs in this file: UniquePID.txt We have updated the listing of authors in NeuroElectro and we noticed that not all of these authors had corresponding entries in the distance matrix output that you had previously generated. Thanks, |
@nathaliebin : got your request, and working on it! I've been traveling and tied up with other stuff. Should be able to get to it soon. |
hi @svdavid have you been able to update this yet? |
Ok! The table is updating. When I merge the new set of pids and previously included pids, I get 1147 distinct nodes. You might check the list that you get out of http://neurotree.org/beta/include/dist_mtx.php and make sure it includes all the pids you need. It's taking a little while, so the 1147 x 1147 distance matrix may not be completely done til late tonight. |
Now there are too many entries and dist_mtx is running out of memory. Do you have a mysql client you can use? I can give you a query to pull directly from the database. |
Yes - there's a mysql client we can use. Feel free to send me credentials On Wed, Nov 2, 2016 at 5:50 PM Stephen D notifications@github.com wrote:
|
@nathaliebin @stripathy : OK, give this a whirl. And let me know how it goes. To connect: Then run the query: select * from pairDistNE; p1,p2 = pid of node pair |
@svdavid , here's the neuroelectro spreadsheet, the relevant column name here is 'Pmid': http://dev.neuroelectro.org/static/src/article_ephys_metadata_curated.csv
No need to re-download if you have the link I sent out from a couple weeks ago, the two spreadsheets will be very similar.
The text was updated successfully, but these errors were encountered: