-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate pairwise similarity of neuroelectro methods and neurotree node distances #2
Comments
Yes! this seems like the thing to do. I guess the first order of business is to assess how easy it is to pull out the distances that you want from NT. That raises two questions: Are the necessary people in there? And is the author-publication linking robust enough? |
Snapshot of Neurotree (as of this afternoon) now available here: |
Here's the initial version of the analysis, for authors publishing on CA1 pyramidal cells and with useful path info from NeuroTree (the 36 authors). Details in this ipython notebook: https://github.com/neuroelectro/neuroelectro_neurotree/blob/master/ca1_analysis.ipynb Matrix of pairwise min path lengths (numbers are neurotree IDs, blue is shorter path lengths). Fewer cliques that I would expect. Lastly, correlations for the scatter plots above: Correlations, for all neurotree paths Correlations, only for neurotree paths 10 or less hops Correlations, only for neurotree paths 5 or less hops A few issues here:
|
Cool! I see the positive correlation up to 5 hops for NT-ephys and NT-methods, and then it looks like it levels out (though I guess you still see a positive correlation below 10?). Think that means NT distance captures something? I've been working on ways to add missing authors/pubs from the NE pub list. Several of those papers have been added, thanks to the revised Scopus scan. However, it looks like a certain amount of supervision will be required to get all the papers in, esp for adding mentor links for new authors if we want a solid NT distance. |
@svdavid and @rgerkin maybe we can have a quick call soon to talk about this? It'd be good for me to discuss before I rerun the analysis on the entire set of neuroelectro authors. Before we (or rather @svdavid ) invest a lot of time trying to add nodes / edges, I wonder if we can quickly see if adding neurotree path info buys us anything in terms of explanatory power from the edges/nodes that are already present. |
The quickest thing to try is a multiple regression with NT distance and methods as inputs and ephys results as outputs. And maybe restrict to the set of pairs with distances that appear to be linear (<= 5?). I agree, it'd be nice to try for a few ephys variables with the data we've got. @stripathy, if you want to talk, maybe sometime mid-day Weds? Leaving town Friday for a couple conferences and will be mostly out of commission next week. stephen |
This sounds good, but I'm not sure how to immediately incorporate the pairwise distances into a regression framework. NeuroTree gives me a distance per pair of pmids/authors whereas for regression I think I need something per individual pmid/neuron record. Or am I missing something? Talking midday on Wed sounds good, how about 11am? or 3pm? |
I was thinking of each pair as a datapoint for the regression: ephys difference as a function of (methods distance, NT distance). Possible some multiple comparisons buried in there for any serious significance tests, but probably ok for first pass. |
11 or 3 Weds is fine. Slight preference for 11. |
Great - let's do 11 on Wednesday. @rgerkin, feel free to join us if you're On Tue, Feb 16, 2016 at 12:36 AM Stephen D notifications@github.com wrote:
|
@svdavid as you requested I fit a simple regression model using pairwise (euclidean) and path lengths to predict pairwise ephys differences. (ipython notebook updated accordingly: https://github.com/neuroelectro/neuroelectro_neurotree/blob/master/ca1_analysis.ipynb). Here, I'm only analyzing pairs of articles with a neurotree path len of <= 5. Both regressors, methods and path lengths are significant and non-zero (denoted by 'meth' and 'path' below). They're also not completely explaining identical sources of ephys variance.
|
@stripathy @svdavid I'll try to make the call on Wednesday (11am PT / 12pm MT). I agree that 5 looks like the right cutoff from the data, if you want a cutoff. Alternatively you could fit something like: @stripathy What do you mean by "They're also not completely explaining identical sources of ephys variance." |
In terms of model variance explained,
Agreed, if I was fitting this in R, I'd use something like a smoothing spline rather than a single linear function. But in 5 mins I couldn't quickly figure out how to make python's statsmodels do that. |
@stripathy, cool! So NT dist explains more than methods??? following up on @rgerkin, you should be able to put both distance and methods into a single regression analysis, and then you can say how much each term contributes, right? More conservative is to do it stepwise, where you regress methods versus ephys and then regress NT distance versus the residual. Regarding the upper limit of 5, I agree, we can do something smart to nail down the statistics, but so far the relationships look like they flatten out pretty quickly. Ok, let's plan on gchat at 11 PT tomorrow for all who can attend. |
In the above, what does "identical sources of ephys variance" correspond to? |
I mean like, the predictive power of neurotree pairwise distance is not completely collinear with methods distance, meaning that a model with both terms is better than a model with just one or the other (looking at adjusted R^2, which accounts for simple differences in model complexity). @svdavid this kind of gets at your point to do the analysis in a stepwise manner.
Yes, but I wouldn't read too much into that. I think it's partially because summarizing the difference between two article's methods sections as a single "euclidean distance" is a terrible way to express their difference. Let's talk more about details tomorrow. @rgerkin feel free to rerun and iterate on the analysis in the ipython notebook. |
Thanks, I think I know what I need to do for this - first, I need to fit a model between ephys and methods (using the current models that me and @rgerkin and @dtebaykin have developed). And then calculate the residual between ephys and predicted ephys given methods info alone. Then reperform the regression analysis using this ephys residual (after accounting for methods differences) with pairwise neurotree distance. |
Here's my rerunning of the analysis after first accounting for the metadata - ephys relationship before then assessing the correlation between neurotree pairwise path length and ephys differences. I used a generic statistical model (@rgerkin's random forest implementation) to predict ephys data given known metadata variables. Steps:
Here's my interpretation in a nutshell - lineage explains both known methods and ephys results out to a path length of about 3-5. Even after accounting for known methods, lineage still explains some residual ephys differences, but considerably less. So what could this variance be? My bet is calculation variance (i.e., whether someone reports spike amplitude as peak to trough vs threshold to peak), but of course it could be literally anything, like how cells are selected, to things in methods sections that we're not yet mining, like ion concentrations of glucose, ATP, HEPES, etc etc. Also, I wonder if low methods difference between author pairs with higher than expected path distances may be an indication of a missing edge... Correlations, only for neurotree paths less than 5 hops |
That sounds reasonable. For preliminary analysis, it seems ok to allow the stephen On Tue, Feb 16, 2016 at 6:18 PM, Shreejoy Tripathy <notifications@github.com
|
@stripathy As a sanity check, can you repeat this, but just have step 3 be:
to show that the correlation is zero or approximately so when using the thing on which the original model was run? |
@rgerkin good idea! The last row below is the correlation between methods and ephys residuals after accounting for methods. It's much higher than I expected, but then looking at the graph, it seems almost entirely due to the fact that there's always going to be some points at the origin when you do a correlation of distance matrices (Mantel's test). I think we can decide later whether it makes sense to remove these origin points or not. Correlations, only for neurotree paths less than 5 hops |
Yeah, zero distance definitely contributes a lot -- I'm assuming there are Seems like an agenda it for today. On Wed, Feb 17, 2016 at 12:06 AM, Shreejoy Tripathy <
|
Results after removing neurotree path lengths = 0: Correlations, only for 0 < path len < 5 @svdavid you're right - a lot of the previous results were coming from a path length = 0 (i.e, the same author). But I want to redo this analysis in a bigger dataset where I have more instances of pathlen = 1 and pathlen = 2, since they're relatively rare in the CA1 only dataset. |
@rgerkin updated ipython notebook available here: https://github.com/neuroelectro/neuroelectro_neurotree/blob/master/ca1_analysis.ipynb |
@stripathy @svdavid I've updated this with the new distance matrix in 8ed9ba7, which slightly improves the neurotree correlations for nonzero path lengths to:
from the previous version (with the much smaller and sparser distance matrix):
I should note that the correlations for shorter path lengths look stronger (visually), at least in the path lenfth 1-3 range. Next up I will work on prediction from the matrix fingerprint (i.e. from columns of the matrix). |
Cool! I'm being pulled in lots of directions at the ARO meeting, but as On Mon, Feb 22, 2016 at 9:08 PM, Richard C Gerkin notifications@github.com
|
@stripathy I just noticed something about the residuals analysis that needs to be addressed. When you are fitting the random forest, you are fitting directly to the whole dataset and not splitting training and test sets. With a sufficient number of features, the correlation will be very high (I get well above 0.9 for each of the seven ephys properties) between the predicted and actual values, and the residuals become less and less meaningful. With a sufficient amount of metadata, the correlation approaches 1 and the residuals approach 0 but that doesn't actually mean you've explained anything since it is just massive overfitting and will not generalize. So if you want to look at correlation with the residuals, it should be residuals from a held-out testset. But perhaps it is better just to see predictive power increase when including neurotree path lengths and not worrying about residuals. I will take care of this but I wanted to get your feedback first in case I am missing something and you had something else in mind. |
@rgerkin you're right, and the residuals being completely overfit as to not be useful didn't occur to me. I just wanted to fit "some model" that accounted for the effect of experimental metadata. |
@stripathy @rgerkin I just uploaded fingerprint_mtx.txt. This contains distance between each NE author (p1) and the 50 top nodes (p2, ie, the most frequent common ancestors among NE authors). This 358 X 50 matrix may be easier to work with for clustering and determining similarity between NE authors. Distances are calculated via common ancestor, so there's now a new "p0" field that indicates the common ancestor for the NE author and the hub node. The new p0 is probably not interesting, but I left it in just for completeness. Does this seem useful? |
@stripathy @svdavid Working on this again today... |
@rgerkin Good to hear from you! Just got back from the Metaknowledge Network spring workshop with some new fuel for the fire. A couple different groups doing latent semantic analysis and related things with collections of abstracts, and I think I've got them interested in things Neurotree-related. Not to divert you from the current analysis, but this might lead to some interesting new dimensions to throw into the regression mix in the long term. |
Yeah - it'd be great to put together a poster panel showing something from Sorry I've been quiet on this the past couple weeks, been busy with On Mon, Mar 21, 2016 at 4:59 PM Stephen D notifications@github.com wrote:
|
I updated the ca1-analysis notebook to do cross-validation (i.e. to compute correlations on holdout samples) and not to worry about residual analysis, which I think is going to be misleading. Instead I compared prediction quality (correlation between predicted and observed) of ephys values in three ways:
(2) uses distances to those authors identified in @stripathy's original 'min_df' matrix that he developed, but recalculated with a more complete distance matrix. (3) uses the nodes in the fingerprint matrix that @svdavid sent a while back, which contains 50 grandfathers. I should note that most of those 50 people are pretty old, and they seem a little too far back to matter (what is being well-connected to Louis Aggasiz really going tell you about AHP amplitudes?) so maybe there is still some better way to do this. The results are in the bottom of the notebook, but basically the neurotree data isn't really adding anything. Although I may have done something wrong, because "Heinz Beck" (who?) is appearing as the second most important feature in analysis 2. |
That's not too promising, is it? I guess there's still the question of whether NT similarity can predict similarity of reported methods (known knowns). And btw, even if the fingerprint people are old, they may still be relevant, inasmuch as the relative distance between researchers and some old hub gives you an approximation of how close they are to each other. But I have gotten an offline pairwise distance calculator working. Would you be interested in the pairwise distance between all NE people? |
@rgerkin I'm a little confused. Where are you getting "Heinz Beck" from? I don't see him as one of the hub nodes for the fingerprint. I'm wondering if I exported some of the data wrong or in a weird way. It's been a while, so I guess my real question is what data/file are you using for the NT fingerprint? |
@svdavid Yes, I'll look at NT similarity alone next. Pairwise similarity between all NE people would be good to have. Then maybe there is some other dimensional reduction of that data that I can try. As for Heinz Beck, that is for analysis 2 which is not using the fingerprint data but rather the original dist_mtx.txt, restricted to the authors in NE. The fingerprint data is used in analysis 3, using fingerprint_mtx.txt. And my guess is that Heinz Beck had a paper that is either miscurated in NE (@stripathy maybe you can check) so it is has values that are off by an order of magnitude from everything else, or else he just has data that is very different. |
Hmm, I just did a quick check for Heinz Beck in neuroelectro - he had 3 On Tue, Mar 22, 2016 at 12:13 PM Richard C Gerkin notifications@github.com
|
@stripathy @svdavid In all cases I am just dropping the entries with no neurotree ID, since otherwise all-neuroelectro gets an unfair advantage. The two graphs are for method 2 above (distances to well-connected authors) and method 3 above (distances to fingerprint grandfathers). For most features, near 0 seems best. In some cases, the curve is actually pretty flat, though, meaning that the neurotree model is not a horrible substitute for the neuroelectro model (e.g. for input resistance), but it is still worse outright. Many other model combinations are possible as well, for example using only some of the neuroelectro features (like ignoring solution metadata), and then averaging with the neurotree model, although I didn't do these. |
Good that NT has some sort of information in it! How easy/hard is it to rotate the problem and simply ask how well NT distance can predict the neuroelectro free parameters. Eg, do two NT cousins tend to use the same solutions and slice temperatures? |
New publication-based similarity scheme. How does this look to you? |
@rgerkin Thinking about the demo I'm giving at FORCE 11 in a couple weeks. How hard is it to modify the regression analysis to test how well mentorship network predicts methodological parameters (eg, how well does distance to well-connected authors predict r_in, tau, etc) I could also show the model combination results if you think they're real. |
Hi Stephen, Have you printed your poster for Force11 yet? If not, is there time for me I'm really sorry I've been incommunicado about this over the past couple On Thu, Mar 31, 2016 at 9:30 AM Stephen D notifications@github.com wrote:
|
Alas the poster has been printed and uploaded. If you have figures, though, And regardless, we should catch up and talk about plans.
|
For each publication in neuroelectro, calculate the pairwise similarity of electrode solutions (or some other set of experimental metadata). For these same publications' last authors, calculate the neurotree pairwise path lengths. Given these two matrices, we can then ask if shorter neurotree paths is correlated with more similar publication methodologies.
The text was updated successfully, but these errors were encountered: