Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate pairwise similarity of neuroelectro methods and neurotree node distances #2

Open
stripathy opened this issue Feb 12, 2016 · 44 comments
Assignees

Comments

@stripathy
Copy link
Contributor

For each publication in neuroelectro, calculate the pairwise similarity of electrode solutions (or some other set of experimental metadata). For these same publications' last authors, calculate the neurotree pairwise path lengths. Given these two matrices, we can then ask if shorter neurotree paths is correlated with more similar publication methodologies.

@stripathy stripathy self-assigned this Feb 12, 2016
@svdavid
Copy link
Collaborator

svdavid commented Feb 12, 2016

Yes! this seems like the thing to do. I guess the first order of business is to assess how easy it is to pull out the distances that you want from NT. That raises two questions: Are the necessary people in there? And is the author-publication linking robust enough?

@svdavid
Copy link
Collaborator

svdavid commented Feb 13, 2016

Snapshot of Neurotree (as of this afternoon) now available here:
http://hearingbrain.org/neurotree/
Same queries should all work, but running on a different database backend that won't interfere with regular site function.

stripathy added a commit that referenced this issue Feb 14, 2016
stripathy added a commit that referenced this issue Feb 16, 2016
@stripathy
Copy link
Contributor Author

Here's the initial version of the analysis, for authors publishing on CA1 pyramidal cells and with useful path info from NeuroTree (the 36 authors). Details in this ipython notebook: https://github.com/neuroelectro/neuroelectro_neurotree/blob/master/ca1_analysis.ipynb

Matrix of pairwise min path lengths (numbers are neurotree IDs, blue is shorter path lengths). Fewer cliques that I would expect.
image
Comparison of the pairwise path lengths above with corresponding neuroelectro methods euclidean distances and ephys euclidean distances. Each dot denotes a pairwise comparison of two publications' methods/ephys or two last authors' path lengths from neurotree.
image
And ephys vs metadata distance:
image

Lastly, correlations for the scatter plots above:

Correlations, for all neurotree paths
neurotree : metadata corr 0.29
neurotree : ephys corr 0.29
metadata : ephys corr 0.35

Correlations, only for neurotree paths 10 or less hops
neurotree : metadata corr 0.35
neurotree : ephys corr 0.37
metadata : ephys corr 0.36

Correlations, only for neurotree paths 5 or less hops
neurotree : metadata corr 0.64
neurotree : ephys corr 0.65
metadata : ephys corr 0.63

A few issues here:

  1. What's the influence of edges missing in NeuroTree?
  2. Above which path length are two authors effectively completely unrelated? I thresholded at 10 here, but calculate up to 20 hops when I query the API.
  3. Adding data from more cell types should help, as it'll add more authors and add more cliques based on traineeship.

@svdavid
Copy link
Collaborator

svdavid commented Feb 16, 2016

Cool! I see the positive correlation up to 5 hops for NT-ephys and NT-methods, and then it looks like it levels out (though I guess you still see a positive correlation below 10?). Think that means NT distance captures something?

I've been working on ways to add missing authors/pubs from the NE pub list. Several of those papers have been added, thanks to the revised Scopus scan. However, it looks like a certain amount of supervision will be required to get all the papers in, esp for adding mentor links for new authors if we want a solid NT distance.

@stripathy
Copy link
Contributor Author

@svdavid and @rgerkin maybe we can have a quick call soon to talk about this? It'd be good for me to discuss before I rerun the analysis on the entire set of neuroelectro authors.

Before we (or rather @svdavid ) invest a lot of time trying to add nodes / edges, I wonder if we can quickly see if adding neurotree path info buys us anything in terms of explanatory power from the edges/nodes that are already present.

@svdavid
Copy link
Collaborator

svdavid commented Feb 16, 2016

The quickest thing to try is a multiple regression with NT distance and methods as inputs and ephys results as outputs. And maybe restrict to the set of pairs with distances that appear to be linear (<= 5?). I agree, it'd be nice to try for a few ephys variables with the data we've got.

@stripathy, if you want to talk, maybe sometime mid-day Weds? Leaving town Friday for a couple conferences and will be mostly out of commission next week.

stephen

@stripathy
Copy link
Contributor Author

The quickest thing to try is a multiple regression with NT distance and methods as inputs and ephys results as outputs. And maybe restrict to the set of pairs with distances that appear to be linear (<= 5?). I agree, it'd be nice to try for a few ephys variables with the data we've got.

This sounds good, but I'm not sure how to immediately incorporate the pairwise distances into a regression framework. NeuroTree gives me a distance per pair of pmids/authors whereas for regression I think I need something per individual pmid/neuron record. Or am I missing something?

Talking midday on Wed sounds good, how about 11am? or 3pm?

@svdavid
Copy link
Collaborator

svdavid commented Feb 16, 2016

I was thinking of each pair as a datapoint for the regression: ephys difference as a function of (methods distance, NT distance). Possible some multiple comparisons buried in there for any serious significance tests, but probably ok for first pass.

@svdavid
Copy link
Collaborator

svdavid commented Feb 16, 2016

11 or 3 Weds is fine. Slight preference for 11.

@stripathy
Copy link
Contributor Author

Great - let's do 11 on Wednesday. @rgerkin, feel free to join us if you're
free.

On Tue, Feb 16, 2016 at 12:36 AM Stephen D notifications@github.com wrote:

11 or 3 Weds is fine. Slight preference for 11.


Reply to this email directly or view it on GitHub
#2 (comment)
.

Shreejoy Tripathy
Post-Doctoral Researcher
Department of Psychiatry
University of British Columbia

@stripathy
Copy link
Contributor Author

@svdavid as you requested I fit a simple regression model using pairwise (euclidean) and path lengths to predict pairwise ephys differences. (ipython notebook updated accordingly: https://github.com/neuroelectro/neuroelectro_neurotree/blob/master/ca1_analysis.ipynb). Here, I'm only analyzing pairs of articles with a neurotree path len of <= 5.

Both regressors, methods and path lengths are significant and non-zero (denoted by 'meth' and 'path' below). They're also not completely explaining identical sources of ephys variance.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.502
Model:                            OLS   Adj. R-squared:                  0.499
Method:                 Least Squares   F-statistic:                     228.4
Date:                Tue, 16 Feb 2016   Prob (F-statistic):           2.33e-69
Time:                        10:31:58   Log-Likelihood:                -608.92
No. Observations:                 457   AIC:                             1224.
Df Residuals:                     454   BIC:                             1236.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.8477      0.089      9.474      0.000         0.672     1.024
path           0.2741      0.028      9.910      0.000         0.220     0.328
meth           0.1208      0.015      8.185      0.000         0.092     0.150
==============================================================================
Omnibus:                       31.919   Durbin-Watson:                   1.563
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.786
Skew:                           0.686   Prob(JB):                     1.03e-08
Kurtosis:                       3.220   Cond. No.                         17.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


@rgerkin
Copy link
Collaborator

rgerkin commented Feb 16, 2016

@stripathy @svdavid I'll try to make the call on Wednesday (11am PT / 12pm MT). I agree that 5 looks like the right cutoff from the data, if you want a cutoff. Alternatively you could fit something like:
ephys_dist ~ f(pair_dist) + constant + noise, where f is a function that goes to zero as pair_dist increases. Presumably there is nothing actually special about 5 so this should really be modeled smoothly.

@stripathy What do you mean by "They're also not completely explaining identical sources of ephys variance."

@stripathy
Copy link
Contributor Author

@stripathy What do you mean by "They're also not completely explaining identical sources of ephys variance."

In terms of model variance explained,
model(ephys_dist ~ meth_dist) < model(ephys_dist ~ pair_dist) < model(ephys_dist ~ meth_dist + pair_dist)

ephys_dist ~ f(pair_dist) + constant + noise, where f is a function that goes to zero as pair_dist increases. Presumably there is nothing actually special about 5 so this should really be modeled smoothly.

Agreed, if I was fitting this in R, I'd use something like a smoothing spline rather than a single linear function. But in 5 mins I couldn't quickly figure out how to make python's statsmodels do that.

@svdavid
Copy link
Collaborator

svdavid commented Feb 16, 2016

@stripathy, cool! So NT dist explains more than methods???

following up on @rgerkin, you should be able to put both distance and methods into a single regression analysis, and then you can say how much each term contributes, right? More conservative is to do it stepwise, where you regress methods versus ephys and then regress NT distance versus the residual.

Regarding the upper limit of 5, I agree, we can do something smart to nail down the statistics, but so far the relationships look like they flatten out pretty quickly.

Ok, let's plan on gchat at 11 PT tomorrow for all who can attend.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 16, 2016

In terms of model variance explained,
model(ephys_dist ~ meth_dist) < model(ephys_dist ~ pair_dist) < model(ephys_dist ~ meth_dist + pair_dist)

In the above, what does "identical sources of ephys variance" correspond to?

@stripathy
Copy link
Contributor Author

In the above, what does "identical sources of ephys variance" correspond to?

I mean like, the predictive power of neurotree pairwise distance is not completely collinear with methods distance, meaning that a model with both terms is better than a model with just one or the other (looking at adjusted R^2, which accounts for simple differences in model complexity). @svdavid this kind of gets at your point to do the analysis in a stepwise manner.

@stripathy, cool! So NT dist explains more than methods???

Yes, but I wouldn't read too much into that. I think it's partially because summarizing the difference between two article's methods sections as a single "euclidean distance" is a terrible way to express their difference.

Let's talk more about details tomorrow. @rgerkin feel free to rerun and iterate on the analysis in the ipython notebook.

@stripathy
Copy link
Contributor Author

following up on @rgerkin, you should be able to put both distance and methods into a single regression analysis, and then you can say how much each term contributes, right? More conservative is to do it stepwise, where you regress methods versus ephys and then regress NT distance versus the residual.

Thanks, I think I know what I need to do for this - first, I need to fit a model between ephys and methods (using the current models that me and @rgerkin and @dtebaykin have developed). And then calculate the residual between ephys and predicted ephys given methods info alone. Then reperform the regression analysis using this ephys residual (after accounting for methods differences) with pairwise neurotree distance.

@stripathy
Copy link
Contributor Author

Here's my rerunning of the analysis after first accounting for the metadata - ephys relationship before then assessing the correlation between neurotree pairwise path length and ephys differences. I used a generic statistical model (@rgerkin's random forest implementation) to predict ephys data given known metadata variables.

Steps:

  1. metadata_predicted_ephys = model(ephys ~ metadata)
  2. ephys_residuals = ephys - metadata_predicted_ephys
  3. corr(neurotree path len, ephys_residual pairwise diffs)

Here's my interpretation in a nutshell - lineage explains both known methods and ephys results out to a path length of about 3-5. Even after accounting for known methods, lineage still explains some residual ephys differences, but considerably less. So what could this variance be? My bet is calculation variance (i.e., whether someone reports spike amplitude as peak to trough vs threshold to peak), but of course it could be literally anything, like how cells are selected, to things in methods sections that we're not yet mining, like ion concentrations of glucose, ATP, HEPES, etc etc. Also, I wonder if low methods difference between author pairs with higher than expected path distances may be an indication of a missing edge...

Correlations, only for neurotree paths less than 5 hops
neurotree : metadata corr 0.64
neurotree : ephys corr 0.65
neurotree : ephys resid corr 0.28

image

image

image

@svdavid
Copy link
Collaborator

svdavid commented Feb 17, 2016

That sounds reasonable. For preliminary analysis, it seems ok to allow the
different steps of the regression to have different analytical form (eg,
linear vs. random forest), though any final model should probably have
single architecture. One thing we should keep in mind, that any
similarities we can explain by NT path length should be attributable to
methods--either explained in the pubs or hidden. So inasmuch as path
length matters, we should expect path length and methods to be somewhat
correlated.

stephen

On Tue, Feb 16, 2016 at 6:18 PM, Shreejoy Tripathy <notifications@github.com

wrote:

@svdavid https://github.com/svdavid, here's my rerunning of the
analysis after first accounting for the metadata - ephys relationship
before then assessing the correlation between neurotree pairwise path
length and ephys differences.

Steps:

  1. metadata_predicted_ephys = model(ephys ~ metadata)
  2. ephys_residuals = ephys - metadata_predicted_ephys
  3. corr(neurotree path len, pairwise ephys_residuals differences)

I used a generic statistical model (@rgerkin https://github.com/rgerkin's
random forest implementation) to predict ephys data given known metadata
variables.


Reply to this email directly or view it on GitHub
#2 (comment)
.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 17, 2016

@stripathy As a sanity check, can you repeat this, but just have step 3 be:

  1. corr(metadata, ephys_residual pairwise diffs)

to show that the correlation is zero or approximately so when using the thing on which the original model was run?

@stripathy
Copy link
Contributor Author

@stripathy As a sanity check, can you repeat this, but just have step 3 be:
corr(metadata, ephys_residual pairwise diffs)
to show that the correlation is zero or approximately so when using the thing on which the original model was run?

@rgerkin good idea! The last row below is the correlation between methods and ephys residuals after accounting for methods. It's much higher than I expected, but then looking at the graph, it seems almost entirely due to the fact that there's always going to be some points at the origin when you do a correlation of distance matrices (Mantel's test). I think we can decide later whether it makes sense to remove these origin points or not.

Correlations, only for neurotree paths less than 5 hops
neurotree : metadata corr 0.73
neurotree : ephys corr 0.67
neurotree : ephys resid corr 0.27
metadata : ephys resid corr 0.34

image

@svdavid
Copy link
Collaborator

svdavid commented Feb 17, 2016

Yeah, zero distance definitely contributes a lot -- I'm assuming there are
a lot of points at (0,0)? We probably should see what happens if we
exclude points at 0 on the x axis as well. Are these the same people
reporting the exact same data twice? In that case, they definitely should
be excluded. The same person running the same experiment twice seems less
problematic, but that shouldn't give you data at (0,0).

Seems like an agenda it for today.

On Wed, Feb 17, 2016 at 12:06 AM, Shreejoy Tripathy <
notifications@github.com> wrote:

@stripathy https://github.com/stripathy As a sanity check, can you
repeat this, but just have step 3 be:
corr(metadata, ephys_residual pairwise diffs)
to show that the correlation is zero or approximately so when using the
thing on which the original model was run?

@rgerkin https://github.com/rgerkin good idea! The last row below is
the correlation between methods and ephys residuals after accounting for
methods. It's much higher than I expected, but then looking at the graph,
it seems almost entirely due to the fact that there's always going to be
some points at the origin when you do a correlation of distance matrices
(Mantel's test). I think we can decide later whether it makes sense to
remove these origin points or not.

Correlations, only for neurotree paths less than 5 hops
neurotree : metadata corr 0.73
neurotree : ephys corr 0.67
neurotree : ephys resid corr 0.27
metadata : ephys resid corr 0.34

[image: image]
https://cloud.githubusercontent.com/assets/2458713/13103279/e4fb5e50-d509-11e5-8e1e-10e4248dfa0f.png


Reply to this email directly or view it on GitHub
#2 (comment)
.

@stripathy
Copy link
Contributor Author

We probably should see what happens if we exclude points at 0 on the x axis as well.

Results after removing neurotree path lengths = 0:

Correlations, only for 0 < path len < 5
neurotree : metadata corr 0.17
neurotree : ephys corr 0.28
neurotree : ephys resid corr -0.00
metadata : ephys resid corr 0.08

@svdavid you're right - a lot of the previous results were coming from a path length = 0 (i.e, the same author). But I want to redo this analysis in a bigger dataset where I have more instances of pathlen = 1 and pathlen = 2, since they're relatively rare in the CA1 only dataset.

@stripathy
Copy link
Contributor Author

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 23, 2016

@stripathy @svdavid I've updated this with the new distance matrix in 8ed9ba7, which slightly improves the neurotree correlations for nonzero path lengths to:

neurotree : metadata corr 0.14
neurotree : ephys corr 0.13

from the previous version (with the much smaller and sparser distance matrix):

neurotree : metadata corr 0.07
neurotree : ephys corr 0.08

I should note that the correlations for shorter path lengths look stronger (visually), at least in the path lenfth 1-3 range.

Next up I will work on prediction from the matrix fingerprint (i.e. from columns of the matrix).

@svdavid
Copy link
Collaborator

svdavid commented Feb 23, 2016

Cool! I'm being pulled in lots of directions at the ARO meeting, but as
soon as I have a chance I'll generate a distance matrix for all the NE
authors to the hubs. That will make an alternative, reduced-dimensionality
fingerprint.

On Mon, Feb 22, 2016 at 9:08 PM, Richard C Gerkin notifications@github.com
wrote:

@stripathy https://github.com/stripathy @svdavid
https://github.com/svdavid I've updated this with the new distance
matrix in 8ed9ba7
8ed9ba7,
which slightly improves the neurotree correlations for nonzero path lengths
to:

neurotree : metadata corr 0.14
neurotree : ephys corr 0.13

from the previous version (with the much smaller and sparser distance
matrix):

neurotree : metadata corr 0.07
neurotree : ephys corr 0.08

Next up I will work on prediction from the matrix fingerprint (i.e. from
columns of the matrix).


Reply to this email directly or view it on GitHub
#2 (comment)
.

@rgerkin
Copy link
Collaborator

rgerkin commented Feb 25, 2016

@stripathy I just noticed something about the residuals analysis that needs to be addressed. When you are fitting the random forest, you are fitting directly to the whole dataset and not splitting training and test sets. With a sufficient number of features, the correlation will be very high (I get well above 0.9 for each of the seven ephys properties) between the predicted and actual values, and the residuals become less and less meaningful. With a sufficient amount of metadata, the correlation approaches 1 and the residuals approach 0 but that doesn't actually mean you've explained anything since it is just massive overfitting and will not generalize.

So if you want to look at correlation with the residuals, it should be residuals from a held-out testset. But perhaps it is better just to see predictive power increase when including neurotree path lengths and not worrying about residuals.

I will take care of this but I wanted to get your feedback first in case I am missing something and you had something else in mind.

@stripathy
Copy link
Contributor Author

@rgerkin you're right, and the residuals being completely overfit as to not be useful didn't occur to me. I just wanted to fit "some model" that accounted for the effect of experimental metadata.

@svdavid
Copy link
Collaborator

svdavid commented Feb 26, 2016

@stripathy @rgerkin I just uploaded fingerprint_mtx.txt. This contains distance between each NE author (p1) and the 50 top nodes (p2, ie, the most frequent common ancestors among NE authors). This 358 X 50 matrix may be easier to work with for clustering and determining similarity between NE authors.

Distances are calculated via common ancestor, so there's now a new "p0" field that indicates the common ancestor for the NE author and the hub node. The new p0 is probably not interesting, but I left it in just for completeness.

Does this seem useful?

@rgerkin
Copy link
Collaborator

rgerkin commented Mar 21, 2016

@stripathy @svdavid Working on this again today...

@svdavid
Copy link
Collaborator

svdavid commented Mar 21, 2016

@rgerkin Good to hear from you! Just got back from the Metaknowledge Network spring workshop with some new fuel for the fire. A couple different groups doing latent semantic analysis and related things with collections of abstracts, and I think I've got them interested in things Neurotree-related. Not to divert you from the current analysis, but this might lead to some interesting new dimensions to throw into the regression mix in the long term.
Also, I'm signed up to give a demo and poster at FORCE 11 ~April 18. If we've got anything to show from the NE analysis, I'd be happy to throw it in.

@stripathy
Copy link
Contributor Author

Yeah - it'd be great to put together a poster panel showing something from
this work. Even if it's a simple proof of concept that neurotree distance
predicts ephys methods similarity.

Sorry I've been quiet on this the past couple weeks, been busy with
post-doc stuff, but I'm hoping to get back into this next week.

On Mon, Mar 21, 2016 at 4:59 PM Stephen D notifications@github.com wrote:

@rgerkin https://github.com/rgerkin Good to hear from you! Just got
back from the Metaknowledge Network spring workshop with some new fuel for
the fire. A couple different groups doing latent semantic analysis and
related things with collections of abstracts, and I think I've got them
interested in things Neurotree-related. Not to divert you from the current
analysis, but this might lead to some interesting new dimensions to throw
into the regression mix in the long term.
Also, I'm signed up to give a demo and poster at FORCE 11 ~April 18. If
we've got anything to show from the NE analysis, I'd be happy to throw it
in.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2 (comment)

Shreejoy Tripathy
Post-Doctoral Researcher
Department of Psychiatry
University of British Columbia

@svdavid
Copy link
Collaborator

svdavid commented Mar 22, 2016

dist_400d
Maybe interesting. I compared common ancestor distance (# of steps between two researchers through a common ancestor) against 1-(MeSH term overlap) and got a correlation (upper left). Also testing out a 400-D LSA space computed from the associated abstracts. That didn't work quite as well (upper right), but there is a suprisingly strong correlation between MeSH term overlap and the 400-D vector space (lower right). Perhaps the more interesting thing is that I'm working on ways to generate different low-dimensional spaces to project abstract/papers.

@rgerkin
Copy link
Collaborator

rgerkin commented Mar 22, 2016

I updated the ca1-analysis notebook to do cross-validation (i.e. to compute correlations on holdout samples) and not to worry about residual analysis, which I think is going to be misleading. Instead I compared prediction quality (correlation between predicted and observed) of ephys values in three ways:

  1. With no neurotree data (only neuroelectro metadata)
  2. With distances to well-connected authors
  3. With distances to historical luminaries

(2) uses distances to those authors identified in @stripathy's original 'min_df' matrix that he developed, but recalculated with a more complete distance matrix. (3) uses the nodes in the fingerprint matrix that @svdavid sent a while back, which contains 50 grandfathers. I should note that most of those 50 people are pretty old, and they seem a little too far back to matter (what is being well-connected to Louis Aggasiz really going tell you about AHP amplitudes?) so maybe there is still some better way to do this.

The results are in the bottom of the notebook, but basically the neurotree data isn't really adding anything. Although I may have done something wrong, because "Heinz Beck" (who?) is appearing as the second most important feature in analysis 2.

@svdavid
Copy link
Collaborator

svdavid commented Mar 22, 2016

That's not too promising, is it? I guess there's still the question of whether NT similarity can predict similarity of reported methods (known knowns).

And btw, even if the fingerprint people are old, they may still be relevant, inasmuch as the relative distance between researchers and some old hub gives you an approximation of how close they are to each other. But I have gotten an offline pairwise distance calculator working. Would you be interested in the pairwise distance between all NE people?

@svdavid
Copy link
Collaborator

svdavid commented Mar 22, 2016

@rgerkin I'm a little confused. Where are you getting "Heinz Beck" from? I don't see him as one of the hub nodes for the fingerprint. I'm wondering if I exported some of the data wrong or in a weird way. It's been a while, so I guess my real question is what data/file are you using for the NT fingerprint?

@rgerkin
Copy link
Collaborator

rgerkin commented Mar 22, 2016

@svdavid Yes, I'll look at NT similarity alone next. Pairwise similarity between all NE people would be good to have. Then maybe there is some other dimensional reduction of that data that I can try.

As for Heinz Beck, that is for analysis 2 which is not using the fingerprint data but rather the original dist_mtx.txt, restricted to the authors in NE. The fingerprint data is used in analysis 3, using fingerprint_mtx.txt. And my guess is that Heinz Beck had a paper that is either miscurated in NE (@stripathy maybe you can check) so it is has values that are off by an order of magnitude from everything else, or else he just has data that is very different.

@stripathy
Copy link
Contributor Author

Hmm, I just did a quick check for Heinz Beck in neuroelectro - he had 3
papers. In 1 paper I only found 1 value (of about 20 total) that was
miscurated, resulting in a 1 order of magnitude difference in AHP
amplitude. My sense is that it shouldn't be a huge bias, but I need to look
into the details of your analysis.

On Tue, Mar 22, 2016 at 12:13 PM Richard C Gerkin notifications@github.com
wrote:

@svdavid https://github.com/svdavid Yes, I'll look at NT similarity
alone next. Pairwise similarity between all NE people would be good to
have. Then maybe there is some other dimensional reduction of that data
that I can try.

As for Heinz Beck, that is for analysis 2 which is not using the
fingerprint data but rather the original dist_mtx.txt, restricted to the
authors in NE. The fingerprint data is used in analysis 3, using
fingerprint_mtx.txt. And my guess is that Heinz Beck had a paper that is
either miscurated in NE (@stripathy https://github.com/stripathy maybe
you can check) so it is has values that are off by an order of magnitude
from everything else, or else he just has data that is very different.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2 (comment)

Shreejoy Tripathy
Post-Doctoral Researcher
Department of Psychiatry
University of British Columbia

@rgerkin
Copy link
Collaborator

rgerkin commented Mar 23, 2016

@stripathy @svdavid
Model averaging sometimes works better than pooling features, especially when the feature groups are very different from each other, so here is what the prediction quality looks like with model averaging ranging from all-neuroelectro-metadata to all-neurotree-distances to everything in between. A value on the x-axis of, say, 0.3 means that the prediction is coming 70% from a pure-neuroelectro model, and 30% from a pure-neurotree model. What you would see if neurotree was helpful above and beyond the neuroelectro data are peaks somewhere to the right of 0 (i.e. giving some weight to the neurotree model).

In all cases I am just dropping the entries with no neurotree ID, since otherwise all-neuroelectro gets an unfair advantage. The two graphs are for method 2 above (distances to well-connected authors) and method 3 above (distances to fingerprint grandfathers).

For most features, near 0 seems best. In some cases, the curve is actually pretty flat, though, meaning that the neurotree model is not a horrible substitute for the neuroelectro model (e.g. for input resistance), but it is still worse outright. Many other model combinations are possible as well, for example using only some of the neuroelectro features (like ignoring solution metadata), and then averaging with the neurotree model, although I didn't do these.

Using distances to well-connected authors:
image

Using distances to fingerprint grandfathers:
image

@svdavid
Copy link
Collaborator

svdavid commented Mar 24, 2016

Good that NT has some sort of information in it! How easy/hard is it to rotate the problem and simply ask how well NT distance can predict the neuroelectro free parameters. Eg, do two NT cousins tend to use the same solutions and slice temperatures?

@svdavid
Copy link
Collaborator

svdavid commented Mar 31, 2016

New publication-based similarity scheme. How does this look to you?
http://neurotree.org/beta/similarity.php?pid=4800
http://neurotree.org/beta/similarity.php?pid=77014

@svdavid
Copy link
Collaborator

svdavid commented Mar 31, 2016

@rgerkin Thinking about the demo I'm giving at FORCE 11 in a couple weeks. How hard is it to modify the regression analysis to test how well mentorship network predicts methodological parameters (eg, how well does distance to well-connected authors predict r_in, tau, etc)

I could also show the model combination results if you think they're real.

@stripathy
Copy link
Contributor Author

Hi Stephen,

Have you printed your poster for Force11 yet? If not, is there time for me
to put together a couple figures?

I'm really sorry I've been incommunicado about this over the past couple
months. I'm hoping to get back to this over the summer.

On Thu, Mar 31, 2016 at 9:30 AM Stephen D notifications@github.com wrote:

@rgerkin https://github.com/rgerkin Thinking about the demo I'm giving
at FORCE 11 in a couple weeks. How hard is it to modify the regression
analysis to test how well mentorship network predicts methodological
parameters (eg, how well does distance to well-connected authors predict
r_in, tau, etc)

I could also show the model combination results if you think they're real.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2 (comment)

Shreejoy Tripathy
Post-Doctoral Researcher
Department of Psychiatry
University of British Columbia

@svdavid
Copy link
Collaborator

svdavid commented Apr 15, 2016

Alas the poster has been printed and uploaded. If you have figures, though,
I could keep them handy during the demo session.

And regardless, we should catch up and talk about plans.
On Apr 15, 2016 12:42 PM, "Shreejoy Tripathy" notifications@github.com
wrote:

Hi Stephen,

Have you printed your poster for Force11 yet? If not, is there time for me
to put together a couple figures?

I'm really sorry I've been incommunicado about this over the past couple
months. I'm hoping to get back to this over the summer.

On Thu, Mar 31, 2016 at 9:30 AM Stephen D notifications@github.com
wrote:

@rgerkin https://github.com/rgerkin Thinking about the demo I'm giving
at FORCE 11 in a couple weeks. How hard is it to modify the regression
analysis to test how well mentorship network predicts methodological
parameters (eg, how well does distance to well-connected authors predict
r_in, tau, etc)

I could also show the model combination results if you think they're
real.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
#2 (comment)

Shreejoy Tripathy
Post-Doctoral Researcher
Department of Psychiatry
University of British Columbia


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants