From 6472dc38e4f5d3d536f0539d88f0e0534b9d9057 Mon Sep 17 00:00:00 2001 From: Milton Pividori Date: Fri, 17 May 2024 15:23:45 -0600 Subject: [PATCH] supplementary material: add Diego's suggestions Co-authored-by: Diego Milone <47642730+dmilone@users.noreply.github.com> --- content/20.00.supplementary_material.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/content/20.00.supplementary_material.md b/content/20.00.supplementary_material.md index da28298..8b0b033 100644 --- a/content/20.00.supplementary_material.md +++ b/content/20.00.supplementary_material.md @@ -10,7 +10,7 @@ While they share certain similarities, there are also notable differences betwee Conceptually, CCC is grounded in clustering input data using each variable separately. This process effectively transforms each variable into a set of partitions, each containing a different number of clusters. The CCC then quantifies the correlation between variables by assessing the similarity of these partitions. -This allows to processing of various types of variables, including both numerical and categorical variables, even when the categories are nominal (i.e., they lack intrinsic order), as explained in [Methods](#sec:ccc_algo). +This allows to process of various types of variables, including both numerical and categorical variables, even when the categories are nominal (i.e., they lack intrinsic order), as explained in [Methods](#sec:ccc_algo). MIC, however, is specifically designed for numerical variables. Additionally, in theory, CCC should also support correlating variables with different dimensions. For 1-dimensional variables (such as genes), CCC obtains partitions using a quantiles-based approach. @@ -18,19 +18,18 @@ For multidimensional variables, CCC could potentially use a standard clustering Now, consider two variables with $n$ data points on a scatterplot. We can overlay a grid on this scatterplot with $x$ columns and $y$ rows, where each cell of this grid contains a portion of the data points, thereby defining a bivariate probability distribution. -The MIC algorithm seeks an optimal grid configuration that maximizes the ratio of mutual information $I$ to $\log \min \{x, y\}$, subject to the constraint that $xy < n^{0.6}$ +The MIC algorithm seeks an optimal grid configuration that maximizes the ratio of mutual information to $\log \min \{x, y\}$, subject to the constraint that $xy < n^{0.6}$. This normalization process using $\log \min \{x, y\}$ scales the MIC score between zero and one. The CCC, as defined in [Methods](#sec:ccc_algo), also generates a symmetric, normalized score between zero and one. However, unlike MIC which utilizes normalized mutual information, CCC employs the Adjusted Rand Index (ARI). The ARI has an advantageous property: it consistently returns a baseline (zero) for independently drawn partitions, irrespective of the number of clusters (see Figure @fig:constant_baseline:k_max). This property is not inherent in mutual information, which can produce varied values for independent variables if the grid dimensions vary. -MIC mitigates this by limiting the grid size with the constraint $xy < n^{0.6}$. +MIC mitigates this by limiting the grid size with the constraint $xy < n^{0.6}$, which could also limit its ability to detect complex relationships. Both CCC and MIC involve binning the input data vectors, aiming to maximize the mutual information and the ARI, respectively. However, their approaches differ significantly in complexity and execution. MIC utilizes a sophisticated dynamic programming algorithm to identify the optimal grid. In contrast, CCC employs a more straightforward and faster method, partitioning the data points separately using the two vectors. -Our analysis on gene expression data (shown later), indicates that CCC's simpler method achieves comparable results to MIC. While CCC might benefit from adopting MIC's more complex grid search approach, it remains uncertain if MIC could maintain its performance using CCC's simpler partitioning strategy. Regarding their parameters, CCC's $k_{\mathrm{max}}$ (maximum number of clusters) and MIC's $B(n)$ (maximum grid size) serve similar purposes. @@ -38,7 +37,7 @@ They control both the complexity of the patterns detected and the computational For example, as illustrated in Figure @fig:datasets_rel (Anscombe I and III), a $k_{\mathrm{max}}$ of 2 is adequate for identifying linear patterns but insufficient for more complex patterns like quadratic or two-lines patterns. A similar principle applies to MIC's $B(n)$. However, a critical distinction exists between the two: the constant baseline property of ARIs ensures that CCC returns a value close to zero for independent variables, regardless of $k_{\mathrm{max}}$. -In contrast, MIC may produce non-zero scores for independent data if $B(n)$ is set too high, as discussed in section 2.2.1 of the supplementary material in [@pmid:22174245]. +In contrast, MIC may produce non-zero scores for independent data if $B(n)$ is set too high, as discussed in Section 2.2.1 of the supplementary material in [@pmid:22174245]. The authors of MIC suggest that a value of $B(n) = n^{0.6}$ is generally effective in practice. @@ -92,7 +91,7 @@ This suggests that new implementations using more advanced processing units (suc ![ **The expression levels of *KDM6A* and *DDX3Y* display sex-specific associations across GTEx tissues.** -CCC captures this nonlinear relationship in all?? GTEx tissues (nine examples are shown in the first three rows), except in female-specific organs (last row). +CCC captures this nonlinear relationship in all GTEx tissues (nine examples are shown in the first three rows), except in female-specific organs (last row). ](images/coefs_comp/kdm6a_vs_ddx3y/gtex-KDM6A_vs_DDX3Y-main.svg "KDM6A and DDX3Y across different GTEx tissues"){#fig:gtex_tissues:kdm6a_ddx3y width="95%" tag="S3"}