Tutorial IV
Methods for Integrating SNP marker info, Mouse Body Weight and
Network Information
Steve Horvath
Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm
This tutorial is the direct sequel of tutorials I, II, III.
The tutorials and data files can be found at the following webpage:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/MouseWeight/
More material on weighted network analysis can be found here
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Content
-
Network properties are integrated with genetic information to characterize weight related genes of the blue module
-
Technically, we regress the body weight based gene significance (GSweight) on SNP significance measures (GSmQTL) and intramodular connectivity (k)
The data and biological implications are described in
Anatole Ghazalpour, Sudheer Doss, Bin Zang, Susanna Wang,Eric E. Schadt, Thomas A. Drake, Aldons J. Lusis, Steve Horvath (2006) Integrating Genetics and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics.
Statistical References
To cite this tutorial or the statistical methods please use
-
Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17
For the generalized topological overlap matrix as applied to unweighted networks see
-
Yip A, Horvath S (2006) Generalized Topological Overlap Matrix and its Applications in Gene Co-expression Networks. Proceedings Volume. Biocomp Conference 2006, Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/GTOM/.
For some additional theoretical insights consider
-
Horvath, Dong, Yip (2006) Using Module Eigengenes to Understand Connectivity and Other Network Concepts in Co-expression Networks. Submitted.
Abstract
The traditional quantitative trait locus (QTL) mapping approach, relates clinical traits to genetic markers directly. Several authors have proposed a strategy that uses genome-wide gene expression data to help map clinical traits. The strategy uses the genetics of gene expression to reconstruct metabolic or regulatory pathways and utilizes gene expression as a quantitative trait, thereby mapping expression QTL (eQTL). Using this method, several groups have found ‘hotspot’ regions (regulatory gene regions) in the genome that are involved in regulating the expression of many genes. These hotspots highlight the genomic loci that determine the relationship between the phenotype and groups of functionally related genes. We identify chromosomal loci (referred to as module quantitative trait loci, mQTL) that perturb the modules and describe a novel approach that integrates network properties with genetic marker information to model gene/trait relationships. Specifically, using the mQTL and the intramodular connectivity of a body weight related module, we describe which factors determine the relationship between gene expression profiles and weight.
Methods
We propose a novel approach for integrating network properties with genetic information to determine the relationship between clinical traits and group of physiologically relevant genes. The following steps summarize our overall approach:
-
A gene co-expression network is constructed from genome-wide expression data from a segregating population.
-
The biochemical and physiological significance of the network modules are determined.
-
Genetic loci regulating gene modules within the network are identified.
-
Network properties are integrated with genetic information to characterize weight related genes of the blue module
-
Steps 1-3 form the topic of other software tutorials posted on our webpage.
This document provides statistical details on step 4.
We will analyze the gene expression data from genes that comprised of Blue module genes. The Blue module was found in step 1. The eQTL hotspots (mQTLs) for the Blue module were found in step 2. Each mQTL can be represented by a SNP marker.
Description of microarray data RNA preparation and array hybridizations were performed at Rosetta Informatics Microarray data of the Blue module genes
Here we restricted the analysis to the Blue module since the genes of this module were related to Body Weight. Characterizing the genes related to Body Weight is the main goal of this analysis.
Network construction
The theory of the network construction algorithm is described in Zhang and Horvath (2006). Briefly, a gene co-expression similarity measure (absolute value of the Pearson product moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency matrix was then constructed using a ‘soft’ power adjacency function aij = |cor(xi, xj)| where the absolute value of the Pearson correlation measures gene is the co-expression similarity, and aij represents the resulting adjacency that measures the connection strengths. The network connectivity (kall) of the ith gene is the sum of the connection strengths with the other genes, i.e. . This summation performed over all genes in a particular module is the intramodular connectivity (kin).
Definition of the Module Eigengene
For our multivariate regression analyses, we find it expedient to sum up the gene expression profiles of an entire module by one idealized gene expression profile. Toward this end we make use of the concept of a module eigengene. For a detailed theoretical discussion see (Horvath, Dong, Yip 2006) and http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/.
Briefly, to compute the module eigengene, one decomposes the standardized gene-expression profile of each module via the singular value decomposition (X=UDVT). The first column V1 of V corresponds to the module eigengene. The singular value decomposition is closely related to principal component analysis, we denote V1 by PC1 in the text.
Intramodular Connectivity
It has been demonstrated in lower organisms that targeted attacks on genes with very high connectivity often result in lethal phenotypes (Han et al., 2004; Jeong et al., 2001, Carlson et al 2006). Intramodular connectivity has also arisen as an important concept for identifying prognostically significant genes in cancer (Carter et al., 2004; Horvath and Zhang 2005)
These results motivated us to study genes with high values of the intramodular connectivity measure (kin). As described in the previous section, kin for each gene is based on its Pearson correlation with all the other genes in the module (Materials and Methods). To assess the significance of kin, we looked within the Blue module, which showed the highest MS value in our previous analysis, and plotted k.in against gene significance for all the 534 genes in this module.
The module eigengene gives rise to a measure of intramodular connectivity (kme) as follows kme(i)=|cor(x(i),PC1)| (Horvath, Dong, Yip 2006) Thus the connectivity of the ith gene is defined as the absolute value of the correlation between the ith expression profile and the module eigengene. The module eigengene based connectivity measure (kme) is highly correlated (r=0.86) with the standard connectivity measure (kin). kme was used only in our regression analysis because it is slightly more informative in our regression models. For all other applications, kin was utilized.
Weight based gene significance measure
For a given physiological trait (e.g. body weight), we defined a measure of gene significance by forming the absolute value of the Spearman correlation between trait and gene expression values. For example, the body weight can be used to define a gene significance of the ith gene expression GSweight(i) = |cor(x(i), weight)| where x(i) is the gene expression profile of the ith gene.
Integration of genetics and intramodular connectivity to explain physiological significance of the module
Next, we sought to explore the relationships between the intramodular connectivity (kme), weight, and mQTL. To carry out this analysis we defined two terms: GSweight and GSmQTL. As mentioned before, GSweight is defined as the absolute value of the correlation between gene expression profile and weight. GSmQTL is defined as a measure of gene significance for each of the SNPs underlying the mQTL. Using an additive marker coding (0,1,2) to code the three possible genotypes at each locus in the F2 mice, we defined the SNP gene significance measure of the ith gene expression x(i) by the absolute value of its correlation with the SNP under consideration.
GSmQTL(i) = |cor(x(i), SNP)|.
This measure is highly correlated with the traditional lod score (r=0.86). The mQTL based significance measure for a particular SNP is designated "GSmQTL" followed by the chromosome number on which the corresponding SNP is located. For example, GSmQTL2 is the mQTL based significance for the mQTL peak marker SNP on Chromosome 2.
The SNPs used in the multivariate linear regression analysis were chosen based on two criteria: 1) They had to be peak markers at the Blue module mQTL and 2) they had to be significant predictors of weight (p<0.05) in the linear model where weight was regressed on all 9 mQTL SNPs. These two criteria resulted in Chromosomes 2, 5, 10a, and 19 mQTL SNPs to be included in the multivariate linear regression analysis.
To assess the relationship between GSweight, kme, and GSmQTL we used multivariate linear regression. When we regressed GSweight on GSmQTL we find that GSmQTL19 is positively correlated with GSweight (Figure 5C) while GSmQTL2, GSmQTL5, and GSmQTL10a are all inversely correlated with GSweight. The latter motivated us to define a summary measure "GSmQTL*"= GSmQTL2 + GSmQTL5 + GSmQTL10a, which quantifies the relationship between a gene expression profile and the mQTL on Chromosomes 2, 5, and proximal 10. When regressing GSweight on GSmQTL* and GSmQTL19, we find that 37% of the variation in the GSweight can be explained by GSmQTL on Chromosomes 2, 5, 10, and 19 (Table 2, model 1). When we regressed GSweight on connectivity (kme), we find that connectivity explains 34% of the variation in GSweight (Table 2, model 2). However, while GSmQTL*, GSmQTL19, and connectivity explain 19%(=r2=0.442) (Figure 5B), 29% (Figure 5C), and 34% (Figure 5A) of the variation in GSweight respectively, a model combining these three covariates was able to explain 70% of GSweight variation (Figure 5D, Table 2, model 3). This analysis demonstrates that a model that integrates genetic information (SNP based significance) with network properties (intramodular connectivity) is a better predictor of the relationship between Blue module genes and weight.
To further explain the relationship between GSweight and the module variables we visualized the distribution of GSweight in different strata using boxplots (Figure 5E). The 9 different boxplots of Figure 5E correspond to the 9 groups of genes that resulted by splitting the Blue module genes by their median GSmQTL*, GSmQTL19 and kme values. We find that genes with high GSmQTL19, high connectivity, and low GSmQTL* have the highest GSweight (Figure 5E, labeled ‘q- 19+ k+’). This means genes with strong linkage to the Chromosome 19 locus, and weak linkage to the SNPs described on Chromosomes 2, 5, and 10, and high connectivity are most likely to be associated with weight. (Detailed information regarding the Blue module genes and the multivariate regression analyses can be found in Appendix C).
Conclusion
Our unsupervised module detection method identifies a weight related gene network module. In that module, genes with high intramodular connectivity (kin) tend to be highly correlated with body weight. Importantly, the availability of genetic marker data greatly enhances our understanding of the relationship between connectivity and GSweight. Two novel concepts enable us to integrate the gene co-expression network analysis with the genetic analysis: First, the use of intramodular connectivity as a covariate characterizing a particular gene. Second, the use of the correlation of gene expression with SNP alleles at mQTL (GSmQTL) as a covariate to characterize a particular gene.
We find that the most significant predictor for the weight based gene significance measure (GSweight) is the intramodular connectivity (kme). However, several of the mQTL based gene significance measures (GSmQTL19, GSmQTL2, GSmQTL5, GSmQTL10a) are highly significant independent predictors as well. The fact that the mQTL based significance measures are independent of connectivity can also be seen from the weak correlations between kme and GSmQTL19 (r= 0.083), GSmQTL2 (r = 0.15), GSmQTL5 (r = -0.071) and GSmQTL10 (r = 0.050). The multivariate regression models highlight the value of taking a network perspective. When integrating co-expression network concepts (connectivity) and genetic marker information (GSmQTL) we are able to explain 70% of the variation in GSweight. This compares favorably to the univariate model that uses only connectivity (R2= 0.34) or a model that uses only the genetic marker information (R2= 0.37). Integrating gene co-expression networks with genetic marker information allows one to understand what factors influence the relationship between gene expression and weight.
Statistical References for the Network Analysis
More material on weighted network analysis can be found here
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Other references
-
Carter, S. L., Brechbuhler, C. M., Griffin, M., and Bond, A. T. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20, 2242-2250.
-
Han, J. D., Bertin, N., Hao, T., Goldberg, D. S., Berriz, G. F., Zhang, L. V., Dupuy, D., Walhout, A. J., Cusick, M. E., Roth, F. P., and Vidal, M. (2004). Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430, 88-93.
-
Jeong, H., Mason, S. P., Barabási, A. L., and Oltvai, Z. N. (2001). Lethality and centrality in protein networks. Nature 411, 41-42.
-
Carlson MRJ, Zhang B, Fang Z, Mischel P, Horvath S, Nelson SF (2006) Gene Connectivity, Function, and Sequence Conservation: Predictions from Modular Yeast Co-Expression Networks. BMC Genomics 2006, 7:40 (3 March 2006). http://www.biomedcentral.com/1471-2164/7/40/
Getting the software and data -
Downloading the R software Go to http://www.R-project.org, download R and install it on your computer. After installing R, you need to install several additional R library packages: For example to install Hmisc, open R, go to menu "Packages\Install package(s) from CRAN", then choose Hmisc. R will automatically install the package. Do the same for some of the other libraries mentioned below. But note that several libraries are already present in the software so there is no need to re-install them.
-
Download the file "BluemoduleGenesWeightandSNPs.csv" and put it into a directory.
-
Download the R function file: "NetworkFunctions.txt", which contains several R functions needed for Weighted Gene Co-Expression Network Analysis. http://www.genetics.ucla.edu/labs/horvath/GeneralFramework/NetworkFunctions.txt
Put the file into the same working directory as the data
The text file contains short description of some network functions. DISCLAIMER: Absolutely no warranty on the code. Please contact SH with suggestions.
# STARTING THE R session:
# Open the R software by double clicking the corresponding icon
#To interact with the R software copy and paste the commands into the R console.
#Text after "#" is a comment and is automatically ignored by R.
# Set the working directory of the R session by using the following command.
# Change the file path and use / instead of \.
setwd("C:/Documents and Settings/shorvath/My Documents/ADAG/AnatoleGhazalpour/MOUSEWEIGHT/TUTORIALS/FourthTutorial")
# read in the custom network functions
source("C:/Documents and Settings/shorvath/My Documents/RFunctions/NetworkFunctions.txt")
# CUSTOM MADE FUNCTIONS that will be used below
# read in the R libraries.
library(MASS) # standard, no need to install
library(class) # standard, no need to install
library(cluster)
library(sma) # install it for the function plot.mat
library(impute) # install it for imputing missing value
library(faraway) # this library is useful for some of the linear model diagnostics
#Memory
# check the maximum memory that can be allocated
memory.size(TRUE)/1024
# increase the available memory
memory.limit(size=4000)
## Read in the data
# these are the data related to the blue module
dat1=read.csv("BluemoduleGenesWeightandSNPs.csv",header=T)
# this data frame contains annotation and other information on the genes
datSummary= data.frame(dat1[-c(1:10), c(1:8, 144:158)])
# this data frame contains the gene expression data (rows are samples, columns are genes)
datExprBlue=data.frame(t(dat1[-c(1:10), c(9:143)]))
# This vector contains the contains the module color (blue) for each gene
color1=rep("blue",dim(datExprBlue)[[2]] )
# This defines the module eigengene
PC1=ModulePrinComps1(datExprBlue,color1)[[1]]$PCblue
# This data frame contains the SNP markers of the mQTLs for the mice
# Rows are mQTL SNP markers and columns are female mouse liver samples
SNP= data.frame(dat1[1:9, c(9:143) ])
dimnames(SNP)[[1]]=as.character(dat1[1:9,1])
#body weight of each mouse
weight=as.numeric(dat1[10, c(9:143) ])
# This defines the weight based gene significance measure
GSweight=as.numeric(abs(cor(weight, datExprBlue,use="p")))
# This defines the SNP (mQTL) based gene significance measure
GSSNP=data.frame(matrix(NA, nrow=dim(SNP)[[1]],ncol=dim(datExprBlue)[[2]] ))
for (i in c(1:dim(SNP)[[1]]) ){GSSNP[i,]= as.numeric(abs(cor(as.numeric(SNP[i,]), datExprBlue,use="p")))}
dimnames(GSSNP)[[1]]=paste("GS",as.character(dat1[1:9,1]),sep="")
# This defines the intramodular connectivity
# Note that this assumes beta=6 used for the power adjacency function
kIN=as.numeric(apply(abs(cor(datExprBlue,use="p"))^6,2,sum))
# This defines the module eigengene based connectivity measure.
kME= as.numeric(abs(cor(PC1,datExprBlue,use="p")))
# Note that kME and kIN are highly correlated, which is always true for module genes.
# See Horvath, Dong, Yip 2006
cor.test(kME,kIN)
Pearson's product-moment correlation
data: kME and kIN
t = 38.4575, df = 533, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8331551 0.8783076
sample estimates:
cor
0.8573722
# Let us now use a multivariable linear regression model to study which of the mQTLs affect
# body weight directly.
# We transformed weight so that it is more normally distributed
lm1= lm(sqrt(weight)~., data=data.frame(t(SNP)))
summary(lm1)
coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.10361 0.16912 36.091 < 2e-16 ***
mQTL2.056 0.03086 0.06312 0.489 0.62587
mQTL3.079 -0.07322 0.06262 -1.169 0.24472
mQTL5.082 -0.09025 0.06369 -1.417 0.15914
mQTL10.058 0.04218 0.07667 0.550 0.58321
mQTL10.101 -0.10538 0.07914 -1.332 0.18563
mQTL12.016 0.05492 0.10642 0.516 0.60680
mQTL12.038 0.03798 0.10600 0.358 0.72077
mQTL14.049 -0.05806 0.06256 -0.928 0.35529
mQTL19.047 0.19943 0.06220 3.206 0.00174 **
Residual standard error: 0.4845 on 116 degrees of freedom
Multiple R-Squared: 0.1453, Adjusted R-squared: 0.07898
Message: only the mQTL on chromosome 19 is a significant predictor of weight. The percent variance explained by the odel is only 14.5%. However, once we make use of the module eigengene, the results change:
summary(lm(sqrt(weight)~PC1+., data=data.frame(t(SNP))))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84842 0.12816 45.633 < 2e-16 ***
PC1 4.25970 0.43540 9.783 < 2e-16 ***
mQTL2.056 0.14555 0.04828 3.015 0.00316 **
mQTL3.079 0.01225 0.04728 0.259 0.79602
mQTL5.082 -0.13178 0.04744 -2.778 0.00640 **
mQTL10.058 0.11609 0.05738 2.023 0.04539 *
mQTL10.101 -0.02922 0.05924 -0.493 0.62274
mQTL12.016 0.05579 0.07896 0.706 0.48131
mQTL12.038 -0.00859 0.07879 -0.109 0.91338
mQTL14.049 0.05000 0.04771 1.048 0.29687
mQTL19.047 0.09810 0.04730 2.074 0.04029 *
Residual standard error: 0.3595 on 115 degrees of freedom
Multiple R-Squared: 0.5335, Adjusted R-squared: 0.493
F-statistic: 13.15 on 10 and 115 DF, p-value: 4.21e-15
Message: after including the module eigengene in the model, the mQTL on chromosome 19 loses much of its significance. But interestingly mQTL mQTL2.056 and mQTL5.082 become highly significant predictors of weight.
GENE NETWORK VIEW
# Let us now take the network view
lm1= lm(GSweight~., data=data.frame(t(GSSNP)))
summary(lm1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.31219 0.01998 15.626 < 2e-16 ***
GSmQTL2.056 -0.29988 0.06324 -4.742 2.74e-06 ***
GSmQTL3.079 0.17480 0.05999 2.914 0.00372 **
GSmQTL5.082 -0.31403 0.05489 -5.721 1.78e-08 ***
GSmQTL10.058 -0.41002 0.07579 -5.410 9.59e-08 ***
GSmQTL10.101 0.31326 0.07818 4.007 7.05e-05 ***
GSmQTL12.016 -0.17205 0.11542 -1.491 0.13666
GSmQTL12.038 0.25288 0.12128 2.085 0.03754 *
GSmQTL14.049 0.36388 0.06119 5.947 4.98e-09 ***
GSmQTL19.047 0.54617 0.05435 10.050 < 2e-16 ***
Residual standard error: 0.09856 on 525 degrees of freedom
Multiple R-Squared: 0.4573, Adjusted R-squared: 0.448
Message: most important locus is on chromosome 19
summary(lm(GSweight~sqrt(kIN)+., data=data.frame(t(GSSNP))))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.196043 0.017614 11.130 < 2e-16 ***
sqrt(kIN) 0.078512 0.004717 16.645 < 2e-16 ***
GSmQTL2.056 -0.332205 0.051237 -6.484 2.07e-10 ***
GSmQTL3.079 -0.021228 0.049973 -0.425 0.6712
GSmQTL5.082 -0.300528 0.044444 -6.762 3.64e-11 ***
GSmQTL10.058 -0.502735 0.061607 -8.160 2.50e-15 ***
GSmQTL10.101 0.128487 0.064261 1.999 0.0461 *
GSmQTL12.016 0.149016 0.095411 1.562 0.1189
GSmQTL12.038 -0.018787 0.099526 -0.189 0.8504
GSmQTL14.049 0.105254 0.051913 2.027 0.0431 *
GSmQTL19.047 0.544618 0.043997 12.378 < 2e-16 ***
Residual standard error: 0.07979 on 524 degrees of freedom
Multiple R-Squared: 0.645, Adjusted R-squared: 0.6382
F-statistic: 95.21 on 10 and 524 DF, p-value: < 2.2e-16
Message: It is intersting to contrast this to the table that relates SNPs to weigh directly after adjusting for the PC.
The SNPs have a much stronger effect on the correlation between gene expression and weight.
# Which loci affect connectivity?
lm1= lm(sqrt(kIN)~. , data=data.frame(t(GSSNP)))
summary(lm1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.47931 0.14965 9.885 < 2e-16 ***
GSmQTL2.056 0.41174 0.47375 0.869 0.385181
GSmQTL3.079 2.49674 0.44936 5.556 4.39e-08 ***
GSmQTL5.082 -0.17203 0.41116 -0.418 0.675832
GSmQTL10.058 1.18097 0.56770 2.080 0.037986 *
GSmQTL10.101 2.35350 0.58565 4.019 6.71e-05 ***
GSmQTL12.016 -4.08942 0.86459 -4.730 2.89e-06 ***
GSmQTL12.038 3.46020 0.90843 3.809 0.000156 ***
GSmQTL14.049 3.29411 0.45832 7.187 2.28e-12 ***
GSmQTL19.047 0.01974 0.40709 0.048 0.961344
Residual standard error: 0.7383 on 525 degrees of freedom
Multiple R-Squared: 0.261, Adjusted R-squared: 0.2483
F-statistic: 20.6 on 9 and 525 DF, p-value: < 2.2e-16
# Interestingly, the mQTL on chromosome 19 has no effect on connectivity.
This suggests that genes with high connectivity are controlled by the mQTL on chromosomes 14, 12, 3, etc.
These loci would have been missed by conventional analysis that ignores connectivity.
# in the following data frame each row corresponds to a mouse
datMouse=data.frame(PC1,weight, t(SNP))
# in the following data frame each row corresponds to a network node
datNode=data.frame(kIN,GSweight, t(GSSNP))
cordatNode=cor(datNode)
signif(cordatNode,2)
kIN GSweight GSmQTL2.056 GSmQTL3.079 GSmQTL5.082 GSmQTL10.058 GSmQTL10.101 GSmQTL12.016 GSmQTL12.038 GSmQTL14.049 GSmQTL19.047
kIN 1.000 0.420 0.210 0.2700 -0.039 0.170 0.260 -0.011 0.140 0.200 -0.0140
GSweight 0.420 1.000 -0.270 0.2000 -0.190 -0.350 -0.027 0.130 0.190 0.200 0.5400
GSmQTL2.056 0.210 -0.270 1.000 0.0950 0.002 0.310 0.250 0.250 0.400 0.041 -0.2500
GSmQTL3.079 0.270 0.200 0.095 1.0000 0.023 -0.280 0.012 0.049 0.230 0.035 0.0029
GSmQTL5.082 -0.039 -0.190 0.002 0.0230 1.000 -0.100 -0.030 0.071 -0.058 0.130 -0.0800
GSmQTL10.058 0.170 -0.350 0.310 -0.2800 -0.100 1.000 0.570 -0.130 -0.110 -0.081 -0.2800
GSmQTL10.101 0.260 -0.027 0.250 0.0120 -0.030 0.570 1.000 0.094 0.160 -0.150 0.0130
GSmQTL12.016 -0.011 0.130 0.250 0.0490 0.071 -0.130 0.094 1.000 0.850 0.220 0.1700
GSmQTL12.038 0.140 0.190 0.400 0.2300 -0.058 -0.110 0.160 0.850 1.000 0.140 0.2100
GSmQTL14.049 0.200 0.200 0.041 0.0350 0.130 -0.081 -0.150 0.220 0.140 1.000 0.0760
GSmQTL19.047 -0.014 0.540 -0.250 0.0029 -0.080 -0.280 0.013 0.170 0.210 0.076 1.0000
cordatMouse=cor(datMouse,use="p")
signif(cordatMouse,2)
> signif(cor(datMouse,use="p"),2)
PC1 weight mQTL2.056 mQTL3.079 mQTL5.082 mQTL10.058 mQTL10.101 mQTL12.016 mQTL12.038 mQTL14.049 mQTL19.047
PC1 1.00 0.630 -0.230 -0.21000 0.13000 -0.190 -0.2100 0.14000 0.200 -0.2700 0.20000
weight 0.63 1.000 0.048 -0.09700 -0.10000 -0.011 -0.1200 0.07600 0.140 -0.0560 0.32000
mQTL2.056 -0.23 0.048 1.000 0.07500 0.07400 0.075 0.0530 -0.17000 -0.190 0.0670 0.14000
mQTL3.079 -0.21 -0.097 0.075 1.00000 -0.19000 -0.120 -0.0350 0.00045 -0.073 0.0710 -0.04400
mQTL5.082 0.13 -0.100 0.074 -0.19000 1.00000 0.092 0.0600 0.15000 0.110 -0.1100 -0.00097
mQTL10.058 -0.19 -0.011 0.075 -0.12000 0.09200 1.000 0.5400 0.09300 0.063 0.0590 -0.07600
mQTL10.101 -0.21 -0.120 0.053 -0.03500 0.06000 0.540 1.0000 0.09900 -0.010 0.0032 -0.11000
mQTL12.016 0.14 0.076 -0.170 0.00045 0.15000 0.093 0.0990 1.00000 0.800 -0.0430 -0.05800
mQTL12.038 0.20 0.140 -0.190 -0.07300 0.11000 0.063 -0.0100 0.80000 1.000 -0.0110 0.04500
mQTL14.049 -0.27 -0.056 0.067 0.07100 -0.11000 0.059 0.0032 -0.04300 -0.011 1.0000 0.02600
mQTL19.047 0.20 0.320 0.140 -0.04400 -0.00097 -0.076 -0.1100 -0.05800 0.045 0.0260 1.00000
plot(cordatMouse[2,-2],cordatNode[2,-2])
par(mfrow=c(3,3))
for(i in 3:11) {scatterplot1(datNode[,1] , datNode[,i],xlab1="K",ylab1=names(datNode)[i],col1="blue")}
par(mfrow=c(3,3))
for(i in 3:11) {scatterplot1(datNode[,2] , datNode[,i],xlab1="GSweight",ylab1=names(datNode)[i],col1="blue")}
attach(datMouse)
# Here we define a shorter name
GSmQTL2= data.frame(t(GSSNP))$GSmQTL2.056
GSmQTL3= data.frame(t(GSSNP))$GSmQTL3.079
GSmQTL5= data.frame(t(GSSNP))$GSmQTL5.082
GSmQTL10a= data.frame(t(GSSNP))$GSmQTL10.058
GSmQTL10b= data.frame(t(GSSNP))$GSmQTL10.101
GSmQTL12a= data.frame(t(GSSNP))$GSmQTL12.016
GSmQTL12b= data.frame(t(GSSNP))$GSmQTL12.038
GSmQTL14= data.frame(t(GSSNP))$GSmQTL14.049
GSmQTL19= data.frame(t(GSSNP))$GSmQTL19.047
mQTL19=datMouse$mQTL19.047
par(mfrow=c(1,2))
boxplot(split(weight,mQTL19),notch=T,varwidth=T,xlab="mQTL19",ylab="weight",col="blue")
boxplot(split(PC1,mQTL19),notch=T,varwidth=T,xlab="mQTL19",ylab="PC1",col="blue")
# Here we fit different multivariable models
summary(step(lm(GSweight~kME+., data=datNode[,-c(1,2)]))) #Adjusted R-squared: 0.7033
summary(step(lm(GSweight~kIN+., data=datNode[,-c(1,2)]))) # Adjusted R-squared: 0.6267
summary(lm(GSweight~kIN, data=datNode[,-c(1,2)]))# Adjusted R-squared: 0.1723
summary(lm(GSweight~kME, data=datNode[,-c(1,2)])) # Adjusted R-squared: 0.337
summary(lm(GSweight~kME+ GSmQTL19, data=datNode[,-c(1,2)])) # Adjusted R-squared: 0.5835
Based on the R^2 values we prefer the module based connectivity measure kME over kIN.
# Our favourite model for integrating gene connectivity and GSSNP data is given by the following
summary(step(lm(GSweight~kME+., data=datNode[,-c(1,2)])))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03603 0.01935 1.863 0.0631 .
kME 0.66571 0.02887 23.058 < 2e-16 ***
GSmQTL2.056 -0.28510 0.04630 -6.158 1.46e-09 ***
GSmQTL3.079 -0.06360 0.04491 -1.416 0.1573
GSmQTL5.082 -0.25280 0.03995 -6.328 5.32e-10 ***
GSmQTL10.058 -0.39250 0.04438 -8.844 < 2e-16 ***
GSmQTL12.016 0.20553 0.08366 2.457 0.0143 *
GSmQTL12.038 -0.12794 0.08956 -1.428 0.1538
GSmQTL19.047 0.52881 0.03895 13.578 < 2e-16 ***
Multiple R-Squared: 0.7077, Adjusted R-squared: 0.7033
# The estimates of the coefficients of GSmQTL2.056, GSmTQL5, and GSmQTL10.058 are similar.
#This motivate us to define the following variable GSmQTL* or shortly
GSmQTLstar=I(GSmQTL2+GSmQTL5+GSmQTL10a)
# This model is reported in our paper (model 3)
summary(lm(GSweight~kME+ GSmQTLstar +GSmQTL19, data=datNode[,-c(1,2)]))
Call:
lm(formula = GSweight ~ kME + GSmQTLstar + GSmQTL19, data = datNode[,
-c(1, 2)])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04421 0.01911 2.313 0.0211 *
kME 0.63572 0.02665 23.858 <2e-16 ***
GSmQTLstar -0.30415 0.02173 -13.997 <2e-16 ***
GSmQTL19 0.55183 0.03711 14.868 <2e-16 ***
Residual standard error: 0.07325 on 531 degrees of freedom
Multiple R-Squared: 0.6969, Adjusted R-squared: 0.6952
# Using the model, we predict the following GSweight
GSweightpredict=predict(lm(GSweight~kME+GSmQTLstar+GSmQTL19))
#To understand the meaning of the model, we dichotomize the variables by their median values
GSmQTL19high= GSmQTL19 >= median(GSmQTL19)
kMEhigh= kME >= median(kME)
GSmQTLstarhigh= GSmQTLstar >= median(GSmQTLstar)
summary(lm(GSweight~kMEhigh +GSmQTLstarhigh+GSmQTL19high ))
kMElabel=factor(ifelse(kMEhigh,"k+","k-"))
GSmQTLstarlabel= factor(ifelse(GSmQTLstarhigh,"q+","q-"), levels=c("q+", "q-"))
GSmQTL19label=factor(ifelse(GSmQTL19high,"19+","19-"))
boxplot(GSweight~ GSmQTLstarlabel+GSmQTL19label+ kMElabel,notch=T,varwidth=T,ylab="GSweight",xlab="Module Group",col="blue")
#This code produces Figure 5 (with Spearman correlations instead of Pearson correlations, see below)
par(mfrow=c(3,2),mar=c(4,4,3,1))
scatterplot1(kME, GSweight,xlab1="k.ME",ylab1="GSweight",col1="blue")
scatterplot1(GSmQTLstar, GSweight,xlab1="GSmQTL*",ylab1="GSweight" ,col1="blue")
scatterplot1(GSmQTL19, GSweight,xlab1="GSmQTL19",ylab1="GSweight" ,col1="blue")
scatterplot1(GSweightpredict,GSweight,xlab1="Predicted GSweight",ylab1="Observed GSweight" ,col1="blue")
boxplot(GSweight~ GSmQTLstarlabel+GSmQTL19label+ kMElabel,notch=T,varwidth=T,ylab="GSweight",xlab="Module Group",col="blue", cex.axis=1.5,cex.lab=1.5, cex.main=1.5)
# Comment: note that the correlations are slightly different from those in our article.
# The reason is that here we report Spearman correlations while in Figure 5 we report
# Pearson correlations. We prefer Pearson correlations for our main article since
# they allow us to estimate the variance explained (=PearsonCorrelation^2)
# Now we output some of our data
datout=data.frame(datSummary, kME, kMEhigh, GSmQTL19high, GSmQTLstarhigh,datNode, GSmQTLstar)
write.table(datout, "BlueModuleSummary.csv", sep=",", row.names=T)
THE END
Share with your friends: |