Tutorial IV methods for Integrating snp marker info, Mouse Body Weight and Network Information

Download 74.34 Kb.

Date	29.04.2017
Size	74.34 Kb.
	#16722

Description of microarray data
Getting the software and data

Tutorial IV

Methods for Integrating SNP marker info, Mouse Body Weight and

Network Information

Steve Horvath

Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm
This tutorial is the direct sequel of tutorials I, II, III.

The tutorials and data files can be found at the following webpage:

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/MouseWeight/

More material on weighted network analysis can be found here

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Content

Network properties are integrated with genetic information to characterize weight related genes of the blue module
Technically, we regress the body weight based gene significance (GSweight) on SNP significance measures (GSmQTL) and intramodular connectivity (k)

The data and biological implications are described in

Anatole Ghazalpour, Sudheer Doss, Bin Zang, Susanna Wang,Eric E. Schadt, Thomas A. Drake, Aldons J. Lusis, Steve Horvath (2006) Integrating Genetics and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics.
Statistical References

To cite this tutorial or the statistical methods please use

Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17

For the generalized topological overlap matrix as applied to unweighted networks see

Yip A, Horvath S (2006) Generalized Topological Overlap Matrix and its Applications in Gene Co-expression Networks. Proceedings Volume. Biocomp Conference 2006, Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/GTOM/.

For some additional theoretical insights consider

Horvath, Dong, Yip (2006) Using Module Eigengenes to Understand Connectivity and Other Network Concepts in Co-expression Networks. Submitted.

Abstract

The traditional quantitative trait locus (QTL) mapping approach, relates clinical traits to genetic markers directly. Several authors have proposed a strategy that uses genome-wide gene expression data to help map clinical traits. The strategy uses the genetics of gene expression to reconstruct metabolic or regulatory pathways and utilizes gene expression as a quantitative trait, thereby mapping expression QTL (eQTL). Using this method, several groups have found ‘hotspot’ regions (regulatory gene regions) in the genome that are involved in regulating the expression of many genes. These hotspots highlight the genomic loci that determine the relationship between the phenotype and groups of functionally related genes. We identify chromosomal loci (referred to as module quantitative trait loci, mQTL) that perturb the modules and describe a novel approach that integrates network properties with genetic marker information to model gene/trait relationships. Specifically, using the mQTL and the intramodular connectivity of a body weight related module, we describe which factors determine the relationship between gene expression profiles and weight.

Methods

We propose a novel approach for integrating network properties with genetic information to determine the relationship between clinical traits and group of physiologically relevant genes. The following steps summarize our overall approach:

A gene co-expression network is constructed from genome-wide expression data from a segregating population.
The biochemical and physiological significance of the network modules are determined.
Genetic loci regulating gene modules within the network are identified.
Network properties are integrated with genetic information to characterize weight related genes of the blue module
Steps 1-3 form the topic of other software tutorials posted on our webpage.

This document provides statistical details on step 4.

We will analyze the gene expression data from genes that comprised of Blue module genes. The Blue module was found in step 1. The eQTL hotspots (mQTLs) for the Blue module were found in step 2. Each mQTL can be represented by a SNP marker.

Description of microarray data

RNA preparation and array hybridizations were performed at Rosetta Informatics

Microarray data of the Blue module genes

Here we restricted the analysis to the Blue module since the genes of this module were related to Body Weight. Characterizing the genes related to Body Weight is the main goal of this analysis.
Network construction

The theory of the network construction algorithm is described in Zhang and Horvath (2006). Briefly, a gene co-expression similarity measure (absolute value of the Pearson product moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency matrix was then constructed using a ‘soft’ power adjacency function a_ij = |cor(x_i, x_j)|^ where the absolute value of the Pearson correlation measures gene is the co-expression similarity, and aij represents the resulting adjacency that measures the connection strengths. The network connectivity (kall) of the ith gene is the sum of the connection strengths with the other genes, i.e. . This summation performed over all genes in a particular module is the intramodular connectivity (kin).

Definition of the Module Eigengene

For our multivariate regression analyses, we find it expedient to sum up the gene expression profiles of an entire module by one idealized gene expression profile. Toward this end we make use of the concept of a module eigengene. For a detailed theoretical discussion see (Horvath, Dong, Yip 2006) and http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/.

Briefly, to compute the module eigengene, one decomposes the standardized gene-expression profile of each module via the singular value decomposition (X=UDVT). The first column V1 of V corresponds to the module eigengene. The singular value decomposition is closely related to principal component analysis, we denote V1 by PC1 in the text.
Intramodular Connectivity

It has been demonstrated in lower organisms that targeted attacks on genes with very high connectivity often result in lethal phenotypes (Han et al., 2004; Jeong et al., 2001, Carlson et al 2006). Intramodular connectivity has also arisen as an important concept for identifying prognostically significant genes in cancer (Carter et al., 2004; Horvath and Zhang 2005)

These results motivated us to study genes with high values of the intramodular connectivity measure (kin). As described in the previous section, kin for each gene is based on its Pearson correlation with all the other genes in the module (Materials and Methods). To assess the significance of kin, we looked within the Blue module, which showed the highest MS value in our previous analysis, and plotted k.in against gene significance for all the 534 genes in this module.

The module eigengene gives rise to a measure of intramodular connectivity (kme) as follows kme(i)=|cor(x(i),PC1)| (Horvath, Dong, Yip 2006) Thus the connectivity of the ith gene is defined as the absolute value of the correlation between the ith expression profile and the module eigengene. The module eigengene based connectivity measure (kme) is highly correlated (r=0.86) with the standard connectivity measure (kin). kme was used only in our regression analysis because it is slightly more informative in our regression models. For all other applications, kin was utilized.

Weight based gene significance measure

For a given physiological trait (e.g. body weight), we defined a measure of gene significance by forming the absolute value of the Spearman correlation between trait and gene expression values. For example, the body weight can be used to define a gene significance of the i^th gene expression GSweight(i) = |cor(x(i), weight)| where x(i) is the gene expression profile of the i^th gene.

Integration of genetics and intramodular connectivity to explain physiological significance of the module

Next, we sought to explore the relationships between the intramodular connectivity (kme), weight, and mQTL. To carry out this analysis we defined two terms: GSweight and GSmQTL. As mentioned before, GSweight is defined as the absolute value of the correlation between gene expression profile and weight. GSmQTL is defined as a measure of gene significance for each of the SNPs underlying the mQTL. Using an additive marker coding (0,1,2) to code the three possible genotypes at each locus in the F2 mice, we defined the SNP gene significance measure of the ith gene expression x(i) by the absolute value of its correlation with the SNP under consideration.

GSmQTL(i) = |cor(x(i), SNP)|.

This measure is highly correlated with the traditional lod score (r=0.86). The mQTL based significance measure for a particular SNP is designated "GSmQTL" followed by the chromosome number on which the corresponding SNP is located. For example, GSmQTL2 is the mQTL based significance for the mQTL peak marker SNP on Chromosome 2.

The SNPs used in the multivariate linear regression analysis were chosen based on two criteria: 1) They had to be peak markers at the Blue module mQTL and 2) they had to be significant predictors of weight (p<0.05) in the linear model where weight was regressed on all 9 mQTL SNPs. These two criteria resulted in Chromosomes 2, 5, 10a, and 19 mQTL SNPs to be included in the multivariate linear regression analysis.

To assess the relationship between GSweight, kme, and GSmQTL we used multivariate linear regression. When we regressed GSweight on GSmQTL we find that GSmQTL19 is positively correlated with GSweight (Figure 5C) while GSmQTL2, GSmQTL5, and GSmQTL10a are all inversely correlated with GSweight. The latter motivated us to define a summary measure "GSmQTL*"= GSmQTL2 + GSmQTL5 + GSmQTL10a, which quantifies the relationship between a gene expression profile and the mQTL on Chromosomes 2, 5, and proximal 10. When regressing GSweight on GSmQTL* and GSmQTL19, we find that 37% of the variation in the GSweight can be explained by GSmQTL on Chromosomes 2, 5, 10, and 19 (Table 2, model 1). When we regressed GSweight on connectivity (kme), we find that connectivity explains 34% of the variation in GSweight (Table 2, model 2). However, while GSmQTL*, GSmQTL19, and connectivity explain 19%(=r2=0.442) (Figure 5B), 29% (Figure 5C), and 34% (Figure 5A) of the variation in GSweight respectively, a model combining these three covariates was able to explain 70% of GSweight variation (Figure 5D, Table 2, model 3). This analysis demonstrates that a model that integrates genetic information (SNP based significance) with network properties (intramodular connectivity) is a better predictor of the relationship between Blue module genes and weight.

To further explain the relationship between GSweight and the module variables we visualized the distribution of GSweight in different strata using boxplots (Figure 5E). The 9 different boxplots of Figure 5E correspond to the 9 groups of genes that resulted by splitting the Blue module genes by their median GSmQTL*, GSmQTL19 and kme values. We find that genes with high GSmQTL19, high connectivity, and low GSmQTL* have the highest GSweight (Figure 5E, labeled ‘q- 19+ k+’). This means genes with strong linkage to the Chromosome 19 locus, and weak linkage to the SNPs described on Chromosomes 2, 5, and 10, and high connectivity are most likely to be associated with weight. (Detailed information regarding the Blue module genes and the multivariate regression analyses can be found in Appendix C).
Conclusion

Our unsupervised module detection method identifies a weight related gene network module. In that module, genes with high intramodular connectivity (kin) tend to be highly correlated with body weight. Importantly, the availability of genetic marker data greatly enhances our understanding of the relationship between connectivity and GSweight. Two novel concepts enable us to integrate the gene co-expression network analysis with the genetic analysis: First, the use of intramodular connectivity as a covariate characterizing a particular gene. Second, the use of the correlation of gene expression with SNP alleles at mQTL (GSmQTL) as a covariate to characterize a particular gene.

We find that the most significant predictor for the weight based gene significance measure (GSweight) is the intramodular connectivity (kme). However, several of the mQTL based gene significance measures (GSmQTL19, GSmQTL2, GSmQTL5, GSmQTL10a) are highly significant independent predictors as well. The fact that the mQTL based significance measures are independent of connectivity can also be seen from the weak correlations between kme and GSmQTL19 (r= 0.083), GSmQTL2 (r = 0.15), GSmQTL5 (r = -0.071) and GSmQTL10 (r = 0.050). The multivariate regression models highlight the value of taking a network perspective. When integrating co-expression network concepts (connectivity) and genetic marker information (GSmQTL) we are able to explain 70% of the variation in GSweight. This compares favorably to the univariate model that uses only connectivity (R2= 0.34) or a model that uses only the genetic marker information (R2= 0.37). Integrating gene co-expression networks with genetic marker information allows one to understand what factors influence the relationship between gene expression and weight.

Statistical References for the Network Analysis

More material on weighted network analysis can be found here

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

Other references

Carter, S. L., Brechbuhler, C. M., Griffin, M., and Bond, A. T. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20, 2242-2250.
Han, J. D., Bertin, N., Hao, T., Goldberg, D. S., Berriz, G. F., Zhang, L. V., Dupuy, D., Walhout, A. J., Cusick, M. E., Roth, F. P., and Vidal, M. (2004). Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430, 88-93.
Jeong, H., Mason, S. P., Barabási, A. L., and Oltvai, Z. N. (2001). Lethality and centrality in protein networks. Nature 411, 41-42.
Carlson MRJ, Zhang B, Fang Z, Mischel P, Horvath S, Nelson SF (2006) Gene Connectivity, Function, and Sequence Conservation: Predictions from Modular Yeast Co-Expression Networks. BMC Genomics 2006, 7:40 (3 March 2006). http://www.biomedcentral.com/1471-2164/7/40/

Getting the software and data

Downloading the R software Go to http://www.R-project.org, download R and install it on your computer. After installing R, you need to install several additional R library packages: For example to install Hmisc, open R, go to menu "Packages\Install package(s) from CRAN", then choose Hmisc. R will automatically install the package. Do the same for some of the other libraries mentioned below. But note that several libraries are already present in the software so there is no need to re-install them.
Download the file "BluemoduleGenesWeightandSNPs.csv" and put it into a directory.
Download the R function file: "NetworkFunctions.txt", which contains several R functions needed for Weighted Gene Co-Expression Network Analysis. http://www.genetics.ucla.edu/labs/horvath/GeneralFramework/NetworkFunctions.txt

Put the file into the same working directory as the data

The text file contains short description of some network functions. DISCLAIMER: Absolutely no warranty on the code. Please contact SH with suggestions.

# STARTING THE R session:

# Open the R software by double clicking the corresponding icon

#To interact with the R software copy and paste the commands into the R console.

#Text after "#" is a comment and is automatically ignored by R.
# Set the working directory of the R session by using the following command.

# Change the file path and use / instead of \.

setwd("C:/Documents and Settings/shorvath/My Documents/ADAG/AnatoleGhazalpour/MOUSEWEIGHT/TUTORIALS/FourthTutorial")
# read in the custom network functions

source("C:/Documents and Settings/shorvath/My Documents/RFunctions/NetworkFunctions.txt")

# CUSTOM MADE FUNCTIONS that will be used below
# read in the R libraries.

library(MASS) # standard, no need to install

library(class) # standard, no need to install

library(cluster)

library(sma) # install it for the function plot.mat

library(impute) # install it for imputing missing value

library(faraway) # this library is useful for some of the linear model diagnostics
#Memory

# check the maximum memory that can be allocated

memory.size(TRUE)/1024

# increase the available memory

memory.limit(size=4000)

## Read in the data
# these are the data related to the blue module

dat1=read.csv("BluemoduleGenesWeightandSNPs.csv",header=T)

# this data frame contains annotation and other information on the genes

datSummary= data.frame(dat1[-c(1:10), c(1:8, 144:158)])

# this data frame contains the gene expression data (rows are samples, columns are genes)

datExprBlue=data.frame(t(dat1[-c(1:10), c(9:143)]))

# This vector contains the contains the module color (blue) for each gene

color1=rep("blue",dim(datExprBlue)[[2]] )

# This defines the module eigengene

PC1=ModulePrinComps1(datExprBlue,color1)[[1]]$PCblue

# This data frame contains the SNP markers of the mQTLs for the mice

# Rows are mQTL SNP markers and columns are female mouse liver samples

SNP= data.frame(dat1[1:9, c(9:143) ])

dimnames(SNP)[[1]]=as.character(dat1[1:9,1])

#body weight of each mouse

weight=as.numeric(dat1[10, c(9:143) ])

# This defines the weight based gene significance measure

GSweight=as.numeric(abs(cor(weight, datExprBlue,use="p")))

# This defines the SNP (mQTL) based gene significance measure

GSSNP=data.frame(matrix(NA, nrow=dim(SNP)[[1]],ncol=dim(datExprBlue)[[2]] ))

for (i in c(1:dim(SNP)[[1]]) ){GSSNP[i,]= as.numeric(abs(cor(as.numeric(SNP[i,]), datExprBlue,use="p")))}

dimnames(GSSNP)[[1]]=paste("GS",as.character(dat1[1:9,1]),sep="")

# This defines the intramodular connectivity

# Note that this assumes beta=6 used for the power adjacency function

kIN=as.numeric(apply(abs(cor(datExprBlue,use="p"))^6,2,sum))
# This defines the module eigengene based connectivity measure.

kME= as.numeric(abs(cor(PC1,datExprBlue,use="p")))

# Note that kME and kIN are highly correlated, which is always true for module genes.

# See Horvath, Dong, Yip 2006

cor.test(kME,kIN)
Pearson's product-moment correlation
data: kME and kIN

t = 38.4575, df = 533, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8331551 0.8783076

sample estimates:

cor

0.8573722

# Let us now use a multivariable linear regression model to study which of the mQTLs affect

# body weight directly.

# We transformed weight so that it is more normally distributed

lm1= lm(sqrt(weight)~., data=data.frame(t(SNP)))

summary(lm1)
coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.10361 0.16912 36.091 < 2e-16 ***

mQTL2.056 0.03086 0.06312 0.489 0.62587

mQTL3.079 -0.07322 0.06262 -1.169 0.24472

mQTL5.082 -0.09025 0.06369 -1.417 0.15914

mQTL10.058 0.04218 0.07667 0.550 0.58321

mQTL10.101 -0.10538 0.07914 -1.332 0.18563

mQTL12.016 0.05492 0.10642 0.516 0.60680

mQTL12.038 0.03798 0.10600 0.358 0.72077

mQTL14.049 -0.05806 0.06256 -0.928 0.35529

mQTL19.047 0.19943 0.06220 3.206 0.00174 **

Residual standard error: 0.4845 on 116 degrees of freedom

Multiple R-Squared: 0.1453, Adjusted R-squared: 0.07898

Message: only the mQTL on chromosome 19 is a significant predictor of weight. The percent variance explained by the odel is only 14.5%. However, once we make use of the module eigengene, the results change:

summary(lm(sqrt(weight)~PC1+., data=data.frame(t(SNP))))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.84842 0.12816 45.633 < 2e-16 ***

PC1 4.25970 0.43540 9.783 < 2e-16 ***

mQTL2.056 0.14555 0.04828 3.015 0.00316 **

mQTL3.079 0.01225 0.04728 0.259 0.79602

mQTL5.082 -0.13178 0.04744 -2.778 0.00640 **

mQTL10.058 0.11609 0.05738 2.023 0.04539 *

mQTL10.101 -0.02922 0.05924 -0.493 0.62274

mQTL12.016 0.05579 0.07896 0.706 0.48131

mQTL12.038 -0.00859 0.07879 -0.109 0.91338

mQTL14.049 0.05000 0.04771 1.048 0.29687

mQTL19.047 0.09810 0.04730 2.074 0.04029 *
Residual standard error: 0.3595 on 115 degrees of freedom

Multiple R-Squared: 0.5335, Adjusted R-squared: 0.493

F-statistic: 13.15 on 10 and 115 DF, p-value: 4.21e-15
Message: after including the module eigengene in the model, the mQTL on chromosome 19 loses much of its significance. But interestingly mQTL mQTL2.056 and mQTL5.082 become highly significant predictors of weight.

GENE NETWORK VIEW

# Let us now take the network view

lm1= lm(GSweight~., data=data.frame(t(GSSNP)))
summary(lm1)
Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.31219 0.01998 15.626 < 2e-16 ***

GSmQTL2.056 -0.29988 0.06324 -4.742 2.74e-06 ***

GSmQTL3.079 0.17480 0.05999 2.914 0.00372 **

GSmQTL5.082 -0.31403 0.05489 -5.721 1.78e-08 ***

GSmQTL10.058 -0.41002 0.07579 -5.410 9.59e-08 ***

GSmQTL10.101 0.31326 0.07818 4.007 7.05e-05 ***

GSmQTL12.016 -0.17205 0.11542 -1.491 0.13666

GSmQTL12.038 0.25288 0.12128 2.085 0.03754 *

GSmQTL14.049 0.36388 0.06119 5.947 4.98e-09 ***

GSmQTL19.047 0.54617 0.05435 10.050 < 2e-16 ***

Residual standard error: 0.09856 on 525 degrees of freedom

Multiple R-Squared: 0.4573, Adjusted R-squared: 0.448

Message: most important locus is on chromosome 19
summary(lm(GSweight~sqrt(kIN)+., data=data.frame(t(GSSNP))))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.196043 0.017614 11.130 < 2e-16 ***

sqrt(kIN) 0.078512 0.004717 16.645 < 2e-16 ***

GSmQTL2.056 -0.332205 0.051237 -6.484 2.07e-10 ***

GSmQTL3.079 -0.021228 0.049973 -0.425 0.6712

GSmQTL5.082 -0.300528 0.044444 -6.762 3.64e-11 ***

GSmQTL10.058 -0.502735 0.061607 -8.160 2.50e-15 ***

GSmQTL10.101 0.128487 0.064261 1.999 0.0461 *

GSmQTL12.016 0.149016 0.095411 1.562 0.1189

GSmQTL12.038 -0.018787 0.099526 -0.189 0.8504

GSmQTL14.049 0.105254 0.051913 2.027 0.0431 *

GSmQTL19.047 0.544618 0.043997 12.378 < 2e-16 ***

Residual standard error: 0.07979 on 524 degrees of freedom

Multiple R-Squared: 0.645, Adjusted R-squared: 0.6382

F-statistic: 95.21 on 10 and 524 DF, p-value: < 2.2e-16
Message: It is intersting to contrast this to the table that relates SNPs to weigh directly after adjusting for the PC.

The SNPs have a much stronger effect on the correlation between gene expression and weight.

# Which loci affect connectivity?

lm1= lm(sqrt(kIN)~. , data=data.frame(t(GSSNP)))

summary(lm1)
Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.47931 0.14965 9.885 < 2e-16 ***

GSmQTL2.056 0.41174 0.47375 0.869 0.385181

GSmQTL3.079 2.49674 0.44936 5.556 4.39e-08 ***

GSmQTL5.082 -0.17203 0.41116 -0.418 0.675832

GSmQTL10.058 1.18097 0.56770 2.080 0.037986 *

GSmQTL10.101 2.35350 0.58565 4.019 6.71e-05 ***

GSmQTL12.016 -4.08942 0.86459 -4.730 2.89e-06 ***

GSmQTL12.038 3.46020 0.90843 3.809 0.000156 ***

GSmQTL14.049 3.29411 0.45832 7.187 2.28e-12 ***

GSmQTL19.047 0.01974 0.40709 0.048 0.961344

Residual standard error: 0.7383 on 525 degrees of freedom

Multiple R-Squared: 0.261, Adjusted R-squared: 0.2483

F-statistic: 20.6 on 9 and 525 DF, p-value: < 2.2e-16
# Interestingly, the mQTL on chromosome 19 has no effect on connectivity.

This suggests that genes with high connectivity are controlled by the mQTL on chromosomes 14, 12, 3, etc.

These loci would have been missed by conventional analysis that ignores connectivity.

# in the following data frame each row corresponds to a mouse

datMouse=data.frame(PC1,weight, t(SNP))

# in the following data frame each row corresponds to a network node

datNode=data.frame(kIN,GSweight, t(GSSNP))

cordatNode=cor(datNode)

signif(cordatNode,2)

kIN GSweight GSmQTL2.056 GSmQTL3.079 GSmQTL5.082 GSmQTL10.058 GSmQTL10.101 GSmQTL12.016 GSmQTL12.038 GSmQTL14.049 GSmQTL19.047

kIN 1.000 0.420 0.210 0.2700 -0.039 0.170 0.260 -0.011 0.140 0.200 -0.0140

GSweight 0.420 1.000 -0.270 0.2000 -0.190 -0.350 -0.027 0.130 0.190 0.200 0.5400

GSmQTL2.056 0.210 -0.270 1.000 0.0950 0.002 0.310 0.250 0.250 0.400 0.041 -0.2500

GSmQTL3.079 0.270 0.200 0.095 1.0000 0.023 -0.280 0.012 0.049 0.230 0.035 0.0029

GSmQTL5.082 -0.039 -0.190 0.002 0.0230 1.000 -0.100 -0.030 0.071 -0.058 0.130 -0.0800

GSmQTL10.058 0.170 -0.350 0.310 -0.2800 -0.100 1.000 0.570 -0.130 -0.110 -0.081 -0.2800

GSmQTL10.101 0.260 -0.027 0.250 0.0120 -0.030 0.570 1.000 0.094 0.160 -0.150 0.0130

GSmQTL12.016 -0.011 0.130 0.250 0.0490 0.071 -0.130 0.094 1.000 0.850 0.220 0.1700

GSmQTL12.038 0.140 0.190 0.400 0.2300 -0.058 -0.110 0.160 0.850 1.000 0.140 0.2100

GSmQTL14.049 0.200 0.200 0.041 0.0350 0.130 -0.081 -0.150 0.220 0.140 1.000 0.0760

GSmQTL19.047 -0.014 0.540 -0.250 0.0029 -0.080 -0.280 0.013 0.170 0.210 0.076 1.0000

cordatMouse=cor(datMouse,use="p")

signif(cordatMouse,2)

> signif(cor(datMouse,use="p"),2)

PC1 weight mQTL2.056 mQTL3.079 mQTL5.082 mQTL10.058 mQTL10.101 mQTL12.016 mQTL12.038 mQTL14.049 mQTL19.047

PC1 1.00 0.630 -0.230 -0.21000 0.13000 -0.190 -0.2100 0.14000 0.200 -0.2700 0.20000

weight 0.63 1.000 0.048 -0.09700 -0.10000 -0.011 -0.1200 0.07600 0.140 -0.0560 0.32000

mQTL2.056 -0.23 0.048 1.000 0.07500 0.07400 0.075 0.0530 -0.17000 -0.190 0.0670 0.14000

mQTL3.079 -0.21 -0.097 0.075 1.00000 -0.19000 -0.120 -0.0350 0.00045 -0.073 0.0710 -0.04400

mQTL5.082 0.13 -0.100 0.074 -0.19000 1.00000 0.092 0.0600 0.15000 0.110 -0.1100 -0.00097

mQTL10.058 -0.19 -0.011 0.075 -0.12000 0.09200 1.000 0.5400 0.09300 0.063 0.0590 -0.07600

mQTL10.101 -0.21 -0.120 0.053 -0.03500 0.06000 0.540 1.0000 0.09900 -0.010 0.0032 -0.11000

mQTL12.016 0.14 0.076 -0.170 0.00045 0.15000 0.093 0.0990 1.00000 0.800 -0.0430 -0.05800

mQTL12.038 0.20 0.140 -0.190 -0.07300 0.11000 0.063 -0.0100 0.80000 1.000 -0.0110 0.04500

mQTL14.049 -0.27 -0.056 0.067 0.07100 -0.11000 0.059 0.0032 -0.04300 -0.011 1.0000 0.02600

mQTL19.047 0.20 0.320 0.140 -0.04400 -0.00097 -0.076 -0.1100 -0.05800 0.045 0.0260 1.00000

plot(cordatMouse[2,-2],cordatNode[2,-2])

par(mfrow=c(3,3))

for(i in 3:11) {scatterplot1(datNode[,1] , datNode[,i],xlab1="K",ylab1=names(datNode)[i],col1="blue")}

par(mfrow=c(3,3))

for(i in 3:11) {scatterplot1(datNode[,2] , datNode[,i],xlab1="GSweight",ylab1=names(datNode)[i],col1="blue")}

attach(datMouse)

# Here we define a shorter name

GSmQTL2= data.frame(t(GSSNP))$GSmQTL2.056

GSmQTL3= data.frame(t(GSSNP))$GSmQTL3.079

GSmQTL5= data.frame(t(GSSNP))$GSmQTL5.082

GSmQTL10a= data.frame(t(GSSNP))$GSmQTL10.058

GSmQTL10b= data.frame(t(GSSNP))$GSmQTL10.101

GSmQTL12a= data.frame(t(GSSNP))$GSmQTL12.016

GSmQTL12b= data.frame(t(GSSNP))$GSmQTL12.038

GSmQTL14= data.frame(t(GSSNP))$GSmQTL14.049

GSmQTL19= data.frame(t(GSSNP))$GSmQTL19.047

mQTL19=datMouse$mQTL19.047

par(mfrow=c(1,2))

boxplot(split(weight,mQTL19),notch=T,varwidth=T,xlab="mQTL19",ylab="weight",col="blue")

boxplot(split(PC1,mQTL19),notch=T,varwidth=T,xlab="mQTL19",ylab="PC1",col="blue")

# Here we fit different multivariable models

summary(step(lm(GSweight~kME+., data=datNode[,-c(1,2)]))) #Adjusted R-squared: 0.7033

summary(step(lm(GSweight~kIN+., data=datNode[,-c(1,2)]))) # Adjusted R-squared: 0.6267

summary(lm(GSweight~kIN, data=datNode[,-c(1,2)]))# Adjusted R-squared: 0.1723

summary(lm(GSweight~kME, data=datNode[,-c(1,2)])) # Adjusted R-squared: 0.337

summary(lm(GSweight~kME+ GSmQTL19, data=datNode[,-c(1,2)])) # Adjusted R-squared: 0.5835
Based on the R^2 values we prefer the module based connectivity measure kME over kIN.

# Our favourite model for integrating gene connectivity and GSSNP data is given by the following

summary(step(lm(GSweight~kME+., data=datNode[,-c(1,2)])))
Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.03603 0.01935 1.863 0.0631 .

kME 0.66571 0.02887 23.058 < 2e-16 ***

GSmQTL2.056 -0.28510 0.04630 -6.158 1.46e-09 ***

GSmQTL3.079 -0.06360 0.04491 -1.416 0.1573

GSmQTL5.082 -0.25280 0.03995 -6.328 5.32e-10 ***

GSmQTL10.058 -0.39250 0.04438 -8.844 < 2e-16 ***

GSmQTL12.016 0.20553 0.08366 2.457 0.0143 *

GSmQTL12.038 -0.12794 0.08956 -1.428 0.1538

GSmQTL19.047 0.52881 0.03895 13.578 < 2e-16 ***

Multiple R-Squared: 0.7077, Adjusted R-squared: 0.7033

# The estimates of the coefficients of GSmQTL2.056, GSmTQL5, and GSmQTL10.058 are similar.

#This motivate us to define the following variable GSmQTL* or shortly

GSmQTLstar=I(GSmQTL2+GSmQTL5+GSmQTL10a)
# This model is reported in our paper (model 3)

summary(lm(GSweight~kME+ GSmQTLstar +GSmQTL19, data=datNode[,-c(1,2)]))

Call:

lm(formula = GSweight ~ kME + GSmQTLstar + GSmQTL19, data = datNode[,

-c(1, 2)])

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.04421 0.01911 2.313 0.0211 *

kME 0.63572 0.02665 23.858 <2e-16 ***

GSmQTLstar -0.30415 0.02173 -13.997 <2e-16 ***

GSmQTL19 0.55183 0.03711 14.868 <2e-16 ***

Residual standard error: 0.07325 on 531 degrees of freedom

Multiple R-Squared: 0.6969, Adjusted R-squared: 0.6952
# Using the model, we predict the following GSweight

GSweightpredict=predict(lm(GSweight~kME+GSmQTLstar+GSmQTL19))

#To understand the meaning of the model, we dichotomize the variables by their median values

GSmQTL19high= GSmQTL19 >= median(GSmQTL19)

kMEhigh= kME >= median(kME)

GSmQTLstarhigh= GSmQTLstar >= median(GSmQTLstar)

summary(lm(GSweight~kMEhigh +GSmQTLstarhigh+GSmQTL19high ))
kMElabel=factor(ifelse(kMEhigh,"k+","k-"))

GSmQTLstarlabel= factor(ifelse(GSmQTLstarhigh,"q+","q-"), levels=c("q+", "q-"))

GSmQTL19label=factor(ifelse(GSmQTL19high,"19+","19-"))

boxplot(GSweight~ GSmQTLstarlabel+GSmQTL19label+ kMElabel,notch=T,varwidth=T,ylab="GSweight",xlab="Module Group",col="blue")

#This code produces Figure 5 (with Spearman correlations instead of Pearson correlations, see below)

par(mfrow=c(3,2),mar=c(4,4,3,1))

scatterplot1(kME, GSweight,xlab1="k.ME",ylab1="GSweight",col1="blue")

scatterplot1(GSmQTLstar, GSweight,xlab1="GSmQTL*",ylab1="GSweight" ,col1="blue")

scatterplot1(GSmQTL19, GSweight,xlab1="GSmQTL19",ylab1="GSweight" ,col1="blue")

scatterplot1(GSweightpredict,GSweight,xlab1="Predicted GSweight",ylab1="Observed GSweight" ,col1="blue")

boxplot(GSweight~ GSmQTLstarlabel+GSmQTL19label+ kMElabel,notch=T,varwidth=T,ylab="GSweight",xlab="Module Group",col="blue", cex.axis=1.5,cex.lab=1.5, cex.main=1.5)

# Comment: note that the correlations are slightly different from those in our article.

# The reason is that here we report Spearman correlations while in Figure 5 we report

# Pearson correlations. We prefer Pearson correlations for our main article since

# they allow us to estimate the variance explained (=PearsonCorrelation^2)

# Now we output some of our data

datout=data.frame(datSummary, kME, kMEhigh, GSmQTL19high, GSmQTLstarhigh,datNode, GSmQTLstar)

write.table(datout, "BlueModuleSummary.csv", sep=",", row.names=T)

THE END

Download 74.34 Kb.

Share with your friends: