Longitudinal trajectory

We propose SAGELD, a method for gene-environment interaction analysis of longitudinal traits while controlling for sample relatedness. Consider a linear mixed model:

\[y_{ij} = X_{ij}^T \beta + G_i \beta_g + (G_i \circ E_{ij}) \beta_{g\times e} + Z_{ij}^T \gamma_i + \varepsilon_{ij}\]

where $E$ is a time-varying variable (e.g., age or BMI) that changes within subjects. We primarily focus on testing $\beta_{g\times e}=0$.

SAGELD consists of two steps:

1) fit an LMM without any genetic variant (i.e., $H_c:\beta_g=\beta_{g\times e}=0$), and estimate model parameters and calculate residuals.

2) use a matrix projection strategy to construct test statistics and then apply SPAGRM to conduct GWAS while controlling for sample relatedness.

Quick start for lme4

Please run the following code in R

Display the example data set

frq = sample(1:10, 500, replace = TRUE)

LongPheno = data.frame(IID = rep(paste0("Subj-", 1:500), frq), AGE = rnorm(sum(frq)), GENDER = rep(rbinom(500, 1, 0.5), frq))

LongPheno$LongPheno = LongPheno$AGE + LongPheno$GENDER + 0.5 * rep(rnorm(500), frq) + 0.5 * rep(rnorm(500), frq) * LongPheno$AGE + rnorm(sum(frq))
print(LongPheno)
#            IID         AGE GENDER  LongPheno
#    1:   Subj-1  0.36854165      1  1.2758600
#    2:   Subj-2  0.66326627      0 -0.3935064
#    3:   Subj-2  0.39415976      0 -1.6903284
#    4:   Subj-2 -0.17028342      0  0.1731918
#    5:   Subj-2  0.74694481      0 -1.3294894
#   ---                                       
# 2759: Subj-500  0.68823610      1 -1.0359893
# 2760: Subj-500 -0.86605339      1  0.2060031
# 2761: Subj-500 -0.28382077      1  0.1007282
# 2762: Subj-500 -0.12656440      1  1.1798696
# 2763: Subj-500 -0.06535694      1  1.0822086

Fit the null model

library(dplyr)
library(tidyr)
library(lme4)

nullmodel = lmer(LongPheno ~ AGE + GENDER + (AGE|IID), data = LongPheno)

summary(nullmodel)$coefficients
#               Estimate Std. Error    t value
# (Intercept) 0.05424788 0.07516685  0.7216995
# AGE         0.98810759 0.05436085 18.1768254
# GENDER      1.06136528 0.10528724 10.0806638

Pre-calculate joint distributions of genotypes

We can use SAGELD.NullModel function in GRAB package to pre-calculate the joint distribution of genotypes.

Prepare input files

SAGELD.NullModel needs the SparseGRMFile and PairwiseIBDFile from step 0, and ResidMatFile from step 1. SparseGRMFile and PairwiseIBDFile only need to be calculated once for a certain cohort.

Run SAGELD.NullModel to pre-calculate genotype distributions

SAGELD.NullModel(NullModel,
                 UsedMethod,
                 PlinkFile,
                 SparseGRMFile,
                 PairwiseIBDFile,
                 PvalueCutoff,
                 control = list())
  • NullModel: required. A fitted null model without any genetic variants.
  • UsedMethod: optional. “SAGELD” (default) or “GALLOP”, decide which method to use.
  • PlinkFile: required. A path to the plink file containing 1,000-2,000 genetic variants to calculate the mean $\lambda$.
  • SparseGRMFile: required. A path to the sparse GRM file.
  • PairwiseIBDFile: required. A path to the pairwise IBD file.
  • PvalueCutoff: optional. A cutoff to decide use the mean $\lambda$ or marker level $\lambda$ (default is 0.001).
  • control: optional. control is to specify a list of parameters for controlling the association testing process. See here for more details.

Example:

# Load in the filepath
library(GRAB)
SparseGRMFile = system.file("SparseGRM", "SparseGRM.txt", package = "GRAB")
PairwiseIBDFile = system.file("PairwiseIBD", "PairwiseIBD.txt", package = "GRAB")
PlinkFile = system.file("extdata", "simuPLINK.bed", package = "GRAB")
PlinkFile = strsplit(PlinkFile,".bed")[[1]]
obj.SAGELD = SAGELD.NullModel(NullModel = nullmodel,
                              PlinkFile = PlinkFile, 
                              SparseGRMFile = SparseGRMFile, 
                              PairwiseIBDFile = PairwiseIBDFile)
# Process the null model product...
# Too many SNPs in the PLINK file! We randomly select 2000 SNPs.
# Reading bim file:	 C:/Users/XH/AppData/Local/R/win-library/4.2/GRAB/extdata/simuPLINK.bim 
# Reading fam file:	 C:/Users/XH/AppData/Local/R/win-library/4.2/GRAB/extdata/simuPLINK.fam 
# setting PLINK object in CPP....
# Reading bim file....
# Reading fam file:	C:/Users/XH/AppData/Local/R/win-library/4.2/GRAB/extdata/simuPLINK.fam
# m_N:	1000
# Setting position of samples in PLINK files....
# posSampleInPlink:	501 510 600
# SampleInModel:	Subj-1 Subj-10 Subj-100
# SampleInPlink:	f1_1 f1_2 f1_3
# Number of samples:	500
# Based on the 'GenoFile' and 'GenoFileIndex', PLINK format is used for genotype data.
# Number of Samples:	 500 
# Number of Markers:	 2000 
# Complete the genotype reading.
# Mean lambda:	 0.001080904 
# cutoffVec:	 -2.902171 2.960981 
# ControlOutlier = TRUE (default) to keep the outliers < 5%;
# Set ControlOutlier = FALSE for higher accuracy.
# Outliers information is as below
# A tibble: 24 x 3
#    SubjID   Resid Outlier
# 1 Subj-304 -4.91 TRUE   
#  2 Subj-474 -4.10 TRUE   
#  3 Subj-367 -4.05 TRUE   
#  4 Subj-242 -4.01 TRUE   
#  5 Subj-109 -3.69 TRUE   
#  6 Subj-253 -3.67 TRUE   
#  7 Subj-458 -3.62 TRUE   
#  8 Subj-399 -3.50 TRUE   
#  9 Subj-227 -3.41 TRUE   
# 10 Subj-300 -3.38 TRUE   
# i 14 more rows
# i Use `print(n = ...)` to see more rows

Conduct genome-wide association studies

We can use GRAB.marker function in GRAB package to conduct marker-level genome-wide association studies. See here for more details.

Run GRAB.marker to conduct genome-wide association studies

GRAB.Marker(objNull,
            GenoFile,
            GenoFileIndex = NULL,
            OutputFile,
            OutputFileIndex = NULL,
            control = NULL)
# run the GRAB.Marker function
GenoFile = system.file("extdata", "simuPLINK.bed", package = "GRAB")
OutputDir = system.file("results", package = "GRAB")
OutputFile = paste0(OutputDir, "/simuOutput.txt")

GRAB.Marker(objNull = obj.SAGELD,
            GenoFile = GenoFile,
            OutputFile = OutputFile)

Value

The analysis results are written in a file of OutputFile. Most columns are common as in SPAGRM, with minor difference:

  • zScore_G: standardized score statistics for SNP main effects, usually follows a standard normal distribution.
  • zScore_GxE: standardized score statistics for SNP GxE effects, usually follows a standard normal distribution.
  • Pvalue_G: association test p-value for each marker’s main effect.
  • Pvalue_GxE: association test p-value for each marker’s GxE effect.
results = data.table::fread(OutputFile)
print(results)
#           Marker        Info    AltFreq AltCounts MissingRate Method   zScore_G  zScore_GxE    Pvalue_G
#     1:     SNP_1     1:1:G:A 0.39501040       380       0.038 SAGELD -1.1243540 -0.51412937 0.260862875
#     2:     SNP_2     1:2:G:A 0.46234310       442       0.044 SAGELD -0.3090360  0.26472862 0.757294124
#     3:     SNP_3     1:3:G:A 0.40400844       383       0.052 SAGELD -1.9107842 -0.03805962 0.056032323
#     4:     SNP_4     1:4:G:A 0.46788009       437       0.066 SAGELD -0.1615199  0.08725930 0.871683968
#     5:     SNP_5     1:5:G:A 0.25630252       244       0.048 SAGELD  0.1972576 -1.89378144 0.843625952
#    ---                                                                                                 
#  9996:  SNP_9996  1:9996:G:A 0.33954451       328       0.034 SAGELD  0.5786508 -0.16980515 0.562824815
#  9997:  SNP_9997  1:9997:G:A 0.26415094       252       0.046 SAGELD  3.0584812 -0.80338128 0.002224621
#  9998:  SNP_9998  1:9998:G:A 0.06526316        62       0.050 SAGELD  1.2286993  0.79542958 0.219184557
#  9999:  SNP_9999  1:9999:G:A 0.24838013       230       0.074 SAGELD  0.5652946  1.08282372 0.571873424
# 10000: SNP_10000 1:10000:G:A 0.16315789       155       0.050 SAGELD  0.2971961  0.28412809 0.766316834
#        Pvalue_GxE   hwepval
#     1: 0.60716154 0.5603599
#     2: 0.79121853 0.5118404
#     3: 0.96964014 0.9440939
#     4: 0.93046540 0.9213428
#     5: 0.05825403 0.9484430
#    ---                     
#  9996: 0.86516338 0.3813232
#  9997: 0.42175439 0.4393182
#  9998: 0.42636366 0.4625118
#  9999: 0.27888668 0.8883984
# 10000: 0.77631222 0.1433092

Note

  • Adding a random slope related to the environmental exposure in the LMM is required to control for heterogeneity introduced by the environmental variable.
  • If lme4 fails due to sparse longitudinal data, you can use glmmTMB. For example,
    nullmodel = glmmTMB::glmmTMB(LongPheno ~ AGE + GENDER + (AGE|IID), data = LongPheno)
    
  • When using UsedMethod = "GALLOP" in SAGELD.NullModel function, e.g.,
    obj.GALLOP = SAGELD.NullModel(NullModel = nullmodel,
                                UsedMethod = "GALLOP")
    

    other parameters will be out of use, the method will perform GALLOP algorithm, and the GWAS results will have two additional columns:

    • Beta_G: effect size estimates for SNP main effects.
    • Beta_GxE: effect size estimates for SNP GxE effects.