Evaluation of GWAS Method Performance Focusing on Population Stratification and Cryptic Relatedness
Master thesis
Permanent lenke
http://hdl.handle.net/11250/284181Utgivelsesdato
2015-05-28Metadata
Vis full innførselSamlinger
- Master's theses (KBM) [930]
Sammendrag
Genetic association studies are primarily used to identify genes associated with complex disease. It can be conducted by genotyping intentionally selected or randomly chosen markers. Numerous statistical and computational algorithms have been developed in the past to analyze the genome wide association study (GWAS) dataset. These are classified as parametric, non-parametric and Bayesian methods. However, there are methodological and computational challenges related with population stratification and the vast volume of data generated by chip and sequencing based technologies. The packages, SNPRelate and GenABEL, are built to overcome this burden. SNPRelate uses parallel computing and loads genotypes block by block to optimize high-speed cache memory. It is designed for principal component analysis (PCA) and identity by descent (IBD) analyses which are used for correcting population structure. Whereas, GenABEL incorporates genome wide rapid association using mixed model and regression (GRAMMAR). It is developed to overcome the limitation of efficiently storing, handling and analyzing data in GWAS by integrating a data format called gwaa.data. In order to evaluate and compare these packages, this study obtained PLINK formatted data from heritable dog osteosarcoma study. PLINK data format is then changed into a genomic data structure (GDS) file format for SNPRelate and gwaa.data file for GenABEL. Using GenABEL, data analysis was performed by ignoring population structure and taking into account population structure. In SNPRelate, LD based pruning is performed prior to PCA and IBD calculation. For three dog breeds, the first and the second PCs have almost 50% of the information. IBD interpretation of PCA indicate that Irish wolfhounds are inbred compared to the other two dog breeds. PCA correction on population structure has the most accurate estimates compared with genomic control and PCs as a predictor correction methods. Comparing SNPRelate and GenABEL, SNPRelate method used for PCA calculation is faster and allows larger data sets than GenABEL which use EIGENSTAR for PCA calculation.