Genotype imputation based on discriminant and cluster analysis

Mahmoud, Medhat

Mahmoud, Medhat

Master thesis

Åpne

Mahmoud2012.pdf (2.839Mb)

Permanent lenke

http://hdl.handle.net/11250/186135

Utgivelsesdato

2012-11-09

Metadata

Vis full innførsel

Samlinger

Master's theses (IHA) [318]

Sammendrag

The recent development of high-throughput systems for genotyping SNP in Eukaryote has led to an extraordinary amount of research activity, particularly in areas such as whole-genome selection of livestock and genome-wide association studies for detection of quantitative trait loci (Van Tassell et al., 2008). Recent technological advances allow us to rapidly genotype more than 10 million SNPs in an individual, accounting for 10% of the estimated number of common SNPs (more than 1% minor allele frequency) across the population. As a result of missing SNPs, true associations might be missed if the causal SNP is not genotyped or if the causal variant is an unknown variant. SNP imputation is important in reducing the cost of re-sequencing and when genotyping all considered animals may be too costly and sometimes not feasible because DNA may not be available for all animals. Computational algorithms and statistical methods have been developed to account for some of the unobserved variants. The main idea behind these methods is based on the observation that SNPs in close proximity to one another in the genome tend to be correlated, or in non-random association (linkage disequilibrium). Several powerful methods to impute missing SNP genotypes already exist that, apart from the genotypic information at the locus of interest, “using only pedigree data” (Gengler, 2007, 2008), “only surrounding markers” (FastPHASE; Scheet and Stephens, 2006), or both (Li and Jiang, 2003; Kong et al., 2008; Meuwissen and Goddard, 2010; Mulder et al., 2010b). The mixed model (BLUP) method presented by Gengler et al. (2007) uses BLUP to find the missing gene content conditional on genotypic information of relatives. “Several articles have described comparisons of imputation methods with respect to computational efficiency and the accuracy of results” (Pei YF, 2008; Yu Z, 2007; Nothnagel M, 2009). Overall, MACH, BEAGLE, and IMPUTE have been shown to have a proximate similar accuracy, and all of these programs have been shown to outperform other methods for imputation such as FAST PHASE (Scheet P, 2006) and PLINK (Purcell S, 2007). Consequently, we perceived a substantial need to proposing a new technique for SNP Imputation with applying linear Discrimination and Clustering Analysis Algorithms. To evaluate the factors potentially affecting imputation accuracy rates (ARs), we used simulated data sets to investigate the effects of Linkage disequilibrium (LD), Minor allele frequency (MAF) of un-typed SNPs, marker density (MD), reference sample size (n) and the different numbers of SNPs in every haplotype block, in imputation accuracy rate (AR) and the performance of linear discriminant analysis and clustering Analysis as a SNP imputation method.

Utgiver

Norwegian University of Life Sciences, Ås