Poster Presentation Society for Molecular Biology and Evolution Conference 2016

Insights on Array Design and Genotyping of over 50,000 Diverse Individuals for the Next Generation of Association Studies (#511)

Christopher R Gignoux 1 , Genevieve L Wojcik 1 , Henry Rich Johnston 2 , Christian Fuchsberger 3 , Suyash S Shringarpure 1 , Alicia R Martin 1 , Stephanie Rosse 4 , Niha Zubair 4 , Daniel Taliun 3 , Ryan Welch 3 , Carsten Rosenow 5 , Noura S Abul-Husn 6 , Gillian Belbin 6 , Hyun M Kang 3 , Goncalo Abecasis 3 , Michael Boehnke 3 , Zhaohui S Qin 2 , Christopher Carlson 4 , Kathleen C Barnes 7 , Carlos D Bustamante 1 , Eimear E Kenny 6 , on behalf of the PAGE Network 8
  1. Genetics, Stanford University, Stanford, CA, USA
  2. Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
  3. Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, USA
  4. Fred Hutchinson Cancer Research Center, Seattle, WA, USA
  5. Illumina, Inc, San Diego, CA, USA
  6. Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
  7. Medicine, Johns Hopkins University, Baltimore, MD, USA
  8. Population Architecture using Genomics and Epidemiology Consortium,

We have seen numerous successes in genome­wide association studies (GWAS) underlying complex traits over the past decade. However much of this work has only been performed in populations of European descent. To address this disparity, we developed the Multi-­Ethnic Genotyping Array (MEGA), a single platform designed for balanced GWAS coverage across the globe incorporating a catalog of functional variation.

To maximize trans-ethnic utility we designed the GWAS backbone to be informed by whole genome sequences across 26 populations of the 1000 Genomes Project and be bolstered by tag SNPs from 642 high-coverage whole genomes from individuals of African descent in the CAAPA consortium. We developed a novel cross-­population tag SNP selection strategy to capture low frequency variants across the diverse populations in Phase 3 of the 1000 Genomes Project (TGP). Importantly, by optimizing imputation accuracy rather than pairwise LD, the performance of the array is high across all continental TGP super-populations (>90% imputation accuracy for MAF >=1% ). We deconvolved admixture to evaluate per-ancestry imputation performance, and devised a whole genome sequencing panel to balance existing reference datasets. A reference panel of several thousand individuals, including the Human Genome Diversity Panel and a large panel of indigenous Americans, will be available on MEGA to aid in rare variant calling, ancestry characterization, and admixture analyses.

Currently we have genotyped >50,000 African-­American, Hispanic/Latino, Asian American and Native American and Hawaiian individuals from PAGE cohorts. From these diverse populations we can infer an extraordinary breadth of population structure, admixture, and differential relatedness with important implications for complex trait association studies within and across ethnicities. Here, we highlight the need for methods tothat can capture and model such high levels of diversity, both to optimize statistical power and improve biological interpretation.