################################################################################ ########## May 2016 MALARIAGEN DATA RELEASE ######## ########## Study ID: EGAS00001001311 ######## ################################################################################ Study Title: Genome-wide study of resistance to severe malaria in eleven worldwide populations Background ---------- This release contains genome-wide summary statistics accompanying the paper "A novel locus of resistance to severe malaria in a region of ancient balancing selection", Malaria Genomic Epidemiology Network, Nature 2015 [1]. Included data reflects association tests between severe malaria-affected individuals and controls from three African populations. Sample counts for cases and controls included in association tests are given in the following table: population total cases controls -------------------- ---------- ---------- -------- Gambia 4921 2430 2491 Malawi 3752 1194 1322 Kenya 2984 1506 1478 For full methods see [1]. In brief, all cases have been diagnosed with meeting WHO definitions of severe malaria (see [1-3]). Controls were samples from within the general population and from new births. Individuals were genotyped on the Illumina Omni 2.5M platform, QCd and imputed into the 1000 Genomes Phase I reference panel [2]. The data included here represents summary statistics from 15122093 autosomal variants and 476377 X chromosome variants that had a minimum IMPUTE info score of 0.75 across populations, and a minor allele frequency of at least 0.005 in at least one population. This data forms the basis for Extended Data Figure 6 in [1]. REFERENCES ---------- [1] The MalariaGEN Consortium, "A novel locus of resistance to severe malaria in a region of ancient balancing selection.", Nature (2015), PMID:26416757, doi:10.1038/nature15390. [2] The 1000 Genomes Project Consortium, "An integrated map of genetic variation from 1,092 human genomes", Nature (2012), PMID:23128226, doi:10.1038/nature11632 Description of data -------------------- Data is provided both in gzipped text file (EGAS00001001311_MalariaGEN_GWAS_summary_statistics_2015.csv.gz) and in a sqlite file (EGAS00001001311_MalariaGEN_GWAS_summary_statistics_2015.sqlite) that can be accessed using the sqlite program (http://www.sqlite.org). The data content of both files is the same, and contains a table of summary statistics for autosomal and X chromosome SNPs and indels with the following columns: - chromosome: the chromosome of the variant. (This is always a two-character field padded on the left by zeros). - position: the position of the variant in NCBI build 37 coordinates. - rsid: the identifier of the variant. Note that the ID given here is taken from the genotyping array or reference panel and may not correspond to dbSNP identifiers. - alleleA: the reference allele of the variant. - alleleB: the non-reference allele of the variant. - add:meta_beta: the log odds ratio of the B allele under an additive model of association, computed by fixed-effect meta-analysis across populations. - add:meta_se: the standard error of `add:meta_beta`. - add:meta_pvalue: the 2-sided P-value for `add:meta_beta`. - mean_bf: The model averaged Bayes factor; see [1] for details. - Gambia:impute_type: the IMPUTE type reflecting imputation of Gambia data; this is either 0 (imputed) or nonzero (directly typed). - Gambia:impute_info: the IMPUTE info score reflecting imputation of Gambia data; higher scores reflect more confident imputation up to a maxmimum of 1. - Gambia:controls_alleleB_frequency: expected frequency of allele B in controls. - Malawi:impute_type: the IMPUTE type reflecting imputation of Malawi data; this is either 0 (imputed) or nonzero (directly typed). - Malawi:impute_info: the IMPUTE info score reflecting imputation of Malawi data; higher scores reflect more confident imputation up to a maxmimum of 1. - Malawi:controls_alleleB_frequency: expected frequency of allele B in controls. - Kenya:impute_type: the IMPUTE type reflecting imputation of Kenya data; this is either 0 (imputed) or nonzero (directly typed). - Kenya:controls_alleleB_frequency: expected frequency of allele B in controls. All allele frequencies are based on imputed genotype probabilities for imputed SNPs, or on prephased genotypes for directly-typed SNPs. Info scores for X chromosome variants are computed in females only. Using the .csv file ------------------- The CSV file is large but can be directly loaded into analysis programs such as R (https://www.r-project.org). As an example, the following R code loads the first 10,000 rows into an R session: # R code: data = read.csv( 'EGAS00001001311_MalariaGEN_GWAS_summary_statistics.csv.gz', header = T, as.is=T, nrow = 10000 ) # end of R code Note that loading the whole file in this way may take a considerable amount of time and will use up several gigabytes of memory. (We do not recommend loading this file into a spreadsheet program such as Excel). Alternatively, UNIX tools such as grep and awk can be used to extract desired subsets of the data. Using the sqlite file --------------------- The sqlite file can be accessed via several programs including the `sqlite3` command shell (http://www.sqlite.org), which is installed by default on most UNIX systems. It can also be accessed directly by programming languages such as R or python using add-on libraries. As an example, the following R code can be used to load summary statistics in a 200kb region around the sickle cell anaemia locus rs334: # R code: library( RSQLite ) db = dbConnect( dbDriver( "SQLite" ), "EGAS00001001311_MalariaGEN_GWAS_summary_statistics.sqlite" ) data = dbGetQuery( db, "SELECT * FROM summary_statistics WHERE chromosome == 11 AND position BETWEEN 5148232 AND 5348232") # end of R code