################################################################################
##########   May 2016  MALARIAGEN DATA RELEASE                          ########
##########   Study ID: EGAS00001001311                                  ########
################################################################################

Study Title: Genome-wide study of resistance to severe malaria in eleven
worldwide populations

Background
----------

This release contains genome-wide summary statistics accompanying the paper
"A novel locus of resistance to severe malaria in a region of ancient balancing selection",
Malaria Genomic Epidemiology Network, Nature 2015 [1].  Included data reflects association
tests between severe malaria-affected individuals and controls from three African populations.
Sample counts for cases and controls included in association tests are given in the following table:

population            total       cases       controls
--------------------  ----------  ----------  --------
Gambia                4921        2430        2491
Malawi                3752        1194        1322
Kenya                 2984        1506        1478

For full methods see [1].  In brief, all cases have been diagnosed with meeting WHO definitions of
severe malaria (see [1-3]). Controls were samples from within the general population and from new births.
Individuals were genotyped on the Illumina Omni 2.5M platform, QCd and imputed into the
1000 Genomes Phase I reference panel [2].  The data included here represents summary statistics
from 15122093 autosomal variants and 476377 X chromosome variants that had a minimum
IMPUTE info score of 0.75 across populations, and a minor allele frequency of at least
0.005 in at least one population.  This data forms the basis for Extended Data Figure 6 in [1].

REFERENCES
----------

[1] The MalariaGEN Consortium, "A novel locus of resistance to severe malaria in a
region of ancient balancing selection.", Nature (2015), PMID:26416757,
doi:10.1038/nature15390.

[2] The 1000 Genomes Project Consortium, "An integrated map of genetic variation from 1,092 human genomes",
Nature (2012), PMID:23128226, doi:10.1038/nature11632


Description of data
--------------------

Data is provided both in gzipped text file (EGAS00001001311_MalariaGEN_GWAS_summary_statistics_2015.csv.gz) and in a sqlite file
(EGAS00001001311_MalariaGEN_GWAS_summary_statistics_2015.sqlite) that can be accessed using the sqlite program (http://www.sqlite.org).
The data content of both files is the same, and contains a table of summary statistics for autosomal and X chromosome SNPs and indels
with the following columns:

- chromosome: the chromosome of the variant.  (This is always a two-character field padded on the left by zeros).
- position:  the position of the variant in NCBI build 37 coordinates.
- rsid: the identifier of the variant.  Note that the ID given here is taken from the genotyping array or reference panel and may not correspond to dbSNP identifiers.
- alleleA: the reference allele of the variant.
- alleleB: the non-reference allele of the variant.
- add:meta_beta: the log odds ratio of the B allele under an additive model of association, computed by fixed-effect meta-analysis across populations.
- add:meta_se: the standard error of `add:meta_beta`.
- add:meta_pvalue: the 2-sided P-value for `add:meta_beta`.
- mean_bf: The model averaged Bayes factor; see [1] for details.
- Gambia:impute_type: the IMPUTE type reflecting imputation of Gambia data; this is either 0 (imputed) or nonzero (directly typed).
- Gambia:impute_info: the IMPUTE info score reflecting imputation of Gambia data; higher scores reflect more confident imputation up to a maxmimum of 1.
- Gambia:controls_alleleB_frequency: expected frequency of allele B in controls.
- Malawi:impute_type: the IMPUTE type reflecting imputation of Malawi data; this is either 0 (imputed) or nonzero (directly typed).
- Malawi:impute_info: the IMPUTE info score reflecting imputation of Malawi data; higher scores reflect more confident imputation up to a maxmimum of 1.
- Malawi:controls_alleleB_frequency: expected frequency of allele B in controls.
- Kenya:impute_type: the IMPUTE type reflecting imputation of Kenya data; this is either 0 (imputed) or nonzero (directly typed).
- Kenya:controls_alleleB_frequency: expected frequency of allele B in controls.

All allele frequencies are based on imputed genotype probabilities for imputed SNPs, or on prephased genotypes for directly-typed SNPs.  Info scores
for X chromosome variants are computed in females only.


Using the .csv file
-------------------

The CSV file is large but can be directly loaded into analysis programs such as R (https://www.r-project.org).
As an example, the following R code loads the first 10,000 rows into an R session:

# R code:
data = read.csv( 'EGAS00001001311_MalariaGEN_GWAS_summary_statistics.csv.gz', header = T, as.is=T, nrow = 10000 )
# end of R code

Note that loading the whole file in this way may take a considerable amount of time and will use up several
gigabytes of memory.  (We do not recommend loading this file into a spreadsheet program such as Excel).

Alternatively, UNIX tools such as grep and awk can be used to extract desired subsets of the data.


Using the sqlite file
---------------------

The sqlite file can be accessed via several programs including the `sqlite3` command shell (http://www.sqlite.org), which
is installed by default on most UNIX systems.  It can also be accessed directly by programming languages such as R or python using
add-on libraries.  As an example, the following R code can be used to load summary statistics in a 200kb region around
the sickle cell anaemia locus rs334:

# R code:
library( RSQLite )
db = dbConnect( dbDriver( "SQLite" ), "EGAS00001001311_MalariaGEN_GWAS_summary_statistics.sqlite" )
data = dbGetQuery( db, "SELECT * FROM summary_statistics WHERE chromosome == 11 AND position BETWEEN 5148232 AND 5348232")
# end of R code