This data release contains SNP genotype data and association test results from our ongoing analysis of severe malaria in eleven populations. Data for three populations (Gambia, Malawi and Kenya) are available currently; additional populations will be added as they become available.
If you use these data, please cite: Malaria Genomic Epidemiology Network. A novel locus of resistance to severe malaria in a region of ancient balancing selection. Nature. 2015 Oct 8;526(7572):253-7. doi: 10.1038/nature15390.
This release contains two types of data:
- SNP genotype data.
These data reflect genotyping of all samples on the Illumina Omni 2.5M array and are provided in VCF format. Addionally we provide the clinical status, gender, and sickle trait status of each sample, and information on quality control. Full details are provided below.
- Association test summary statistics.
Identifiers, allele frequencies, imputation status and meta-analysis results for directly typed SNPs and variants imputed from the 1000 Genomes Project Phase 1 reference panel. For full details see the association test summaray statistics README file (95.6 KB).
These data have been deposited in the European Genome-phenome Archive under EGA Study ID: EGAS00001001311.
- All cases were diagnosed as meeting the WHO definition of severe malaria (see References [1-3]).
- Controls were samples from within the general population and from new births.
- Samples in these datasets are nominally unrelated, with the exception of a small number of familial relationships detailed in the relevant <population>.relationships.txt file (described below).
The information provided here is common to each of the three population-specific datasets. For the association test summary statistics please see the separate README file (95.6 KB).
Data set structure
Each data set contains a README file and a set of three data files:
Each data set includes three sample-related files:
- A sample file
- A sample metadata file
- A file with information about any familial relationships
These are space-delimited files in a format suitable for use with the program SNPTEST, and contain information on the samples included in this study.
Samples are identified both by a sample identifier and a chip assay identifier.
Note that in some cases the same sample was genotyped multiple times, giving multiple chip IDs.
The first row of this file gives column names. Columns are described below and in the file sample_metadata.csv.
The second row of this file contains information on the type of values stored in the file, as follows:
- 0 - an identifier field
- D - a discrete or categorical field
- B - a binary (case/control) phenotype
- C - a continuous or numerical covariate
Note that for some tools it may be necessary to rename the first two columns of this file as 'ID_1' and 'ID_2'.
Columns in this file are as follows:
- chip_id - identifier for the chip assay
- sample_id - identifier for the DNA sample
- missing - not used in this dataset
- dataset - the name of the dataset
- plate - the id of the 96-well plate on which the sample was supplied for genotyping
- well - the well on the 96-well plate on which the sample was supplied for genotyping
- status - either 'CASE' (for severe malaria cases) or 'CONTROL' (for population controls). Please note that some samples have no control or malaria assignment. These are samples collected as parents of affected children (reported as 'PARENT') or samples with other designation (not reported here). Where applicable, family structure is described in the file Gambia_GWAS-2.5M_b37.relationships.txt (described below). There are also 3 HapMap samples (NA12878, NA12891, NA12892).
- severe_malaria - A binary (0/1) indicator of case/control status based on the status column above. We include this to simplify association testing.
- clinical_sex - Gender as reported on sample collection. M = male, F = female, NA = missing or unknown gender.
- estimated_sex - Gender as determined by comparison of assay intensities on the X and Y chromosomes. This is only provided for samples that passed QC thresholds.
- ethnicity - Reported ethnic group. Where maternal and paternal ethnic group differs, this is reported in the format '<maternal ethnic group>_MIXED'. Only ethnic information for the major ethnic groups (those comprising at least 5% of our sample) is provided. All other groups have been pooled together and labelled as "OTHER".
- rs334_genotype - Assayed HbS (rs334) genotype for each individual as typed on the Sequenom iPLEX platform. See URLs below for links to further details on this SNP. The genotype data for rs334 are provided with respect to the forward strand of the human reference sequence (T: Major allele/ancestral allele/reference allele and A: Minor allele/alternative allele/non-reference allele). Note that although this SNP is reported as multi-allelic in dbSNP, we have assayed only the segregating T and A alleles. The genome position with respect to GRCh37 is 11:5204808. Where we were unable to determine a genotype the data are represented by NA.
- PC1 to PC10 - The first 10 principal components used in  to control for population structure in genome-wide association analysis (GWAS). Missing values are set to NA; samples with missing values are those that were excluded from GWAS analyses in ; these samples also appear in the exclusion lists.
Sample metadata file
Each data package is accompanied by a sample metadata file: samples/sample_metadata.csv.
This is a tab-separated file listing columns in the above sample file, and giving an abbreviated form of the above descriptions. This file may be useful for automated processing.
Files reflecting family structure
The samples in these datasets are nominally unrelated, with the exception of a small number of familial relationships detailed in the relevant <population>.relationships.txt file.
These files describe known blood (i.e. familial) relationships in this study, as reported in our clinical data. (These data contain a small number of trio and parent-child relationships.)
* This file does not contain any information, as all samples in this data set are unrelated.
A directory called ‘vcf’ contains the genotype data in per-chromosome files.
Where ?? represents the chromosome number with zero-padded prefix.
Genotype and normalised intensity data is provided in bgzipped VCF format. A tabix index (.tbi) file is provided with each vcf file. See the ‘Useful links’ section below for links to software that can be used to access these data.
Column names in the VCF files refer to the chip_id in the sample information file described above, and appear in the same order as in that file.
VCF files contain the following fields:
- GT - consensus genotype call, representing a consensus among three algorithms (Illuminus, GenoSNP, and Illumina GenCall). See [1,2] for full methodology.
- GLI - genotype call from Illuminus
- GLG - genotype call from GenoSNP
- GC - genotype call from Illumina's Gencall algorithm
- GCS - genotype call score from Illumina's Gencall algorithm
- XY - normalised assay intensity information for each SNP and each sample
All chromosomes and positions in the files are in NCBI build 37/GRCh37 coordinates. All data was typed on the Illumina Omni 2.5M [quad/oct] platform using the [HumanOmni2.5-4v1_D/HumanOmni2.5-8v1_A] Illumina chip manifest, which is available from Illumina.
Note that variant IDs and alleles in these files reflect the Name, IlmnID and SNP columns of the chip manifest.
These files contain a list of samples that were excluded from our analysis due to QC criteria including missing call rate and heterozygosity, or as genetic duplicates. This file has two columns: the first reflects the Chip ID of the excluded samples, and the second indicates the reason for exclusion.
Possible reasons for exclusion are 'quality' (excluded due to high missingness, outlying heterozygosity, or outlying average intensities), 'relatedness' (excluded due to high relatedness with another sample), 'technical' (excluded for technical reasons), or 'hapmap' (a hapmap sample).
These files contain a list of SNPs that were excluded from our analysis during QC prior to imputation.
This file has six columns reflecting the SNPID, rsid, chromosome, position and alleles of the excluded SNP.
This data was used in the following manuscripts:
 Malaria Genomic Epidemiology Network. A novel locus of resistance to severe malaria in a region of ancient balancing selection. Nature, 2015;526(7572):253-7. DOI: 10.1038/nature15390.
 Band et al. Imputation-based meta-analysis of severe malaria in three African populations. PLOS Genetics, 2013; 8(10): e75675. DOI: 10.1371/journal.pgen.1003509
The following manuscript may also be of use in interpreting these data:
 Rockett et al. Reappraisal of known malaria resistance loci in a large multicenter study. Nature Genetics, 2014; 46(11): 1197-204. DOI: 10.1038/ng.3107
VCF format http://www.htslib.org/doc/vcf.html
SNPTEST file formats
More information on rs334
The following tools may be useful in manipulating the files contained in this data release:
VariantAnnotation R package http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html