Imputation-based meta-analysis of severe malaria

Project: Consortial Project 1

Released on 30 Jun 2014

The initial study and data description are published in: Band G et al. (2013). Imputation-based meta-analysis of severe malaria in three African populations. PLoS Genet. 9:e1003509.

This data release contains three separate data packages of SNP genotype data for cases and controls from three populations: Gambia, Kenya and Malawi.

This data has been deposited in the European Genotyping Archive under EGA Study Code EGAS00001000807.

  • All cases have been diagnosed with malaria in a hospital.
  • Controls were samples from within the general population and from new births.
  • All samples are unrelated (but see Readme files for further details).

The information provided here is common to each of the three country datasets and where differences exist these are noted.

Data set structure

Each data package contains:

Files are all in the 'oxford' format suitable for use with SNPTEST v2:

For a more detailed description of the file formats used in SNPTEST, see


  • Gambia_650Y_HM3_information.sample
  • Kenya_HM3_information.sample
  • Malawi_1M_HM3.sample

These are space-delimited files.

These files contain information on the samples. The table below gives an example of the data:

ID_1 ID_2 missing sex ethnicity country control MALARIA rs334 PCA_1 ... PCA_2
0 0 0 D D D B B D C C C
WTCCC130585 WTCCC130585 0.076 M MANDINKA GM 0 1 TT 0.0481 ... -0.00276
  • 0  = N/A
  • D = discrete variable
  • C = continuous variable

Row 3 onwards are the sample data and they are listed in the same row order as in the other files in this dataset. Please do not change the sample sort order of this file or the other files.


  • ID_1 and ID_2 describe the sample_ids and are identical
  • Missing: proportion of SNP data missing
  • Sex: M = male, F = female, NA = missing
  • Ethnicity: only ethnic information for the major ethnic groups is provided and all other groups have been pooled together and labelled as "OTHER"
  • Country_code: GM, KE, MW  (ISO code for Gambia, Kenya and Malawi respectively)
  • Control: sample collected from the general population (0=NO, 1=YES)
  • Malaria: sample collected from a patient with severe malaria (0=NO, 1=YES)
  • rs334: probable HbS (rs334) genotype for each individual typed using the Sequenom iPLEX platform.
  • PCA_x: PCA_1 to PCA_10 are the first 10 principal components used in the above paper to control for population structure in Genome Wide Association Analysis (GWAS). Missing values are set to NA; samples with missing PCs are those that were excluded from GWAS analyses in the the above paper; these samples also appear in the exclusion lists.

Excluded samples:

  • Gambia_650Y_HM3_samples.excluded
  • Kenya_HM3_samples.excluded
  • Malawi_1M_HM3_samples.excluded

These files contain lists of samples that were excluded from final analysis due to various QC criteria including:

  • pass_rate
  • heterozygosity

Note that there are no header line or column data-type rows at the beginning of these files.


These files contain posterior probabilities of genotyping calls from imputation using IMPUTE2 into the HapMap 3 (release #2) reference panel obtained from the IMPUTE webpage (

Coordinates refer to NCBI build 36.

A Directory called 'gen' contains the SNP data by chromosome.

Files are provide per chromosome of the name format:

  • Gambia_650Y_HM3_imputed.??.gen.gz
  • Kenya_650Y_HM3_imputed.??.gen.gz
  • Malawi_HM3_imputed.??.gen.gz

where ?? represents the chromosome number with zero-padded prefix.

Each file contains the posterior probabilities of genotyping calls for HapMap3 SNPs.

The file format is as per SNPTEST v2:

SNP1 rs1 1000 A C 1 0 0 1 0 0 ...
SNP2 rs2 2000 G T 1 0 0 0 1 0 ...
SNP3 rs3 3000 C T 1 0 0 0 1 0 ...

There are no header rows.

Columns correspond to:

  • Column1: SNP name
  • Column2: rs id
  • Column3: chromosomal position
  • Column4 and 5: SNP alleles from Illumina Manifests
  • Column6 ... ColumnN: contain the posterior probabilities of imputation. Each individual is represented by 3 values corresponding to the genotypes AA, AB and BB respectively.

Other files

A ReadMe file:

  • ReadMe_Gambia_Illumina-Imputation_data_EGAS00001000807.txt
  • ReadMe_Kenya_Illumina-Imputation_data_EGAS00001000807.txt
  • ReadMe_Malawi_Illumina-Imputation_data_EGAS00001000807.txt

A directory called 'bgen':

  • Gambia_650Y_HM3_imputed.bgen
  • Kenya_650Y_HM3_imputed.bgen
  • Malawi_HM3_imputed.bgen

Genotype probabilities of imputation output in bgen format (

A directory called 'info':


These files provide imputation information. Note: The .bgen file can be converted into other formats using qctool, e.g. bgen to gen formatsee qctool manual for more details ( example command:

./qctool -g Gambia_650Y_HM3_imputed.bgen -og Gambia_650Y_HM3_imputed.gen

Extra information specific to Kenya

This is file describes samples in trios or part trios.

Example format:

family_ID sample_ID father_ID mother_ID family_member
Kenya_fam_02 MLCP1_1M1300381 MLCP1_1M1424842 MLCP1_1M1424843 Index
Kenya_fam_02 MLCP1_1M1424842 NA NA Father
Kenya_fam_02 MLCP1_1M1424843 NA NA Mother
  • family_id: id to group family members
  • sample_id: unique individual_id that maps to the Kenya_HM3_information.samples file
  • father_id: individual_id of the father that maps to the individual_id field
  • mother_id: individual_id of the mother that maps to the individual_id field

    family_member: relationship of the individual to the family index member


EGA Data Study: EGAS00001000807

EGA Data Set ID: EGAD00010000572 (1,533 controls and 1,247 cases)

Method: Illumina 650Y with 1000G imputation


EGA Data Study: EGAS00001000807

EGA Data Set ID: EGAD00010000570 (1,544 controls and 1,711 cases)

Method: Illumina 2.5M with 1000G imputation


EGA Data Study: EGAS00001000807

EGA Data Set ID: Not yet available (2,239 controls and 1,451 cases)

Method: Illumina 1.2M with 1000G imputation

Release notes

9 Oct 2015
Samples may also be included in other data releases

Some of the samples included in this data release may also be present in other MalariaGEN data releases where different genotyping technologies or chip designs were used. The sample_ids provide the primary way to identify these samples between the different data releases.

10 Oct 2015
Malawi data set

Please note that this dataset has been prepared for release by MalariaGEN and will be released as soon as the relevant ethics committee confirms the range of acceptable research uses.