The initial study and data description are published in: Band G et al. (2013). Imputation-based meta-analysis of severe malaria in three African populations. PLoS Genet. 9:e1003509.
This data release contains three separate data packages of SNP genotype data for cases and controls from three populations: Gambia, Kenya and Malawi.
This data has been deposited in the European Genotyping Archive under EGA Study Code EGAS00001000807.
- All cases have been diagnosed with malaria in a hospital.
- Controls were samples from within the general population and from new births.
- All samples are unrelated (but see Readme files for further details).
The information provided here is common to each of the three country datasets and where differences exist these are noted.
Data set structure
Each data package contains:
Files are all in the 'oxford' format suitable for use with SNPTEST v2: http://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
For a more detailed description of the file formats used in SNPTEST, see http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html
These are space-delimited files.
These files contain information on the samples. The table below gives an example of the data:
- 0 = N/A
- D = discrete variable
- C = continuous variable
Row 3 onwards are the sample data and they are listed in the same row order as in the other files in this dataset. Please do not change the sample sort order of this file or the other files.
- ID_1 and ID_2 describe the sample_ids and are identical
- Missing: proportion of SNP data missing
- Sex: M = male, F = female, NA = missing
- Ethnicity: only ethnic information for the major ethnic groups is provided and all other groups have been pooled together and labelled as "OTHER"
- Country_code: GM, KE, MW (ISO code for Gambia, Kenya and Malawi respectively)
- Control: sample collected from the general population (0=NO, 1=YES)
- Malaria: sample collected from a patient with severe malaria (0=NO, 1=YES)
- rs334: probable HbS (rs334) genotype for each individual typed using the Sequenom iPLEX platform.
- For further details on this SNP see:http://www.ensembl.org/Homo_sapiens/Variation/Explore?r=11:5226502-5227502;v=rs334;vdb=variation;vf=328 and http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs334
- The genotype data are provided with respect to the plus '+' strand
- T: Major allele/ancestral allele/reference allele
- A: Minor allele/alternative allele/non-reference allele
- The genome position with respect to the GRCh36 is 11:5204808
- Where we were unable to determine a genotype the data are represented by NA
- PCA_x: PCA_1 to PCA_10 are the first 10 principal components used in the above paper to control for population structure in Genome Wide Association Analysis (GWAS). Missing values are set to NA; samples with missing PCs are those that were excluded from GWAS analyses in the the above paper; these samples also appear in the exclusion lists.
These files contain lists of samples that were excluded from final analysis due to various QC criteria including:
Note that there are no header line or column data-type rows at the beginning of these files.
These files contain posterior probabilities of genotyping calls from imputation using IMPUTE2 into the HapMap 3 (release #2) reference panel obtained from the IMPUTE webpage (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference).
Coordinates refer to NCBI build 36.
A Directory called 'gen' contains the SNP data by chromosome.
Files are provide per chromosome of the name format:
where ?? represents the chromosome number with zero-padded prefix.
Each file contains the posterior probabilities of genotyping calls for HapMap3 SNPs.
The file format is as per SNPTEST v2:
There are no header rows.
Columns correspond to:
- Column1: SNP name
- Column2: rs id
- Column3: chromosomal position
- Column4 and 5: SNP alleles from Illumina Manifests
- Column6 ... ColumnN: contain the posterior probabilities of imputation. Each individual is represented by 3 values corresponding to the genotypes AA, AB and BB respectively.
A ReadMe file:
A directory called 'bgen':
Genotype probabilities of imputation output in bgen format (http://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#input_file_formats)
A directory called 'info':
These files provide imputation information. Note: The .bgen file can be converted into other formats using qctool, e.g. bgen to gen formatsee qctool manual for more details (http://www.well.ox.ac.uk/~gav/qctool/#tutorial) example command:
./qctool -g Gambia_650Y_HM3_imputed.bgen -og Gambia_650Y_HM3_imputed.gen
Extra information specific to Kenya
This is file describes samples in trios or part trios.
- family_id: id to group family members
- sample_id: unique individual_id that maps to the Kenya_HM3_information.samples file
- father_id: individual_id of the father that maps to the individual_id field
- mother_id: individual_id of the mother that maps to the individual_id field
family_member: relationship of the individual to the family index member