Genome-wide study of resistance to severe malaria in eleven populations

Project: Consortial Project 1

Released on 21 Mar 2016

Background

This data release contains SNP genotype data and association test results from our ongoing analysis of severe malaria in eleven populations. Data for three populations (Gambia, Malawi and Kenya) are available currently; additional populations will be added as they become available.

If you use these data, please cite: Malaria Genomic Epidemiology Network. A novel locus of resistance to severe malaria in a region of ancient balancing selection. Nature. 2015 Oct 8;526(7572):253-7. doi: 10.1038/nature15390.

This release contains two types of data:

  • SNP genotype data.

These data reflect genotyping of all samples on the Illumina Omni 2.5M array and are provided in VCF format. Addionally we provide the clinical status, gender, and sickle trait status of each sample, and information on quality control. Full details are provided below.

  • Association test summary statistics. 

Identifiers, allele frequencies, imputation status and meta-analysis results for directly typed SNPs and variants imputed from the 1000 Genomes Project Phase 1 reference panel. For full details see the association test summaray statistics README file (95.6 KB).

These data have been deposited in the European Genome-phenome Archive under EGA Study ID: EGAS00001001311.

  • All cases were diagnosed as meeting the WHO definition of severe malaria (see References [1-3]).
  • Controls were samples from within the general population and from new births.
  • Samples in these datasets are nominally unrelated, with the exception of a small number of familial relationships detailed in the relevant <population>.relationships.txt file (described below).

The information provided here is common to each of the three population-specific datasets. For the association test summary statistics please see the separate README file (95.6 KB).

Data set structure

Each data set contains a README file and a set of three data files:

README files

  • EGAS00001001311_Kenya_GWAS-2.5M_b37_releasenote.txt
  • EGAS00001001311_Gambia_GWAS-2.5M_b37_releasenote.txt
  • EGAS00001001311_Malawi_GWAS-2.5M_b37_releasenote.txt 

Samples

Each data set includes three sample-related files:

  • A sample file
  • A sample metadata file
  • A file with information about any familial relationships

Sample files:

  • samples/Kenya_GWAS-2.5M_b37.sample
  • samples/Gambia_GWAS-2.5M_b37.sample
  • samples/Malawi_GWAS-2.5M_b37.sample

These are space-delimited files in a format suitable for use with the program SNPTEST, and contain information on the samples included in this study.

Samples are identified both by a sample identifier and a chip assay identifier.

Note that in some cases the same sample was genotyped multiple times, giving multiple chip IDs.

The first row of this file gives column names. Columns are described below and in the file sample_metadata.csv.

The second row of this file contains information on the type of values stored in the file, as follows:

  • 0 - an identifier field
  • D - a discrete or categorical field
  • B - a binary (case/control) phenotype
  • C - a continuous or numerical covariate

Note that for some tools it may be necessary to rename the first two columns of this file as 'ID_1' and 'ID_2'.

Columns in this file are as follows:

  • chip_id - identifier for the chip assay
  • sample_id - identifier for the DNA sample
  • missing - not used in this dataset
  • dataset - the name of the dataset
  • plate - the id of the 96-well plate on which the sample was supplied for genotyping
  • well - the well on the 96-well plate on which the sample was supplied for genotyping
  • status - either 'CASE' (for severe malaria cases) or 'CONTROL' (for population controls). Please note that some samples have no control or malaria assignment. These are samples collected as parents of affected children (reported as 'PARENT') or samples with other designation (not reported here). Where applicable, family structure is described in the file Gambia_GWAS-2.5M_b37.relationships.txt (described below). There are also 3 HapMap samples (NA12878, NA12891, NA12892).
  • severe_malaria - A binary (0/1) indicator of case/control status based on the status column above. We include this to simplify association testing.
  • clinical_sex - Gender as reported on sample collection. M = male, F = female, NA = missing or unknown gender.
  • estimated_sex - Gender as determined by comparison of assay intensities on the X and Y chromosomes. This is only provided for samples that passed QC thresholds.
  • ethnicity - Reported ethnic group. Where maternal and paternal ethnic group differs, this is reported in the format '<maternal ethnic group>_MIXED'. Only ethnic information for the major ethnic groups (those comprising at least 5% of our sample) is provided. All other groups have been pooled together and labelled as "OTHER".
  • rs334_genotype - Assayed HbS (rs334) genotype for each individual as typed on the Sequenom iPLEX platform. See URLs below for links to further details on this SNP. The genotype data for rs334 are provided with respect to the forward strand of the human reference sequence (T: Major allele/ancestral allele/reference allele and A: Minor allele/alternative allele/non-reference allele). Note that although this SNP is reported as multi-allelic in dbSNP, we have assayed only the segregating T and A alleles. The genome position with respect to GRCh37 is 11:5204808. Where we were unable to determine a genotype the data are represented by NA.
  • PC1 to PC10 - The first 10 principal components used in [1] to control for population structure in genome-wide association analysis (GWAS). Missing values are set to NA; samples with missing values are those that were excluded from GWAS analyses in [1]; these samples also appear in the exclusion lists.

Sample metadata file

Each data package is accompanied by a sample metadata file: samples/sample_metadata.csv.

This is a tab-separated file listing columns in the above sample file, and giving an abbreviated form of the above descriptions. This file may be useful for automated processing.

Files reflecting family structure

The samples in these datasets are nominally unrelated, with the exception of a small number of familial relationships detailed in the relevant <population>.relationships.txt file. 

These files describe known blood (i.e. familial) relationships in this study, as reported in our clinical data. (These data contain a small number of trio and parent-child relationships.)

  • samples/Kenya_GWAS-2.5M_b37.relationships.txt
  • samples/Gambia_GWAS-2.5M_b37.relationships.txt
  • samples/Malawi_GWAS-2.5M_b37.relationships.txt*

* This file does not contain any information, as all samples in this data set are unrelated.

Example format:

Family Child Father Mother
family_1 MLCP1_1M1300381 MLCP1_1M1424842 MLCP1_1M1424843
family_2 MLCP1_1M1300381 NA MLCP1_1M1424843

Genotypes

A directory called ‘vcf’ contains the genotype data in per-chromosome files.

Genotype files:

  • vcf/Kenya_GWAS-2.5M_b37_chr??.vcf.gz
  • vcf/Gambia_GWAS-2.5M_b37_chr??.vcf.gz
  • vcf/Malawi_GWAS-2.5M_b37_chr??.vcf.gz

Index files:

  • vcf/Kenya_GWAS-2.5M_b37_chr??.vcf.gz.tbi
  • vcf/Gambia_GWAS-2.5M_b37_chr??.vcf.gz.tbi
  • vcf/Malawi_GWAS-2.5M_b37_chr??.vcf.gz.tbi

Where ?? represents the chromosome number with zero-padded prefix.

Genotype and normalised intensity data is provided in bgzipped VCF format. A tabix index (.tbi) file is provided with each vcf file. See the ‘Useful links’ section below for links to software that can be used to access these data.

Column names in the VCF files refer to the chip_id in the sample information file described above, and appear in the same order as in that file.

VCF files contain the following fields:

  • GT - consensus genotype call, representing a consensus among three algorithms (Illuminus, GenoSNP, and Illumina GenCall). See [1,2] for full methodology.
  • GLI - genotype call from Illuminus
  • GLG - genotype call from GenoSNP
  • GC - genotype call from Illumina's Gencall algorithm
  • GCS - genotype call score from Illumina's Gencall algorithm
  • XY - normalised assay intensity information for each SNP and each sample

All chromosomes and positions in the files are in NCBI build 37/GRCh37 coordinates. All data was typed on the Illumina Omni 2.5M [quad/oct] platform using the [HumanOmni2.5-4v1_D/HumanOmni2.5-8v1_A] Illumina chip manifest, which is available from Illumina.

Note that variant IDs and alleles in these files reflect the Name, IlmnID and SNP columns of the chip manifest.

Quality control (QC) information

Sample exclusions:

  • exclusions/Kenya_GWAS-2.5M_b37_sample_exclusions.txt
  • exclusions/Gambia_GWAS-2.5M_b37_sample_exclusions.txt
  • exclusions/Malawi_GWAS-2.5M_b37_sample_exclusions.txt

These files contain a list of samples that were excluded from our analysis due to QC criteria including missing call rate and heterozygosity, or as genetic duplicates. This file has two columns: the first reflects the Chip ID of the excluded samples, and the second indicates the reason for exclusion.

Possible reasons for exclusion are 'quality' (excluded due to high missingness, outlying heterozygosity, or outlying average intensities), 'relatedness' (excluded due to high relatedness with another sample), 'technical' (excluded for technical reasons), or 'hapmap' (a hapmap sample).

SNP exclusions:

  • Kenya_GWAS-2.5M_b37_snp_exclusions.txt
  • Gambia_GWAS-2.5M_b37_snp_exclusions.txt
  • Malawi_GWAS-2.5M_b37_snp_exclusions.txt 

These files contain a list of SNPs that were excluded from our analysis during QC prior to imputation.

This file has six columns reflecting the SNPID, rsid, chromosome, position and alleles of the excluded SNP.

References

This data was used in the following manuscripts:

[1] Malaria Genomic Epidemiology Network. A novel locus of resistance to severe malaria in a region of ancient balancing selection. Nature, 2015;526(7572):253-7. DOI: 10.1038/nature15390.

[2] Band et al. Imputation-based meta-analysis of severe malaria in three African populations. PLOS Genetics, 2013; 8(10): e75675. DOI: 10.1371/journal.pgen.1003509

The following manuscript may also be of use in interpreting these data:

[3] Rockett et al. Reappraisal of known malaria resistance loci in a large multicenter study. Nature Genetics, 2014; 46(11): 1197-204. DOI: 10.1038/ng.3107

Useful links

File formats

VCF format http://www.htslib.org/doc/vcf.html

SNPTEST https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html

SNPTEST file formats

https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#input_file_formats

More information on rs334

Ensembl genome browser http://www.ensembl.org/Homo_sapiens/Variation/Explore?r=11:5226502-5227502;v=rs334;vdb=variation;vf=328

dbSNP http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs334

 The following tools may be useful in manipulating the files contained in this data release:

Vcftools https://vcftools.github.io/index.html

tabix http://www.htslib.org/doc/

QCTOOL http://www.well.ox.ac.uk/~gav/qctool/#overview

VariantAnnotation R package http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html

21 Mar 2016

Kenya

EGA Study ID: EGAS00001001311

EGA Data Set ID: EGAD00010000904 (1,708 controls; 1,944 cases; 180 parents; 33 other)

Method: Illumina Omni 2.5M genotyping

21 Mar 2016

Gambia

EGA Study ID: EGAS00001001311

EGA Data Set ID: EGAD00010000902 (2,786 controls; 2,807 cases; 1 parents)

Method: Illumina Omni 2.5M genotyping

17 Mar 2016

Malawi

EGA Study ID: EGAS00001001311

EGA Data Set ID: EGAD00010000903 (1,498 controls; 1,590 cases)

Method: Illumina Omni 2.5M genotyping

17 Oct 2016

Association test summary statistics (three populations)

EGA Study ID: EGAS00001001311

EGA Data Set ID: EGAD00010001081 (5,291 controls; 5,130 cases)

Description: Allele frequencies and meta analysis summary statistics for Kenya, The Gambia and Malawi.

Release notes

21 Mar 2016
Samples may also be included in other data releases

Some of the samples included in this data release may also be present in other MalariaGEN data releases where different genotyping technologies or chip designs were used. The sample_ids provide the primary way to identify these samples between the different data releases.

17 Oct 2016
Association test summary statistics (three populations) data set

The README file for the Association test summary statistics (three populations) is available here: README (5.96 KB).