NEW: Dominic Kwiatkowski’s final paper... more
Gambia, Ghana, and Malawi Trios

Released on 20 Jan 2011.

Human

This release contains SNP genotype data from mother-father-child trios genotyped on the Illumina 650Y array.

  • These data include parents and a single offspring (so-called trios) from three partner studies within Consortial Project 1.
  • All children have been diagnosed with malaria in a hospital.
  • This data release contains complete total of 4,174 samples that have passed quality control.

These data have been deposited in the European Genotyping Archive (EGA) under EGA Study IDs: EGAS00000000087 (Gambia) and EGAS00000000088 (Ghana).

Data files

Annotations

Sample Support Files

(650Y_samples_[country]Trios.txt)

The tab delimited Samples Files lists the following information for each sample:

  • Family ID
  • Sample ID
  • Paternal ID (Missing parental IDs are set to ‘0’)
  • Maternal ID (Missing parental IDs are set to ‘0’)
  • Gender (1= Male, 2 = Female)
  • Phenotype (2= Affected, 1 = Unaffected)
  • Ethnicity*

*In the Ghanaian and Gambian Samples Files a seventh column lists self-reported ethnicity. Ethnicity is only reported for groups with a substantial number of samples in the data; this includes four groups in Gambia (Mandinka, Wolof, Fula, Jola) and two groups in Ghana (Akan and Northerners). All other ethnic groups and samples with unknown/missing ethnicity have been set to ‘Other’. A child’s ethnicity is assigned according to the Mother’s ethnicity. No ethnicity is being released for Malawi as the samples do not exhibit population substructure.

The following is an example of the Sample Support file format:

FamilyID SampleID PaternalID MaternalID Gender Phenotype Ethnicity
GAMTDT0420 WGA_T123428 TDT204237 TDT204236 1 2 Jola
GAMTDT0420 TDT204236 0 0 2 1 Jola
GAMTDT0420 TDT204237 0 0 1 1 Jola

Sample overlap with Gambian Case-Control on Affymetrix 500k platform

(GM_CC_affy500k_Trios_Illumina650Y_data_release_overlap.txt)

This file provides details on samples that are common between the Gambian Case-Control experiment (EGAS00000000026) run on the Affymetrix 500k platform and the Gambian Trios experiment (EGAS00000000087) run on the Illumina 650Y platform.

The samples pertinent to both platforms are malaria cases only.
The file details samples present in both data releases that passed all relevant QC stages.

The file identifies the trio of relevance and then gives the sample ID (or IDs) relevant to the genotyping platform.

There are 279 cases common to both experiment datasets:

  • 276 individuals are represented by a single sample on both platforms
  • 3 individuals have singletons on Affymetrix and duplicates on Illumina

The file is tab-delimited and formatted as shown below:

TRIO_ID GM_CC_Affy500K GM_CC_Affy500K GM_TRIOS_Illumina650Y
GAMTDT0002 WTCCC131484 WTCCC131484
GAMTDT0004 WTCCC131481 WTCCC131481
GAMTDT0006 WTCCC131244 WTCCC131244
GAMTDT0149 WTCCC130093 WTCCC130141 WTCCC130093
GAMTDT0160 WTCCC130573 WTCCC130621 WTCCC130621

Sample QC

This data release contains trios where all three members have passed quality control.

We excluded samples which fell into at least one of the following categories:

  1. Missing genotypes at > 5% of autosomal SNPs
  2. Relatetedness defined as ‘samples with 85% – 98% identity-by-state (IBS)’. We excluded all samples in each collection of related individuals except the one with the highest call rate
  3. Duplicates defined as ‘samples with > 98% IBS’. We excluded all samples from each duplicate collection except the one with the highest call rate
  4. Contaminated samples defined as’ excess levels of heterozygosity’
  5. Mis-specified trios defined as ‘Mendelian errors at greater than 2.5% of SNPs passing QC’

SNP QC

We excluded SNPs that fell into at least one of the following categories:

  1. Missing genotypes at more than 5% of samples
  2. Departure from HWE (calculated in parents only) at a p value of less than 10 -7
  3. Minor Allele frequency of less than 1% in the parents and
  4. Greater than 10% Mendelian errors

Also:

  • After sample QC, the genotypes of a trio at a SNP with a Mendelian error were set to missing in all three members of that trio.
  • ~250 SNPs failing visual clusterplot inspection have also been excluded.
  • As the Malawi genotypes were noticeably noisier than the other two cohorts we excluded SNPs with an Illuminus perturbation score of less than 0.95 from the Malawi dataset (~47K SNPs).

Description of Data

(MalariaGEN_TriosReleaseDocumentation_2011.pdf)

This file is a pdf version of this web page, although without the supplementary information. Details of supplementary fields can be found in the Supplementary_data directory.

Intensities (Normalised signals)

(650Y_Intensities_fwd_[country]_chr[number].txt)

Appropriately normalized signal data were generated from the Illumina intensity (“IDAT”) files via BeadStudio, and these were used as input to the ILLUMINUS genotype calling program. The format of the signal data is tab-delimited plain text; one line per SNP, consisting of ID, coordinate (NCBI Build 36), alleles and one pair of intensities per sample for each of the two alleles. All genotypes have also been configured to the ‘+’ strand of the SNP.

The following is an example of the signal file format.

RS Coord Allele1 Allele2 ID-XXX1_A ID-XXX1_B ID-XXX2_A ID-XXX2_A
rs5994034 15274090 T C 0.0056 0.420 0.023 0.343
rs2027653 15298335 T C 0.083 0.180 0.090 0.149
rs9604967 15492342 T C 0.091 0.770 0.051 0.508

Please note that these files may contain very long lines and are not intended to be human-readable.

Genotypes

(650Y_genotypes_fwd_[country]_chr[number].txt)

Because the Illumina 650Y SNP chip can yield several GBs per cohort, the genotype data have been partitioned by chromosome. Each file is presented in tab-delimited format and contains one genotype per line. The score is the posterior probability of the genotype called using the Illuminus program. Regardless of how the SNPs are organized, all assays are sorted according to sample so that the file can be readily separated into sample blocks. It should also be noted that all genotypes have been configured to the ‘+’ strand of the SNP. The following is an example of the genotype data format:

SNP SAMPLE GENOTYPE SCORE
rs1020382 ID-XXXXXXX TC 1
rs12459906 ID-XXXXXXX TC 0.9999
rs12151104 ID-XXXXXXX AA 0.9983

Data are provided for all SNPs allowing the user to set their own QC metrics.

Plink_files (Post-SNP-QC Genotype Data in Plink Format)

(650Y_fwd_[country]_chr[number].ped and 650Y_fwd_[country]_chr[number].map)

Genotypes in PLINK .map and .ped file formats. File formats are described at: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml.

Genotypes are reported with respect to the + strand, missing genotypes are set to ‘N’. SNP positions are listed in NCBI Build 36 coordinates.

Only SNPs passing quality control (described below) are included in these files.

Phased (Haplotype Files)

[country]_fwd_sample_release_ids.txt
chr[number]_[country]_fwd_b36_legend.txt
chr[number]_ [country]_fwd_phased

Haplotypes are presented in the same format as the HapMap Phase II haplotypes (http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2006-07_phaseII/00README.txt). Please note that the designation of allele 0 and allele 1 do not always conform to the HapMap files. Phasing was done using the post SNP-QC data with the trio option of PHASE.

There is one sample file per country and two files per chromosome, per country cohort.

The chr[number]_[country]_fwd_b36_legend.txt files contains a legend detailing the rs id, base pair position (NCBI Build 36), with the allele coded 0 and the allele coded 1 for each of the segregating SNPs e.g.

RS position 0 1
rs16981694 15647732 A C
rs9606468 15648282 C T

The [country]_fwd_sample_release_ids.file contains the ordered list of individuals that correspond to the chr[number]_ [country]_fwd_phased files.

  • In the chr[number]_[country]_fwd_phased files, one haplotype is listed per line.
  • Two haplotypes are listed for each sample, with the transmitted haplotype listed first.
  • To maintain consistency in the X chromosome files each male has a placeholder second haplotype represented by a row of dashes.
  • The haplotypes of children are not included as they can be inferred from their parent’s haplotypes.

Parental haplotypes are listed consecutively, for example:

Trio 1 parent 1 transmitted haplotype
Trio 1 parent 1 untransmitted haplotype
Trio 1 parent 2 transmitted haplotype
Trio 1 parent 2 untransmitted haplotype
Trio 2 parent 1 transmitted haplotype

Supplementary_data

From March 2013, we have added some supplementary data for use with this dataset. A ReadMe file accompanies the supplementary data that describes their contents and how they relate to the main data.

Data sets

Gambia Trios

EGA Study ID: EGAS00000000087

EGA Data Set ID: EGAD00000000019

Method: Illumina 650Y array

658 Gambian trios (1,984 individuals)

  • 650 unique trios (3 individuals/family)
  • 6 quads (4 individuals/family)
    • GAMTDT0142
    • GAMTDT0154
    • GAMTDT0212
    • GAMTDT0492
    • GAMTDT0496
    • GAMTDT0698
  • 2 x ½ siblings (5 individuals/family)
    • GAMTDT0169
    • GAMTDT0274

Ghana Trios

EGA Study ID: EGAS00000000088

EGA Data Set ID: EGAD00000000020

Method: Illumina 650Y array

608 Ghanaian trios (1,824 individuals)

Malawi Trios

EGA Study ID: Not yet available (see release note below)

EGA Data Set ID: Not yet available

Method: Illumina 650Y array

122 Malawian trios (366 individuals)

Release notes

Malawi trios
10 Oct 2015

Please note that this dataset has been prepared for release by MalariaGEN and will be released as soon as the relevant ethics committee confirms the range of acceptable research uses.