Gambia, Ghana, and Malawi Trios

This release contains SNP genotype data from mother-father-child trios genotyped on the Illumina 650Y array.

These data include parents and a single offspring (so-called trios) from three partner studies within Consortial Project 1.
All children have been diagnosed with malaria in a hospital.
This data release contains complete total of 4,174 samples that have passed quality control.

These data have been deposited in the European Genotyping Archive (EGA) under EGA Study IDs: EGAS00000000087 (Gambia) and EGAS00000000088 (Ghana).

Data files

Annotations

Sample Support Files
Sample overlap with Gambian Case-Control on Affymetrix 500k platform
Sample QC
SNP QC
Description of data
Intensities (Normalised Signals)
Genotypes
Plink_files
Phased
Supplementary_data

Sample Support Files

(650Y_samples_[country]Trios.txt)

The tab delimited Samples Files lists the following information for each sample:

Family ID
Sample ID
Paternal ID (Missing parental IDs are set to ‘0’)
Maternal ID (Missing parental IDs are set to ‘0’)
Gender (1= Male, 2 = Female)
Phenotype (2= Affected, 1 = Unaffected)
Ethnicity*

*In the Ghanaian and Gambian Samples Files a seventh column lists self-reported ethnicity. Ethnicity is only reported for groups with a substantial number of samples in the data; this includes four groups in Gambia (Mandinka, Wolof, Fula, Jola) and two groups in Ghana (Akan and Northerners). All other ethnic groups and samples with unknown/missing ethnicity have been set to ‘Other’. A child’s ethnicity is assigned according to the Mother’s ethnicity. No ethnicity is being released for Malawi as the samples do not exhibit population substructure.

The following is an example of the Sample Support file format:

FamilyID	SampleID	PaternalID	MaternalID	Gender	Phenotype	Ethnicity
GAMTDT0420	WGA_T123428	TDT204237	TDT204236	1	2	Jola
GAMTDT0420	TDT204236	0	0	2	1	Jola
GAMTDT0420	TDT204237	0	0	1	1	Jola

Sample overlap with Gambian Case-Control on Affymetrix 500k platform

(GM_CC_affy500k_Trios_Illumina650Y_data_release_overlap.txt)

This file provides details on samples that are common between the Gambian Case-Control experiment (EGAS00000000026) run on the Affymetrix 500k platform and the Gambian Trios experiment (EGAS00000000087) run on the Illumina 650Y platform.

The samples pertinent to both platforms are malaria cases only.
The file details samples present in both data releases that passed all relevant QC stages.

The file identifies the trio of relevance and then gives the sample ID (or IDs) relevant to the genotyping platform.

There are 279 cases common to both experiment datasets:

276 individuals are represented by a single sample on both platforms
3 individuals have singletons on Affymetrix and duplicates on Illumina

The file is tab-delimited and formatted as shown below:

TRIO_ID	GM_CC_Affy500K	GM_CC_Affy500K	GM_TRIOS_Illumina650Y
GAMTDT0002	WTCCC131484		WTCCC131484
GAMTDT0004	WTCCC131481		WTCCC131481
GAMTDT0006	WTCCC131244		WTCCC131244
GAMTDT0149	WTCCC130093	WTCCC130141	WTCCC130093
GAMTDT0160	WTCCC130573	WTCCC130621	WTCCC130621

Sample QC

This data release contains trios where all three members have passed quality control.

We excluded samples which fell into at least one of the following categories:

Missing genotypes at > 5% of autosomal SNPs
Relatetedness defined as ‘samples with 85% – 98% identity-by-state (IBS)’. We excluded all samples in each collection of related individuals except the one with the highest call rate
Duplicates defined as ‘samples with > 98% IBS’. We excluded all samples from each duplicate collection except the one with the highest call rate
Contaminated samples defined as’ excess levels of heterozygosity’
Mis-specified trios defined as ‘Mendelian errors at greater than 2.5% of SNPs passing QC’

SNP QC

We excluded SNPs that fell into at least one of the following categories:

Missing genotypes at more than 5% of samples
Departure from HWE (calculated in parents only) at a p value of less than 10 -7
Minor Allele frequency of less than 1% in the parents and
Greater than 10% Mendelian errors

Also:

After sample QC, the genotypes of a trio at a SNP with a Mendelian error were set to missing in all three members of that trio.
~250 SNPs failing visual clusterplot inspection have also been excluded.
As the Malawi genotypes were noticeably noisier than the other two cohorts we excluded SNPs with an Illuminus perturbation score of less than 0.95 from the Malawi dataset (~47K SNPs).

Description of Data

(MalariaGEN_TriosReleaseDocumentation_2011.pdf)

This file is a pdf version of this web page, although without the supplementary information. Details of supplementary fields can be found in the Supplementary_data directory.

Intensities (Normalised signals)

(650Y_Intensities_fwd_[country]_chr[number].txt)

Appropriately normalized signal data were generated from the Illumina intensity (“IDAT”) files via BeadStudio, and these were used as input to the ILLUMINUS genotype calling program. The format of the signal data is tab-delimited plain text; one line per SNP, consisting of ID, coordinate (NCBI Build 36), alleles and one pair of intensities per sample for each of the two alleles. All genotypes have also been configured to the ‘+’ strand of the SNP.

The following is an example of the signal file format.

RS	Coord	Allele1	Allele2	ID-XXX1_A	ID-XXX1_B	ID-XXX2_A	ID-XXX2_A	…
rs5994034	15274090	T	C	0.0056	0.420	0.023	0.343	…
rs2027653	15298335	T	C	0.083	0.180	0.090	0.149	…
rs9604967	15492342	T	C	0.091	0.770	0.051	0.508	…

Please note that these files may contain very long lines and are not intended to be human-readable.

Genotypes

(650Y_genotypes_fwd_[country]_chr[number].txt)

Because the Illumina 650Y SNP chip can yield several GBs per cohort, the genotype data have been partitioned by chromosome. Each file is presented in tab-delimited format and contains one genotype per line. The score is the posterior probability of the genotype called using the Illuminus program. Regardless of how the SNPs are organized, all assays are sorted according to sample so that the file can be readily separated into sample blocks. It should also be noted that all genotypes have been configured to the ‘+’ strand of the SNP. The following is an example of the genotype data format:

SNP	SAMPLE	GENOTYPE	SCORE
rs1020382	ID-XXXXXXX	TC	1
rs12459906	ID-XXXXXXX	TC	0.9999
rs12151104	ID-XXXXXXX	AA	0.9983

Data are provided for all SNPs allowing the user to set their own QC metrics.

Plink_files (Post-SNP-QC Genotype Data in Plink Format)

(650Y_fwd_[country]_chr[number].ped and 650Y_fwd_[country]_chr[number].map)

Genotypes in PLINK .map and .ped file formats. File formats are described at: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml.

Genotypes are reported with respect to the + strand, missing genotypes are set to ‘N’. SNP positions are listed in NCBI Build 36 coordinates.

Only SNPs passing quality control (described below) are included in these files.

Phased (Haplotype Files)

[country]_fwd_sample_release_ids.txt
chr[number]_[country]_fwd_b36_legend.txt
chr[number]_ [country]_fwd_phased

Haplotypes are presented in the same format as the HapMap Phase II haplotypes (http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2006-07_phaseII/00README.txt). Please note that the designation of allele 0 and allele 1 do not always conform to the HapMap files. Phasing was done using the post SNP-QC data with the trio option of PHASE.

There is one sample file per country and two files per chromosome, per country cohort.

The chr[number]_[country]_fwd_b36_legend.txt files contains a legend detailing the rs id, base pair position (NCBI Build 36), with the allele coded 0 and the allele coded 1 for each of the segregating SNPs e.g.

RS	position	0	1
rs16981694	15647732	A	C
rs9606468	15648282	C	T

The [country]_fwd_sample_release_ids.file contains the ordered list of individuals that correspond to the chr[number]_ [country]_fwd_phased files.

In the chr[number]_[country]_fwd_phased files, one haplotype is listed per line.
Two haplotypes are listed for each sample, with the transmitted haplotype listed first.
To maintain consistency in the X chromosome files each male has a placeholder second haplotype represented by a row of dashes.
The haplotypes of children are not included as they can be inferred from their parent’s haplotypes.

Parental haplotypes are listed consecutively, for example:

Trio	1	parent	1	transmitted	haplotype
Trio	1	parent	1	untransmitted	haplotype
Trio	1	parent	2	transmitted	haplotype
Trio	1	parent	2	untransmitted	haplotype
Trio	2	parent	1	transmitted	haplotype

Supplementary_data

From March 2013, we have added some supplementary data for use with this dataset. A ReadMe file accompanies the supplementary data that describes their contents and how they relate to the main data.

Data sets

Gambia Trios

EGA Study ID: EGAS00000000087

EGA Data Set ID: EGAD00000000019

Method: Illumina 650Y array

658 Gambian trios (1,984 individuals)

650 unique trios (3 individuals/family)

6 quads (4 individuals/family)

GAMTDT0142

GAMTDT0154

GAMTDT0212

GAMTDT0492

GAMTDT0496

GAMTDT0698

2 x ½ siblings (5 individuals/family)

GAMTDT0169

GAMTDT0274

Ghana Trios

EGA Study ID: EGAS00000000088

EGA Data Set ID: EGAD00000000020

Method: Illumina 650Y array

608 Ghanaian trios (1,824 individuals)

Malawi Trios

EGA Study ID: Not yet available (see release note below)

EGA Data Set ID: Not yet available

Method: Illumina 650Y array

122 Malawian trios (366 individuals)

Release notes

Malawi trios
10 Oct 2015
Please note that this dataset has been prepared for release by MalariaGEN and will be released as soon as the relevant ethics committee confirms the range of acceptable research uses.

Apply for access

Our approach to sharing data

Data package contact

Dr Victoria Simpson