MalariaGEN Gambia, Ghana and Malawi Trios Data Release Documentation
Date of Release and Datasets
|20th January 2011||Gambia Trios|
|20th January 2011||Ghana trios|
Please note that this dataset has been prepared for release by MalariaGEN. The data will be released as soon as the relevant ethics committee confirms the range of acceptable research purposes for which it may be used.
- This release contains SNP genotype data from mother-father-child trios genotyped on the Illumina 650Y array.
- All children have been diagnosed with malaria in a hospital.
- This data release contains complete total of 4,174 samples that have passed quality control.
|EGAS00000000087:||658 Gambian families [1984 individuals]
|EGAS00000000088:||608 Ghanaian trios [1824 individuals]|
|TBA||122 Malawian trios [366 individuals]|
This release contains the following data files:
- Sample Support Files
- Sample overlap with Gambian Case-Control on Affymetrix 500k platform
- Normaized Signals (Intensity Files)
- Genotype Files
- Sample QC
- SNP QC
- Post SNP-QC Genotype in Plink Format
Sample Support Files:
The tab delimited Samples Files lists the following information for each sample:
- Family ID
- Sample ID
- Paternal ID (Missing parental IDs are set to ‘0’)
- Maternal ID (Missing parental IDs are set to ‘0’)
- Gender (1= Male, 2 = Female)
- Phenotype (2= Affected, 1 = Unaffected)
*In the Ghanaian and Gambian Samples Files a seventh column lists self-reported ethnicity. Ethnicity is only reported for groups with a substantial number of samples in the data; this includes four groups in Gambia (Mandinka, Wolof, Fula, Jola) and two groups in Ghana (Akan and Northerners). All other ethnic groups and samples with unknown/missing ethnicity have been set to ‘Other’. A child’s ethnicity is assigned according to the Mother’s ethnicity. No ethnicity is being released for Malawi as the samples do not exhibit population substructure.
The following is an example of the Sample Support file format:
Sample overlap with Gambian Case-Control on Affymetrix 500k platform:
This file provides details on samples that are common between the
Gambian Case-Control experiment (EGAS00000000026) run on the Affymetrix 500k platform and the Gambian Trios experiment (EGAS00000000087) run on the Illumina 650Y platform.
The samples pertinent to both platforms are malaria cases only.
The file details samples present in both data releases that passed all relevant QC stages.
The file identifies the trio of relevance and then gives the sample ID (or IDs) relevant to the genotyping platform.
There are 279 cases common to both experiment datasets:
- 276 individuals are represented by a single sample on both platforms
- 3 individuals have singletons on Affymetrix and duplicates on Illumina
The file is tab-delimited and formatted as shown below:
Appropriately normalized signal data were generated from the Illumina intensity (“IDAT”) files via BeadStudio, and these were used as input to the ILLUMINUS genotype calling program. The format of the signal data is tab-delimited plain text; one line per SNP, consisting of ID, coordinate (NCBI Build 36), alleles and one pair of intensities per sample for each of the two alleles. All genotypes have also been configured to the '+' strand of the SNP.
The following is an example of the signal file format.
Please note that these files may contain very long lines and are not intended to be human-readable.
Because the Illumina 650Y SNP chip can yield several GBs per cohort, the genotype data have been partitioned by chromosome. Each file is presented in tab-delimited format and contains one genotype per line. The score is the posterior probability of the genotype called using the Illuminus program. Regardless of how the SNPs are organized, all assays are sorted according to sample so that the file can be readily separated into sample blocks. It should also be noted that all genotypes have been configured to the '+' strand of the SNP. The following is an example of the genotype data format:
Data are provided for all SNPs allowing the user to set their own QC metrics.
This data release contains trios where all three members have passed quality control:
We excluded samples which fell into at least one of the following categories:
- Missing genotypes at > 5% of autosomal SNPs
- Relatetedness defined as ‘samples with 85% – 98% identity-by-state (IBS)’. We excluded all samples in each collection of related individuals except the one with the highest call rate
- Duplicates defined as ‘samples with > 98% IBS’. We excluded all samples from each duplicate collection except the one with the highest call rate
- Contaminated samples defined as’ excess levels of heterozygosity’
- Mis-specified trios defined as ‘Mendelian errors at greater than 2.5% of SNPs passing QC’
We excluded SNPs that fell into at least one of the following categories:
- Missing genotypes at more than 5% of samples
- Departure from HWE (calculated in parents only) at a p value of less than 10 -7
- Minor Allele frequency of less than 1% in the parents and
- Greater than 10% Mendelian errors
- After sample QC, the genotypes of a trio at a SNP with a Mendelian error were set to missing in all three members of that trio.
- ~250 SNPs failing visual clusterplot inspection have also been excluded.
- As the Malawi genotypes were noticeably noisier than the other two cohorts we excluded SNPs with an Illuminus perturbation score of less than 0.95 from the Malawi dataset (~47K SNPs).
Post-SNP-QC Genotype Data in Plink Format:
(650Y_fwd_[country]_chr[number].ped and 650Y_fwd_[country]_chr[number].map)
Genotypes in PLINK .map and .ped file formats. File formats are described at: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml.
Genotypes are reported with respect to the + strand, missing genotypes are set to ‘N’. SNP positions are listed in NCBI Build 36 coordinates.
Only SNPs passing quality control (described below) are included in these files.
Haplotypes are presented in the same format as the HapMap Phase II haplotypes (http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2006-07_phaseII/00README.txt). Please note that the designation of allele 0 and allele 1 do not always conform to the HapMap files. Phasing was done using the post SNP-QC data with the trio option of PHASE.
There is one sample file per country and two files per chromosome, per country cohort.
The chr[number]_[country]_fwd_b36_legend.txt files contains a legend detailing the rs id, base pair position (NCBI Build 36), with the allele coded 0 and the allele coded 1 for each of the segregating SNPs e.g.
The [country]_fwd_sample_release_ids.file contains the ordered list of individuals that correspond to the chr[number]_ [country]_fwd_phased files.
- In the chr[number]_[country]_fwd_phased files, one haplotype is listed per line.
- Two haplotypes are listed for each sample, with the transmitted haplotype listed first.
- To maintain consistency in the X chromosome files each male has a placeholder second haplotype represented by a row of dashes.
- The haplotypes of children are not included as they can be inferred from their parent’s haplotypes.
- Parental haplotypes are listed consecutively, for example: