This release contains SNP genotype data from mother-father-child trios genotyped on the Illumina 650Y array.
- These data include parents and a single offspring (so-called trios) from three partner studies within Consortial Project 1.
- All children have been diagnosed with malaria in a hospital.
- This data release contains complete total of 4,174 samples that have passed quality control.
These data have been deposited in the European Genotyping Archive (EGA) under EGA Study IDs: EGAS00000000087 (Gambia) and EGAS00000000088 (Ghana).
Data files
Annotations
- Sample Support Files
- Sample overlap with Gambian Case-Control on Affymetrix 500k platform
- Sample QC
- SNP QC
- Description of data
- Intensities (Normalised Signals)
- Genotypes
- Plink_files
- Phased
- Supplementary_data
Sample Support Files
(650Y_samples_[country]Trios.txt)
The tab delimited Samples Files lists the following information for each sample:
- Family ID
- Sample ID
- Paternal ID (Missing parental IDs are set to ‘0’)
- Maternal ID (Missing parental IDs are set to ‘0’)
- Gender (1= Male, 2 = Female)
- Phenotype (2= Affected, 1 = Unaffected)
- Ethnicity*
*In the Ghanaian and Gambian Samples Files a seventh column lists self-reported ethnicity. Ethnicity is only reported for groups with a substantial number of samples in the data; this includes four groups in Gambia (Mandinka, Wolof, Fula, Jola) and two groups in Ghana (Akan and Northerners). All other ethnic groups and samples with unknown/missing ethnicity have been set to ‘Other’. A child’s ethnicity is assigned according to the Mother’s ethnicity. No ethnicity is being released for Malawi as the samples do not exhibit population substructure.
The following is an example of the Sample Support file format:
FamilyID | SampleID | PaternalID | MaternalID | Gender | Phenotype | Ethnicity |
---|---|---|---|---|---|---|
GAMTDT0420 | WGA_T123428 | TDT204237 | TDT204236 | 1 | 2 | Jola |
GAMTDT0420 | TDT204236 | 0 | 0 | 2 | 1 | Jola |
GAMTDT0420 | TDT204237 | 0 | 0 | 1 | 1 | Jola |
Sample overlap with Gambian Case-Control on Affymetrix 500k platform
(GM_CC_affy500k_Trios_Illumina650Y_data_release_overlap.txt)
This file provides details on samples that are common between the Gambian Case-Control experiment (EGAS00000000026) run on the Affymetrix 500k platform and the Gambian Trios experiment (EGAS00000000087) run on the Illumina 650Y platform.
The samples pertinent to both platforms are malaria cases only.
The file details samples present in both data releases that passed all relevant QC stages.
The file identifies the trio of relevance and then gives the sample ID (or IDs) relevant to the genotyping platform.
There are 279 cases common to both experiment datasets:
- 276 individuals are represented by a single sample on both platforms
- 3 individuals have singletons on Affymetrix and duplicates on Illumina
The file is tab-delimited and formatted as shown below:
TRIO_ID | GM_CC_Affy500K | GM_CC_Affy500K | GM_TRIOS_Illumina650Y |
---|---|---|---|
GAMTDT0002 | WTCCC131484 | WTCCC131484 | |
GAMTDT0004 | WTCCC131481 | WTCCC131481 | |
GAMTDT0006 | WTCCC131244 | WTCCC131244 | |
GAMTDT0149 | WTCCC130093 | WTCCC130141 | WTCCC130093 |
GAMTDT0160 | WTCCC130573 | WTCCC130621 | WTCCC130621 |
Sample QC
This data release contains trios where all three members have passed quality control.
We excluded samples which fell into at least one of the following categories:
- Missing genotypes at > 5% of autosomal SNPs
- Relatetedness defined as ‘samples with 85% – 98% identity-by-state (IBS)’. We excluded all samples in each collection of related individuals except the one with the highest call rate
- Duplicates defined as ‘samples with > 98% IBS’. We excluded all samples from each duplicate collection except the one with the highest call rate
- Contaminated samples defined as’ excess levels of heterozygosity’
- Mis-specified trios defined as ‘Mendelian errors at greater than 2.5% of SNPs passing QC’
SNP QC
We excluded SNPs that fell into at least one of the following categories:
- Missing genotypes at more than 5% of samples
- Departure from HWE (calculated in parents only) at a p value of less than 10 -7
- Minor Allele frequency of less than 1% in the parents and
- Greater than 10% Mendelian errors
Also:
- After sample QC, the genotypes of a trio at a SNP with a Mendelian error were set to missing in all three members of that trio.
- ~250 SNPs failing visual clusterplot inspection have also been excluded.
- As the Malawi genotypes were noticeably noisier than the other two cohorts we excluded SNPs with an Illuminus perturbation score of less than 0.95 from the Malawi dataset (~47K SNPs).
Description of Data
(MalariaGEN_TriosReleaseDocumentation_2011.pdf)
This file is a pdf version of this web page, although without the supplementary information. Details of supplementary fields can be found in the Supplementary_data directory.
Intensities (Normalised signals)
(650Y_Intensities_fwd_[country]_chr[number].txt)
Appropriately normalized signal data were generated from the Illumina intensity (“IDAT”) files via BeadStudio, and these were used as input to the ILLUMINUS genotype calling program. The format of the signal data is tab-delimited plain text; one line per SNP, consisting of ID, coordinate (NCBI Build 36), alleles and one pair of intensities per sample for each of the two alleles. All genotypes have also been configured to the ‘+’ strand of the SNP.
The following is an example of the signal file format.
RS | Coord | Allele1 | Allele2 | ID-XXX1_A | ID-XXX1_B | ID-XXX2_A | ID-XXX2_A | … |
---|---|---|---|---|---|---|---|---|
rs5994034 | 15274090 | T | C | 0.0056 | 0.420 | 0.023 | 0.343 | … |
rs2027653 | 15298335 | T | C | 0.083 | 0.180 | 0.090 | 0.149 | … |
rs9604967 | 15492342 | T | C | 0.091 | 0.770 | 0.051 | 0.508 | … |
Please note that these files may contain very long lines and are not intended to be human-readable.
Genotypes
(650Y_genotypes_fwd_[country]_chr[number].txt)
Because the Illumina 650Y SNP chip can yield several GBs per cohort, the genotype data have been partitioned by chromosome. Each file is presented in tab-delimited format and contains one genotype per line. The score is the posterior probability of the genotype called using the Illuminus program. Regardless of how the SNPs are organized, all assays are sorted according to sample so that the file can be readily separated into sample blocks. It should also be noted that all genotypes have been configured to the ‘+’ strand of the SNP. The following is an example of the genotype data format:
SNP | SAMPLE | GENOTYPE | SCORE |
---|---|---|---|
rs1020382 | ID-XXXXXXX | TC | 1 |
rs12459906 | ID-XXXXXXX | TC | 0.9999 |
rs12151104 | ID-XXXXXXX | AA | 0.9983 |
Data are provided for all SNPs allowing the user to set their own QC metrics.
Plink_files (Post-SNP-QC Genotype Data in Plink Format)
(650Y_fwd_[country]_chr[number].ped and 650Y_fwd_[country]_chr[number].map)
Genotypes in PLINK .map and .ped file formats. File formats are described at: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml.
Genotypes are reported with respect to the + strand, missing genotypes are set to ‘N’. SNP positions are listed in NCBI Build 36 coordinates.
Only SNPs passing quality control (described below) are included in these files.
Phased (Haplotype Files)
[country]_fwd_sample_release_ids.txt
chr[number]_[country]_fwd_b36_legend.txt
chr[number]_ [country]_fwd_phased
Haplotypes are presented in the same format as the HapMap Phase II haplotypes (http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2006-07_phaseII/00README.txt). Please note that the designation of allele 0 and allele 1 do not always conform to the HapMap files. Phasing was done using the post SNP-QC data with the trio option of PHASE.
There is one sample file per country and two files per chromosome, per country cohort.
The chr[number]_[country]_fwd_b36_legend.txt files contains a legend detailing the rs id, base pair position (NCBI Build 36), with the allele coded 0 and the allele coded 1 for each of the segregating SNPs e.g.
RS | position | 0 | 1 |
---|---|---|---|
rs16981694 | 15647732 | A | C |
rs9606468 | 15648282 | C | T |
The [country]_fwd_sample_release_ids.file contains the ordered list of individuals that correspond to the chr[number]_ [country]_fwd_phased files.
- In the chr[number]_[country]_fwd_phased files, one haplotype is listed per line.
- Two haplotypes are listed for each sample, with the transmitted haplotype listed first.
- To maintain consistency in the X chromosome files each male has a placeholder second haplotype represented by a row of dashes.
- The haplotypes of children are not included as they can be inferred from their parent’s haplotypes.
Parental haplotypes are listed consecutively, for example:
Trio | 1 | parent | 1 | transmitted | haplotype |
Trio | 1 | parent | 1 | untransmitted | haplotype |
Trio | 1 | parent | 2 | transmitted | haplotype |
Trio | 1 | parent | 2 | untransmitted | haplotype |
Trio | 2 | parent | 1 | transmitted | haplotype |
Supplementary_data
From March 2013, we have added some supplementary data for use with this dataset. A ReadMe file accompanies the supplementary data that describes their contents and how they relate to the main data.
Data sets
Gambia Trios
EGA Study ID: EGAS00000000087
EGA Data Set ID: EGAD00000000019
Method: Illumina 650Y array
658 Gambian trios (1,984 individuals)
- 650 unique trios (3 individuals/family)
- 6 quads (4 individuals/family)
- GAMTDT0142
- GAMTDT0154
- GAMTDT0212
- GAMTDT0492
- GAMTDT0496
- GAMTDT0698
- 2 x ½ siblings (5 individuals/family)
- GAMTDT0169
- GAMTDT0274
Ghana Trios
EGA Study ID: EGAS00000000088
EGA Data Set ID: EGAD00000000020
Method: Illumina 650Y array
608 Ghanaian trios (1,824 individuals)
Malawi Trios
EGA Study ID: Not yet available (see release note below)
EGA Data Set ID: Not yet available
Method: Illumina 650Y array
122 Malawian trios (366 individuals)
Release notes
Malawi trios
10 Oct 2015Please note that this dataset has been prepared for release by MalariaGEN and will be released as soon as the relevant ethics committee confirms the range of acceptable research uses.