NEW: Over 100 African researchers begin training... more
An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples
24 Feb 2021

MalariaGEN et. al

Wellcome Open Research, 2021; 6 42 DOI: 10.12688/wellcomeopenres.16168.1

This page provides information about the Pf6 dataset which contains genome variation data on 7,000 worldwide samples of Plasmodium falciparum. The key publication is MalariaGEN et al, Wellcome Open Research 2021642 DOI: 10.12688/wellcomeopenres.16168.1. You can browse summary data using the Pf6 data exploration tool.

Background and previous releases

This dataset is based on the MalariaGEN Plasmodium falciparum Community Project which supported groups around the world to integrate parasite genome sequencing into clinical and epidemiological studies of malaria. It comprises multiple partner studies, each with its own research objectives and led by a local investigator. Genome sequencing is performed centrally, and partner studies are free to analyse and publish the genetic data produced on their own samples, in line with MalariaGEN’s guiding principles on equitable data sharing.

Aggregated data from the Community Project were initially released through a companion project called Pf3k whose goal was to bring together leading analysts from multiple institutions to benchmark and standardise methods of variant discovery and genotyping calling. The Pf3k dataset can be explored using an interactive web application.

The open dataset was enlarged in 2016 when multiple partner studies contributed to a consortial publication on 3,488 samples from 23 countries. The variants and genotypes described in this publication used version 3 of the analysis pipeline. Data produced using an earlier version of the data analysis pipeline can be explored using an interactive web application.

About the version 6 data pipeline

In 2018 the Plasmodium falciparum Community Project upgraded to version 6 of its variant discovery and genotype calling pipeline. Details of the methods can be found in the accompanying paper and here. The major change from previous versions is that the version 6 pipeline is based on GATK and utilises findings on genome accessibility generated by P. falciparum Genetic Crosses Project.

Content of the data release

This release contains details on contributing partner studies, sample metadata and key sample attributes inferred from genomic data, and genomic data including raw sequence reads. Further details and analytical results can be found in the accompanying data release paper.

These data are available open access. Publications using these data should acknowledge and cite the source of the data using the following format: “This publication uses data from the MalariaGEN Plasmodium falciparum Community Project as described in ‘An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples. MalariaGEN et al, Wellcome Open Research 2021642 DOI: 10.12688/wellcomeopenres.16168.1.'”.

  • Study information: Details of the 49 contributing partner studies, including description, contact information and key people.
  • Sample provenance and sequencing metadata: sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 7,113 samples from 28 countries.
  • Measure of complexity of infections: characterisation of within-host diversity (FWS) for 5,970 QC pass samples.
  • Drug resistance marker genotypes: genotypes at known markers of drug resistance for 7,113 samples, containing amino acid and copy number genotypes at six loci: crt, dhfr, dhps, mdr1, kelch13, plasmepsin 2-3.
  • Inferred resistance status classification: classification of 5,970 QC pass samples into different types of resistance to 10 drugs or combinations of drugs and to RDT detection: chloroquine, pyrimethamine, sulfadoxine, mefloquine, artemisinin, piperaquine, sulfadoxine- pyrimethamine for treatment of uncomplicated malaria, sulfadoxine- pyrimethamine for intermittent preventive treatment in pregnancy, artesunate-mefloquine, dihydroartemisinin-piperaquine, hrp2 and hrp3 genes deletions.
  • Drug resistance markers to inferred resistance status: details of the heuristics utilised to map genetic markers to resistance status classification.
  • Gene differentiation: estimates of global and local differentiation for 5,561 genes.
  • Short variants genotypes: Genotype calls on 6,051,696 SNPs and short indels in 7,113 samples from 29 countries, available both as VCF (ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_vcf/) and zarr (ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6.zarr.zip files.

A README file describes in fine detail all the files included in the release, the format and interpretation of each column, and contains some tips and tricks for accessing genotype data in VCF and zarr files.

NOTE: You may need to download a free FTP client to access the FTP links.

Supplementary data

The following supplementary data is available as a single document download: Supplementary data

  • Supplementary Note
    • Analysis of local differentiation score
    • The classic 76T chloroquine resistance mutation in crt is found on multiple haplotypes
    • Suplhadoxine-pyrimethamine resistance is widespread and associated with many haplotypes
    • mdr1 duplications have many different breakpoints
    • Artemisinin, piperaquine, and mefloquine resistance
    • No evidence of resistance to less commonly used antimalarials
  • Supplementary Table 1. Breakdown of analysis set samples by geography
  • Supplementary Table 2. Studies contributing samples
  • Supplementary Table 3. Summary of discovered variant positions
  • Supplementary Table 4. Breakpoints of duplications of gch1
  • Supplementary Table 5. Breakpoints of duplications of mdr1
  • Supplementary Table 6. Breakpoints of duplications of plasmepsin 2-3
  • Supplementary Table 7. Genes ranked by global differentiation score
  • Supplementary Table 8. Genes ranked by local differentiation score
  • Supplementary Table 9. Number of samples used to determine proportions in Table 2
  • Supplementary Table 10. Frequencies of mutations associated with mono- and multi-drug resistance pre- and post-2011
  • Supplementary Table 11. Frequency of crt amino acid 72-76 haplotypes
  • Supplementary Table 12. Frequencies of dhfr (51, 59, 108, 164) and dhps (437, 540, 581, 613) multi-locus haplotypes
  • Supplementary Table 13. Frequency of HRP2 and HRP3 deletions by country
  • Supplementary Table 14. Alleles at six mitochondrial positions used for the species identification
  • Supplementary Figure 1. Histogram of local differentiation score for all genes

Publications that have used the P. falciparum Community Project data resource, prior to and including version 6