NEW: Over 100 African researchers begin training... more
Pf7: An open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples
16 Jan 2023

MalariaGEN et. al

Wellcome Open Research, 2023; 8 22 DOI: 10.12688/wellcomeopenres.18681.1

Parasite

About the Plasmodium falciparum version 7 data

This page provides information about the Pf7 dataset which contains genome variation data on over 20,000 worldwide samples of Plasmodium falciparum. The key publication is MalariaGEN et al, Wellcome Open Research 2023, 8:22 https://doi.org/10.12688/wellcomeopenres.18681.1.

Open the Pf7 app to view summary information about contributing studies, countries, and resistance profiles.

Background and previous releases

This dataset is based on genome variation from the MalariaGEN network, including samples which were previously released through the Pf3k Project, Plasmodium falciparum Community Project and GenRe Mekong Project. It comprises multiple partner studies, each with its own research objectives and led by a local investigator. Genome sequencing is performed centrally, and partner studies are free to analyse and publish the genetic data produced on their own samples, in line with MalariaGEN’s guiding principles on equitable data sharing.

This new open dataset is almost three times larger than the last dataset release (Pf6, published 2021), and includes samples from a wider geographic reach. The variants and genotypes described in this publication used version 3 of the analysis pipeline. Data produced using an earlier version of the data analysis pipeline can be explored using an interactive web application.

About the version 7 data pipeline

Details of the methods can be found in the accompanying paper and here.

Content of the data release

This release contains details on contributing partner studies, sample metadata and key sample attributes inferred from genomic data, and genomic data including raw sequence reads. Further details and analytical results can be found in the accompanying data release paper.

These data are available open access. Publications using these data should acknowledge and cite the source of the data using the following format: “This publication uses MalariaGEN data as described in ‘Pf7: an open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples’ MalariaGEN et al, Wellcome Open Research 2023, 8:22 https://doi.org/10.12688/wellcomeopenres.18681.1.

  • Study information: Details of the 82 contributing partner studies, including description, contact information and key people.
  • Sample provenance and sequencing metadata: sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 20,864 samples from 33 countries.
  • Measure of complexity of infections: characterisation of within-host diversity (FWS) for 16,203 QC pass samples.
  • Drug resistance marker genotypes: genotypes at known markers of drug resistance for 16,203 samples, containing amino acid and copy number genotypes at six loci: crt, dhfr, dhps, mdr1, kelch13, plasmepsin 2-3.
  • Inferred resistance status classification: classification of 16,203 QC pass samples into different types of resistance to 10 drugs or combinations of drugs and to RDT detection: chloroquine, pyrimethamine, sulfadoxine, mefloquine, artemisinin, piperaquine, sulfadoxine- pyrimethamine for treatment of uncomplicated malaria, sulfadoxine- pyrimethamine for intermittent preventive treatment in pregnancy, artesunate-mefloquine, dihydroartemisinin-piperaquine, hrp2 and hrp3 gene deletions.
  • Drug resistance markers to inferred resistance status: details of the heuristics utilised to map genetic markers to resistance status classification.
  • CRT haplotypes: Full crt gene haplotypes for 16,203 QC pass samples. These are available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_crt_haplotypes.txt
  • CSP C-terminal haplotypes: Full csp C-terminal haplotypes for 16,203 QC pass samples plus 6 lab strains. These are available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_csp_c_terminal_haplotypes.txt
  • EBA175 calls: eba175 allelic type calls for 16,203 QC pass samples.
  • Reference genome: the version of the 3D7 reference genome fasta file used for mapping. This is available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pfalciparum.genome.fasta
  • Annotation file: the version of the 3D7 reference annotation gff file used for genome annotations. This is available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff
  • Genetic distances: Genetic distance matrix comparing all 20,864 samples. This is available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_genetic_distance_matrix.npy
  • Short variants genotypes: Genotype calls on 10,145,661 SNPs and short indels in all 20,864 samples from 33 countries, available both as VCF (ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_vcf/) and zarr (ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7.zarr.zip) files.

A README file describes in fine detail all the files included in the release, the format and interpretation of each column, and contains some tips and tricks for accessing genotype data in VCF and zarr files.

NOTE: You may need to download a free FTP client to access the FTP links.

The Pf7 user guide is a useful companion to these data, providing information on how to use the malariagen_data Python package to access data in the cloud using free computer services and Jupyter Notebooks without having to first download the resource locally.

Additional files that have been released since the initial data release are listed below.

  • Deletion and breakpoint locations with the histidine-rich protein genes II and III (HRP-2 and -3): List of deletion and breakpoint locations within the histidine-rich protein genes II and III (HRP-2 and -3) across 16,203 QC-pass samples. All samples with a deletion either have an exact breakpoint (in the case of samples where the deletion is due to telomere healing) or a breakpoint range (in the case of recombination with a different chromosome that has similar sequence).
  • Mean pairwise FST between all major sub-populations
  • Bam and gvcf files for all samples: Bam files are the final sample-level bams created as part of the pipeline. These differ in a number of ways from the raw bam/cram files available at the ENA. For example we have remapped all reads to the latest reference genome, fixed mates and marked duplicates, run GATK’s BQSR and merged lanes where a sample had more than one lane of sequencing data. Gvcf files are the output of GATK’s HaplotypeCaller. This could be useful if you would like to create a joint call set of Pf7 and your own sequencing data.

Supplementary data

The following supplementary data is available as a single document download: Supplementary data

  • Supplementary Table 1. Breakdown of analysis set samples by geography
  • Supplementary Table 2. Studies contributing samples
  • Supplementary Table 3. Summary of discovered variant positions
  • Supplementary Table 4. Numbers of samples used to determine proportions in Table 2.
  • Supplementary Table 5. Newly emerging Dd2 background mutations in crt.
  • Supplementary Table 6. Frequency of HRP2 and HRP3 deletions by country.
  • Supplementary Table 7. Summary of hrp2 and hrp3 deletion breakpoints.
  • Supplementary Figure 1. Breakdown of samples by country.
  • Supplementary Figure 2. Distribution of samples by year of collection.
  • Supplementary Figure 3. Lack of bias in population structure due to use of sWGA.
  • Supplementary Figure 4. Population structure from a neighbour-joining tree.
  • Supplementary Figure 5. Linkage disequilibrium decay in ten major parasite sub-populations.
  • Supplementary Figure 6. Characteristics of the ten major parasite sub-populations.
  • Supplementary Figure 7. Geographic patterns of population differentiation and gene flow.
  • Supplementary Figure 8. Abacus plot of inferred drug resistance frequencies in location/year combinations.
  • Supplementary Figure 9. Increase in frequency of KEL1.
  • Supplementary Figure 10. Variation in c-terminal of csp.
  • Supplementary Figure 11. Proportion of C allele of eba175 in different major sub-populations.

Publications that have used the P. falciparum Community Project data resource, prior to and including version 7