An open dataset of Plasmodium vivax genome variation in 1,895 worldwide samples

8 Apr 2022
MalariaGEN et. al
Wellcome Open Research 2022 7 136 DOI:

This page provides information about the Pv4 dataset, which contains genome variation data on 1,895 worldwide samples of Plasmodium vivax. The key publication is MalariaGEN et al, Wellcome Open Research 2022, 7:136

Previous releases and background

This dataset is based on the Plasmodium vivax Genome Variation Project, which supported groups around the world to establish the landscape of evolution in P. vivax populations and help guide informed control interventions by integrating whole genome sequencing with clinical and epidemiological studies. The Plasmodium vivax Genome Variation Project is comprised of multiple partner studies, uniting around collective ambitions for open genomic data resources, but each with their own respective research objectives and led by independent investigators all over the world. To address the need for comprehensive, large-scale genetic surveillance of P. vivax populations, we combined centralised sequencing capabilities with a standardised analysis pipeline for variant discovery and genotyping that resulted in whole genome sequencing data for partners to conduct downstream analysis in line with MalariaGEN’s guiding principles on equitable data sharing. This culminated in the open-access P. vivax Genome Variation project May 2016 data release and associated analyses in Pearson et al, 2016. The full text article is accessible online. 

This platform has provided a foundation to build upon for a second public data resource (v4), which sought to expand on this model to not only integrate more samples from partner studies, but to also include existing sample data generated by the wider scientific community. This is the first large-scale curation of malaria genome variation data across heterogeneous sequencing methodologies and locations, enabling community access to the largest curated dataset for epidemiological inferences across space and time, while simultaneously minimising the potential introduction of biases during the aggregation process with a standardised pipeline. This combined open resource contains a total of 1,895 samples, with the majority of all samples provided by VivaxGEN (1,025), and GlaxoSmith-Kline (GSK) (357), as well as 297 previously published samples from external studies. The data resource collectively represents 14 studies from 27 countries and 88 sampling locations, primarily between 2001-2017.  Following on from the initial open data release, we have provided genomic variation data, including SNPs, indels, and tandem duplications. For ease of downstream analysis, we have also included information on population structure, calculated per-sample metrics of within-host diversity, and classified samples into four different types of drug resistance based on a limited set of published genetic markers.

About the data pipeline

Full details of the methods can be found in the accompanying paper. The major changes from the v1 (May 2016 data release) pipeline are that we now a) map to the PvP01 reference genome rather than PvSal1 and b) use a pipeline based on current GATK best practices which is analogous to the Pf6 pipeline.

Contents of the release

This release contains details on contributing partner studies, sample metadata and key sample attributes inferred from genomic data, and genomic data including raw sequence reads. Further details and analytical results can be found in the accompanying data release paper.

These data are available open access. Publications using these data should acknowledge and cite the source of the data using the following format: "This publication uses data from the MalariaGEN Plasmodium vivax Genome Variation Project as described in ‘An open dataset of Plasmodium vivax genome variation in 1,895 worldwide samples’. MalariaGEN et al. Wellcome Open Research 2022, 7:136

  • Study information: Details of the 11 contributing partner studies, and 3 external studies, including description, contact information and key people. 

  • Sample provenance and sequencing metadata: sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 1,895 samples from 27 countries. 

  • Measure of complexity of infections: characterisation of within-host diversity (FWS) for 1,072 QC pass samples. 

  • Drug resistance marker genotypes: genotypes at known markers of drug resistance for 1,895 samples, containing amino acid and copy number genotypes at 3 loci: dhfr, dhps, mdr1. 

  • Inferred resistance status classification: classification of 1,072 QC pass samples into different types of resistance to 4 drugs or combinations of drugs: pyrimethamine, sulfadoxine, mefloquine, and sulfadoxine-pyrimethamine combination. 

  • Drug resistance markers to inferred resistance status: details of the heuristics utilised to map genetic markers to resistance status classification. 

  • Tandem duplication genotypes: genotypes for tandem duplications discovered in four regions of the genome.

  • Genome regions and Genome regions index: a bed file classifying genomic regions as core genome or different classes of non-core genome in addition to tabix index file for genome regions file.

  • Short variants genotypes: Genotype calls on 4,571,056 SNPs and short indels in 1,895 samples from 27 countries, available both as VCF and zarr files.  These are available at:

A README file describes in detail all the files included in the release, the format and interpretation of each column, and contains some tips and tricks for accessing the genotype data in VCF and zarr files. 

The VCF and zarr files in this release can be downloaded from the Wellcome Sanger Institute public FTP site using a freely available FTP client.

Publications that have used the P. vivax Genome Variation project data resource, prior to and including version 4