An open dataset of Plasmodium vivax genome variation in 1,895 worldwide samples

This page provides information about the Pv4 dataset, which contains genome variation data on 1,895 worldwide samples of Plasmodium vivax. The key publication is MalariaGEN et al, Wellcome Open Research 2022, 7:136 https://doi.org/10.12688/wellcomeopenres.17795.1

Previous releases and background

This dataset is based on the Plasmodium vivax Genome Variation Project, which supported groups around the world to establish the landscape of evolution in P. vivax populations and help guide informed control interventions by integrating whole genome sequencing with clinical and epidemiological studies. The Plasmodium vivax Genome Variation Project is comprised of multiple partner studies, uniting around collective ambitions for open genomic data resources, but each with their own respective research objectives and led by independent investigators all over the world. To address the need for comprehensive, large-scale genetic surveillance of P. vivax populations, we combined centralised sequencing capabilities with a standardised analysis pipeline for variant discovery and genotyping that resulted in whole genome sequencing data for partners to conduct downstream analysis in line with MalariaGEN’s guiding principles on equitable data sharing. This culminated in the open-access P. vivax Genome Variation project May 2016 data release and associated analyses in Pearson et al, 2016. The full text article is accessible online.

This platform has provided a foundation to build upon for a second public data resource (v4), which sought to expand on this model to not only integrate more samples from partner studies, but to also include existing sample data generated by the wider scientific community. This is the first large-scale curation of malaria genome variation data across heterogeneous sequencing methodologies and locations, enabling community access to the largest curated dataset for epidemiological inferences across space and time, while simultaneously minimising the potential introduction of biases during the aggregation process with a standardised pipeline. This combined open resource contains a total of 1,895 samples, with the majority of all samples provided by VivaxGEN (1,025), and GlaxoSmith-Kline (GSK) (357), as well as 297 previously published samples from external studies. The data resource collectively represents 14 studies from 27 countries and 88 sampling locations, primarily between 2001-2017. Following on from the initial open data release, we have provided genomic variation data, including SNPs, indels, and tandem duplications. For ease of downstream analysis, we have also included information on population structure, calculated per-sample metrics of within-host diversity, and classified samples into four different types of drug resistance based on a limited set of published genetic markers.

About the data pipeline

Full details of the methods can be found in the accompanying paper. The major changes from the v1 (May 2016 data release) pipeline are that we now a) map to the PvP01 reference genome rather than PvSal1 and b) use a pipeline based on current GATK best practices which is analogous to the Pf6 pipeline.

Contents of the release

This release contains details on contributing partner studies, sample metadata and key sample attributes inferred from genomic data, and genomic data including raw sequence reads. Further details and analytical results can be found in the accompanying data release paper.

These data are available open access. Publications using these data should acknowledge and cite the source of the data using the following format: "This publication uses data from the MalariaGEN Plasmodium vivax Genome Variation Project as described in ‘An open dataset of Plasmodium vivax genome variation in 1,895 worldwide samples’. MalariaGEN et al. Wellcome Open Research 2022, 7:136 https://doi.org/10.12688/wellcomeopenres.17795.1

Study information: Details of the 11 contributing partner studies, and 3 external studies, including description, contact information and key people.
Sample provenance and sequencing metadata: sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 1,895 samples from 27 countries.
Measure of complexity of infections: characterisation of within-host diversity (F_WS) for 1,072 QC pass samples.
Drug resistance marker genotypes: genotypes at known markers of drug resistance for 1,895 samples, containing amino acid and copy number genotypes at 3 loci: dhfr, dhps, mdr1.
Inferred resistance status classification: classification of 1,072 QC pass samples into different types of resistance to 4 drugs or combinations of drugs: pyrimethamine, sulfadoxine, mefloquine, and sulfadoxine-pyrimethamine combination.
Drug resistance markers to inferred resistance status: details of the heuristics utilised to map genetic markers to resistance status classification.
Tandem duplication genotypes: genotypes for tandem duplications discovered in four regions of the genome.
Genome regions and Genome regions index: a bed file classifying genomic regions as core genome or different classes of non-core genome in addition to tabix index file for genome regions file.
Short variants genotypes: Genotype calls on 4,571,056 SNPs and short indels in 1,895 samples from 27 countries, available both as VCF and zarr files.These are available at: ftp://ngs.sanger.ac.uk/production/malaria/Resource/30.

A README file describes in detail all the files included in the release, the format and interpretation of each column, and contains some tips and tricks for accessing the genotype data in VCF and zarr files.

The VCF and zarr files in this release can be downloaded from the Wellcome Sanger Institute public FTP site using a freely available FTP client.

The Pv4 user guide is a useful companion to these data, providing information on how to use the malariagen_data Python package to access data in the cloud using free computer services and Jupyter Notebooks without having to first download the resource locally.

Publications that have used the P. vivax Genome Variation project data resource, prior to and including version 4

Chaturvedi R, Chhibber-Goel J, Verma I, et al. Geographical spread and structural basis of sulfadoxine-pyrimethamine drug-resistant malaria parasites. Int J Parasitol 2021; 51: 505-525
Dia A, Jett C, Trevino SG, et al. Single-genome sequencing reveals within-host evolution of human malaria parasites. Cell Host Microbe 2021; 29: 1496-1506.e3.
Diez Benavente E, Manko E, Phelan J, et al. Distinctive genetic structure and selection patterns in Plasmodium vivax from South Asia and East Africa. Nat Commun 2021; 12: 3160.
Ba, H, Auburn, S, Jacob, CG, at al. Multi-locus genotyping reveals established endemicity of a geographically distinct Plasmodium vivax population in Mauritania, West Africa. Plos Neglect Trop Dis 2020; 14: e0008945.
Brashear AM, Fan Q, Hu Y, et al. Population genomics identifies a distinct Plasmodium vivax population on the China-Myanmar border of Southeast Asia. PLoS Negl Trop Dis 2020; 14: e0008506.
Dewasurendra RL, Baniecki ML, Schaffner S, et al. Use of a Plasmodium vivax genetic barcode for genomic surveillance and parasite tracking in Sri Lanka. Malar J 2020; 19: 342.
Diez Benavente E, Campos M, Phelan J, et al. A molecular barcode to inform the geographical origin and transmission dynamics of Plasmodium vivax malaria. PLoS Genet 2020; 16: e1008576.
Fola AA, Kattenberg E, Razook Z, et al. SNP barcodes provide higher resolution than microsatellite markers to measure Plasmodium vivax population genetics. Malar J 2020; 19: 375.
Noviyanti R, Miotto O, Barry A, et al. Implementing parasite genotyping into national surveillance frameworks: feedback from control programmes and researchers in the Asia–Pacific region. Malar J 2020; 19: 271.
Auburn S, Getachew S, Pearson RD, et al. Genomic analysis of Plasmodium vivax in southern Ethiopia reveals selective pressures in multiple parasite mechanisms. J Infect Dis 2019; 220: 1738–1749.
He WQ, Karl S, White MT, Nguitragool W, et al. Antibodies to Plasmodium vivax reticulocyte binding protein 2b are associated with protection against P. vivax malaria in populations living in low malaria transmission regions of Brazil and Thailand. PLoS Negl Trop Dis 2019; 13: e0007596.
Gruszczyk J, Kanjee U, Chan LJ, et al. Transferrin receptor 1 is a reticulocyte-specific receptor for Plasmodium vivax. Science 2018; 359:48-55
Loy DE, Plenderleith LJ, Sundararaman SA, et al. Evolutionary history of human Plasmodium vivax revealed by genome-wide analyses of related ape parasites. Proc Natl Acad Sci U S A 2018; 115: E8450.
Mbenda HGN, Zeng W, Bai Y, et al. Genetic diversity of the Plasmodium vivax phosphatidylinositol 3-kinase gene in two regions of the China-Myanmar border. Infect Genet Evol 2018: 61: 45.
de Oliveira TC, Rodrigues PT, Menezes MJ, et al. Genome-wide diversity and differentiation in New World populations of the human malaria parasite Plasmodium vivax. PLoS Negl Trop Dis 2017; 11: e0005824.
Vauterin P, Jeffery B, Miles A, et al. Panoptes: web-based exploration of large scale genome variation data. Bioinformatics 2017; 33: 3243–3249
Auburn S, Serre D, Pearson RD, et al. Genomic analysis reveals a common breakpoint in amplifications of the Plasmodium vivax multidrug resistance 1 locus in Thailand. J Infect Dis 2016; 214: 1235–1242.
Gruszczyk J, Lim NT, Arnott A, et al. Structurally conserved erythrocyte-binding domain in Plasmodium provides a versatile scaffold for alternate receptor engagement. Proc Natl Acad Sci U S A 2016; 113: E191.
Hostetler JB, Lo E, Kanjee U, et al. Independent Origin and Global Distribution of Distinct Plasmodium vivax Duffy Binding Protein Gene Duplications. Plos Neglect Trop Dis 2016; 10: e0005091.
Pearson RD, Amato R, Auburn S, et al. Genomic analysis of local variation and recent evolution in Plasmodium vivax. Nat Genet 2016; 48: 959-964