======================================================================
MalariaGEN Plasmodium vivax Genome Variation Project - Pv4 data release
======================================================================

Date: 2022-02-17

======================================================================
Description
======================================================================

Through an analysis of 1,895 parasite samples collected at 88 different locations in Asia, America, Africa and Oceania, we identified more than 4 million variant positions (SNPs and indels). This download includes genotyping data for samples contributed to the MalariaGEN Plasmodium vivax Genome Variation Project and other publicly available sequencing data.

For more information on the Plasmodium vivax Genome Variation Project that generated these data, please visit: https://www.malariagen.net/projects/p-vivax-genome-variation


The methods used to generate the data are described in detail in MalariaGEN, biorxiv (2022), DOI: ???.


======================================================================
Citation information
======================================================================

Publications using these data should acknowledge and cite the source of the data using the following format: "This publication uses data from the MalariaGEN Plasmodium vivax Genome Variation Project as described in An open dataset of Plasmodium vivax genome variation in 1,895 worldwide samples, biorxiv, 2022 (DOI: ???)."


======================================================================
Files in the release
======================================================================

Files in the release include:

    - Pv4_samples.txt : sample metadata file in tab-delimited format
    - Pv4_tandem_duplication_genotypes.txt : tandem duplication genotypes file in tab-delimited format
    - Pv4_drug_resistance_marker_genotypes.txt : drug resistance marker genotypes file in tab-delimited format
    - Pv4_inferred_resistance_status_classification.txt : inferred resistance status classification file in tab-delimited format
    - Pv4_fws.txt : Fws values in tab-delimited format
    - Pv4_resistance_classification_20220217.pdf : rules used to create Pv4_inferred_resistance_status_classification.txt in pdf format
    - Pv4_studies.pdf : details on partner studies
    - Pv4_regions.bed.gz defines the core genome used for analysis, and other regions of the genome
    - Pv4_regions.bed.gz.tbi is a tabix index file for Pv4_regions.bed.gz
    - Pv4_vcf : directory containing vcf files, one per chromosome
    - Pv4.zarr.zip : sample genotypes in zipped zarr format


File descriptions:

======================================================================
- Pv4_samples.txt
======================================================================

This file includes sample metadata for all 1,895 samples collected from partners and public databases, and details of sequence read data available at the European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena). It contains the following columns:

    - Sample                                Unique ID of each sample (which can be used to link to other sample information and genotypes)
    - Study                                 Code of the partner study which collected this sample
    - Site                                  Location the sample was collected (in the case of returning travellers this is the country visited)
    - First-level administrative division   GADM administrative division in which this site lies. If unknown the name of the country only is given
    - Country                               Country in which the sample was collected (in the case of returning travellers this is the country visited)
    - Lat                                   GPS coordinate of the latitude of the Site
    - Long                                  GPS coordinate of the longitude of the Site
    - Year                                  Year in which the sample was collected
    - ENA                                   ENA run accession(s) for the sequencing read data. In some cases multiple runs of sequencing data were merged
    - All samples same individual           Identifies all the samples in the data set which were collected from the same individual, either at the same or different time points
    - Population                            Population to which we assigned the sample (SAM=South America, AF=Africa, WAS=Western Asia, WSEA=Western SE Asia, ESEA=Eastern SE Asia, MSEA=Maritime SE Asia, OCE=Oceania, unassigned=Sample was not assigned to one of the 7 populations)
    - % callable                            Percentage of the genome which has coverage of at least 5 reads and less than 10% of reads with mapping quality 0
    - QC pass                               Flag indicated whether the sample passed QC (True=passed QC, False=failed QC)
    - Exclusion reason                      Reason samples failed QC
    - Is returning traveller                Flag indicating whether the sample was from a returning traveller (True) or not (False)


======================================================================
- Pv4_tandem_duplication_genotypes.txt
======================================================================

This file contains genotypes for tandem duplications discovered in four regions of the genome for all 1,895 samples, derived from analysis of sequence data. It contains the following columns:

    - Sample                         Unique ID of each sample (which can be used to link to other sample information and genotypes)
    - DBP_cn                         Copy number of dbp. Heterozygous duplications will have a copy number of 1.5. If copy number is undetermined will be shown as -1.0
    - DBP_breakpoints                Tandem duplication breakpoints around dbp. Refer to Supplementary Table ??? for details.
    - PvP01_09_cn                    Copy number of eight genes on chromosome 9. Heterozygous duplications will have a copy number of 1.5. If copy number is undetermined will be shown as -1.0
    - PvP01_09_breakpoints           Tandem duplication breakpoints around eight genes on chromosome 9. Refer to Supplementary Table ??? for details.
    - MDR1_cn                        Copy number of nine genes on chromosome 10 including mdr1. Heterozygous duplications will have a copy number of 1.5. If copy number is undetermined will be shown as -1.0
    - MDR1_breakpoints               Tandem duplication breakpoints around nine genes on chromosome 10 including mdr1. Refer to Supplementary Table ??? for details.
    - PVP01_1468200_cn               Copy number of PVP01_1468200. Heterozygous duplications will have a copy number of 1.5. If copy number is undetermined will be shown as -1.0
    - PVP01_1468200_breakpoints      Tandem duplication breakpoints around dbp. Refer to Supplementary Table ??? for details.


======================================================================
- Pv4_drug_resistance_marker_genotypes.txt
======================================================================

This file contains genotypes at drug resistance markers for all 1,895 samples derived from analysis of sequence data. It contains the following columns:

    - Sample                         Unique ID of each sample (which can be used to link to other sample information and genotypes)
    - dhfr_57[F]                     Amino acid at dhfr position 57. For explanation see below.
    - dhfr_58[R]                     Amino acid at dhfr position 58. For explanation see below.
    - dhfr_61[T]                     Amino acid at dhfr position 61. For explanation see below.
    - dhfr_111[V]                    Amino acid at dhfr position 111. For explanation see below.
    - dhfr_117[N]                    Amino acid at dhfr position 117. For explanation see below.
    - dhfr_173[I]                    Amino acid at dhfr position 173. For explanation see below.
    - dhps_383[A]                    Amino acid at dhps position 383. For explanation see below.
    - cn_mdr1                        Copy number of mdr1. Heterozygous duplications will have a copy number of 1.5. If copy number is undetermined will be shown as -1.0
    - breakpoint_mdr1                Tandem duplication breakpoints around mdr1. Refer to Supplementary Table ??? for details.

Explanation of amino acid columns in dhfr and dhps:

Each value can have a single haplotype if homozygous or two haplotypes separated by a comma if heterozygous
It is possible to have heterozygous calls where both amino acid haplotypes are the same. The heterozygosity here is at the nucleotide level. These could perhaps be considered homozygous alt.
- represents missing (missing genotype in at least one of the positions)
X represents an unphased het followed by another het. Because hets are unphased it is not possible to resolve the two haplotypes. These are perhaps best considered missing.
? represents two different values in the GATK PID field. Again it is not possible to resolve the two haplotypes in such cases. These are perhaps best considered missing.
! represents a frame-shift in the haplotype. These are perhaps best considered missing.


======================================================================
- Pv4_inferred_resistance_status_classification.txt
======================================================================

This file includes sample phenotype data for all 1,895 samples derived from the data in Pv4_drug_resistance_marker_genotypes.txt, using the rules outlined in "Pv4_resistance_classification.pdf". It contains the following columns:

    - Pyrimethamine                  Pyrimethamine resistance status. Resistant/Sensitive/Undetermined
    - Sulfadoxine                    Sulfadoxine resistance status. Resistant/Sensitive/Undetermined
    - Mefloquine                     Mefloquine resistance status. Resistant/Sensitive/Undetermined
    - SP (uncomplicated)             Sulfadoxine-Pyrimethamine treatment resistance status. Samples carrying the dhfr triple mutant, which is strongly associated with SP failure. Resistant/Sensitive/Undetermined


======================================================================
- Pv4_fws.txt
======================================================================

This file includes Fws values for 1,031 analysis set samples that passed QC and were assigned to one of the seven populations. It contains the following columns:

    - Sample                         Unique ID of each sample (which can be used to link to other sample information and genotypes)
    - Fws                            Fws value


======================================================================
- Pv4_resistance_classification_20220217.pdf
======================================================================

This file describes rules used to create Pv4_inferred_resistance_status_classification.txt


======================================================================
- Pv4_vcf
======================================================================

This directory contains vcf files, one per chromosome. Each file is in bgzip format (.vcf.gz) and has an associated tabix index file (.vcf.gz.tbi) and MD5 checksum (.vcf.gz.md5). There are seventeen files in total, fourteen for each of the autosomes (PvP01_01_v1 - PvP01_14_v1), one for the mitochondrial sequence (PvP01_MIT_v1), one for the apicoplast sequence (PvP01_API_v1) and one containing a concatenation of all the short contigs (PvP01_short_contigs which contains Transfer.PvP01_00_1.final - Transfer.PvP01_00_239.final).

The files, once unzipped, are tab-separated text files, but may be too large to open in Excel.

The VCF format is described in https://github.com/samtools/hts-specs

Tools to assist in handling VCF files are freely available from
http://samtools.github.io/bcftools/

The VCF files contains details of 4,571,056 discovered variant genome positions.
These variants were discovered amongst all samples from the release.
3,083,454 of these variant positions are SNPs, with the remainder being either
short insertion/deletions (indels), or a combination of SNPs and indels. It is
important to note that many of these variants are considered low quality. Only
the variants for which the FILTER column is set to PASS should be considered of
reasonable quality. There are 1,303,984 such PASS variants of which 945,649
are SNPs and 358,335 indels.

The FILTER column is based on two types of information. Firstly certain regions
of the genome are considered "non-core". This includes sub-telomeric regions,
centromeres and internal hypervariable regions on chromosomes 4
(PvP01_04_v1:685685-748923, containing 13 SERA genes), 10
(PvP01_10_v1:1327961-1375229, containing 9 MSP3 genes) and 12
(PvP01_12_v1:792292-818496, containing 12 MSP7 genes). All
variants within non-core regions are considered to be low quality, and hence
will not have the FILTER column set to PASS. The regions which are core and
non-core can be found in the file
ftp://ngs.sanger.ac.uk/production/malaria/pf-crosses/1.0/regions-20130225.onebased.txt.

Secondly, variants are filtered out based on a quality score called VQSLOD. All
variants with a VQSLOD score below 0 are filtered out, i.e. will have a value of
Low_VQSLOD in the FILTER column, rather than PASS. The VQSLOD score for each
variant can be found in the INFO field of the VCF file. It is possible to use
the VQSLOD score to define a more or less stringent set of variants (see next
section for further details).

It is also important to note that many variants have more than two alleles. For
example, amongst the 1,303,984 variants, 972,167 are biallelic. The
remaining 331,817 PASS variants have 3 or more alleles. The maximum number of
alternative alleles represented is 6. Note that some positions can in truth have
more than 6 alternative alleles, particularly those at the start of short tandem
repeats. In such cases, some true alternative alleles will be missing.

In addition to alleles representing SNPs and indels, some variants have an
alternative allele denoted by the * symbol. This is used to denote a "spanning
deletion". For samples that have this allele, the base at this position has been
deleted. Note that this is not the same as a missing call - the * denotes that
there are reads spanning across this position, but that the reads have this
position deleted yet map on either side of the deletion. For further details see
https://software.broadinstitute.org/gatk/guide/article?id=6926

In addition to the VQSLOD score mentioned above, The INFO field contains many
other variant-level metrics. The metrics QD, FS, SOR, DP are all measures
related to the quality of the variant. The VQSLOD score is derived from these
four metrics.

AC contains the number of non-reference alleles amongst the samples in the file.
Because the file contains diploid genotype calls, homozygous non-reference calls
will be counted as two non-reference alleles, whereas heterozygous calls will be
counted as one non-reference allele. Where a variant position has more than one
one non-reference allele, counts of each different non-reference allele are
given. AN contains the total number of called alleles, including reference
alleles. A simple non-reference allele frequency can be calculated as AC/AN.
AC and AN values are all specfic to the samples in the study the VCF was created
for.

Various functional annotations are held in the the SNPEFF variables of the INFO
field. Where appropriate, the amino acid change caused by the variant can be
found in SNPEFF_AMINO_ACID_CHANGE. Note that for multi-allelic variants, only
one annotation is given, and therefore this should not be relied on for non-
biallelic variants. SNPEFF_AMINO_ACID_CHANGE also does not take account of
nearby variants, so if two SNPs are present in the same codon, the
amino acid change given is likely to be wrong. Similarly, if two coding indels
are found in the same exon, the SNPEFF annotations are likely to be wrong. This
situation occurs at the CRT locus (see next section for further details).

Coding variants are identified using the CDS flag in the INFO field.

Columns 10 and onwards of the VCF contain the information for each sample.
The first component of this (GT) is always the diploid genotype call as
determined by GATK. A value of 0/0 indicates a homozygous reference call. A
value of 1/1 indicates a homozygous alternative allele call. 0/1 indicates a
heterozygous call. A value of 2 indicates the sample has the second alternative
allele, i.e. the second value in the ALT column. For example 2/2 would mean the
sample is homozygous the the second alternative allele, 0/2 would mean the
sample is heterozygous for the reference and second alternative alleles, and 1/2
would mean the sample is heterozygous for the first and second alternative
alleles. A value of ./. indicates a missing genotype call, usually because there
are no reads mapping at this position in that sample.


Recommendations regarding sets of variants to use in analyses
-------------------------------------------------------------

Variants are filtered using the VQSLOD metric. VQSLOD is log(p/q) where p is the
probability of being true and q is the probability of being false. Theoretically,
when VQSLOD > 0, p is greater than q, and therefore the variant is more likely
true than false. Conversely, when VQSLOD < 0, the variant is theoretically more
like false than true. This is why we have chosen 0 as the threshold to use to
declare that variants have passed the filters: all PASS variants are
theoretically more likely true than false. Of course, for variants where VQSLOD
is only slightly above 0, there is only a slightly greater probability of being
true than of being false. Therefore, for example, many of the variants with
values between 0 and 1 are likely to be false.

Empirically we have found that SNPs tend to be more accurate than indels, coding
variants tend to be more accurate than non-coding variants, and bi-allelic
variants tend to be more accurate than multi-allelic variants. If you require a
very reliable set of variants for genome-wide analysis, and don't mind if you
miss some real variants, we recommend using only bi-allelic coding SNPs in the
core genome with a VQSLOD score > 6. There are 33,020 such stringent SNPs in the
call set. We include a command below to create such a set of variants.

If instead you would like to know of all likely variation within a certain
region, even if this means including a few false variants, we recommend using
all PASS variants. Finally, if you want to ensure you miss as little as possible
of the true variation, at the risk of including large numbers of false positives,
you could ignore the FILTER column and use all variants in the VCF.

In general, we recommend caution in analysing indels. For any given sample, the
majority of differences from the reference genome are likely to be due to indels
in low-complexity non-coding regions, e.g. in length polymorphisms of short
tandem repeats (STRs), such as homopolymer runs or AT repeats. In general, it is
difficult to map short reads reliably in such regions, and this is compounded by
the fact that these regions tend to have high AT content, and in general we
typically have much lower coverage in high AT regions. Indels also tend to be
multi-allelic, making analysis much more challenging than for (typically
bi-allelic) SNPs.

Despite what is written above, it may often be important to analyse indels in
order to determine the true nature of variation at a locus.


Extracting data from the VCF file
-----------------------------

We recommend the use of bcftools. To install bcftools, follow the instructions
at: https://github.com/samtools/bcftools/wiki/HOWTOs

The following are some commands which you might find useful for extracting data
from the vcf.gz files. We've used an example the vcf for chromosome 10
(Pv4_PvP01_10_v1.final.vcf.gz), but similar commands should work on all vcf files.

To extract sample IDs and put into a file, one per line:
bcftools query --list-samples Pv4_PvP01_10_v1.final.vcf.gz > samples.txt

To extract chromosome, position, reference allele, all alternate alleles,
filter value and VQSLOD for all variants into a tab-delimited file:
bcftools query -f \
'%CHROM\t%POS\t%REF\t%ALT{0}\t%ALT{1}\t%ALT{2}\t%ALT{3}\t%ALT{4}\t%ALT{5}\t%FILTER\t%VQSLOD\n' \
Pv4_PvP01_10_v1.final.vcf.gz > all_variants.txt

To extract chromosome, position, reference allele, all alternate alleles and
VQSLOD for PASS SNPs only into a tab-delimited file:
bcftools query -f \
'%CHROM\t%POS\t%REF\t%ALT{0}\t%ALT{1}\t%ALT{2}\t%ALT{3}\t%ALT{4}\t%ALT{5}\t%VQSLOD\n' \
--include 'FILTER="PASS" && TYPE="snp"' \
Pv4_PvP01_10_v1.final.vcf.gz > pass_snps.txt

To extract chromosome, position, reference allele, alternate allele and VQSLOD
for biallelic PASS SNPs only into a tab-delimited file:
bcftools query -f \
'%CHROM\t%POS\t%REF\t%ALT{0}\t%VQSLOD\n' \
--include 'FILTER="PASS" && TYPE="snp" && N_ALT=1' \
Pv4_PvP01_10_v1.final.vcf.gz > biallelic_pass_snps.txt

To extract chromosome, position, reference allele, alternate allele and VQSLOD
for biallelic PASS SNPs that are segregating within a subset of samples into a
tab-delimited file:
bcftools view \
--samples PD0165-C,PD0166-C,PD0167-C \
Pv4_PvP01_10_v1.vcf.gz | \
bcftools query -f \
'%CHROM\t%POS\t%REF\t%ALT{0}\t%VQSLOD\n' \
--include 'FILTER="PASS" && TYPE="snp" && N_ALT=1 && AC>0' \
> biallelic_segregating_pass_snps.txt

bcftools query -f \
'%CHROM\t%POS\t%REF\t%ALT{0}\t%VQSLOD\n' \
--include 'FILTER="PASS" && TYPE="snp" && N_ALT=1 && AC>0' \
Pv4_PvP01_10_v1.final.vcf.gz > biallelic_segregating_pass_snps.txt

To create a vcf file which contains only PASS bi-allelic coding SNPs with
VQSLOD > 6:
bcftools view \
--include 'FILTER="PASS" && N_ALT=1 && CDS==1 && TYPE="snp" && VQSLOD>6.0' \
--output-type z \
--output-file output_filename.vcf.gz \
Pv4_PvP01_10_v1.final.vcf.gz
bcftools index --tbi output_filename.vcf.gz

To extract diploid genotype calls for biallelic PASS SNPs in gene MDR1 into a
tab-delimited text file, including the chromosome, position, ref and alt
alleles, VQSLOD score and amino acid substituion, and a header containing
sample names:
bcftools query \
-f '%CHROM\t%POS\t%REF\t%ALT{0}\t%VQSLOD\t%SNPEFF_AMINO_ACID_CHANGE[\t%GT]\n' \
--regions PvP01_10_v1:478739-483133 \
--include 'FILTER="PASS" && TYPE="snp" && N_ALT=1' \
--print-header \
Pv4_PvP01_10_v1.final.vcf.gz > mdr1_genotypes.txt

To extract ref allele depths for biallelic PASS SNPs in gene MDR1 into a
tab-delimited text file, including the chromosome, position, ref and alt
alleles, VQSLOD score and amino acid substituion, and a header containing
sample names:
bcftools query \
-f '%CHROM\t%POS\t%REF\t%ALT{0}\t%VQSLOD\t%SNPEFF_AMINO_ACID_CHANGE[\t%AD{0}]\n' \
--regions PvP01_10_v1:478739-483133 \
--include 'FILTER="PASS" && TYPE="snp" && N_ALT=1' \
--print-header \
Pv4_PvP01_10_v1.final.vcf.gz > mdr1_ref_allele_depth.txt

To extract alt allele depths for biallelic PASS SNPs in gene MDR1 into a
tab-delimited text file, including the chromosome, position, ref and alt
alleles, VQSLOD score and amino acid substituion, and a header containing
sample names:
bcftools query \
-f '%CHROM\t%POS\t%REF\t%ALT{0}\t%VQSLOD\t%SNPEFF_AMINO_ACID_CHANGE[\t%AD{1}]\n' \
--regions PvP01_10_v1:478739-483133 \
--include 'FILTER="PASS" && TYPE="snp" && N_ALT=1' \
--print-header \
Pv4_PvP01_10_v1.final.vcf.gz > mdr1_alt_allele_depth.txt


======================================================================
- Pv4.zarr.zip
======================================================================

This file contains the information that is encoded in the VCF files, but in zipped zarr format.

We recommend analysing data using the scikit-allel package with the zarr file. For more details
on using scikit-allel, please see https://scikit-allel.readthedocs.io/en/stable/


======================================================================
Release notes:
======================================================================

Data excluded from release:
Sequence read data on samples collected in Indonesia cannot be made publically available because of national export restrictions.