Pf3k pilot data release 4

Project: Pf3k

Released on 16 Oct 2015

This release contains sample information and accession numbers, analysis BAMs, and de novo variant discovery and genotyping across 2,512 samples collected in 14 countries, as well as five lab strains included for method development validation.

At the time of their release, these data were subject to the Pf3k Pilot Phase Terms of Use. In September 2016, these restrictions were lifted and this dataset is now available open access.

4.0 Data

This data set includes sample information and analysis BAMs from 2,517 samples.

The files in this data set include:

  • A table of sample metadata in tab-delimited and Excel file formats. This table includes the accessions for downloading the sequence reads from the European Nucleotide Archive (ENA), the sampling location, the contributing partner study ID and contact person, and mapping metadata including sequence coverage metrics.
  • Analysis BAM files. These files, one per sample, contain alignments of the raw sequence reads to the 3D7v3.1 reference genome.

These data can be downloaded from the Wellcome Trust Sanger Institute public ftp site.

4.1 Data

This data set contains a set of de novo genotypes for the 4.0 sample set. The genotyping, including both indel and SNP variants, was performed using a pipeline based on GATK best practices ( These genotypes should not be taken as a quality-controlled output of the Pf3K project and are provided for public interest and as a basis for future methods development.

For more information, see the README files on the ftp site.

Files in this release include:

These data can be downloaded from the Wellcome Trust Sanger Institute public ftp site.

Known issues

16 Oct 2015
VCF version

The VCF files are tagged as version 4.1 spec vcf, but are in fact version 4.2. This results from a known bug in the current version of GATK.