Ag1000G phase 1 AR3 data release

Project: Ag1000G

Released on 22 Jul 2015

This data release includes variant calls and associated data for 845 mosquito specimens — 765 wild-caught specimens collected from eight countries across sub-Saharan Africa, and 80 specimens comprising parents and progeny of four crosses. All mosquitoes were sequenced by the Wellcome Trust Sanger Institute’s Malaria programme.

Any use of Project data is subject to the Terms of Use.


This data release comprises variant call data, available as either VCF or HDF5 format files, and other associated data files.

All of the data files included in this release can be downloaded from the Wellcome Trust Sanger Institute public FTP site.

NOTE: Many browsers now do not support links to FTP sites. If you are experiencing difficulties, you may need to change your browser settings.

Release notes

22 Jul 2015
Genome accessibility

This release includes new data on genome accessibility. The “accessibility” directory within the FTP site contains files providing a number of metrics of genome accessibility for each position in the AgamP3 reference genome, derived from alignments of sequence reads from the 765 wild-caught samples to the reference. Also included is a mask specifying which positions are considered accessible and which are not.

22 Jul 2015

Also new in this release are variant calls for four crosses between parents derived from various established colonies, including the Mali and Pimperena colonies. Each cross comprises two parents and around 18 progeny. The “variation/crosses” directory contains variant calls in both VCF and HDF5 formats.

22 Jul 2015
Variant filtering

The raw variant calls for the main phase 1 cohort of 765 wild-caught samples have not changed since the previous phase 1 AR2 release, however, the variant filtering strategy is different. Variant filters now make use of the genome accessibility metrics mentioned above. The new filtering strategy is generally more conservative than the previous AR2 release, thus some variants previously passing all filters may now fail one or more filters. Variant calls for the 765 wild-caught samples are in the “variation/main” directory, in both VCF and HDF5 formats.

22 Jul 2015

In addition to the unphased genotype calls, this release includes phased haplotypes estimated for both the 765 wild-caught individuals and the parents and progeny of the crosses. Data are available in the “haplotypes” directory in HDF5 and SHAPEIT formats. The directory also includes some data on estimates of phasing error rates over the genome.