NEW: Vector genomics fellows... more
Ag1000G phase 1 AR2 data release
Project: Ag1000G

Released on 8 Dec 2014.


This data release includes variant calls on 765 mosquito specimens collected from eight countries across sub-Saharan Africa and sequenced by the Wellcome Trust Sanger Institute’s Malaria programme.

Any use of Project data is subject to the Terms of Use.

Data sets


This data release comprises variant call data, available as either VCF or HDF5 format files, and other supporting data files, including a table of sample metadata.

All of the data files included in this release can be downloaded from the Wellcome Trust Sanger Institute public FTP site.

The same data files are also available from Amazon S3, see the following URL for a list of file locations:

If you are downloading files, please use the Sanger FTP site where possible. The ag1000g-eu S3 bucket is hosted in the eu-west-1 region, and so is fastest and most cost-efficient when accessing data from other AWS compute resources hosted in the same region.

NOTE: Many browsers now do not support links to FTP sites. If you are experiencing difficulties, you may need to change your browser settings.

Go to FTP

Release notes

Organisation of VCF files
15 Dec 2014

There are two VCF files available for each chromosome arm. One file has all SNPs discovered (e.g., ag1000g.phase1.AR2.2L.vcf.gz) and the second file has only those SNPs that passed all quality filters (ag1000g.phase1.AR2.2L.PASS.vcf.gz). For most analyses it is recommended to only work with PASS variants and therefore the PASS.vcf.gz files will be more convenient to use.

Variant filters
15 Dec 2014

A number of annotations have been added to the FILTER column in the VCF files. These annotations indicate quality filters that apply to the given variant. The VCF file headers contain information on the meaning of each of the filters used.

HDF5 files
6 Oct 2015

The HDF5 files (*.h5) contain data extracted from the VCF files but organised as binary arrays. For many analyses it is more efficient to access variation data via these HDF5 files than it is to process the VCF files directly. If you are familiar with the VCF files then the layout of data within the HDF5 files should be fairly self-explanatory, if you have any questions please email

Open access


Our approach to sharing data

Data package contact


To cite these data directly, please use the following citation format:

The Anopheles gambiae 1000 Genomes Consortium (2014): Ag1000G phase 1 AR2 data release. MalariaGEN.