Ag1000G phase 1 AR2 data release

Project: Ag1000G

Released on 8 Dec 2014


This data release comprises variant call data, available as either VCF or HDF5 format files, and other supporting data files, including a table of sample metadata.

All of the data files included in this release can be downloaded from the Wellcome Trust Sanger Institute public FTP site.

The same data files are also available from Amazon S3, see the following URL for a list of file locations:

If you are downloading files, please use the Sanger FTP site where possible. The ag1000g-eu S3 bucket is hosted in the eu-west-1 region, and so is fastest and most cost-efficient when accessing data from other AWS compute resources hosted in the same region.

NOTE: Many browsers now do not support links to FTP sites. If you are experiencing difficulties, you may need to change your browser settings.

Release notes

15 Dec 2014
Organisation of VCF files

There are two VCF files available for each chromosome arm. One file has all SNPs discovered (e.g., ag1000g.phase1.AR2.2L.vcf.gz) and the second file has only those SNPs that passed all quality filters (ag1000g.phase1.AR2.2L.PASS.vcf.gz). For most analyses it is recommended to only work with PASS variants and therefore the PASS.vcf.gz files will be more convenient to use.

15 Dec 2014
Variant filters

A number of annotations have been added to the FILTER column in the VCF files. These annotations indicate quality filters that apply to the given variant. The VCF file headers contain information on the meaning of each of the filters used.

15 Dec 2015
HDF5 files

The HDF5 files (*.h5) contain data extracted from the VCF files but organised as binary arrays. For many analyses it is more efficient to access variation data via these HDF5 files than it is to process the VCF files directly. If you are familiar with the VCF files then the layout of data within the HDF5 files should be fairly self-explanatory, if you have any questions please email ag1000g-public [at]