Ag1000G phase 1 preview data release
This data release comprises variant call data, available as either VCF or HDF5 format files, and other supporting data files, including a table of sample metadata.
All of the data files included in this release can be downloaded from the Wellcome Trust Sanger Institute public FTP site.
The same data files are also available from Amazon S3, see the following URL for a list of file locations:
If you are downloading files, please use the Sanger FTP site where possible. The ag1000g-eu S3 bucket is hosted in the eu-west-1 region, and so is fastest and most cost-efficient when downloading data into AWS compute resources hosted in the same region.
In the HDF5 format files. where there is a missing genotype call, other data fields (e.g., GQ, AD, DP) may have incorrect values due to a bug in the format conversion software. This applies only to missing genotype calls, otherwise the call data fields in the HDF5 format files are correct and correspond to the data in the VCF format files.
Four of the FILTER annotations that are declared in the header of the VCF were not actually applied to the variants due to an error in the VCF processing pipeline. These FILTER annotations are:
##FILTER=<ID=FS,Description="FS > 60">
##FILTER=<ID=MQ,Description="MQ < 40">
##FILTER=<ID=QD,Description="QD < 5">
##FILTER=<ID=ReadPosRankSum,Description="ReadPosRankSum < -8">
If you use these data, it is recommended that you apply these variant filters yourself prior to any analysis. If you use GATK to apply these filters you must use JEXL expressions with the correct value type, these are all Float fields so, e.g., the correct expression for the FS filter should be "FS > 60.0".
This preview release is a subset of a larger callset which will be released in the near future. The Multiallelic filter was applied to the larger callset, and so some variants annotated in this preview release as Multiallelic will actually only have two segregating alleles.