Data Formats

 

Flatfiles Readable Ex Silico - version 1.0

These formats are intended to be both readable by the human eye and easily parsed computationally. This format is distinguished by the characters 'fs1' in the filenames.

Genotype data

The Affymetrix 500K SNP chip can yield approximately 2 GB per cohort, so this platform's genotype data have been partitioned according to chromosome and sorted according to SNP position.

Each file is presented in tab-delimited format and contains one genotype per line. Regardless of how the SNPs are organised, all assays are sorted according to sample so that the file can be readily separated into sample blocks. It should also be noted that all genotypes for Affymetrix have been configured to the '+' strand of the SNP.

The following is a brief example of the genotype data format:

SNP             SAMPLE         GENOTYPE     SCORE
rs1234567    ID-XXXXXXX     CC             0.9262
rs1234568    ID-XXXXXXX     TC             0.8650
rs1234569    ID-XXXXXXX     AA             0.9117

Sample support files

We are providing data from two cohorts, in files that come with information describing each sample. These files are tab-delimited and contain each sample's gender, plate and well number, cohort and ethnic group. They are denoted 'samples' files; for example, Affymetrix_20080506fs1_samples_AFC.txt

The following is a brief example of a sample support file:

SAMPLE         GENDER*     COHORT       PLATE/WELL     ETHNICITY**
ID-XXXXXX1     2                 AFC           12701b2          Jola
ID-XXXXXX2     1                 AFC           12701c2          Fula
ID-XXXXXX3     2                 AFC           12701d2          Others

* Females denoted 2, males denoted 1, undefined on manifest is denoted 0.
**Only ethnic information for the major ethnic groups is available and all other groups have been pooled together and labelled as "Others".

Note that, for some data sets on this site, the chromosome X data have been split into two 'chromosomes': 23 and 24. The region not homologous with Y (23) needed to be treated differently from the pseudo autosomal region (24).

Normalised signals

Quantile normalised signal data were generated from the Affymetrix intensity ('CEL') files and used as input to the CHIAMO genotype calling program. Software to perform the normalisation is available (see Available software). The format of the signal data is tab-delimited plain text; there is one line per SNP, consisting of IDs, position, alleles and one pair of intensities per sample for each of the two alleles. All genotypes have also been configured to the '+' strand of the SNP.

The following is a brief example of a signal file.

AFFYID             RSID  pos     AlleleA  AlleleB 1234A1_A 1234A1_B 1234A2_A ...
SNP_A-0123456 rs001 10000 C          T        0.407238 1.366599  0.347438 ...
SNP_A-0123457 rs002 20000 A          G        0.958866 1.084143  0.148448 ...
SNP_A-0123458 rs003 30000 C          G        1.943426 0.291587  1.610764 ...

Please note that these files may contain very long lines and are not intended to be human-readable.

 

Further pages in this section: