Giraffe Data

Giraffe Paper Data

These data files were used in preparing the manuscript for Pangenomics enables genotyping known structural variants in 5,202 diverse genomes, and will be useful for anyone looking to repeat or build on the experiments.

Basic layout

The data is available from https://cgl.gi.ucsc.edu/data/giraffe/. This archive is laid out as follows:

  • calling: files for the structural variant genotyping analysis
  • construction: Input files used by the giraffe-sv-paper graph construction scripts
    • yeast: Input files related specifically to constructing graphs for the yeast experiments
  • mapping: Files related to mapping experiments with Giraffe and other mappers
    • reads: Reads for mapping
      • real: Real reads for mapping or training the read simulator
        • NA19239 , NA19240, NA19240, HG002, and HG003: Real reads from those human samples
        • yeast: Real reads from yeast for the yeast experiments
      • sim: Simulated reads for mapping accuracy analysis
        • for-NA19239 and for-NA19230: Reads simulated to match each sample’s haplotypes
          • 1000gplons or hgsvc: Reads simulated for the 1000GPlons graph or the HGSVC graph
            • hs38d1: All reads were simulated from graphs using this reference
              • hiseq2500, hiseqxten, and novaseq6000: Illumina sequencing technology whose reaqds are used to train the error model
                • out_sim_gbwt: All reads were simulated using GBWT haplotypes
                  • sim.gam: File with simulated reads in vg Graph Alignment/Map format
    • graphs: Graphs to map to
      • for-NA19239 and for-NA19240: Sample used for the subgraph or sub-GBWT to simulate reads from for the graph
        • 1000gplons or hgsvc: data source used to obtain the variants to build the graph
          • hs38d1: All graphs were built up from this linear reference
      • generic: Graphs with no associated sample
        • cactus: Graphs built fom Cactus alignments, for the yeast experiments
          • yeast_all: Graphs using a full set of yeast samples
          • yeast_subset: Graphs using a subset of the yeast samples used in the manuscript
        • primary: stick-shaped graphs containing only the linear “primary” reference contigs, with no variation
          • S288C: Yeast primary reference graph
          • hs38d1: Human primary reference graph
  • products: Contains output files recommended for re-use, if any, organized according to its own README
  • software: Contains code used in the manuscript, organized according to its own README

Code

All files should be readable by vg v1.32.0, with commit SHA1 hash 095c529f8c70521ca60cf0435bac1b0b4ffd1f6d. The tool is available on Github at https://github.com/vgteam/vg/releases/tag/v1.32.0. A fully static binary for GNU/Linux 3.2.0 and a gzip-compressed source tar file are available.

The code to reproduce the analysis in the paper can be found on the web or in the archive.

Read Mapping

For the mapping experiments, we used a series of graphs.

For each graph, the graph itself is available in “.vg” and “.xg” files. The “.xg” files are all in XG format, while the “.vg” files are in either VPKG-encapsulated PackedGraph or VG Protobuf format, depending on the graph.

Graphs have other associated index files.

The “.snarls” and “.trivial.snarls” files contain information on the snarl decomposition of the graph, excluding or including “trivial” snarls with no internal nodes, respectively. These are not needed for mapping, but can be exported to JSON with “vg view -Rj file.snarls”.

The “.gcsa” and “.gcsa.lcp” files contain the full-text indexes needed for mapping with VG-MAP.

The “.dist” files, required for mapping with Giraffe, contain the Giraffe distance indexes.

Graphs can also have one or more sets of “.gbwt”, “.gg”, and “.min” files, which contain the haplotype database, the graph node sequence information, and the minimizer lookup information, respectively. A set of matched files is required for mapping with Giraffe. The “.gg” file can be regenerated from the GBWT and the graph, and obviates the need to pass the full graph to Giraffe.

These matched sets come in several versions. The “raw” set, with no additional extension, contains the original haplotypes articulated in the input VCF (for VCF-based graphs) or the named paths of the input alignment (for alignment-based graphs). The “full” set starts with the raw graph and covers connected components without haplotype data with synthesized haplotypes as described in the manuscript. The “cover.number” and “sampled.number” sets use haplotypes produced to cover graph elements, or by sampling from the raw haplotypes with additional haplotypes for connected components with no data, as described in the manuscript. Each is built to have the included number’s worth of haplotype coverage overall. The “cover” sets, with no associated number, use haplotypes produced to cover graph elements up to an overall coverage of 16. Of these sets, the “sampled.64” set is recommended for general use.

The available graphs are as follows:

For experiments that rely on surjecting GAM files into BAM files during graph alignment, the following sequence dictionaries are provided:

Additionally, the simulated reads used in the mapping experiments are available. These reads are available in vg’s Graph Alignment/Map (GAM) format, annotated with their true positions along named paths, and can be converted to interleaved FASTQ with the command vg view -aiX file.gam, or to JSON with the command vg view -aj file.gam. All files contain interleaved, paired reads.

Structural variant genotyping

Files in the calling folder.

During the evaluation of the SV genotyping accuracy we reproduced analysis from Hickey et al. on the HGSVC graph and the GIAB graph (although now using GIAB v0.6). The indexes, including new Giraffe indexes, for these pangenomes are available:

There are also indexes for the pangenome containing the combined SV catalogs.

There is a summary of the structural variants genotyped across 2,000 samples from the MESA cohort and their frequencies

The structural variants found in the high-coverage 1000 Genomes Project dataset are:

Of note, other files are provided in the products folder. These include files like VCF files and files related to eQTLs.