Giraffe Paper Data
These data files were used in preparing the manuscript for Pangenomics enables genotyping known structural variants in 5,202 diverse genomes, and will be useful for anyone looking to repeat or build on the experiments.
Basic layout
The data is available from https://cgl.gi.ucsc.edu/data/giraffe/. This archive is laid out as follows:
calling
: files for the structural variant genotyping analysisconstruction
: Input files used by thegiraffe-sv-paper
graph construction scriptsyeast
: Input files related specifically to constructing graphs for the yeast experiments
mapping
: Files related to mapping experiments with Giraffe and other mappersreads
: Reads for mappingreal
: Real reads for mapping or training the read simulatorNA19239
,NA19240
,NA19240
,HG002
, andHG003
: Real reads from those human samplesyeast
: Real reads from yeast for the yeast experiments
sim
: Simulated reads for mapping accuracy analysisfor-NA19239
andfor-NA19230
: Reads simulated to match each sample’s haplotypes1000gplons
orhgsvc
: Reads simulated for the 1000GPlons graph or the HGSVC graphhs38d1
: All reads were simulated from graphs using this referencehiseq2500
,hiseqxten
, andnovaseq6000
: Illumina sequencing technology whose reaqds are used to train the error modelout_sim_gbwt
: All reads were simulated using GBWT haplotypessim.gam
: File with simulated reads in vg Graph Alignment/Map format
graphs
: Graphs to map tofor-NA19239
andfor-NA19240
: Sample used for the subgraph or sub-GBWT to simulate reads from for the graph1000gplons
orhgsvc
: data source used to obtain the variants to build the graphhs38d1
: All graphs were built up from this linear reference
generic
: Graphs with no associated samplecactus
: Graphs built fom Cactus alignments, for the yeast experimentsyeast_all
: Graphs using a full set of yeast samplesyeast_subset
: Graphs using a subset of the yeast samples used in the manuscript
primary
: stick-shaped graphs containing only the linear “primary” reference contigs, with no variationS288C
: Yeast primary reference graphhs38d1
: Human primary reference graph
products
: Contains output files recommended for re-use, if any, organized according to its own READMEsoftware
: Contains code used in the manuscript, organized according to its own README
Code
All files should be readable by vg v1.32.0, with commit SHA1 hash 095c529f8c70521ca60cf0435bac1b0b4ffd1f6d. The tool is available on Github at https://github.com/vgteam/vg/releases/tag/v1.32.0. A fully static binary for GNU/Linux 3.2.0 and a gzip-compressed source tar file are available.
The code to reproduce the analysis in the paper can be found on the web or in the archive.
Read Mapping
For the mapping experiments, we used a series of graphs.
For each graph, the graph itself is available in “.vg” and “.xg” files. The “.xg” files are all in XG format, while the “.vg” files are in either VPKG-encapsulated PackedGraph or VG Protobuf format, depending on the graph.
Graphs have other associated index files.
The “.snarls” and “.trivial.snarls” files contain information on the snarl decomposition of the graph, excluding or including “trivial” snarls with no internal nodes, respectively. These are not needed for mapping, but can be exported to JSON with “vg view -Rj file.snarls”.
The “.gcsa” and “.gcsa.lcp” files contain the full-text indexes needed for mapping with VG-MAP.
The “.dist” files, required for mapping with Giraffe, contain the Giraffe distance indexes.
Graphs can also have one or more sets of “.gbwt”, “.gg”, and “.min” files, which contain the haplotype database, the graph node sequence information, and the minimizer lookup information, respectively. A set of matched files is required for mapping with Giraffe. The “.gg” file can be regenerated from the GBWT and the graph, and obviates the need to pass the full graph to Giraffe.
These matched sets come in several versions. The “raw” set, with no additional extension, contains the original haplotypes articulated in the input VCF (for VCF-based graphs) or the named paths of the input alignment (for alignment-based graphs). The “full” set starts with the raw graph and covers connected components without haplotype data with synthesized haplotypes as described in the manuscript. The “cover.number” and “sampled.number” sets use haplotypes produced to cover graph elements, or by sampling from the raw haplotypes with additional haplotypes for connected components with no data, as described in the manuscript. Each is built to have the included number’s worth of haplotype coverage overall. The “cover” sets, with no associated number, use haplotypes produced to cover graph elements up to an overall coverage of 16. Of these sets, the “sampled.64” set is recommended for general use.
The available graphs are as follows:
- The human 1000 Genomes Project Liftover No Seg Dupes (1000GPlons) graph
- With simulation samples removed, as used for mapping, in the files named “
1000GPlons_hs38d1_filter.*
”. - Without those samples removed, as used for simulation, in the files named “
1000GPlons_hs38d1.*
”.
- With simulation samples removed, as used for mapping, in the files named “
- The human HGSVC graph, in the files named “
HGSVC_hs38d1.*
”. The simulation samples were not removed from this graph. - The linear human reference version 38 graph (a negative control for the HGSVC graph), in the files named “
primaryhs38d1.*
”. - The yeast graph used as a mapping target, in the files named “
yeast_subset.*
”. - The graph used to simulate yeast reads, in the files named “
yeast_all.*
”. - The linear yeast Saccharomyces cerevisiae S288c reference graph (a negative control for the yeast graph), in the files named “
primaryS288C.*
”.
For experiments that rely on surjecting GAM files into BAM files during graph alignment, the following sequence dictionaries are provided:
- No Seg Dupes Reference FASTA Sequence Dictionary
- Used for alignments to graph files named “
1000GPlons_hs38d1_filter.*
” or “1000GPlons_hs38d1.*
“.
- Used for alignments to graph files named “
- Full Reference FASTA Sequence Dictionary
- Used for alignments to linear graph files named “
primaryhs38d1.*
“.
- Used for alignments to linear graph files named “
Additionally, the simulated reads used in the mapping experiments are available. These reads are available in vg’s Graph Alignment/Map (GAM) format, annotated with their true positions along named paths, and can be converted to interleaved FASTQ with the command vg view -aiX file.gam
, or to JSON with the command vg view -aj file.gam
. All files contain interleaved, paired reads.
- For the 1000GPlons graph, reads simulated to resemble sample NA19239 as sequenced by different Illumina sequencing technologies are available:
- For the HGSVC graph, reads simulated to resemble sample NA19240 as sequenced by different Illumina sequencing technologies are available:
- For the yeast graph, reads simulated to resemble different yeast strains as sequenced by a single Illumina sequencing technology are available:
Structural variant genotyping
Files in the calling
folder.
During the evaluation of the SV genotyping accuracy we reproduced analysis from Hickey et al. on the HGSVC graph and the GIAB graph (although now using GIAB v0.6). The indexes, including new Giraffe indexes, for these pangenomes are available:
There are also indexes for the pangenome containing the combined SV catalogs.
There is a summary of the structural variants genotyped across 2,000 samples from the MESA cohort and their frequencies
- Information about each SV site, including frequencies of the most expressed and second most expressed alleles
- Information about each SV allele, grouped by SV site
The structural variants found in the high-coverage 1000 Genomes Project dataset are:
- Information about each SV site, including frequencies of the most expressed and second most expressed alleles, and frequency in the 5 super-populations
- Information about each SV allele, grouped by SV site
- The allele counts for each of the 2,504 unrelated samples across SV sites
- The genotype quality for each of the 2,504 unrelated samples across SV sites
Of note, other files are provided in the products
folder. These include files like VCF files and files related to eQTLs.
Recent Comments