Concepts & Definitions

Input Data

Orthogroup

A group of genes from different species that evolved from a common ancestral gene. Identified by tools like OrthoFinder or extracted from TOGA2 output.

Example: ENSGALG00010001554 (chicken gene ID used as orthogroup identifier in TOGA2)

TOGA2 BED12

Output format from TOGA2 (Tool to infer Orthologs from Genome Alignments). Contains gene annotations with orthology information embedded in the name field.

Format: transcript#ortholog_gene#scores$fragment
Example: ENSGALT00010003578#ENSGALG00010001554#19155,18478$1

Graph Construction

Gene Token

The fundamental unit in our de Bruijn graph. Represents a gene by its orthogroup ID.

Example: BRCA1, TP53, ENSGALG00010001554

k-mer (Gene k-mer)

A sequence of k consecutive gene tokens from a genome. Unlike DNA k-mers, these represent gene order, not nucleotide sequence.

With k=3: [BRCA1, TP53, MYC] is a 3-mer

Colored de Bruijn Graph

A graph where:

Nodes: (k-1)-mers of gene tokens

Edges: k-mers connecting nodes

Colors: Each edge is "colored" by the set of genomes that contain it

The "colors" tell us which genomes share a particular gene arrangement. An edge present in all genomes indicates universal synteny.

Compaction

Unitig (Synteny Block)

A maximal non-branching path in the colored de Bruijn graph. This means:

1. All edges in the path have the same set of colors (same genomes)

2. At each internal node, there is exactly one way to continue the path

3. The path cannot be extended further while maintaining these properties

If genomes A and B both have genes X-Y-Z in that order, but genome C has X-Y-W, then:
- [X-Y] is a unitig shared by A, B, C
- [Y-Z] is a unitig shared by A, B only
- [Y-W] is a unitig unique to C

Key insight: A unitig with the same ID in different genomes represents the exact same conserved gene order.

Branch Point

A node in the graph where paths diverge - different genomes have different gene arrangements at this position. Branch points indicate potential rearrangement sites.

Synteny Analysis

Synteny

Conservation of gene order between genomes. Two genomes are syntenic in a region if they share the same sequence of genes.

Shared Block

A unitig (block) that appears in multiple genomes. In the browser, blocks with the same ID in both compared genomes are connected by ribbons.

Universal Block

A unitig present in all analyzed genomes - represents highly conserved synteny.

Singleton Block

A unitig present in only one genome - represents a unique gene arrangement.

Rearrangements (Planned)

Inversion

A chromosomal rearrangement where a segment is reversed. Will be detected by incorporating strand information into the graph.

Status: Not yet implemented

Translocation

Movement of a genomic segment to a different chromosome. Can be detected when adjacent blocks in one genome are on different chromosomes in another.

Status: Not yet implemented

Visualization

Genome Track

Linear representation of a genome showing synteny blocks as colored rectangles. Blocks are colored by chromosome and sized by gene count.

Ribbon

A curved connection between two genome tracks showing that the same block (same ID) exists in both genomes. Ribbons visualize synteny relationships.