From: Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
OGC | Strengths | Limitations | Recommended use |
---|---|---|---|
Ref-db | - Annotation and classification at once - The most robust method in pangenomes with incomplete genomes | - Loss of all gene families that are not represented in the reference database - Discrimination of paralogs is limited by the taxonomic resolution of the reference database | - Functional profiling and detection of single-copy core genes (if missing taxon-specific genes is not a problem) - Analysis of pangenomes with a large fraction of incomplete genomes - Cross-species tracking of pangenome contents |
Homology | - Faster than any other method | - Setting a hard similarity threshold can lead to merging paralogs and/or missing distant orthologs | - Quick detection of single-copy core genes - Fast de novo clustering of sequences without homologs in reference databases - Analysis of very large pangenomes |
Orthology | - Evolutionary consistent clustering of mobile gene families - Discrimination of in- and out-paralogs | - Computationally costly (panX becomes prohibitive with > 200 genomes) - Poor performance if the pangenome contains incomplete genomes | - Unbiased functional profiling - Study of genome plasticity and gene flux in high-quality pangenomes - Assessment of within-species gene expansions |
Synteny | - Accurate discrimination of vertically transmitted gene families - The most sensitive method for single-copy core genes | - The fragmentation of mobile gene families in multiple OGC can bias functional profiles and quantitative estimates of genome plasticity | - Identification of vertically transmitted orthologs - Generation of high-resolution phylogenetic trees - Analysis of pangenomes with up to 20% of incomplete genomes |