Skip to main content

Table 2 Practical guidelines to select optimal gene clustering criteria for pangenome analysis

From: Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses

OGC

Strengths

Limitations

Recommended use

Ref-db

- Annotation and classification at once

- The most robust method in pangenomes with incomplete genomes

- Loss of all gene families that are not represented in the reference database

- Discrimination of paralogs is limited by the taxonomic resolution of the reference database

- Functional profiling and detection of single-copy core genes (if missing taxon-specific genes is not a problem)

- Analysis of pangenomes with a large fraction of incomplete genomes

- Cross-species tracking of pangenome contents

Homology

- Faster than any other method

- Setting a hard similarity threshold can lead to merging paralogs and/or missing distant orthologs

- Quick detection of single-copy core genes

- Fast de novo clustering of sequences without homologs in reference databases

- Analysis of very large pangenomes

Orthology

- Evolutionary consistent clustering of mobile gene families

- Discrimination of in- and out-paralogs

- Computationally costly (panX becomes prohibitive with > 200 genomes)

- Poor performance if the pangenome contains incomplete genomes

- Unbiased functional profiling

- Study of genome plasticity and gene flux in high-quality pangenomes

- Assessment of within-species gene expansions

Synteny

- Accurate discrimination of vertically transmitted gene families

- The most sensitive method for single-copy core genes

- The fragmentation of mobile gene families in multiple OGC can bias functional profiles and quantitative estimates of genome plasticity

- Identification of vertically transmitted orthologs

- Generation of high-resolution phylogenetic trees

- Analysis of pangenomes with up to 20% of incomplete genomes