Skip to main content

Table 1 Accuracy of amosvalidate mis-assembly signatures and suspicious regions summarized for 16 bacterial genomes assembled with Phrap

From: Genome assembly forensics: finding the elusive mis-assembly

    

Mis-assembly signatures

Suspicious regions

Species

Len

Ctgs

Errs

Num

Valid

Sens

Num

Valid

Sens

B. anthracis

5.2

87

2

1,336

21

100.0

127

2

100.0

B. suis

3.4

120

10

1,047

30

80.0

158

9

90.0

C. burnetii

2.0

55

22

1,375

70

100.0

124

19

100.0

C. caviae

1.4

270

12

625

16

83.3

50

8

66.7

C. jejuni

1.8

53

5

290

11

80.0

61

3

60.0

D. ethenogenes

1.8

632

12

688

22

91.7

88

9

100.0

F. succinogenes

4.0

455

21

1,670

27

95.2

266

14

66.7

L. monocytogenes

2.9

172

1

1,381

5

100.0

201

1

100.0

M. capricolum

1.0

17

3

83

0

0.0

16

0

0.0

N. sennetsu

0.9

16

0

91

0

NA

13

0

NA

P. intermedia

2.7

243

21

1,655

57

100.0

201

20

100.0

P. syringae

6.4

274

64

2,841

200

98.4

366

55

98.4

S. agalactiae

2.1

127

21

687

53

95.2

112

18

85.7

S. aureus

2.8

824

41

1,850

69

97.6

227

18

75.6

W. pipientis

3.3

2017

31

761

92

100.0

132

30

100.0

X. oryzae

5.0

50

151

2,569

379

100.0

100

69

100.0

Totals

46.8

5412

417

18,949

1,052

96.9

2,242

275

92.6

  1. Species name, genome length (Len), number of assembled contigs (Ctgs), and alignment inferred mis-assemblies (Errs) are given in the first four columns. Number of mis-assembly signatures output by amosvalidate (Num) is given in column 5, along with the number of signatures coinciding with a known mis-assembly in column 6 (Valid), and percentage of known mis-assemblies identified by one or more signatures in column 7 (Sens). The same values are given in columns 8-10 for the suspicious regions output by amosvalidate. The suspicious regions represent at least two different, coinciding lines of evidence, whereas the signatures represent a single line of evidence. A signature or region is deemed 'validated' if its location interval overlaps a mis-assembled region identified by dnadiff. Thus, a single signature or region can identify multiple mis-assemblies, and vice versa, a single mis-assembly can be identified by multiple signatures or regions.