HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

Han, Renmin; Qi, Junhai; Xue, Yang; Sun, Xiujuan; Zhang, Fa; Gao, Xin; Li, Guojun

doi:10.1186/s13059-023-03053-1

Method
Open access
Published: 05 October 2023

HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

Renmin Han ORCID: orcid.org/0000-0003-4761-6526¹^na1,
Junhai Qi^1,2^na1,
Yang Xue¹,
Xiujuan Sun³,
Fa Zhang⁴,
Xin Gao⁵ &
…
Guojun Li¹

Genome Biology volume 24, Article number: 222 (2023) Cite this article

2185 Accesses
1 Altmetric
Metrics details

Abstract

DNA barcodes enable Oxford Nanopore sequencing to sequence multiple barcoded DNA samples on a single flow cell. DNA sequences with the same barcode need to be grouped together through demultiplexing. As the number of samples increases, accurate demultiplexing becomes difficult. We introduce HycDemux, which incorporates a GPU-parallelized hybrid clustering algorithm that uses nanopore signals and DNA sequences for accurate data clustering, alongside a voting-based module to finalize the demultiplexing results. Comprehensive experiments demonstrate that our approach outperforms unsupervised tools in short sequence fragment clustering and performs more robustly than current state-of-the-art demultiplexing tools for complex multi-sample sequencing data.

Background

A barcode is a very short nucleotide sequence attached at the 3′- or 5′- end of a DNA sequence to state where the sequence comes. By incorporating a unique barcode into the library of DNA molecules, multiple DNA libraries are able to be sequenced simultaneously [1]. Usually, short nucleotide sequence corresponds to a barcode or special coded segment within a long read whose length is short than 100 nucleotides. Clustering or classifying the reads into bins based on these short nucleic acid fragments is the first step in high-throughput sequencing techniques like multiple sample sequencing and single cell protocols [2, 3]. Specifically, the barcoding technique has recently been introduced to Oxford Nanopore devices to sequence multiple barcoded DNA samples on a single flow cell [4, 5].

Oxford Nanopore sequencing is a rapidly developing technology that enables ultra-long sequencing in real time at low-cost. The key innovation of nanopore sequencing is the direct measurement of the electrical current signal (denoted as the raw signal) when a single-strand DNA passes through the nanopore. These raw signals are transferred to nucleic acid bases by base-calling for further analysis [6,7,8]. The translation from raw current signals to reads may introduce significant base-calling errors. Specifically, considering a 40-nt barcode and a base-calling system with 10% error, the possibility that a sequenced barcode is completely correct is $0.9^{40}\approx 0.014$, which can badly hamper the downstream analyses [9]. Especially, because of the high base-calling error, the Unique Molecular Identifiers (UMI) technique, which is shorter nucleotide sequence added to sequencing libraries to identify PCR duplicates, is rarely used in Nanopore sequencing [10, 11].

A number of methods have been devised to group biological sequences that are related. In early 2001, a tool named CD-HIT [12] is proposed for the clustering of a large number of sequences, based on pairwise alignment and greedy strategy. Later, improved methods [13, 14] of CD-HIT are also devised to cope with the next-generation sequencing data. Inspired by CD-HIT, DNACLUST [15] is proposed for taxonomic profiling. Recently, the mean shift algorithm has also been introduced by MeShClust [16] to reduce the side effect of parameter dependency in the greedy strategy. On the other hand, alignment-free similarity measures [17,18,19,20] have been utilized in sequence clustering [21, 22], by mapping DNA sequences into feature vectors. Furthermore, clustering tools have also been devised for specified purpose, e.g., the Starcode [23] and Bartender [24]. However, all of these methods could only utilize the information of base-called reads in Nanopore sequencing.

On the contrary, the raw current signal contains much more information compared with the base-called reads. In practice, the frequency of the electrical current measurements is 7$\sim$9 times higher than the passing speed of the DNA sequence, which makes the raw current signal to contain $\sim$ 8 $\times$ redundant information than the base-called read. Except for signal-level polishing [25], efforts have been made to utilize raw signal for targeted sequencing [8, 26, 27], variant identification [28, 29], and methylation detection [30,31,32]. Recently, the raw current signal has also been utilized in ONT barcode demultiplexing and achieved good results, by training a deep neural network as barcode’s raw signal classifier [33, 34]. Here, the demultiplexing is carried out as a supervised machine learning task with the classifier trained under a large human-labeled dataset. However, a problem with the supervised-learning-based classification is that the performance of these methods heavily depends on the training dataset.

In this paper, we first demonstrate that, for Nanopore sequencing, signal-similarity (dynamic time warping distance)-based clustering performs much better than the base-space clustering in various criteria (Additional file 1: S1), though the computation of pair-wise signal similarity is computationally expensive (Additional file 1: S2). Consequently, we propose HycDemux, which integrates a GPU-parallelized hybrid clustering algorithm and a voting module for the accurate clustering of short sequence fragments and demultiplexing of barcoded samples in nanopore sequencing. Our approach utilizes both the base-called nucleic base information and the raw current signal, in which the nucleotides are used to generate initial clustering and representation sequences, while the raw signals are used for cluster merging and refinement (Fig. 1 (A, B, and C) gives an example). A checking mechanism is built to make sure that the good sequences are reserved and correctly grouped. We compared our hybrid clustering algorithm with traditional DNA clustering tools and found that our algorithm provides more complete clustering results ($>95.5$%) while ensuring high homogeneity ($>99.7$%). The completeness of our method is about 30% higher than traditional clustering tools, providing a strong guarantee for subsequent demultiplexing. The results of high completeness and high homogeneity imply that our hybrid clustering algorithm can find out the barcodes that can be successfully generated in the dataset, which has potential significance for the design of barcodes. To transform clustering results into the final demultiplexed results, we designed a module based on a voting mechanism (Fig. 1D). Comprehensive experiments show that combining our hybrid clustering algorithm with this module leads to more accurate and robust demultiplexing results. When applied to multi-sample sequencing data generated by Nanopore’s official barcode suites, our method performs comparably to the state-of-the-art method. In particular, we evaluated our algorithm on datasets with different sequencing error rates, regarding different nanopore sequencing kits [35,36,37,38]. For complex sequencing data (number of barcodes = 350, sequencing error 10% $\sim$ 15%), we achieve a demultiplexing accuracy of above 90% for each barcode, which is about 30% higher than state-of-the-art method and 15% higher than state-of-the-art method on the low error rate datasets. It is important to note that the field of sequencing technology is continually advancing, leading to enhanced sequencing accuracy. This improvement suggests that our algorithms will yield even better results in future studies. In addition, our algorithm incorporates a GPU-based parallel mechanism, which allows for the demultiplexing of 3.5 GB (gigabytes) of data (nanopore signals + base-called DNA sequences) in approximately 1 min.

Results

We have developed a comprehensive pipeline to extract pseudo-barcode regions from raw sequencing data. All of the extracted data is then utilized for subsequent clustering and demultiplexing. In regard to the extracted pseudo-barcode regions, HycDemux integrates an unsupervised hybrid approach to achieve accurate and efficient clustering, in which the nucleotides-based greedy algorithm is utilized to obtain initial clusters (Initial clustering), and the raw signal information is measured to guide the continuously optimization and refinement of clustering results (cluster merging and cluster refinement). GPU acceleration based on CUDA technique is utilized in our hybrid clustering (GPU-accelerated DTW). On the other hand, HycDemux integrates a module that uses a voting mechanism to determine the final demultiplexing result. This module selects n representatives (5 by default) for each cluster and calculates the DTW distance matrix between these representatives and the standard barcode signal. By identifying the row index of the minimum value in each column of the distance matrix, the module determines which barcode the sequence belongs to. As a result, n demultiplexing results are obtained, and the barcode with the highest frequency in the result determines the demultiplexing outcome of this cluster. The detailed implementation is explicated in the “Materials and methods” section.

We derive the results of demultiplexing by the results of hybrid clustering, implying that the results of clustering directly affect the results of demultiplexing. In this section, we first evaluate the performance of HycDemux’s hybrid clustering algorithm, and the experimental results show that it can generate high-quality clusters, which provides a strong guarantee for subsequent demultiplexing. Afterwards, we show that a voting-based demultiplexing module can derive demultiplexed results with high accuracy from clustered results.

Evaluation of hybrid clustering algorithm

We demultiplex sequences based on the results of hybrid clustering algorithm, which poses a performance challenge for clustering. Here, we mainly evaluate the performance of clustering from two aspects: one is homogeneity, and the other is completeness. High homogeneity means that the sequences in each cluster obtained by clustering have the same barcode. However, there is an extreme case where each cluster contains only one sequence and the homogeneity is 100%. Therefore, the clustering results also need to be evaluated through completeness. High completeness means that there is a relatively small difference between the number of clusters obtained in the end and the actual number of barcodes.

Starting from the clustering results, the demultiplexing results are deduced. The main advantage is that under the clustering results with high homogeneity, the demultiplexing result of a sequence are determined by some sequences in the cluster where it is located. This makes the result of demultiplexing more robust. On the other hand, we point out that clustering does not affect the efficiency of demultiplexing. Given n sequences, assuming that these n sequences carry m different barcodes, in theory, the final demultiplexing result can be obtained by completing $n \times m$ alignments. Clustering n sequences, assuming that $n_1$ clusters are finally obtained, and then pick k representative elements in each cluster, and the k representative elements determine the result of demultiplexing, which requires $n_1 \times k \times m$ alignments, when the clustering result has high completeness, $n_1 \times k$ will be much smaller than n, which means that the efficiency of demultiplexing will be greatly improved. At the same time, clustering results with high homogeneity will lead to highly accurate demultiplexing results.

We conducted experiments to demonstrate the hybrid clustering algorithm’s ability to produce high homogeneity and completeness results. The experimental process and analysis are presented in detail below.

Simulated datasets

A set of synthetic datasets with different configurations are generated. Here, we first generate a set of random barcodes and then produce a number of these barcodes’ copies as well as their raw signals by DeepSimulator. The configuration of synthetic dataset includes the following three points:

The length of barcode (nucleotide sequence length)
The number of clusters within a dataset
The number of sequences for the whole dataset

Finally, we construct 12 simulated datasets. The details of these datasets are provided in Additional file 1: S3.

Real-world datasets

The real-world dataset came from eight R9.4 flow cells and six R9.5 flow cells, all sequenced with the EXP-NBD103 barcoding kit. We conducted a random selection of 130,000 sequences from the dataset provided by [33] and proceeded to calculate the edit distance between the barcode area in each sequence and the standard barcode area. If the edit distance exceeded 10, we labeled the sequence as “fuzzy,” indicating uncertainty regarding the presence of barcodes in these particular sequences. Ultimately, we constructed barcode labels for 120,947 sequences, forming a dataset known as the amplicon library. In amplicon library, we use Edlib [39] to locate the fixed region of the barcode and segment the barcode read and use Semi-Global Dynamic Time Warping [40] to extract the corresponding raw signal of these barcodes.

All the aforementioned datasets primarily consist of two main components. The first component comprises sequence fragments that represent the barcodes, while the second component consists of the nanopore signals that correspond to the barcode sequences.

Run scripts

DNACLSUT fails to cope with dataset large than 10,000 sequences. Therefore, here we mainly compare our hybrid clustering with CD-HIT, UCLUST, MeShClust. The command line options for these three clustering tools are listed as follows:

1.
CD-HIT: ./cd-hit-est -i infile.fasta -o outfile.fasta -c indentity
2.
MeShClust: ./meshclust infile.fasta –id identity -output outfile.fasta
3.
UCLSUT: ./usearch -cluster_fast infile.fasta -id identity -clusters output

All the experiments were run on an Ubuntu 18.04 system with Intel(R) Core(TM) i9-10980XE (18 cores), 128 Gb memory, and an NVIDIA RTX3080 card.

Evaluation on synthetic datasets

Six synthetic datasets with different barcode lengths, numbers of clusters, and data sizes are selected to demonstrate the performance of HycDemux. Table 1 describes the details of the six selected datasets.

Table 1 Summaries of the details about the selected synthetic datasets

Full size table

Table 2 summarizes the experimental results of different clustering methods on these synthetic datasets, where the indexes AMI, FMI, ACC, HOMO, COMP, and runtime are adapted for the performance evaluation. Detailed information on all evaluation metrics can be found in the Additional file 1: S1.

Table 2 Performance comparison of the different clustering methods on the dataset$S_1 \sim S_6$. Here, Identity is the parameter of clustering tools, AMI is the abbreviation of adjusted mutual information, FMI is the abbreviation of Fowlkes-Mallows Index, ACC is the abbreviation of accuracy, HOMO is the abbreviation of homogeneity, and COMP is the abbreviation of completeness

Full size table

Judging from the experimental results, it can be found that CD-HIT and UCLUST are able to guarantee high homogeneity (HOMO index $=100\%$) under various situations. The high homogeneity is reasonable because CD-HIT and UCLUST are designed to maintain the consistency of the elements within a cluster. UCLUST is the fastest and CD-HIT is the second fastest in clustering speed, because of the utility of non-alignment technique. When the sequence length is very short, MeShClust behaves the poorest within the four clustering methods, as shown in Table 2.

Drown from the six synthetic datasets, we can make the following key conclusions:

CD-HIT and UCLUST are the fastest and able to guarantee high homogeneity (HOMO more than 98%) of the results.
For barcodes with short sequence length ($S_1$ and $S_4$), the clustering performance of MeShClust is very poor. With the increase of barcode length, the clustering performance of MeShClust is significantly improved.
The clustering results of $S_1$ to $S_3$ and $S_4$ to $S_6$ demonstrate that the performance of clustering tools depends on the length of barcode sequence, and longer sequence results in better clustering result.
HycDemux outperforms other tools significantly in terms of clustering performance, provided that speed is ensured. HycDemux achieves a completeness of over 95%, which is more than 30% higher than other tools, while ensuring high homogeneity (> 99.7%). This ensures accuracy in subsequent demultiplexing and improves overall efficiency.
The speed of HycDemux is affected by the number of clusters residing in the dataset, as the result comparison of $S_1$ and $S_4$, $S_2$ and $S_5$, and $S_3$ and $S_6$. This is caused by the fact that the number of clusters determines the number of DTW distance that should be calculated in the cluster merging phase.

In general, due to the base-calling error, the traditional clustering tools such as CD-HIT, UCLUST, and MeShClust could not get good clustering results in the analysis of short nanopore reads. In particular, the clustering completeness of these tools is poor. For a dataset containing 50,000 sequences and 20 clusters, these tools may produce results with more than 1000 clusters. However, HycDemux significantly improves completeness while ensuring clustering speed, resulting in fewer than 100 clusters. The initial clustering of our method guarantees the extremely high homogeneity within the clusters, and the cluster merging and refinement guarantee the high accuracy and completeness of the clustering, which greatly reduces the influence of base-calling error. The results show that our method produces very good clustering results, and the results produced by traditional clustering tools cannot compete with us. This greatly benefits subsequent demultiplexing processes. More benchmarking results are provided in Additional file 1: Table S10 $\sim$ Table S15.

Performance analysis of different stages

As introduced in previous sections, the hybrid clustering algorithm is composed by three stages, i.e., initial clustering, cluster merge, and cluster refinement. In this section, we analyze the detailed contributions of different stages in hybrid clustering and their time cost.

First, we analyze the change of cluster accuracy of these different stages. As shown in Fig. 2A, after the initial clustering, the clustering result is not so good. With the completion of the merging phase, the clustering performance has been greatly improved. After the refinement phase, the clustering result is further improved. The change of index values in Fig. 2A clearly show the effectiveness of the three-stage solution in our hybrid clustering algorithm. Especially, the signal based cluster merging and refinement contributes a lot to the accuracy improvement in clustering result.

Then, we analyze the time cost of these different stages. Figure 2B shows the runtime of the three stages for the clustering of $S_3$ in pie chart. As shown in Fig. 2B, the hybrid clustering algorithm spends the most time in the cluster merging stage. This is reasonable, since a large number of DTW distance comparisons are computed in cluster merging.

Speedup of GPU-accelerated DTW

As described in the previous section, numerous DTW distance are calculated in our algorithm, but we still achieve relatively small time cost. Here, we would like to show the benefits of GPU acceleration in DTW distance calculation.

In order to show the overall acceleration effect, we generate a large amount of time series as test data, whose details are shown in Table 3. We compare the CUDA implementation of DTW with CPU single-threaded method and CPU multi-threaded method. The DTW method realized by CUDA is equivalent to the original DTW in the mathematical model, which can guarantee its correctness. The CPU single-threaded method is a naive DTW algorithm. In the CPU multi-threaded approach, each CPU thread is responsible for calculating the DTW distance of a pair of time series, while the single CPU thread still uses the traditional method for calculation. Figure 3A clearly shows the time spent by different approaches in logarithmic scale. As shown in Fig. 3A, the CUDA accelerated DTW is at least three orders of magnitude faster than the traditional single-threaded DTW, and two orders of magnitude faster than the 30-threaded DTW.

Table 3 Three different kinds of time series used for the comparison of different DTW’s implementation, where all the time series are with a length of 1300. “Random” means time series of random walk. “Simulator” means current signals generated by DeepSimulator. “Amplicon library” means the real data downloaded from [33]. Here, we divide the data into two groups, and the time series within each group will compare with each other. For example, in the first random dataset,$200\times 1000$means that the first group has 200 time series, the second group has 1000 time series, and the number of DTW comparison is$200 \times 1000 = 200000$

Full size table

In addition, we evaluate the DTW acceleration ratio of different lengths by simulated nanopore signals generated from DeepSimulator, where the length of the current signal is approximately 8 times of that of the corresponding DNA template. Figure 3B shows the change of acceleration ratio with different sequence lengths, where the acceleration ratio for single DTW calculation ranges from 14$\times$ to 22$\times$, increasing with the lengthening of sequences. Furthermore, as introduced in previous section, a block-wise acceleration strategy is proposed to fully utilize the advantage of GPU blocks, which enables the launch of million threads of DTW calculation simultaneously. In Supplementary Table S1, we have shown that it takes about 1100 min to calculate the DTW distance matrix of $2000\times 2000$. As a comparison, by applying the CUDA acceleration strategy, the time cost of the DTW distance matrix calculation can be reduced to 4 s.

Runtime analysis of the hybrid clustering algorithm

As discussed in previous section, the hybrid clustering algorithm consists of three stages and the calculation of DTW distance is GPU-accelerated. Here, we would like to further analyze the overall time complexity of the hybrid clustering algorithm.

We simulated a number of datasets by DeepSimulator to test the time cost of our algorithm under different sequence lengths, dataset sizes, and numbers of clusters, whose results are summarized in Fig. 4. Figure 4A shows that the runtime of our algorithm is not simply correlated with the sequence length. When the sequence length is 75 nt, the total time cost of our algorithm is smaller than the one with 75 nt sequence. In addition, the time costs for datasets with 75 nt, 85 nt, and 95 nt sequence length are almost the same. In fact, the accuracy of initial clustering could benefit from longer sequence, which shortens the runtime of further cluster merging and refinement. Figure 4B and C shows that the runtime of our algorithm linearly increases with the increment of dataset size and number of clusters, which ensures an acceptable time cost even when the size of dataset is relatively large. In practice, our method can complete barcode clustering efficiently.

Evaluation on real-world dataset

The real-world dataset is downloaded from [33], composed of 12 classes of nanopore barcodes, with $\sim$ 40 base pairs for each barcode. For the real-world barcode sequences, the first 8 positions of nucleobase and the last 8 positions of nucleobase are same to each other. Thus, the base-calling error may translate two identical nanopore barcode into different nucleobase reads, which greatly hampers the correctness of clustering and classification.

Table 4 summarizes the experimental results of different clustering methods on the real-world dataset. As shown in Table 4, the performance of MeShClust on the real-world dataset is very poor at all identities, failing to guarantee even the homogeneity of clustering. At each identity, the performance of the CD-HIT is slightly better than that of UCLUST, while UCLUST can always guarantee higher HOMO. The performance of HycDemux on the real-world dataset is much better than that of the other classic clustering tools, with 99.47% homogeneity and 83.22% completeness. In terms of clustering efficiency, UCLUST and CDHIT still maintain a clear advantage, while the hybrid clustering algorithm can also complete the clustering of about one hundred thousand sequences within very short time. By fully utilizing the raw signal information, HycDemux can cope with the challenge of base-calling error well, outperforming the classic ‘base-space’ clustering tools. Especially, our algorithm has finished the clustering within 15 s which is also very efficient.

Table 4 Performance comparison of the different clustering methods on the real-world dataset

Full size table

Evaluation of the demultiplexing in HycDemux

Previous studies have demonstrated that hybrid clustering algorithm can deliver clustering results with high homogeneity and completeness. In this context, we will elaborate on how our hybrid clustering algorithm, coupled with a voting mechanism-based demultiplexing module, can attain demultiplexing results with high accuracy. We compare our method with the state-of-the-art demultiplexing tool, Guppy, and provide experiment details below.

Simulated multi-sample sequencing data

We obtained whole genome sequences for 17 Enterotoxigenic Escherichia coli strains [41], 45 historical Shigella strains [42], and 67 Shiga toxin-producing Escherichia coli strains [43] to construct multi-sample sequencing datasets. We constructed multiple multi-sample sequencing libraries by randomly interrupting genome sequences based on the sequencing length distribution of Oxford Nanopore Sequencing Technology (ONT). Figure 5A illustrates the resulting DNA sequence after library construction. We used a total of 11 multi-sample sequencing datasets (Table 5) to evaluate our algorithm; all datasets were mixed with an additional 1000 sequences that either lacked or had incomplete barcode regions (with a missing ratio greater than 0.6). These sequences were classified as negative samples and their correct barcode label should be “unclassified.” In contrast, sequences containing the complete barcode region were categorized as positive samples. In addition, D1$\sim$D7 carried higher sequencing errors (10$\sim$15%), and DB4$\sim$DB7 carried lower sequencing errors (2$\sim$ 5%).

In these datasets, D1$\sim$D3 integrate the official nanopore barcode. In order to evaluate the robustness of the demultiplexing method and the performance of demultiplexing a large number of non-ONT barcodes [44, 45], we increased the number of barcodes and generated simulated multi-sample sequencing data (D4$\sim$D7 and DB4$\sim$DB7), these barcodes were randomly generated, and a certain edit distance was guaranteed (edit distance = 14.5 ± 1.8). The barcode consist of three primary components: upstream flanking region, variable region, and downstream flanking region (Fig. 5A). While the barcode lengths vary among these datasets, the variable regions remain consistent at 24 nt in length.

For the EXP-NBD104 kit (D1), both upstream flanking region and downstream flanking region are 8 nt long, resulting in a total barcode length of 40 nt. In the case of the SQK-16S024 kit (D2), upstream flanking region spans 15 nt, downstream flanking region covers 20 nt, and the barcode itself is 59 nt in length. Finally, the EXP-PBC096 kit (D3) features a upstream flanking region of 7 nt, a downstream flanking region of 29 nt, and an overall barcode length of 60 nt. For barcodes in D4$\sim$D7 and DB4$\sim$DB7, both upstream flanking region and downstream flanking region are 8nt long, resulting in a total barcode length of 40 nt.

Table 5 All datasets used to evaluate demultiplexing performance. “GB” is an abbreviation for gigabytes

Full size table

Extract data for demultiplexing

We obtained the barcode sequences (signals) from the multi-sample sequencing dataset (Fig. 5B). However, as errors may occur during the extraction process, we refer to the barcode sequence (signal) as a pseudo-barcode sequence (signal). These extracted pseudo-barcode sequences (signals) are utilized in the hybrid clustering and demultiplexing that follow (Fig. 5C).

Evaluation index

Our analysis encompasses the demultiplexing accuracy of each individual barcode, employing two evaluation metrics: average accuracy and minimum accuracy. Now, we explain the concepts of average accuracy and minimum accuracy using an example. Consider a scenario with 10 sequences labeled as $read_1, read_2, ..., read_{10}$. The correct barcode labels for these sequences are 1, 1, 1, 1, 1, 2, 2, 2, 2, 2. Here, the label “1” (or “2”) indicates that the sequence carries the 1st (or 2nd) barcode. We want to assess the accuracy rates for these two barcodes. Assuming that the barcode labels obtained by the demultiplexing algorithm for the 10 sequences are 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, we can calculate the accuracy rates. The accuracy rate for the first barcode is 3/5, indicating that 3 out of 5 sequences labeled as barcode 1 are correct. Similarly, the accuracy rate for the second barcode is 2/5, as 2 out of 5 sequences labeled as barcode 2 are correct. In this case, average accuracy is calculated as (3/5 + 2/5)/2 = 0.5, and minimum accuracy is $min(3/5,2/5)=0.4$. It is important to note that when dealing with a large number of barcodes, it is possible to have a demultiplexing result with a high average accuracy but a low minimum accuracy. This means that the algorithm performs well on the majority of barcodes but may be ineffective for certain barcodes. Therefore, solely relying on the average accuracy might not provide a comprehensive evaluation of the demultiplexing effectiveness. Supplementary Fig. S1 further demonstrates the importance of minimum accuracy. By including the minimum accuracy, we can better assess the performance of the demultiplexing algorithm.

In addition, we utilize the recall rate as a measure of the model’s performance in correctly identifying positive samples. The formula for calculating the recall rate is: Recall = TP / (TP + FN). Here, TP represents true positive examples (the number of samples correctly predicted as positive by the model), and FN represents false negative examples (the number of samples that are actually positive but are incorrectly predicted as negative by the model).

Performance on all datasets

We conducted extensive experiments on all datasets to showcase the effectiveness of HycDemux, and the main experimental results are presented in Tables 6 and 7. As shown in Table 6, both HycDemux and Guppy achieve nearly perfect accuracy on datasets D1 $\sim$ D3, which have a limited number of carefully designed ONT barcodes (not exceeding 96). As the number of barcodes increases in D4$\sim$D7, Guppy’s average accuracy remains around 0.95 but its minimum accuracy drops below 0.7, indicating that Guppy fails to demultiplex sequences associated with certain barcodes. In contrast, HycDemux maintains a stable performance with a minimum accuracy above 0.9. In terms of recall, HycDemux outperforms Guppy by approximately 3%. Additionally, we observed that HycDemux exhibits fewer instances of “unclassified” labels compared to Guppy, and it aligns more closely with the ground truth value of 1000. This indicates that our demultiplexing algorithm excels at accurately assigning the correct barcode label to each sequence.

Table 6 Performance of HycDemux and Guppy (GPU version) on D1$\sim$D7 with a 10$\sim$15% sequencing error rate

Full size table

The randomly selected barcodes in D4$\sim$D7 may contain basecalling errors, which could impede Guppy’s demultiplexing accuracy based on DNA sequences. Additionally, as the number of barcodes increases, distinguishing between some barcodes becomes increasingly challenging, thereby making demultiplexing more difficult. HycDemux utilizes both DNA sequence and nanopore signal information to achieve highly homogeneous clustering results and avoid basecalling errors. The voting mechanism is used to obtain demultiplexing results, which prevents abnormal sequences from affecting the accuracy of demultiplexing.

As shown in Table 7, with the improvement of sequencing error rate, both HycDemux and Guppy showed improved demultiplexing accuracy, which is expected as sequencing error rates are generally inversely related to algorithm’s accuracy. In terms of average accuracy and recall, HycDemux demonstrated an advantage of approximately 2% over Guppy. In addition, it is worth noting that even with these improvements, the minimum accuracy of Guppy remains below 0.8, and HycDemux outperforms Guppy by $\sim$15%. This shows that under the current state-of-the-art sequencing accuracy, Guppy still cannot successfully demultiplex some samples, while HycDemux guarantees a demultiplexing accuracy above 0.9.

Compared to Guppy, HycDemux is slightly less efficient in terms of speed but the running time remains at the same order of magnitude, due to the fact that HycDemux involves a lot of DTW distance calculations. However, it is important to note that HycDemux still achieves a high level of demultiplexing efficiency. In our test environment, the extraction efficiency of barcode data is around $\sim$255 reads/s. Based on this estimate, the time required to complete the demultiplexing of D1 is 17 + 47 = 64 s, which means that our method can complete the demultiplexing of 3.5 G data in $\sim$ 1 min.

Table 7 Performance of HycDemux and Guppy (GPU version) on DB4$\sim$DB7 with a 2$\sim$5% sequencing error rate

Full size table

Discussion

We perform demultiplexing based on the clustering results, which offers a significant advantage. Clustering, particularly in clusters with high homogeneity, determines the demultiplexing outcome of a sequence based on other sequences within the same cluster. This characteristic enhances the robustness of the demultiplexing process, as it ensures that sequences within a cluster contribute to the determination of the demultiplexed result. Nanopore sequencing produces two types of data, i.e., the raw current signals and base-called reads. For barcode sequence clustering, the first consideration is what kind of data should be used for clustering. We found that the direct use of raw signal information combined with the DTW algorithm can produce good clustering performance, but the time cost is high (Additional file 1: S2). Using the read information combined with existing clustering tools is fast but cannot produce good clustering completeness. The hybrid clustering algorithm makes use of these two types of data for clustering. In the initial clustering stage, the read information is used to generate the initial clustering results, and in the cluster merging and refinement stages, the raw signal information is used to continuously refine the initial clustering results. From the experimental results of simulated datasets and real datasets, the clustering accuracy of the hybrid clustering algorithm is obviously better than that of various classic clustering tools. Additionally, we have integrated a GPU-based module into our algorithm, specifically designed for computing the DTW distance matrix between time series. This module proves highly efficiency of GPU powered clustering, when dealing with time series datasets. The utilization of GPU for distance computation and clustering has been crucial, as evidenced by our experiments. By harnessing the power of GPUs, we have effectively applied certain algorithms that are slow but accurate, such as DTW distance computation with a complexity of $n^2$, to big data analysis. This has ensured both the accuracy and efficiency of the analysis process, and has the potential to inspire future work in this area.

Through extensive experiments, we made an interesting observation regarding clustering tools and their clustering accuracy. While some clustering tools may not achieve high clustering accuracy, we found that certain tools utilizing greedy strategies, such as CDHIT, can ensure near-perfect homogeneity. This discovery has led to the emergence of a new clustering concept: employing a greedy strategy to rapidly obtain highly homogeneous clusters and subsequently merging these clusters in a careful manner to continually improve clustering accuracy. By employing a suitable merging strategy for these initial clusters, we can achieve clustering results with significantly higher accuracy. Additionally, the complexity of clustering is substantially reduced when starting from these initial clusters, as compared to the original sequence set. This strategy can be seamlessly applied to DNA sequence clustering problems once an appropriate cluster merging scheme is established.

Our demultiplexing module is designed based on the hybrid clustering algorithm, which yields highly homogeneous and integrated clustering results. By employing a voting mechanism for demultiplexing each cluster, HycDemux achieves more accurate and stable demultiplexing results.

There is still room for improvement in HycDemux. Currently, we employ a heuristic scheme to extract the pseudo-barcode sequence (signal) by relying on the relationship between the length of the nanopore signal and the length of the DNA sequence. While experimental results have shown its effectiveness, there are cases where we cannot guarantee that the extracted pseudo-barcode signal contains sufficient useful information. To address this concern and prevent it from affecting the final demultiplexing results, we have adopted a simple designed DTW distance threshold (as described in the "Extract barcode information from raw data" section). In future research, we will focus on enhancing our algorithm in this aspect.

In recent years, significant advancements have been made in ONT Direct RNA sequencing. This approach eliminates the need for reverse transcription of RNA into cDNA, thereby mitigating potential issues associated with introducing errors or losing information during transcription. However, it is important to note that individual sequencing of RNA molecules often yields data with a relatively high error rate [46, 47]. On the other hand, the combination of RNA molecules and barcodes also enables multi-sample sequencing [34]. Through experiments, we can see that our demultiplexing algorithm can successfully complete the demultiplexing of multiple samples on datasets with a high error rate, which implies that our algorithm can be applied to the demultiplexing of RNA samples. This is also the focus of our future work.

Furthermore, barcoding is not only applicable to the multi-sample sequencing but also finds significant utility in the realm of single-cell RNA sequencing. By employing the 10X method in conjunction with ONT sequencing, RNA isoforms can be quantified at the individual cell level. The combination of ONT sequencing and the 10X method generates vast amounts of data, encompassing thousands of barcodes. These barcodes originate from a “white list” consisting of millions of barcodes. In downstream analysis, accurately identifying the barcodes within the sequences is the crucial initial step, as sequences with the same barcode are presumed to originate from the same cell. In response to this specific challenge, we aim to develop a more adaptive algorithm building upon our current work.

Conclusion

This paper presents an approach named HycDemux for barcoded sample demultiplexing in nanopore sequencing.

HycDemux initially obtains highly homogeneous clusters using the hybrid clustering algorithm and then employs a voting mechanism module to perform demultiplexing. HycDemux delivers stable performance, particularly when there is a large number of samples. It ensures a demultiplexing accuracy of $>0.9$ per sample, which is approximately 0.3 higher than the accuracy of the state-of-the-art method on the high error rate datasets and 0.15 higher than the state-of-the-art method on the low error rate datasets. On the other hand, experiments on datasets with high error rates imply that HycDemux can be applied to direct RNA sequencing problems, especially RNA demultiplexing of multiple samples. Specifically, the introducing of GPU-acceleration significantly reduce the execution time of signal similarity comparison, which makes the processing of a huge number of data possible. In addition, the experimental evaluation of GPU-based DTW calculation demonstrates the efficient utilization of GPUs in clustering analysis. This approach ensures both efficiency and accuracy in the measurement process, offering valuable insights and reference for related research endeavors.

Materials and methods

Overview

We have designed a heuristic scheme to extract pseudo-barcode sequences (signals) in raw data for subsequent clustering and demultiplexing. For these pseudo-barcode sequences (signals), we developed an unsupervised hybrid approach, in which the nucleobase-based greedy algorithm is utilized to obtain initial clusters, and the raw signal information is measured to guide the continuously optimization and refinement of clustering results. Figure 6 shows the detailed workflow of hybrid clustering.

Given the nanopore sequences, we first utilize the nucleotide information for initial clustering to generate clusters with high homogeneity (identity $\geqslant$ 95%), whose process is based on a greedy clustering strategy and very quick. Then, we select some sequences in the clusters for threshold determination for subsequent cluster merging and refinement. Finally, we make the cluster merging and refinement by calculating the DTW distance between the raw signals and each cluster’s representative signals, with GPU-accelerated DTW to ensure efficiency. To address the demultiplexing problem, we designed a module based on the voting mechanism to parse the demultiplexing results from the clustering results. The usage of our method is presented in Additional file 1: S5. In the following, we give out the details of each step in the hybrid clustering and demultiplexing, where the detailed pseudocode for each step is given in Additional file 1: S3.

Extract barcode information from raw data

Raw data comprises both the native nanopore signals and their corresponding DNA sequences. To successfully demultiplex the raw data, it is crucial to extract the barcode information accurately. To accomplish this, we have devised a heuristic scheme based on the distinctive characteristics of DNA libraries (refer to Fig. 5A).

In our approach, we begin by analyzing the native nanopore signal. We employ the semi-global dynamic time warping algorithm to identify the position of the adapter signal within the signal. The tail position of the adapter signal serves as the starting point for the barcode signal. By leveraging this information, we are able to locate the barcode signal within the nanopore signal accurately. The determination of the barcode signal’s position takes into account the length of the barcode sequence and the sampling rate of the nanopore signal.

Similarly, for the DNA sequence, we utilize the Edlib to identify the position of the adapter sequence within the sequence. Subsequently, we determine the position of the barcode sequence based on its length. Assuming that the standard adapter sequence has a length of n, if the edit distance between the standard adapter sequence and the DNA sequence exceeds 0.45 times n (using local alignment), the barcode label of the sequence is deemed ambiguous and marked as “unclassified”.

All the sequences (signals) that we extract, both from the nanopore signals and DNA sequences, are referred to as pseudo-barcoded sequences (signals). These pseudo-barcoded sequences (signals) are utilized for subsequent clustering and demultiplexing stages.

Initial clustering

We utilize a nucleobase-based greedy algorithm to generate homogeneous initial clustering, in which the process is similar to the ones in CD-HIT. Figure 7 describes the detailed workflow. Firstly, the sequences are sorted in descending order of the sequence length. The longest sequence is assumed to be the representative sequence of a cluster. A short word filter [13] is applied to reduce the comparison in pairwise alignment. Here, each selected sequence is compared with the existing representative sequences. If the similarity between the selected sequence and a representative sequence is higher than the threshold, the selected sequence will be merged into the cluster of the representative sequence. Otherwise, the selected sequence becomes a new representative sequence. Repeating this process until all the sequences are visited, resulting in a number of clusters and a set of ultra-short sequences that do not belong to any cluster. Finally, A verification mechanism is used to check the homogeneity of the clusters and retrieve the ultra-short sequences which are misclassified.

With a high enough threshold (e.g., identity $\geqslant$ 95%), the initial clustering is able to quickly generate clusters with high homogeneity, where these initial clusters can be considered as completely correct, to significantly reduce the pair comparison in further signal-similarity based clustering.

Threshold determination

After obtaining the initial clustering result, we need to further refine it according to the raw signal information. The refinement of the initial clusters depends on the merging threshold, which is critical to the final demultiplexing accuracy. Because of the initial clusters’ high homogeneity, the merge threshold is possible to be determined from the initial clustering result.

Once the initial clustering is complete, we obtain the representative units of good clusters and their corresponding nanopore signals. All clusters are sorted in descending order based on their size, and the clusters ranked top-($0.01 \times |clusters|$) are called good clusters. |clusters| refers to the number of clusters, the size of a cluster with rank exactly $0.01 \times |clusters|$ is defined as GoodIndex. We calculate pairwise DTW distances of all nanopore signals and set the threshold as the average of the maximum and minimum distances divided by a constant value k (default is 4).

Cluster merging and refinement

Cluster merging

For a certain multiplex sequencing configuration, it is possible to estimate the minimal set size of a cluster. Here, we define an initial cluster with set size larger than GoodIndex as a good cluster and denote $GoodClusterSet =\left\{ C_{1},C_{2},...,C_{M}\right\}$ as the set of good clusters of the initial clustering result, where M is the number of good clusters and $\{C_{i}\}$ is sorted in the descending order according to their cardinality.

For each $C_{i}$, we randomly select K raw signal sequences within $C_{i}$, and record these signals as $\{sig_{i_{k}}\}_{k=1,2,...,K}$, where K must satisfy

$$\begin{aligned} K <\min \{ \left| C_{i}\right| \}_ {i=1,2,...,M}. \end{aligned}$$

Every time we choose the top unvisited cluster $C_{i}$, i.e., the largest unvisited cluster in GoodClusterSet, as a query to compare with the other clusters ($i=1$ in the first time). We compare $\{sig_{i_{k}}\}$ with the sampled K raw signals $\{sig_{m_{k}}\}$ from the rest clusters ($1<m<M$) by the DTW distance. If for $\forall p,q,\ DTW(sig_{i_{p}},sig_{m_{q}}) <threshold$, $C_{m}$ is merged into $C_{i}$, and $GoodClusterSet = GoodClusterSet \backslash C_{m}$. Every time, the selected cluster $C_{i}$ is compared with the remaining clusters $\{C_{m}\}$ and is merged with all the clusters $C_{m}$ that satisfy the DTW distance constraint. We iteratively select the top unvisited cluster and make the cluster merging until all the clusters’ relationship has been checked.

Refinement 1

It should be noted that the GoodClusterSet has been changed during cluster merging. After the final merging, a set of refined clusters can be obtained, i.e., $GoodClusterSet' =\left\{ C_{1}',C_{2}',...,C_{T}'\right\} , T \le M$. The corresponding representative sequence set can be denoted as

$$\begin{aligned} ConSeqSet =\left\{ cseq_{1},cseq_{2},...,cseq_{T} \right\} . \end{aligned}$$

Given the representative sequence, its corresponding nanopore signal can also be obtained. Thus, with the representative sequences, a set of representative signal can be generated, which is denoted as

$$\begin{aligned} ConSigSet =\left\{ csig_{1},csig_{2},...,csig_{T}\right\} . \end{aligned}$$

The representative signal is utilized as standard reference to optimize the initial clustering results. For the sequences that are not in $C_{i},i = 1,2,...,M$, we get the corresponding raw signals of these sequences and calculate the DTW distance between these sequences and the representative signals in ConSigSet. For a given sequence, if the distance between this sequence’s raw signal and a representative signal is less than the threshold, the sequence is merged to the representative signal’s corresponding cluster.

Refinement 2

After the above steps, there are still some sequences that have not been classified. We get the raw signals of these sequences and make the following process: first, randomly select a sequence to generate a new cluster $C_{new}$, where the selected sequence is the representative sequence of $C_{new}$, denoted by $seq_{new}$. Calculate the DTW distance between the raw signal of $seq_{new}$ and the raw signal of the remaining sequences. If the distance is less than threshold, add the corresponding sequence to $C_{new}$. Repeat the process until all the sequences are visited.

Figure 1 illustrates an example for the merging and refinement process of 30 sequences from two cells. The clustering accuracy is continuously improved with the utilization of all the information residing in the nucleobase sequences and raw signals.

Demultiplexing module based on voting mechanism

We performed demultiplexing on each cluster obtained from the hybrid clustering algorithm (as shown in Fig. 1D). To achieve this, we followed a specific procedure.

Given a cluster set, we initially selected the first k elements (with a default value of 5) corresponding to k pseudo-barcoded signals. For these selected signals, we computed the DTW distance matrix between them and all the standard barcode signals.

Next, we determined the row index of the minimum value in each column of the DTW distance matrix, resulting in a k-dimensional vector. This vector captures the closest match for each pseudo-barcoded signal among the standard barcode signals.

Finally, we calculated the mode of the k-dimensional vector, which represents the most frequent value in the vector. This mode value serves as the final demultiplexing result for the cluster. In other words, it represents the demultiplexing outcome for each sequence within the cluster.

Specifically, we identify sequences with ambiguous barcode labels using a straightforward and predetermined DTW distance threshold. The first cluster obtained from the clustering results is often of high quality, serving as a basis for generating a DTW distance threshold. Here is the refined description of the process. Firstly, 100 sequences are randomly selected from the cluster. Secondly, the DTW distance matrix is calculated for their corresponding pseudo-barcode signals. Next, the average value of the matrix elements is computed and referred to as “mean.” Finally, a threshold is set to c times the mean, with a default value of c as 1.65. In the case of a cluster set containing only one sequence, the following criterion is applied: If the minimum DTW distance between the standard barcode signal and the pseudo-barcode signal corresponding to this sequence exceeds the threshold, the barcode label of the sequence is considered ambiguous and marked as “unclassified.”

GPU-accelerated DTW

The most computational expensive part of HycDemux is the calculation of the tens of millions to hundreds of millions of DTW distances. Generally, the computational complexity of a DTW algorithm should be O(mn) if the algorithm is sequentially implemented, where m and n are the lengths of the compared sequences. However, with the development of graphics processing unit (GPU) for general purpose processing, CUDA (or Compute Unified Device Architecture) has been widely used to accelerate computational biology tasks [48,49,50]. Here, we propose a CUDA-based GPU-accelerated DTW to solve the speed problem, by combining a coarse-grained block-wise acceleration strategy and a fine-grained multi-thread acceleration strategy. Figure 8 describes the outline of our acceleration strategy.

Dependency analysis

Obviously, the DTW distance between different signals is totally independent. Thus, a coarse-grained block-wise parallel strategy is devised to calculate these DTW distances simultaneously within each CUDA block, as shown in Fig. 8A. The computation of DTW weight matrix could also be accelerated by CUDA. However, data dependence exists in the calculation of a DTW matrix, i.e., the calculation of position (i, j) in the DTW matrix needs the values in position $(i -1, j)$, $(i -1, j-1)$, and $(i, j-1)$. Here, considering the elements on a slash lane of a DTW matrix, these elements are independent with each other (Fig. 9). Thus, we change the calculation of DTW matrix from sequence order into slash-lane order and propose a fine-grained multi-thread parallel strategy to ensure the speed and accuracy, as shown in Fig. 8B.

Block-wise acceleration

Within a general GPU card with NVIDIA Turing architecture, up to a few million blocks are allowed to execute asynchronously and concurrently. As shown in Fig. 8A, each CUDA block is responsible for calculating one DTW distance. That is, millions of blocks could be initialized to calculate these DTW distances simultaneously, which makes the calculation extremely fast. In contrast, a multi-CPU server may only contain a few dozen cores, allowing the simultaneous calculation of only a few dozen DTW distances.

Multi-thread acceleration

Each DTW matrix is calculated by multiple threads lane by lane. Synchronize strategy is applied to ensure that the values needed by the current position have been calculated correctly. To control which columns should be calculated at a given time, we use a register variable T to serve as a timer. $\forall i\in \lbrace 0,1,2,...,n-1\rbrace$, the ith thread calculates the ith row (counting from 0), then the thread with thread number t needs to process the $(T-t)$th element of the row at time T. And the threads with thread number $c=T-t<0$ should wait in place until $T-c>0$. Figure 8B shows an example with $T=25$.

On-chip storage

Since a CUDA block contains 1024 threads at most but the longest signal length is up to $\sim$1500 (a barcode’s length is up to 145, while the corresponding signal is $8\sim 10$ times of the barcode sequence), we extended the algorithm to let a thread computes two DTW matrix rows at a time, which makes a block able to process 2048-length signal. Considering barcode sequences are not too long (such as 40 nt), the GPU card of the current Turing architecture can fully store data and perform calculations in on-chip memory (shared memory), which avoids the copy cost from the global memory and further accelerates the calculation. As shown in Fig. 8C, green cells represent the elements stored in shared memory, and the yellow cells are the elements being calculating. Actually, maximum amount of shared memory per block is 163 KB on NVIDIA Turing architecture, which provides the ability that one thread processes 9 DTW rows whose elements stored in single-precision float format.

Availability of data and materials

To evaluate our hybrid clustering algorithm, we utilized both simulated and real datasets. The simulated data sets (dataset S1 to S6) are accessible at the following URL: https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8256481 [51]. The real dataset [52] are accessible at https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8256500, obtained (details are in the Results) from the European Nucleotide Archive (ENA) under accession ERR2767931 [33] and figshare, can be accessed at https://ftp.sra.ebi.ac.uk/vol1/run/ERR276/ERR2767931/deepbinner_amplicon_fast5s.tar.gz and https://figshare.com/projects/Deepbinner/34223. Both the simulated and real datasets contain the extracted barcode signals and sequences, which are essential for the direct evaluation of our hybrid clustering algorithm in HycDemux.

All reads with barcodes (considered as positive samples) in datasets D1 to D7 can be accessed via the following URL: https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8264231 [53]. Additionally, the reads [54] without true barcodes (considered as negative samples) in datasets D1 to D7 are accessible at https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260510. Moreover, the reads [55] in datasets DB4 to DB7 are accessible at https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260583.

Negative sample signals of D1 to D7 (DB4 to DB7) can be accessed at https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260534 [72].

The HycDemux software is available on GitHub at https://github.com/junhaiqi/Hybrid_clustering.git [73] under the GNU General Public License v3.0. Additionally, the source code for HycDemux has been deposited at https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260659 [74].

References

Church GM, Kieffer-Higgins S. Multiplex DNA sequencing. Science. 1988;240(4849):185–8.
CAS PubMed Google Scholar
Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012;9(1):72–4.
CAS Google Scholar
Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163.
CAS PubMed Google Scholar
Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genomics. 2017;3(10):e000132.
Wei S, Weiss ZR, Williams Z. Rapid multiplex small DNA sequencing on the MinION nanopore sequencing platform. G3 Genes Genomes Genet. 2018;8(5):1649–57.
CAS Google Scholar
Deamer D, Akeson M, Branton D. Three decades of nanopore sequencing. Nat Biotechnol. 2016;34(5):518–24.
CAS PubMed PubMed Central Google Scholar
Jain M, Olsen HE, Paten B, Akeson M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 2016;17(1):1–11.
Google Scholar
Han R, Wang S, Gao X. Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing. Bioinformatics. 2020;36(5):1333–43.
CAS PubMed Google Scholar
Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 2019;20(1):1–10.
CAS Google Scholar
Byrne A, Beaudin AE, Olsen HE, Jain M, Cole C, Palmer T, et al. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat Commun. 2017;8(1):1–11.
Google Scholar
Lebrigand K, Magnone V, Barbry P, Waldmann R. High throughput error corrected Nanopore single cell transcriptome sequencing. Nat Commun. 2020;11(1):1–8.
Google Scholar
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282–3.
CAS PubMed Google Scholar
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.
CAS PubMed Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
CAS PubMed PubMed Central Google Scholar
Ghodsi M, Liu B, Pop M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinformatics. 2011;12(1):1–11.
Google Scholar
James BT, Luczak BB, Girgis HZ. MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 2018;46(14):e83–e83.
PubMed PubMed Central Google Scholar
Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (I): statistics and power. J Comput Biol. 2009;16(12):1615–34.
CAS PubMed PubMed Central Google Scholar
Lu G, Zhang S, Fang X. An improved string composition method for sequence comparison. BMC Bioinformatics. 2008;9(6):1–8.
Google Scholar
Aita T, Husimi Y, Nishigaki K. A mathematical consideration of the word-composition vector method in comparison of biological sequences. BioSystems. 2011;106(2–3):67–75.
PubMed Google Scholar
Dai Q, Liu X, Yao Y, Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol. 2011;276(1):174–80.
PubMed Google Scholar
Wei D, Jiang Q. A DNA sequence distance measure approach for phylogenetic tree construction. In: 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA). Changsha: IEEE; 2010. p. 204–12.
Wei D, Jiang Q, Wei Y, Wang S. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics. 2012;13(1):1–15.
Google Scholar
Zorita E, Cusco P, Filion GJ. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015;31(12):1913–9.
CAS PubMed PubMed Central Google Scholar
Zhao L, Liu Z, Levy SF, Wu S. Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics. 2018;34(5):739–47.
CAS PubMed Google Scholar
Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015;12(8):733–5.
CAS PubMed Google Scholar
Loose M, Malla S, Stout M. Real-time selective sequencing using nanopore technology. Nat Methods. 2016;13(9):751–4.
CAS PubMed PubMed Central Google Scholar
Kovaka S, Fan Y, Ni B, Timp W, Schatz MC. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol. 2021;39(4):431–41.
CAS PubMed Google Scholar
Szalay T, Golovchenko JA. De novo sequencing and variant calling with nanopores using PoreSeq. Nat Biotechnol. 2015;33:1087–91. https://0-doi-org.brum.beds.ac.uk/10.1038/nbt.3360.
Article CAS PubMed PubMed Central Google Scholar
Giesselmann P, Brändl B, Raimondeau E, Bowen R, Rohrandt C, Tandon R, et al. Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing. Nat Biotechnol. 2019;37(12):1478–81.
CAS PubMed Google Scholar
Simpson JT, Workman RE, Zuzarte P, David M, Dursi L, Timp W. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods. 2017;14(4):407–10.
CAS PubMed Google Scholar
Ni P, Huang N, Zhang Z, Wang DP, Liang F, Miao Y, et al. DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning. Bioinformatics. 2019;35(22):4586–95.
CAS PubMed Google Scholar
Tourancheau A, Mead EA, Zhang XS, Fang G. Discovering multiple types of DNA methylation from bacteria and microbiome using nanopore sequencing. Nat Methods. 2021;18(5):491–8.
CAS PubMed PubMed Central Google Scholar
Wick RR, Judd LM, Holt KE. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks. PLoS Comput Biol. 2018;14(11):e1006583.
PubMed PubMed Central Google Scholar
Smith MA, Ersavas T, Ferguson JM, Liu H, Lucas MC, Begik O, et al. Molecular barcoding of native RNAs using nanopore sequencing and deep learning. Genome Res. 2020;30(9):1345–53.
CAS PubMed PubMed Central Google Scholar
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.
CAS PubMed PubMed Central Google Scholar
Sereika M, Kirkegaard RH, Karst SM, Michaelsen TY, Sørensen EA, Wollenberg RD, et al. Oxford Nanopore R10. 4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat Methods. 2022;19(7):823–6.
CAS PubMed PubMed Central Google Scholar
Sanderson ND, Kapel N, Rodger G, Webster H, Lipworth S, Street TL, et al. Comparison of R9. 4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction. Microb Genomics. 2023;9(1):mgen000910.
Ferguson S, McLay T, Andrew RL, Bruhl JJ, Schwessinger B, Borevitz J, et al. Species-specific basecallers improve actual accuracy of nanopore sequencing in plants. Plant Methods. 2022;18(1):1–11.
Google Scholar
Šošić M, Šikić M. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics. 2017;33(9):1394–5.
PubMed PubMed Central Google Scholar
Boža V, Brejová B, Vinař T. Improving nanopore reads raw signal alignment. arXiv preprint arXiv:1705.01620. 2017.
Smith P, Lindsey RL, Rowe LA, Batra D, Stripling D, Garcia-Toledo L, et al. High-quality whole-genome sequences for 21 enterotoxigenic Escherichia coli strains generated with PacBio sequencing. Genome Announc. 2018;6(2):e01311-17.
PubMed PubMed Central Google Scholar
Kim J, Lindsey RL, Garcia-Toledo L, Loparev VN, Rowe LA, Batra D, et al. High-quality whole-genome sequences for 59 historical Shigella strains generated with PacBio sequencing. Genome Announc. 2018;6(15):e00282-18.
PubMed PubMed Central Google Scholar
Patel PN, Lindsey RL, Garcia-Toledo L, Rowe LA, Batra D, Whitley SW, et al. High-quality whole-genome sequences for 77 Shiga toxin-producing Escherichia coli strains generated with PacBio sequencing. Genome Announc. 2018;6(19):e00391-18.
PubMed PubMed Central Google Scholar
Ezpeleta J, Garcia Labari I, Villanova GV, Bulacio P, Lavista-Llanos S, Posner V, et al. Robust and scalable barcoding for massively parallel long-read sequencing. Sci Rep. 2022;12(1):7619.
CAS PubMed PubMed Central Google Scholar
Srivathsan A, Lee L, Katoh K, Hartop E, Kutty SN, Wong J, et al. ONTbarcoder and MinION barcodes aid biodiversity discovery and identification by everyone, for everyone. BMC Biol. 2021;19:1–21.
Google Scholar
Jain M, Abu-Shumays R, Olsen HE, Akeson M. Advances in nanopore direct RNA sequencing. Nat Methods. 2022;19(10):1160–4.
CAS PubMed Google Scholar
Liu-Wei W, van der Toorn W, Bohn P, Hölzer M, Smyth R, von Kleist M. Sequencing accuracy and systematic errors of nanopore direct RNA sequencing. bioRxiv. 2023;2023–03.
Schatz MC, Trapnell C, Delcher AL, Varshney A. High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics. 2007;8(1):1–10.
Google Scholar
Manavski SA, Valle G. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics. 2008;9(2):1–9.
Google Scholar
Han R, Wan X, Li L, Lawrence A, Yang P, Li Y, et al. Autom-dualx: a toolkit for fully automatic fiducial marker-based alignment of dual-axis tilt series with simultaneous reconstruction. Bioinformatics. 2019;35(2):319–28.
CAS PubMed Google Scholar
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, Datasets S1 $\sim$ S6 for evaluating hybrid clustering algorithm,. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8256481.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, A real dataset for evaluating hybrid clustering algorithm. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8256500.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, All DNA, sequences in datasets D1–D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8264231.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, Negative sample sequences contained in datasets D1–D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260510.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, Sequences (DB4-DB7) with low sequencing error rate for evaluating HycDemux. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260583.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, Dataset D1 used to evaluate HycDemux. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8264226.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, Dataset D2 used to evaluate HycDemux. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8264249.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The first part of all non-negative sample nanopore signals in dataset D3. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8256994.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The second part of all non-negative sample nanopore signals in dataset D3. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8264210.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The third part of all non-negative sample nanopore signals in dataset D3. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260102.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The fourth part of all non-negative sample nanopore signals in dataset D3. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8264137.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.1 part of all non-negative sample nanopore signals in dataset D4 $\sim$ D7 (DB4 $\sim$ DB7). Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266227.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.2 part of all non-negative sample nanopore signals in dataset D4 $\sim$ D7 (DB4 $\sim$ DB7). Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266246.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.3 part of all non-negative sample nanopore signals in dataset D4 $\sim$ D7 (DB4 $\sim$ DB7). Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266248.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.4 part of all non-negative sample nanopore signals in dataset D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266251.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.5 part of all non-negative sample nanopore signals in dataset D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266225.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.6 part of all non-negative sample nanopore signals in dataset D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266223.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.7 part of all non-negative sample nanopore signals in dataset D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266221.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.8 part of all non-negative sample nanopore signals in dataset D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8264285.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.9 part of all non-negative sample nanopore signals in dataset D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266219.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, The NO.10 part of all non-negative sample nanopore signals in dataset D7. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8266213.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL, Nanopore signals corresponding to all negative sample sequences. Datasets. Nanopore Sequencing Data. 2023. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260534.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL. Source code for “HycDemux: A hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing”. 2023. Github. https://github.com/junhaiqi/Hybrid_clustering.git.
Han R, Junhai Qi YX, Xiujuan Sun FZ, Xin Gao GL. Source code for “HycDemux: A hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing”. 2023. Zenodo. https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.8260659.

Download references

Acknowledgements

We are grateful to Prof. Lei Li for the discussion and suggestion of methods and experiments.

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 2.

Funding

This research was supported by the National Key Research and Development Program of China [2020YFA0712400 and 2021YFF0704300], the National Natural Science Foundation of China Projects Grant [62072280, 11931008, 61771009, 32241027], the Natural Science Foundation of Shandong Province ZR2023YQ057, the King Abdullah University of Science and Technology (KAUST) Office of Research Administration (ORA) under Award No FCC/1/1976-44-01, FCC/1/1976-45-01, REI/1/5234-01-01, REI/1/5414-01-01, URF/1/4352-01-01, and the open project of BGI-Shenzhen BGIRSZ20220005.

Author information

Renmin Han and Junhai Qi should be considered as joint first author.

Authors and Affiliations

Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
Renmin Han, Junhai Qi, Yang Xue & Guojun Li
BioMap Research, California, USA
Junhai Qi
High Performance Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Xiujuan Sun
School of Medical Technolgoy, Beijing Institute of Technology, Beijing, 100085, China
Fa Zhang
King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal, 23955, Saudi Arabia
Xin Gao

Authors

Renmin Han
View author publications
You can also search for this author in PubMed Google Scholar
Junhai Qi
View author publications
You can also search for this author in PubMed Google Scholar
Yang Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xiujuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Fa Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Guojun Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.L., X. G., and F. Z. conceived and managed the project. R.H. and J. Q. implemented the algorithm, collected all the datasets, and performed all the analysis. Y.X. and X.S. were involved in the data analysis and testing of the algorithm. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Fa Zhang, Xin Gao or Guojun Li.

Ethics declarations

Ethics approval and consent to participate

All data and samples used in this study were collected and analyzed in compliance with relevant ethical standards. As such, no formal ethical approval was necessary for the conduct of this research.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1. S1.

Evaluation criteria. S2. Comparison experiment table of signal-similarity based clustering method and base space-based clustering method. S3. Comparison tables of hybrid clustering algorithm and three clustering tools. S4. Pseudo code about hybrid clustering algorithm. S5. Usage of our method. Table S1. A performance comparison of various clustering methods was conducted on a simulated dataset containing 50 clusters, with 2000 sequences of approximately 145bp in length. Table S2. A performance comparison of various clustering methods was conducted on a simulated dataset containing 100 clusters, with 2000 sequences of approximately 145bp in length. Table S3. A performance comparison of various clustering methods was conducted on a simulated dataset containing 20 clusters, with 2000 sequences of approximately 145bp in length. Table S4. A performance comparison of various clustering methods was conducted on a simulated dataset containing 100 clusters, with 2000 sequences of approximately 95bp in length. Table S5. A performance comparison of various clustering methods was conducted on a simulated dataset containing 50 clusters, with 2000 sequences of approximately 95bp in length. Table S6. A performance comparison of various clustering methods was conducted on a simulated dataset containing 20 clusters, with 2000 sequences of approximately 95bp in length. Table S7. A performance comparison of various clustering methods was conducted on a simulated dataset containing 100 clusters, with 2000 sequences of approximately 45bp in length. Table S8. A performance comparison of various clustering methods was conducted on a simulated dataset containing 50 clusters, with 2000 sequences of approximately 45bp in length. Table S9. A performance comparison of various clustering methods was conducted on a simulated dataset containing 20 clusters, with 2000 sequences of approximately 45bp in length. Table S10. Comparison of the performances of the three tools and our method on the simulation data set 1. Table S11. Comparison of the performances of the three tools and our method on the simulation data set 2. Table S12. Comparison of the performances of the three tools and our method on the simulation data set 3. Table S13. Comparison of the performances of the three tools and our method on the simulation data set 4. Table S14. Comparison of the performances of the three tools and our method on the simulation data set 5. Table S15. Comparison of the performances of the three tools and our method on the simulation data set 6.

Additional file 2.

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Han, R., Qi, J., Xue, Y. et al. HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing. Genome Biol 24, 222 (2023). https://0-doi-org.brum.beds.ac.uk/10.1186/s13059-023-03053-1

Download citation

Received: 09 January 2022
Accepted: 08 September 2023
Published: 05 October 2023
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s13059-023-03053-1

HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

Abstract

Background

Results

Evaluation of hybrid clustering algorithm

Simulated datasets

Real-world datasets

Run scripts

Evaluation on synthetic datasets

Performance analysis of different stages

Speedup of GPU-accelerated DTW

Runtime analysis of the hybrid clustering algorithm

Evaluation on real-world dataset

Evaluation of the demultiplexing in HycDemux

Simulated multi-sample sequencing data

Extract data for demultiplexing

Evaluation index

Performance on all datasets

Discussion

Conclusion

Materials and methods

Overview

Extract barcode information from raw data

Initial clustering

Threshold determination

Cluster merging and refinement

Cluster merging

Refinement 1

Refinement 2

Demultiplexing module based on voting mechanism

GPU-accelerated DTW

Dependency analysis

Block-wise acceleration

Multi-thread acceleration

On-chip storage

Availability of data and materials

References

Acknowledgements

Peer review information

Review history

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1. S1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us