The Smith-Waterman algorithm has become a linchpin in the rapidly expanding world of bioinformatics, the go-to computational model for DNA sequencing and local sequence alignments. With the growth in recent years in genome research, there has been a sharp increase in the amount of data around genes and proteins that needs to be collected and analyzed, and the 36-year-old Smith-Waterman algorithm is a primary way of sequencing the data.
The key to the algorithm is that rather than examining an entire DNA or protein sequence, Smith-Waterman uses a technique called dynamic programming in which the algorithm looks at segments of varying lengths and zeroes in on patterns and similarities in the strings.
At the same time, it is also a compute-intensive procedure. Other algorithms, including BLAST and FASTA, are faster but not as accurate. Smith-Waterman (SW) goes through more possibilities, giving scores to each comparison – positive for matches and negatives for mismatches. The scores are added together, and the alignment with the highest score is the one that’s used. It is a complex algorithm, and there has been a growing interest in finding ways to accelerate the execution time through not only the CPUs used to run the workloads, but also through the use of such accelerators as GPUs and field programmable gate arrays (FPGAs), as well as the previous generation of Intel’s many-core Xeon Phi coprocessors, dubbed “Knights Corner.” Most recently, a group of researchers from the National University of La Plata in Argentina and the Complutense University of Madrid in Spain examined how well Intel’s more recent “Knights Landing” generation of Xeon Phi processors handle the highly complex SW algorithm.
The Knights Landing chips are the latest generation of Intel’s many-core initiative designed to push the X86 architecture deeper into HPC environments and the parallel workloads that are becoming increasingly prevalent. Supercomputer makers over the past decade have turned to GPUs from the likes of Nvidia and AMD to help accelerate the performance of complex workloads while maintaining energy efficiency. The Xeon Phi effort was born after the chip maker in 2009 ended its “Larrabee” project, which was designed to create Intel’s own GPU to compete with those from Nvidia and AMD. Knights Landing chips provide up to 72 cores that are based on a heavily modified Atom “Silvermont” microarchitecture. There are some key differences between Knights Landing and Knights Corner. The new chip family – which launched last year – can be used as a primary processor, whereas the Knights Corner silicon could only be used as a coprocessor that runs along with primary CPUs. Knights Landing also includes an integrated fabric, two controllers for DDR4 memory and eight memory controllers for MCDRAM (Multi-Channel DRAM), a 3D stacked on-package memory. Both memory controllers are completely separate.
In their study titled First Experiences Optimizing Smith-Waterman on Intel’s Knights Landing Processor, the researchers noted other key aspects of the latest Xeon Phi chips.
“Among the main differences of KNL respect to its predecessor, are the incorporation of AVX-512 extensions, a remarkable number of vector units increment and new on-package high-bandwidth memory,” they wrote, noting the previous studies with the Knights Corner chips and adding that “these aspects make necessary the revision of the previous optimization proposals for the SW algorithm.” In addition, they noted that Knights Landing “supports not only old Intel’s multimedia extensions such as and 256-bit AVX, but also modern 512-bit AVX-512.”
Previous studies regarding optimizing the SW algorithm include one from last year that looking at SIMD-vector exploitation capabilities in the newest CPUs that leverage the Parasail library. In addition, the CUDASW++ software, designed for multiple CUDA-enabled GPUs, was the most successful effort for using heterogeneous computing for Smith-Waterman, the authors wrote. There also have been studies focusing on tests where the CPUs and accelerators being used at the same time, as well as research around using FPGAs from Xilinx and Altera, with accelerators from the latter on OpenCL proving to be among the most energy efficient.
The researchers ran multiple tests on a server armed with Intel’s 68-core, 1.4 GHz Xeon Phi 7250 processor, which offers four threads per core, 16 GB of high-bandwidth memory (HBM) and 64 GB of main memory. It was run in Flat memory mode, in which the MCDRAM is used as addressable memory, and in quadrant cluster mode, and leveraged Intel’s ICC compiler. The authors made their evaluations by searching 20 query protein sequences against the Environmental NR database, which they said consist of almost 1.4 billion amino acid residues in more than 6.9 million sequences, 11,944 of which were the maximum length.
Using the GCUPS (billion cell updates per second) measurement common for the SW algorithm, the researchers evaluated different instruction sets, which used different numbers of threads. The AVX2 extensions saw the best performances (340.3 GCUPS), with the AVX-512 (157.8 GCUPS) and SSE4.1 (97.6 GCUPS) following. The performance of the AVX-512 was limited due to the lack of low-range integer operations, they wrote. In addition, in evaluating the performance using HBM, the researchers wrote that “as the entire application fits in the MCDRAM, we can benefit from placing all data in that memory using the numactl utility (without source code modification). In particular, MCDRAM exploitation achieves an average speedup of 1.04X and a maximum speedup of 1.1X.”
The study also compared the researchers’ implementation with the aligner application in the Parasail library, and found that their version outperformed Parasail for all query lengths that were used, running an average of 4.6 times faster, and that there were even larger differences for shorter queries.
Overall, the authors found that using low-range integer vectors was key to getting the best performance, and that different numbers of threads got the best results for each instruction set: 204 threads for SSE4.1, 136 threads for AVX2 and 68 threads for AVX-512. Future tests will include evaluating Knights Landing chips in different memory and cluster modes, including using the Flat mode while running larger genomic databases – such as UniProtKB/TrEMBL – that don’t fit into MCDRAM, as well as comparing the Intel chips to other accelerators not only for performance but also for energy efficiency. They also are interested in using future Knights Landing chips that will include the AVX-512BW set, which enables more SIMD parallelism.