Pathology laboratories are big data environments. However, these big data are often hidden behind expert humans who manually and with great care visually parse large complex and detailed datasets to provide critical diagnoses. Humans it turns out, are amazingly detailed and accurate large data visualization, segmentation and interpretation devices. Experts are able to zoom in and identify the potentially five or six tumor glands from a large area of stained tissue that comprise the average cancer positive needle biopsy.
However, pathology is still an extremely manual and detailed process requiring great skill and accuracy to avoid any potential misdiagnosis. The pathologist effectively views sets of individual slides with sections of a labeled specimen under a microscope, then creates a pathology report based on what they can physically see with the naked eye. The pathology lab clearly needs and deserves to have some automated help with this critical task.
An average 20X zoom pathology whole slide image (WSI) results in several gigapixels of data, each with only 300 or so critical pixels that would be accurately used to define a positive diagnosis. Any form of computational assistance to augment the pathology lab has the obvious potential to not only accelerate the whole process, but also make sure that a human is also able to more accurately detect potentially anomalous cell types that may be missed. Faster, and more accurate diagnosis of cancer absolutely sounds like something we should be doing.
Thomas Fuchs, director at Memorial Sloan Kettering Cancer Center, professor at Weill Cornell, and often called the “father of computational pathology,” certainly agrees. Recently Fuchs, became founder and chief science officer at Paige.ai, and earlier this year closed on a Series A round of $25 million. Funding was led by Breyer Capital to enable Paige to not only gain access to the AI technology built by Fuchs, but also to access to the largest pathology slide repository in the world at MSKCC. The center did receive equity as part of the license agreement, but isn’t itself a cash investor. Combining technology, software and rich clinical data with a commercialization path is the key here, this is the chocolate and peanut butter of modern clinical AI research.
Finding And Annotating The Big Data
Hunting down enough data to effectively train systems to identify potentially cancerous cells has been non trivial. However, Fuchs and team managed to gather up a dataset of unprecedented size in the field of computational pathology that contained 12,160 individual slides from prostate needle biopsies. Their paper, Terabyte-scale Deep Multiple Instance Learning for Classification and Localization in Pathology, describes the work and challenges in more depth. The data set used in this terabyte scale analysis is almost two orders of magnitude larger than most other clinically available digital datasets in the field. To put this in perspective, this dataset has roughly the same number of pixels of 25 individual ImageNet datasets.
It is big data, not massive, but certainly heading in the general direction to generate issues with computational balance as the archives grow. The slide set of 12,160 needle biopsies was scanned in digitally at 20x magnification, with 2,424 slides being labeled as positive and 9,736 as negative. That dataset was then randomly split into training (70 percent), validation (15 percent), and testing (15 percent) sets for a complete validated learning and testing process.
Software And Systems Complexity
For the pathology pipeline, an in-house cluster of six Nvidia DGX-1 systems were used, each contain eight Tesla V100 Volta GPUs accelerators, which used OpenSlide to access the WSI files on-the-fly. PyTorch was used for the data loading, building the models, and training, with further data analysis of results carried out in R. It is important to note that the software components for this type of analysis are each individually also non trivial. So much so, that Nvidia recently announced at the IEEE/CVF Conference on Computer Vision and Pattern Recognition in Salt Lake City a whole suite of software to make AI pipelines such as the one used by Fuchs even more highly performant, and just a little bit easier for researchers to manage.In particular, the setup had the new Apex software from Nvidia, which is essentially a PyTorch extension that provides mixed precision utilities. The utilities are each designed to improve training speed while also maintaining the accuracy and stability of training in single precision. Specifically, Apex offers automatic execution of operations in either FP16 or FP32, with automatic handling of master parameter conversion, and automatic loss scaling. NVIDIA claims that these functions are all available with four or fewer line changes to the existing code. This type of development is critically important as we need to build complex training systems with ever larger data sets, such as those found in this specific pathology use case, while also keeping an eye to the overall software complexity. Using such high level language extensions really does sound like the proverbial “easy button” for software developers who have been wrestling with how to best code against these new architectures for performance.
Finally, the Paige.ai configuration includes a data pipeline software stack called Dali, which was also just announced by Nvidia. Dali is a GPU-accelerated data augmentation and image loading library for optimizing data pipelines for deep learning frameworks. By accelerating data augmentation using GPUs, Dali addresses performance bottlenecks, researchers can scale training performance on image classification models, users will have less code duplication due to more consistent high-performance data loading and augmentation across frameworks. Dali relies on a new accelerated Nvidia nvJPEG routine that was also released as open source during the CVPR conference. To round all of that out, Kubernetes on GPUs now finally has a release candidate. This should be able to help wrap up all the software pieces into a nice neat and tidy container, further reducing the stress on the researcher. These subtle software announcements are going to be critical as ever more complex systems are being built for research.
Science And Storage
Returning to the science use case, which is based on a thorough evaluation of the performance of Fuchs’ work on a Multiple Instance Learning (MIL) pipeline and which assumes only the overall slide diagnosis is necessary for training, thereby avoiding all the expensive pixel-wise annotations that are usually part of supervised learning approaches. Fuchs and team achieved an AUC of 0.98 on the held-out test set of 1,824 slides. This is extremely encouraging, especially as an augmentation system for the ever more over burdened and high pressure pathology laboratory. The question is what accuracy ought these methods need to achieve before they can be brought into clinical workflows? More research will be needed, and larger data sets brought to bear to build accurate and fault tolerant augmentation systems, the single largest issue for an augmented diagnosing service is it missing a potential positive sample.
To that end, and to further accelerate the process, advanced all-flash systems from Pure Storage have been deployed to move the large multi tera/petabyte data into and out of machine learning clusters. Fuchs presented their work this week at the IEEE/CVF Conference on Computer Vision and Pattern Recognition or CVPR in Salt Lake City, with the title, Computational Pathology at Scale: Changing Clinical Practice One Petabyte at a Time. Once again there is a direct clue in the subject here, “one petabyte at a time.” This team are clearly dealing with a large, and ever growing number of big files. We have spoken in the past about removing the bottleneck for AI with the combination of Pure Storage flash and Nvidia GPU systems, and it is becoming crystal clear from this pathology use case that even faster combinations of storage, network and compute will be needed to eventually further accelerate accurate annotation of cancerous tissues with intelligent systems.
Recently, some moderate concerns were raised in the storage community that high speed I/O may not actually be needed for machine learning applications. A cursory look at the TensorFlow benchmark and comparing synthetic with real data training sets can give one the initial impression that the synthetic data set which doesn’t perform disk I/O shows a similar performance to real data which clearly does hit the disk. This would suggest that disk I/O may not be a major bottleneck. Why would this be? Well, for most standard dual socket CPU boxes with a handful of SSD, the I/O really isn’t a bottleneck, there’s not enough compute to cause the pipe to clog. The TensorFlow results are absolutely right. However, when you add in high speed FP16 Tensor Cores and 300 GB/sec NVSwitch into the mix, that whole game changes fast as these devices have an insatiable need for super high bandwidth pipes so they can inhale massive datasets quickly and then get to work on them.
The TensorFlow benchmarks are for a single copy of the albeit complex 14 million picture ImageNet set which itself weighs in a few hundred gigabytes, as opposed to clinical data that is already at the level of hundreds of terabytes, and soon to be multiple petabytes each needing to flow quickly through clusters of exaops capable devices. It’s a whole new challenge for accurately understanding and diagnosing just the I/O path for sure.
Increasing data sets and associated data rates are also directly correlated with increased accuracy that may be derived from analyzing ever larger archives of glass slides generated by the pathology laboratories. It is the accessibility to rich, detailed archive data that will eventually drive very significant I/O issues. Of concern to folks in computing, there also seems to be significant interest in building ever larger repositories of genetic material and tissue samples, showing yet another now very familiar terrifying hockey stick growth graph that the compute, storage and networks will all need to keep up with. Couple that with projects such as All of Us to also aim to sequence 1,000,000 Americans, and we have the potential for another wave of data deluges to strike even more areas of life sciences, in clinical and biomedical research. Drowning in data has at no time before ever been a more accurate analogy.
We have talked about what the data deluge in the life sciences means for exascale and clouds, and here once again we find another terrifying hockey stick graph. This time for predicted physical slide volume at MSKCC. That slide library is predicted to be growing at 1,000,000 slides a year, even the physical environmentally controlled storage infrastructure needed to store the glass slides alone is non-trivial.
Fuchs hits on the underlying issue in the paper by stating that, “the lack of large datasets which are indispensable to learn high capacity classification models has set back the advance of computational pathology.” This is the ultimate take away fact, and that this fact is now starting to change. These large tissue archives beget more data, and that larger data begets ever more accurate predictive models and methods for healthcare.
The bottom line here is that ever more high speed data mobility and advanced accelerated computing and storage, both inside and near to the clinic, are going to be needed as these datasets become more widespread. The computational models and hardware will continue to be critically important as rapidly evolving, real world and more accurate AI methods will be able to provide better health outcomes, and become even more mainstream in our healthcare processes. More data, better decisions is the new mantra.