Apache Kafka Gives Large-Scale Image Processing a Boost
March 13, 2017 Ben Cotton
The digital world is becoming ever more visual. From webcams and drones to closed-circuit television and high-resolution satellites, the number of images created on a daily basis is increasing and in many cases, these images need to be processed in real- or near-real-time.
This is a computationally-demanding task on multiple axes: both computation and memory. Single-machine environments often lack sufficient memory for processing large, high-resolution streams in real time. Multi-machine environments add communication and coordination overhead. Essentially, the issue is that hardware configurations are often optimized on a single axis. This could be computation (enhanced with accelerators like GPGPUs or coprocessors like the Intel Xeon Phi), memory, or storage bandwidth. Real-time processing of streaming video data requires both fast computation and large amounts of in-memory storage. Meeting both needs in a single box can be prohibitively expensive for all but the most well-funded operations.
Where hardware cannot meet the need, software steps up. This is the premise of decades of distributed computing projects. At the Seventh International Conference on Computer Science and Information Technology held in Zurich, Switzerland last month, two researchers from the Department of Electrical Engineering at Korea University presented a paper outlining their software-based approach to solving this problem.
Yoon-Ki Kim and Chang-Sung Jeong developed a distributed system based on Apache Kafka to process video streams in real time. Image frames are first published to Kafka by the camera or a remote receiver. Processing nodes subscribe to the Kafka broker, pulling data to process as capacity is available. This pull model is key to real-time processing, since it allows the processing nodes to keep themselves full instead of relying on communication back and forth with a push-based model. Breaking out multiple channels into their own Kafka topic improves the throughput when accompanied with an increase in node count.
“Although the GPU is well suited for high-speed processing of images, it still has limited memory capacity. Using Kafka to distributed environment allows overcoming of the memory capacity that cannot be accommodated by one node. In particular, since image data can be stored in the file system, it is advantageous to handle large-scale images without data loss.”
Although the focus here is on computation and memory, disk performance plays a role as well. The Hadoop Image Processing Interface (HIPI) is a similar project using Hadoop MapReduce to distribute computation tasks. Kim and Jeong considered HIPI to be unsuitable for real-time processing in part due to the I/O patterns of the Hadoop Distributed File System (HDFS). HDFS uses a random-access pattern, which can lead to disk I/O being a bottleneck. Kafka avoids this issue by using sequential access.
“Apache Kafka is a distributed messaging system for log processing. Kafka uses Zookeeper to organize several distributed nodes for storing data in real time stably. It also stores data in several messaging topics. It provides a messaging queue that allows nodes that need a message to access and process it.”
Using a software system to efficiently parallelize the computation makes real-time processing of large image streams allows for “normal” high-performance computing hardware to be used. This makes it a viable option for many use cases, from the frivolous addition of visual effects on social media and video chat applications, to serious applications like object recognition for defense. But as the problem was created by hardware, so might it be addressed by hardware in the future. Increasing the amount of memory per core (or GPU) or perhaps the future broad availability of processing-in-memory could address the constraints imposed by current hardware. At least until the size and resolution of the images to be processed leapfrogs ahead again.