When it comes to neural network training, Python is the language of choice. But for inference, code needs to be transformed to meet the various hardware performance and device limitations.
This has meant that the various AI inference hardware makers have had to build comprehensive custom software stacks to handle inference on their own devices, limiting their expansion due to user unwillingness to tap into a new programming model just to run inference workloads. All of this could be changing as recent work highlights Python’s suitability to handle inference.
Stanford and Facebook AI researchers have created an abstraction to allow use multiple Python interpreters within a single process for scalable datacenter inference, achieved partially though a new type of container format for the models themselves. This approach “simplifies the model deployment story by eliminating the model extraction step, making it easier to integrate existing performance-enhancing Python libraries.”
Right now, taking a trained model and inferencing has some complexity, especially for the inference startups that have had to build their own custom stacks for unpacking trained sets and getting them to run on their custom hardware. At best, there is manual effort to refactor the models into something less flexible and usable than Python. With the Stanford/Facebook AI approach, the key is using the existing CPython interpreter as the platform for model inference by organizing it in a way that several such isolated interpreters can run simultaneously. All of this is packaged up in a way that allows for self-contained model artifacts from the existing Python code and weights to move. The ins and outs of the process are available here.
Part of what is interesting about being able to use basic Python and related tooling for TensorFlow and other frameworks is that the performance of the approach gets better with more threads. This means GPUs and all the other high core-count devices taking aim at the datacenter inference market might be able to carve at neater, more usable path to more users, especially since the one complaint we keep hearing is that the software stacks for AI chips startups are too unmanageable.
As the Stanford/Facebook AI team explains, “for models that spend significant time in the Python interpreter, the use of multiple interpreters enables scaling when the GIL would otherwise create contention. Furthermore, for GPU inference, the ability to scale the number of Python interpreters allws the Python overhead to be amortized across multiple request threads.”
The one problem with taking the Python approach is that it hits memory capacity because of all the copying and loading from the shared interpreter library, with each interpreter needing its own copy. The amount of memory isn’t huge for server applications but adding complexity will take up much-needed memory, especially concerning for the inference devices that have worked hard to keep minimum memory for cost and power reasons.
They do add that there are still some other kinks with this approach but none are insurmountable. For instance, users might hit an inability to use some of the all-important third-party C++ extensions (this includes Python bindings like those in Mask RCNN) because of mismatches in the tables.
Ultimately, the team says Python inference gives model authors flexibility to quickly prototype and deploy models and then focus on the performance of the models when necessary rather than having to invest in upfront effort to extract the model.
“Because Python does not have to be entirely eliminated, it also offers a more piecemeal approach to performance. For instance using Python-based libraries like Halide to accelerate the model while still packaging the model as a Python program. Using Python as the packaging format opens up the possibility of employing bespoke compilers and optimizers without the burden of creating an entire packaging and deployment environment for each technology.”
A much more comprehensive view of the Python-based approach and benchmarks can be found here.
Be the first to comment