The Internet of Things (IoT) has shown significant growth and promise, with data generated by IoT devices alone expected to reach 73.1 zettabytes by 2025. Moving this data away from its point of creation to a centralized data center or cloud would contradict the application’s purpose. Thus, edge computing was born. Fast forwarding to 2024, edge computing is now being paired with recent advances in AI to intelligently process data at the edge, leading to faster speeds, reduced latency and improved privacy and security.
In sectors like manufacturing and healthcare, where efficiency and accuracy are key, AI at the edge is changing the game. In manufacturing, where there are approximately 15 billion connected devices globally, the milliseconds lost in sending data to the cloud for processing can mean the difference between detecting a flaw immediately or letting it slip through quality control. In healthcare, the immediacy with which patient data is analyzed can affect the accuracy of diagnoses and the effectiveness of treatment, especially with the risk of decentralized healthcare and wearable devices. By processing data on the spot, AI at the edge eliminates the latency that cloud computing introduces, leading to more timely, informed decisions.
The global market for edge computing technologies is estimated to increase from $46.3 billion in 2022 to $124.7 billion by 2027 at a compound annual growth rate (CAGR) of 21.9 percent from 2022 through 2027. Implementing AI at the edge will result in tangible benefits to all industries, enabling businesses to unlock new possibilities and achieve greater levels of performance.
The Shift To Smaller Models
In the past year, the conversation around AI models has begun to change. Large models with extensive parameter counts have started giving way to smaller, more focused models. This includes both the utilization of smaller models as well as the utilization of efficiency techniques, such as quantization, sparsity and pruning, to make large models smaller. These smaller models are easier to deploy and manage while being significantly more cost-effective and explainable, yielding similar performance with a fraction of the computational resources. These smaller models can also be used in many task-specific domains. Pre-trained models can be optimized for specific task performance using techniques such as inferencing and finetuning, making them perfect candidates for the stringent requirements of edge computing.
These smaller models are not only beneficial to the logistical challenges of deploying hardware at the edge, but they also meet the nuanced needs of specific applications. In manufacturing, a small, specialized AI model can continuously monitor the auditory signatures of machines to predict maintenance needs before a breakdown occurs. In healthcare, a similar model can provide continuous, real-time monitoring of patient vitals, alerting medical staff to changes that may indicate an emerging condition.
Mastering Model Optimization And Inferencing Techniques
Optimization at the edge is not just about making AI models smaller; it’s a balancing act to make models as small as possible while still retaining performance.
Techniques such as pruning convert larger models into smaller models by reducing the number of unimportant connections and, more recently, layers. Pruning aims to create more memory and energy-efficient systems that retain the performance of their original larger counterparts. Successful pruning techniques include pruning by filter, pruning by channel, and pruning by layer (where optimal blocks of layers to prune are considered via a similarity search and then model recovery is achieved through fine-tuning via parameter-efficient fine-tuning (PEFT) and quantized low-rank adapters (QLORA).
Another technique used to make models smaller is quantization, a process for reducing model size by reducing the precision of model weights, parameters and activations so they have a smaller memory footprint. The memory requirement to store 32-bit or 16-bit floating point values is very high, but with quantization, these weights, parameters and activations can be converted to 8-bit, 4-bit and occasionally smaller integers that can run at the edge. For example, depending on the technique, a Llama 2 7B model can be reduced from 13.5 GB to 3.9 GB, and a Llama 13B can be reduced from 26.1 GB to 7.3 GB by FP16 to INT4 conversion. Quantization can be accomplished both through post-training and during training. To maintain performance, however, mixing precision techniques or mixing precision with pruning may need to be considered.
Other efficiency techniques, such as low-rank adaption (LORA), allow for parameter-efficient finetuning by reducing computational costs and memory and increasing speed while maintaining accuracy. This technique focuses on modifying a subset of parameters rather than the entire model. This is done by keeping the original model weights frozen and applying changes to a separate set of weights, which can be added. As large models inherently possess a low dimensional structure, model parameters can be transformed into a low rank, which is a dimensionality reduction process that finds a matrix’s rank or the number of linearly independent rows or columns (this implies that no row or column can be the result of a combination of other rows or columns). Popularly, LORA is now commonly combined with quantization, and QLORA can be used for finetuning models.
Additional techniques to consider include federated learning, matrix decomposition, weight sharing, memory optimization and knowledge distillation. In practice, such optimized models can provide critical insights with minimal delay. For instance, in manufacturing, AI optimized for the edge using these techniques could analyze equipment vibrations to detect early signs of wear and tear, scheduling maintenance before a failure occurs. In healthcare, edge-optimized AI can process real-time videos of patients to alert caregivers when a patient has fallen.
Hardware, Software, And Learning Optimizations
Dell Technologies is at the forefront of AI at the edge, optimizing both hardware and software to support AI workload deployments at the edge. With initiatives like NativeEdge, Dell Technologies ensures that AI models at the edge are not just operational but powerful, whether on a CPU or GPU.
Through a nuanced approach, Dell Technologies centralizes the deployment and management of edge infrastructure and applications across geo-distributed locations, helping enterprises securely scale their edge operations using automation, open design, Zero Trust security principles and multicloud connectivity.
Regardless of the intended use case, the key to successful edge and AI implementation lies in the synergy between optimized models and the hardware they run on. The Dell NativeEdge platform exemplifies the integration needed to manage these advanced AI systems effectively. Using NativeEdge, organizations can seamlessly deploy and control their edge AI applications, ensuring that the insights gleaned are timely and actionable.
Putting The AI At The Edge Vision Into Practice
The journey toward implementing AI at the edge is marked by real-world applications that demonstrate its transformative impact. AI at the edge has the potential to provide organizations with valuable, real-time insights to better navigate complex data and opportunities. It represents a significant shift toward a new era of business where organizations thrive on immediacy and adaptability. The time to put these theories into practice is now. Organizations that solve for integrating the value of their data through AI where it is most valuable, at the edge, will gain a competitive advantage in the months and years ahead.
To learn more about AI at the edge, visit the Dell Technologies Resource Library.
Ananta Nair is an artificial intelligence engineer at Dell Technologies.
This article was commissioned by Dell Technologies.