What is the key to scaling AI development?
The answer: AI Infrastructure
Emerging opportunities for AI optimization
Companies are increasingly adopting AI applications for a wide range of use cases, such as legal, financial tasks, sales, customer support, medical advice, and more. However, significant infrastructure costs, compute fragmentation, and software-hardware inefficiencies show that it is critical to address opportunities for optimizing AI development tooling and workflows. Fine-tuning a model on Open AI, for example, can cost at minimum $3k for 1,000 users, going all the way up to $60k for just one model. This does not even include hardware costs, which can go north of $200k. In addition to the financial costs, the environmental footprint of training and developing AI models is also consequential. One study estimated that training popular AI models can emit five times amount of carbon dioxide as the average American car over its “lifetime”. More recently, Hugging Face also studied the energy efficiency of models on its platform, comparing tasks, model size, and modality, signaling the growing importance of assessing AI development and addressing overall efficiency. Improving AI infrastructure both at the hardware and software levels, therefore, is key for enterprises to capitalize on AI to solve different business problems, ramp up productivity, and create new sources of differentiation and value at scale.
Why is optimizing AI infrastructure important now?
Significant energy, financial, and environmental costs and computational requirements for developing scalable and powerful AI models
Software and hardware infrastructure inefficiencies, which present trade-offs for different use cases and applications
AI tooling, frameworks, and development toolkits are still evolving, and need to consider optimization across different levels including models and data, in order to improve model outputs
Scanning the AI infrastructure stack
First, it’s all about the data
Enterprises have been figuring out how to implement best practices in adopting the modern data stack. As data needs are constantly changing at each part of the AI development pipeline, AI infrastructure at different layers must also address changing data volumes, types, and latencies. There are a few key aspects of the data layer that are important to consider for innovation in AI infrastructure:
New storage mechanisms: given the variety of data that AI applications work with and produce, new data storage mechanisms are needed to make these processes more efficient. This includes components such as vector databases, which make vector embedding models and retrieval mechanisms more efficient (i.e. Superlinked), and ensuring that AI outputs can be put into any database.
Data quality innovations: techniques such as Reinforcement Learning from Human Feedback (RLHF), which seeks to improve data quality using human assessments during AI development, is among promising developments to generate better data, that has implications for optimization. Data innovation also includes methods to generate larger volumes of high-quality data at scale, whether through data labeling, modeling and synthetic generation, or methods such as Retrieval Augmented Generation (RAG), which rely on internal knowledge bases. Finally, analyzing data quality and comparing model performance in the style of Tenyks is also another approach to improving AI models and data inputs.
Model focus and size: parallel efforts towards smaller, more tailored models, and larger, general-purpose models have implications for AI infrastructure costs and metrics, due to different data and GPU needs, making smaller models a cost-effective choice. But, given the push towards larger models by big technology players, due to their investments in the ecosystem behind them, large models are unlikely to go away anytime soon, making data-related optimization a promising area for innovation.
Opportunities in Hardware
AI hardware companies are mainly focused on optimizing for key metrics including memory, logic, networking, latency, concurrency, throughput, and power consumption, which include various trade-offs. Inference and training costs comprise the bulk of AI infrastructure spending, and also include hardware, performance, the opportunity costs of development time, and the power consumption.
Trends and opportunities for AI hardware optimization include:
Specialization for AI workloads: a key challenge is lowering the cost of compute and improving efficiency for AI-specific workloads. Some companies are moving such as Google’s TPUs for machine learning tasks, while others, such as Microsoft, have developed the MAIA architecture that is only for internal use. Therefore, increased external support for commercial chips is essential.
Hardware-software co-design: new ML architectures are emerging to maximize existing hardware capabilities. At the same time, new hardware is also being designed to optimize for the best architectures as well as memory organization and storage. While some model architectures, such as transformers or mixture-of-experts models may be ideal for high quality results, they might not necessarily be the best for hardware efficiency. Microsoft Research, for example, published research on retentive networks as an alternative architecture to transformers to optimize information processing, recurrence, and attention for longer texts. But, hardware improvements must also be adaptable for a variety of model architectures - becoming too specialized for a particular architecture will make it obsolete.
Multimodal data: current processors tend to be best suited for models that deal with separate modalities of AI such as computer vision, natural language processing, etc. However, AI applications are increasingly dealing with multimodal data, and optimization will play a role as workloads become increasingly complex and require more compute.
Optimized processors: areas for innovation include compute acceleration, data storage and memory, and I/O, such as cutting down on network latencies, bandwidth, time to complete jobs, and resource utilization.
Custom accelerators for inference and training:. Amazon has developed the Inferentia chip for inference with improvement in high throughput and low latency, and Tranium, specifically for training 100b+ parameter AI models. Meta is also working on custom accelerators for inference workloads in order to optimize for low latency and high performance. Startups such as Fractile and MatX are building chips specifically for LLM inference, while others are optimizing for general purpose ML (Rain.ai). Literal Labs is also developing both software and hardware IP to improve energy efficiency and inference times. Overall, lowering costs for these compute-intensive stages has been the focus of efforts across big tech and startups.
New design approaches: current gaps include flexibility, scalability and energy use. Emerging approaches include neuromorphic computing (Rigpa.ai, Innaterra) which is based on asynchronous information processing, increased parallelism, on-device learning and more, and optical networking (Lumai, Salience Labs) to improve energy efficiency and speed.
Addressing the software-hardware boundary
Software development toolkits (SDKs) and AI frameworks allow for programming AI accelerators, and are either made for specific hardware or are hardware-agnostic. NVIDIA’s CUDA is the most widely used, and was developed to work with their GPU hardware. NVIDIA’s investments into this software ecosystem has contributed to their moat around their hardware offerings. Alternatives include OpenCL (developed by Arm) and Triton (from OpenAI) however, there are opportunities to develop SDKs that maximize the full capabilities of AI hardware rather than being connected to one specific vendor. Additionally, as accelerators move towards some level of specialization, there is an opportunity for new players to provide optimized, hardware-agnostic platforms that can address the software-hardware intersection. AI frameworks are also important here, as they provide an abstraction for developers to work with their AI model while it runs on particular hardware. Libraries such as PyTorch and Tensorflow allow developers to define architectures and invoke algorithms on hardware through interface systems such as CUDA without having to write CUDA code directly, which can be tedious and time-consuming. However, there are still big gaps that companies like Modular are addressing, such as improving usability and performance of software development toolkits, by providing a single platform and SDK (in their case, Mojo).
Optimizing the AI software stack
Due to the complexities in managing AI hardware, and even obtaining the best chips in the first place due to skyrocketing demand, many startups are also providing these resources for companies. Opportunities for innovation in this area span compute and models, mainly orchestration and deployment.
Orchestration refers to managing the workloads involved in model development, and popular tools include Amazon Sagemaker, Databricks, Fiddler, Vertex AI, and Kubeflow. There are still opportunities to efficiently manage scheduling and scaling workloads as models receive new data, and optimize the hardware and compute, especially given the scarcity of SOTA chips.
Compute Optimization and Workload Management: Companies can manage hardware resources which are otherwise difficult to allocate efficiently. Run:ai, for example, manages GPU allocation and workload scheduling across cloud and on-prem environments, while other companies such as Modal Labs work with different types of hardware and specific model use cases. This can help with costs as users are charged for what they use, rather than making them pay for idle GPUs. But, all providers manage infrastructure differently, some may process user requests immediately, while others have slight delays. Trends in this area include specialization for CPU optimization (Neural Magic) and GPU optimization (Deci). Other companies such as Krai focus on general hardware optimization that offers a variety of delivery mechanisms.
Deployment is an area that is still a major challenge as there are different metrics that need to be optimized for on the hardware level while ensuring that the output of the model is still high-quality. Currently, the popular tools in the market include Amazon Sagemaker, Tensorflow Serving, Onix Runtime, Cortex, and Seldon, among others.
Model Optimization/Compression: There are many approaches to make AI development more efficient at the model level - whether it is related to fine-tuning (PEFT, RAG), reducing the model size and parameters (QLoRA), model pruning, etc there is still room for innovation in preserving model quality and performance while selecting the optimal parameters and “settings”. This process will also be different for computer vision models vs LLMs, for example - startups such as Clika and Lepton AI are trying to create smaller models that still have high performance, but this is an area open to new approaches.
Inference: inference is easily one of the most expensive parts of AI development due to the sheer number of decisions and outputs that must be generated continuously. While many optimization approaches have been explored, there is still potential to achieve significant improvements in cost reduction at a large scale. In addition to model compression, current approaches include real time inference (or continuous/dynamic batching), batch inference, deploying the model as a web service, or distributed computing. However, these depend on latency and end-user requirements , and will need different levels of infrastructure to support the ongoing input and processing of data and meet end user needs. Therefore, some companies are focusing specifically on different model types, whether it is LLMs (TitanML, Unify), or general models across use cases (Pruna.ai).
There are still other considerations for improving inference, as latency, memory, and performance will vary based on the following application-level factors, some of which are not fully known in advance:
Output generation frequency (e.g. in batches, real-time, scheduled, etc.)
End-user requirements (e.g. how quickly the outputs need to be generated)
Amount of time that predictions need to be stored
Operational/maintenance requirements of the model
Amount of computational power the model requires
Infrastructure as a service/serverless GPUs: many vendors are also offering on-demand GPUs without the need for customers to procure their own hardware. However, some challenges remain, such as generalized model autoscaling and the cold start problem. While it is easier to tailor approaches for particular model architectures, creating a generalized platform to scale ML model deployment up and down and managing larger workloads is still an unsolved area. Additionally, as a customer’s AI application scales, it might make sense for them to eventually own their own hardware rather than renting it out, which is also another important consideration. Finally, when scaling these serverless systems, they receive more requests than they can process until the server is initialized. Along with factors such as latency and autoscaling, which are affected by changing workloads, optimization becomes critical when dealing with large models and time-sensitive applications.
˚。⋆ Looking ahead – what are the key opportunities in AI infrastructure?⋆。˚
Earlier this week, it was reported that NVIDIA is in the process of potentially acquiring Run:ai, an AI optimization platform for workload and compute management. The proposed deal, which could be valued at $1 billion USD, underlines the key role of optimization in AI development. The AI infrastructure stack is extensive and fragmented, yet vital for scaling the impact of AI. While the main categories include data, models, software, and hardware, there are opportunities that span these layers, presenting high-potential innovation opportunities. Optimizing AI infrastructure will be essential to making AI more scalable and sustainable, opening it up to a wider range of uses and sectors than previously possible.
Improving power consumption and performance during development and real-time inference, while balancing variables such as latency, throughput, and networking cannot be ignored as demand for AI applications increases. Although model optimization has been the focus of current efforts, there are still many opportunities to further improve upon computational and energy efficiency and reduce increasing development costs. There is also a lack of standardized pricing tied to these metrics, especially for customers to assess solutions. For use cases with strict application requirements, pricing could consider production workloads or hourly inference rates, or follow an on-demand model for customers building smaller models or prototypes. Ultimately, platforms should remain customizable for developer needs while addressing the impact of different choices (across accelerators, inference, model compression, for example) on key metrics and pricing.
The hardware-software boundary also presents new opportunities for optimizing AI infrastructure, by providing unified and vendor-agnostic platforms to leverage the latest frameworks, hardware, and model architectures. Due to the variety of hardware choices available for different use cases, software development toolkits and AI development platforms that provide improved usability, performance, and help users maximize the capabilities of their chosen accelerators, servers, and hardware is a promising area for innovation. Given the ever-evolving nature of AI hardware and model architectures, helping developers create high-performing applications that maximize current infrastructure presents an exciting landscape for innovation.
Further Reading:
Computing Power and the Governance of AI - Sastry et al. (2024)
Do We Really Have Too Much AI Infrastructure? -
More on LLM Performance - Databricks
What do you think will be critical for advancing AI development? Are you working in an early-stage deeptech venture? Reach out, leave a comment, and keep in touch!