Serving OPT-175B Language Model with Alpa

https://opt.alpa.ai/#generation • Aug 13, 2022 14:28

Extracto

Alpa is an open-source system for training and serving large-scale neural networks. Alpa aims to automate large-scale distributed training and serving with just a few lines of code. Alpa was initially developed by folks in the Sky Lab, UC Berkeley. Some advanced techniques used in Alpa have been written in a paper published in OSDI'2022. Alpa community is growing with new contributors from Google, Amazon, AnyScale, and more.

Contenido

Alpa is an open-source system for training and serving large-scale neural networks. Alpa aims to automate large-scale distributed training and serving with just a few lines of code. Alpa was initially developed by folks in the Sky Lab, UC Berkeley. Some advanced techniques used in Alpa have been written in a paper published in OSDI'2022. Alpa community is growing with new contributors from Google, Amazon, AnyScale, and more.

A language model is a probability distribution over sequences of words. It predicts the next word based on all the previous words. It is useful for a variety of AI applications, such the auto-completion in your email or chatbot service. For more information, check out the language model wikipedia page.

GPT-3 is very large language model, with 175 billion parameters, that uses deep learning to produce human-like text. Many researchers and news articles described GPT-3 as "one of the most interesting and important AI systems ever produced". GPT-3 is gradually being used as a backbone in the latest NLP research and applications.

Due to its gigantic size, training and serving GPT-3 are very difficult and expensive, and pose significant challenges to the underlying software systems. The original GPT-3 trained by OpenAI is closed sourced and developed as a charged service --- When using it, the users have to pay for every token generated.

High-level speaking, Alpa is more automatic, scalable, and cost-effective compared to existing systems.

In more details, if you are an ML developer or data scientist who is looking for a system that can train or serve large models like GPT-3, Alpa provides state-of-the-art performance while requires the least amount of system expertise to setup. Meanwhile, Alpa enables to train or serve large models on older generations of (hence cheaper) GPUs, such as 40GB A100, V100, T4, M60, etc., which are common in many in-house clusters and more accessible for many people.

If you are a system developer aiming for developing better training or serving systems, Alpa, as a compiler, offers the most flexibility to try out various ML parallelization methods (inter- and intra-op parallelisms), and the richest coverage of big model architectures (GPT-3, MoE, WideResNet, etc.). Alpa might be a good starting point for you to start your prototyping.

If you are an amateur in ML/NLP/systems, well &#128539, you can play with OPT-175B inference for free; while all existing service will charge you for each token generated.

It depends on which types of GPUs used. A hard constraint now is that the total GPU memory in the cluster needs to be greater than 350GB in order to successfully run the model inference. Many existing training or serving systems usually rely on using the latest generations of GPUs with the largest memory capacity, such as 80GB A100. In contrast, Alpa, due to its more powerful backend, enables serving OPT-175B with more flexible parallelisms on older generations of GPUs, such as 40GB A100, V100, T4, M60, etc.

Take an example, if you choose to use 16GB V100 GPUs, then you would need 350 / 16 = 22 V100 GPUs to run the service.

We are working on a feature to enable serving models even if you do not have enough GPU memory, stay tuned.

Alpa currently runs on top of a Ray cluster, and uses some Ray functionalities to coordinate distributed processes. However, in contrast to Ray, Alpa is designed as a compiler for large-scale distributed machine learning training and serving with high performance.