good luck have fun
Extracto
Chat with open-source models
Contenido
Run
(almost)
any
language
model
NOTE: Thanks for the launch day interest! Signups are back up!
If you made an account on launch day, you'll need to re-sign-in with Google (or re-register your email/password), but all of your data will still be there when you sign back in, including chat history and API keys. Sorry for the chop! We ended up with a lot more traffic than expected.
We use
and a custom-built, autoscaling GPU scheduler to run
(almost)
any open-source large language model for you: just paste a link to the Hugging Face repo. You can use our chat UI, or our OpenAI-compatible API. We'll let you use up to eight Nvidia A100 80Gb GPUs.
Works with any full-weight or 4-bit AWQ repo on Hugging Face that vLLM supports, including:
- Meta Llama 3.1 405b Instruct (and 70b, and 8b)
- Qwen 2 72b
- Mixtral 8x22b
- Gemma 2 27b
- Deepseek V2 Coder Lite (support for the full model is in the works)
- Phi-3
And many more. We'll run full-weight finetunes as well, like those from Nous Research or uncensored anti-refusal abliterated models.
For the most popular models, we proxy to always-on inference providers for you automatically. For the more bespoke models, we'll spin up a cluster for you on-demand, and spin it down once you're done using it.
It's free during the beta period, while we work out the kinks and figure out how to price it. Once the beta is over, we expect to significantly beat pricing of the major cloud GPU vendors due to our ability to run the models multi-tenant.