When you get started with finetuning AI models, you typically pull the datasets
and models from somewhere like the Hugging Face Hub. This is generally fine, but
as your usecase grows and gets more complicated, you're going to run into two
big risks:
- You're going to depend on the things that are critical to your business being
hosted by someone else on a platform that doesn't have a public SLA
(Service-Level Agreement, or commitment to uptime with financial penalties
when it is violated).
- Your dataset will grow beyond what you can fit into ram (or even your hard
disk), and you'll have to start sharding it into chunks that are smaller than
ram.
Most of the stuff you'll find online deals with the "happy path" of training AI
models, but the real world is not quite as kind as this happy path is. Your data
will be bigger than ram. You will end up needing to make your own copies of
datasets and models because they will be taken offline without warning. You will
need to be able to move your work between providers because price hikes will
happen.
The unfortunate part is that this is the place where you're left to figure it
out on your own. Let's break down how to do larger scale model training in the
real world with a flow that can expand to any dataset, model, or cloud provider
with minimal changes required. We're going to show you how to use Tigris to
store your datasets and models, and how to use
SkyPilot to
abstract away the compute layer so that you can focus on the actual work of
training models. This will help you reduce the risk involved with training AI
models on custom datasets by importing those datasets and models once, and then
always using that copy for training and inference.
A blue tiger surfs the internet waves, object storage in tow. The image has an
ukiyo-e style with flat pastel colors and thick outlines.
Details
Generation details
Generated using
Counterfeit v3.0
using a
ComfyUI flow stacking several LoRA adapters
as well as four rounds of upscaling and denoising. Originally a sketch by Xe
Iaso.