AI’s Impending Left-pad Scenario

February 11, 2025 · 9 min read

Senior Cloud Whisperer

A cartoon tiger desperately runs away from a datacentre fire

A cartoon tiger desperately runs away from a datacentre fire. Image generated using Flux [pro].

The software ecosystem is built on a bedrock of implicit trust. We trust the software won’t have deliberately placed security vulnerabilities and won’t be yanked away offline without warning. AI models aren’t exactly software, but they’re distributed using a lot of the same platforms and technology as software. Thus, people assume they’re distributed using the same social contract as with software.

The AI ecosystem has a lot of the same distribution and trust challenges as software ecosystems do, but with much larger blobs of data that are harder to introspect. There are fears that something bad is going to happen with some large model and create a splash even greater than the infamous left-pad incident of 2016. These kinds of attacks seem unthinkable, but are inevitable.

How can you defend against AI supply-chain attacks? What are the risks? Today I’m going to cover what we can learn from the left-pad incident and how making a copy of the models you depend on can make your products more resilient.

DeepSeek R1 is good enough

January 29, 2025 · 19 min read

Xe Iaso

Senior Cloud Whisperer

A majestic blue tiger surfing on the back of a killer whale. The image evokes Ukiyo-E style framing.

A majestic blue tiger surfing on the back of a killer whale. The image evokes Ukiyo-E style framing. Image generated using Flux [pro].

DeepSeek R1 is a mixture of experts reasoning frontier AI model; it was released by DeepSeek on January 20th, 2025. Along with the model being available by DeepSeek's API, they released the model weights on HuggingFace and a paper about how they got it working.

DeepSeek R1 is a Mixture of Experts model. This means that instead of all of the model weights being trained and used at the same time, the model is broken up into 256 "experts" that each handle different aspects of the response. This doesn't mean that one "expert" is best at philosophy, music, or other subjects; in practice one expert will end up specializing with the special tokens (begin message, end message, role of interlocutor, etc), another will specialize on punctuation, some will focus on visual description words or verbs, and some can even focus on proper names or numbers. The main advantage of a Mixture of Experts model is that it allows you to get much better results with much less compute spent in training and at inference time. There are some minor difficulties involved in making sure that tokens get spread out between the experts in training, but it works out in the end.

Tigris now supports recent releases of the S3 SDK

January 28, 2025 · One min read

Xe Iaso

Senior Cloud Whisperer

A bunch of wrenches on a tool rack.

Recently Amazon made changes to the S3 libraries that broke Tigris support. We have made fixes on our end and you can upgrade to the latest releases of the AWS CLI, AWS SDK for Python (boto3), AWS SDK for JavaScript, AWS SDK for Java and AWS SDK for PHP.

If you are running into any issues with these updated SDK releases, please reach out via Bluesky, LinkedIn, or X (formerly Twitter).

How do large language models get so large?

January 23, 2025 · 8 min read

Xe Iaso

Senior Cloud Whisperer

A majestic blue tiger riding on a sailing ship. The tiger is very large.

A majestic blue tiger riding on a sailing ship. The tiger is very large. Image generated using PonyXL.

AI models can get pretty darn large. Larger models seem to perform better than smaller models, but we don’t quite know why. My work MacBook has 64 gigabytes of RAM and I’m able to use nearly all of it when I do AI inference. Somehow these 40+ gigabyte blobs of floating point numbers are able to take a question about the color of the sky and spit out an answer. At some level this is a miracle of technology, but how does it work?

Today I’m going to cover what an AI model really is and the parts that make it up. I’m not going to cover the linear algebra at play nor any of the neural networks. Most people want to start with an off the shelf model, anyway.

If you’ve upgraded boto3 or the JavaScript S3 client in the last week, uploading files won’t work. Here’s how to fix it.

January 21, 2025 · 4 min read

Xe Iaso

Senior Cloud Whisperer

Hey all. Recently AWS released boto3 version 1.36.0, and in the process they changed how the upload_file call works. This will cause uploads to Tigris with boto3 version 1.36.0 or higher to fail with the following error message:

boto3.exceptions.S3UploadFailedError: Failed to upload ./filename.jpg to mybucket/filename.jpg: An error occurred (MissingContentLength) when calling the PutObject operation: You must provide the Content-Length HTTP header.

In order to work around this, downgrade boto3 to the last release of version 1.35.x:

pip install boto3==1.35.95

Make sure that you persist this in your requirements.txt, pyproject.toml, or whatever you use to do dependency management.

You might also hit this with the JavaScript client at version v3.729.0 or later. In order to fix that, downgrade to version v3.728.0:

npm install @aws-sdk/client-s3@3.728.0
npm install @aws-sdk/s3-request-presigner@3.728.0

Make sure the changes are saved in your package.json file.

We’re fixing this on our end, but we want to take a minute to clarify why this is happening and what it means for Tigris to be S3 compatible.

What does it mean to be S3 compatible?

At some level, an API is just a set of calls that have listed side effects. You upload an object by name and later are able to get that object back when you give the name. The devil is in the details, and like any good API there are a lot of details.

In a perfect world, when you switch to using Tigris, you drop Tigris into place and then you don’t need to think anymore. We don’t live in a perfect world, and as such Tigris has a list of compatible API calls, and if your app only uses those calls you’ll be fine. Most apps are perfectly happy with that set of calls (in fact only use about 5 of them at most). We are adding support for any missing calls as reality demands and time allows. Our goal is that there’s no breaking changes when anything else gets released. Client or server.

S3’s API was originally meant to be used with Amazon S3. It has since become a cross-cloud standard, any cloud you can think of likely has a S3-compatible object storage system. It’s become the POSIX abstraction for the cloud. Any changes to the API change a whole host of edge cases that the creators of S3 probably don’t have in mind.

Tigris, Digital Ocean, MinIO, R2, and others were all affected by this change. We found out about this breakage when one of our tests broke in a new and exciting way that confused us. From what we can tell, users of boto3 and the JavaScript client found out about this change by their production code breaking without warning. Even some of AWS’ own example code broke with this change.

I feel bad for the team behind the S3 API changes, they’re probably not getting very much love from the developer community right now. If this was an outage, I’d say #hugops. I’m not sure what to say this time other than I hope that this post helps you make your code work again.

We’re taking this incident seriously and are updating our testing practices to make sure that we have more advance warning should this happen again as we take S3 compatibility seriously.

We’re updating Tigris so that developers can use this new version of the S3 client. We’ll have that rolled out soon. Follow us on Bluesky @tigrisdata.com or on LinkedIn to keep up to date!

Want to try it out?

Make a global bucket with no egress fees and use it with Python or JavaScript.

Get Started

Using Tigris as a Filesystem

December 19, 2024 · 10 min read

Xe Iaso

Senior Cloud Whisperer

Earlier this year I started consolidating some workloads to my homelab Kubernetes cluster. One of the last ones was a doozy. It's not a compute-hard or a memory-hard workload, it's a storage-hard workload. I needed to move the DJ set recording bot for an online radio station off of its current cloud and onto my homelab, but I still wanted the benefits of the cloud such as no thinking remote backups.

This bot has been running for a decade and the dataset well predates that, over 675 Gi of DJ sets, including ones that were thought to be lost media. Each of these sets is a 320 KiB/sec MP3 file that is anywhere from 150 to 500 MB, with smaller text files alongside them.

Needless to say, this dataset is very important to me. The community behind this radio station is how I've met some of my closest friends. I want to make sure that it's backed up and available for anyone that wants to go back and listen to the older sets. I want to preserve these files and not just dump them in an Archive bucket or something that would make it hard or slow to access them. I want these to be easily accessible to help preserve the work that goes into live music performances.

Here's how I did it and made it easy with Tigris.

An extreme close-up of a tiger with blue and orange fur. The Kubernetes logo replaces its iris.

How Beam runs GPUs anywhere

December 12, 2024 · 6 min read

Katie Schilling

DevRel Enthusiast

What do you do when you need to serve up a completely custom, 7+ billion parameter model with sub 10 second cold start times? And without writing a Dockerfile or managing scaling policies yourself. It sounds impossible, but Beam's serverless GPU platform provides performant, scalable AI infrastructure with minimal configuration. Your code already does the AI inference in a function. Just add a decorator to get that function running somewhere in the cloud with whatever GPU you specify. It turns on when you need it, it turns off when you don't. This can save you orders of magnitude over running a persistent GPU in the cloud.

Tigris tiger watching a beam from a ground satellite. Image generated with Flux [dev] from Black Forest Labs on fal.ai

Tigris tiger watching a beam from a ground satellite. Image generated with Flux [dev] from Black Forest Labs on fal.ai.

Training with Big Data on Any Cloud

December 3, 2024 · 21 min read

Xe Iaso

Senior Cloud Whisperer

When you get started with finetuning AI models, you typically pull the datasets and models from somewhere like the Hugging Face Hub. This is generally fine, but as your usecase grows and gets more complicated, you're going to run into two big risks:

You're going to depend on the things that are critical to your business being hosted by someone else on a platform that doesn't have a public SLA (Service-Level Agreement, or commitment to uptime with financial penalties when it is violated).
Your dataset will grow beyond what you can fit into ram (or even your hard disk), and you'll have to start sharding it into chunks that are smaller than ram.

Most of the stuff you'll find online deals with the "happy path" of training AI models, but the real world is not quite as kind as this happy path is. Your data will be bigger than ram. You will end up needing to make your own copies of datasets and models because they will be taken offline without warning. You will need to be able to move your work between providers because price hikes will happen.

The unfortunate part is that this is the place where you're left to figure it out on your own. Let's break down how to do larger scale model training in the real world with a flow that can expand to any dataset, model, or cloud provider with minimal changes required. We're going to show you how to use Tigris to store your datasets and models, and how to use SkyPilot to abstract away the compute layer so that you can focus on the actual work of training models. This will help you reduce the risk involved with training AI models on custom datasets by importing those datasets and models once, and then always using that copy for training and inference.

A blue tiger surfs the internet waves, object storage in tow. The image has an ukiyo-e style with flat pastel colors and thick outlines.

Details

Generation details

Generated using Counterfeit v3.0 using a ComfyUI flow stacking several LoRA adapters as well as four rounds of upscaling and denoising. Originally a sketch by Xe Iaso.

Nomadic Infrastructure Design for AI Workloads

November 12, 2024 · 19 min read

Xe Iaso

Senior Cloud Whisperer

A nomadic server hunting down wild GPUs in order to save money on its cloud computing bill. Image generated with Flux [dev] from Black Forest Labs on fal.ai

A nomadic server hunting down wild GPUs in order to save money on its cloud computing bill. Image generated with Flux [dev] from Black Forest Labs on fal.ai.

Taco Bell is a miracle of food preparation. They manage to have a menu of dozens of items that all boil down to permutations of 8 basic items: meat, cheese, beans, vegetables, bread, and sauces. Those basic fundamentals are combined in new and interesting ways to give you the crunchwrap, the chalupa, the doritos locos tacos, and more. Just add hot water and they’re ready to eat.

Even though the results are exciting, the ingredients for them are not. They’re all really simple things. The best designed production systems I’ve ever used take the same basic idea: build exciting things out of boring components that are well understood across all facets of the industry (eg: S3, Postgres, HTTP, JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the industry-disrupting industry.

A bunch of companies want to sell you inference time for your AI workloads or the results of them inferencing AI workloads for you, but nobody really tells you how to make this yourself. That’s the special Mexican Pizza sauce that you can’t replicate at home no matter how much you want to be able to.

Today, we’ll cover how you, a random nerd that likes reading architectural articles, should design a production-ready AI system so that you can maximize effectiveness per dollar, reduce dependency lock-in, and separate concerns down to their cores. Buckle up, it’s gonna be a ride.

Tigris supports Storage Tiers

November 5, 2024 · 5 min read

Katie Schilling

DevRel Enthusiast

$A library with a fractal of bookshelves in all directions, wooden ladders connecting the floor to the shelves. Many blue tigers tend to the books. — Image generated with Flux [pro] 1.1 from Black Forest Labs on fal.ai$

A library with a fractal of bookshelves in all directions, wooden ladders connecting the floor to the shelves. Many blue tigers tend to the books. — Image generated with Flux [pro] 1.1 from Black Forest Labs on fal.ai

When you have a lot of data, maybe even Big Data ™️, you might start to wonder why you're paying so much to keep it all hot and ready. Do you really need that prior version of your model weights from last year to be available instantly? Let's be clear though: we're happy to serve you petabytes of old model weights and datasets… but we'd rather help you save some money on your infrastructure budget.

When you create new objects or buckets, you can select the storage tier to put it in: Standard, Infrequent Access, or Archive. Everything you currently have in Tigris is likely in the Standard storage tier, and when you create new objects with the S3 API and don't specify a storage tier, it'll end up in Standard too.

We've updated our pricing with specifics, but you can expect to save $0.016 per GB per month by moving your backups and other old data from the Standard storage tier to the Archive storage tier. If you want to store one terabyte of data in the Archive tier, it will cost you $4 (at time of writing). At Infrequent Access rates, that will cost you $10, and at Standard, it'll cost you $20 per month. This is a 5x cost reduction for data that you don't need often and can tolerate waiting an hour or so for it to be pulled out of Archive.

And, of course, none of our Storage Tiers include egress fees.

Want to try it out?

Make a global bucket with no egress fees

Get Started

Deciding what tier to use

I'm sure you've heard of folks regretting their decision to archive data that they end up needing in a hurry. Here's a good rule of thumb to decide where objects should go: how much downtime can you tolerate when everything's on fire and you need that data NOW?

If you can tolerate an hour of downtime for that data to get restored from Archive, Archive is fine. If you can't, Infrequent Access is probably the best bet: Tigris returns Infrequent Access objects as rapidly as Standard tier objects.

Your database backups from 3 years ago or the shared drive from a long-completed project are probably not going to be accessed very often (maybe even never), so it makes sense to Archive them just in case. Your database backups from about 10 minutes ago are much more likely to be accessed, so it makes sense to put them into Infrequent Access. That way you can respond instantly to the wrong database being deleted instead of having to wait for an hour for the backups to load from Archive.

Here's how to use object tiers

When you create a new bucket in the Tigris Dashboard, you can select which storage tier you want objects to use by default:

The Tigris Dashboard showing storage tier selection with three options: Standard, Infrequent Access, and Archive

A screenshot of the Tigris Dashboard showing the default storage tier selector for a newly created bucket. The options are: Standard, Infrequent Access, and Archive.

Choose between:

Standard: the default storage class, it provides high durability, availability, and performance for frequently accessed data.
Infrequent access: Lower-cost storage for data that isn't accessed frequently, but requires rapid access when needed.
Archive: Low-cost storage for data archiving. Long-term data archiving with infrequent access.

Otherwise, you can set it when you upload a file:

CLI
Python
Go

# Standard
aws s3 cp --storage-class STANDARD hello.txt s3://your-bucket-name/your-object-name

# Infrequent Access

aws s3 cp --storage-class STANDARD_IA hello.txt s3://your-bucket-name/your-object-name

# Archive
aws s3 cp --storage-class GLACIER hello.txt s3://your-bucket-name/your-object-name

import boto3

s3 = boto3.client('s3', endpoint_url='https://fly.storage.tigris.dev')

# Define the bucket name, file to upload, and the key (name) for the file in S3
bucket_name = 'your-bucket-name'
file_name = 'path/to/your/file'
object_name = 'your-object-name'

# Upload the file with the specified storage class
s3.upload_file(
    Filename=file_name,
    Bucket=bucket_name,
    Key=object_name,
    ExtraArgs={
        'StorageClass': 'STANDARD_IA', # Infrequent Access
        #'StorageClass': 'ARCHIVE', # Archive
    }
)

print(f"File {file_name} uploaded to tigris://{bucket_name}/{object_name} with storage class STANDARD_IA")

// assumes that ctx is of type context.Context and in scope

const (
  bucketName = "your-bucket-name"
  objectName = "your-object-name"
)

sdkConfig, err := config.LoadDefaultConfig(ctx)
if err != nil {
	panic(err)
}

cli := s3.NewFromConfig(sdkConfig, func(o *s3.Options) {
	o.BaseEndpoint = aws.String("https://fly.storage.tigris.dev")
	o.Region = "auto"
})

fin, err := os.Open("path/to/your/file")
if err != nil {
	return fmt.Errorf("can't open file: %w", err)
}
defer fin.Close()

st, err := fin.Stat()
if err != nil {
	return fmt.Errorf("can't stat %s: %w", fin.Name(), err)
}

contentType := mime.TypeByExtension(filepath.Ext(fin.Name()))

if _, err := cli.PutObject(ctx, &s3.PutObjectInput{
	Bucket:        aws.String(bucketName),
	Key:           aws.String(objectName),
	Body:          fin,
	ContentType:   aws.String(mime.TypeByExtension(filepath.Ext(fin.Name()))),
	ContentLength: aws.Int64(st.Size()),
	// use infrequent access tier
	StorageClass: types.StorageClassStandardIa,
	// use archive tier
	//StorageClass: types.StorageGlacier,
}); err != nil {
	return fmt.Errorf("can't upload %s to tigris://%s/%s: %w", fin.Name(), bucketName, objectName, err)
}

What's up next

I bet you're thinking, Wow this would be really cool to use with a Lifecycle Rule feature so I can better manage my backups and older objects. Us, too! Lifecycle Rules are coming soon.

Convinced? Make a new bucket today and give Tigris a try.

What does it mean to be S3 compatible?​

Want to try it out?

Want to try it out?

Deciding what tier to use​

Here's how to use object tiers​

What's up next​

What does it mean to be S3 compatible?

Deciding what tier to use

Here's how to use object tiers

What's up next