Skip to main content

· 10 min read
Xe Iaso

Earlier this year I started consolidating some workloads to my homelab Kubernetes cluster. One of the last ones was a doozy. It's not a compute-hard or a memory-hard workload, it's a storage-hard workload. I needed to move the DJ set recording bot for an online radio station off of its current cloud and onto my homelab, but I still wanted the benefits of the cloud such as no thinking remote backups.

This bot has been running for a decade and the dataset well predates that, over 675 Gi of DJ sets, including ones that were thought to be lost media. Each of these sets is a 320 KiB/sec MP3 file that is anywhere from 150 to 500 MB, with smaller text files alongside them.

Needless to say, this dataset is very important to me. The community behind this radio station is how I've met some of my closest friends. I want to make sure that it's backed up and available for anyone that wants to go back and listen to the older sets. I want to preserve these files and not just dump them in an Archive bucket or something that would make it hard or slow to access them. I want these to be easily accessible to help preserve the work that goes into live music performances.

Here's how I did it and made it easy with Tigris.

An extreme close-up of a tiger with blue and orange fur. The Kubernetes logo replaces its iris.

An extreme close-up of a tiger with blue and orange fur. The Kubernetes logo replaces its iris.

· 6 min read
Katie Schilling

What do you do when you need to serve up a completely custom, 7+ billion parameter model with sub 10 second cold start times? And without writing a Dockerfile or managing scaling policies yourself. It sounds impossible, but Beam's serverless GPU platform provides performant, scalable AI infrastructure with minimal configuration. Your code already does the AI inference in a function. Just add a decorator to get that function running somewhere in the cloud with whatever GPU you specify. It turns on when you need it, it turns off when you don't. This can save you orders of magnitude over running a persistent GPU in the cloud.

Tigris tiger watching a beam from a ground satellite. Image generated with Flux [dev] from Black Forest Labs on fal.ai

Tigris tiger watching a beam from a ground satellite. Image generated with Flux [dev] from Black Forest Labs on fal.ai.

· 21 min read
Xe Iaso

When you get started with finetuning AI models, you typically pull the datasets and models from somewhere like the Hugging Face Hub. This is generally fine, but as your usecase grows and gets more complicated, you're going to run into two big risks:

  • You're going to depend on the things that are critical to your business being hosted by someone else on a platform that doesn't have a public SLA (Service-Level Agreement, or commitment to uptime with financial penalties when it is violated).
  • Your dataset will grow beyond what you can fit into ram (or even your hard disk), and you'll have to start sharding it into chunks that are smaller than ram.

Most of the stuff you'll find online deals with the "happy path" of training AI models, but the real world is not quite as kind as this happy path is. Your data will be bigger than ram. You will end up needing to make your own copies of datasets and models because they will be taken offline without warning. You will need to be able to move your work between providers because price hikes will happen.

The unfortunate part is that this is the place where you're left to figure it out on your own. Let's break down how to do larger scale model training in the real world with a flow that can expand to any dataset, model, or cloud provider with minimal changes required. We're going to show you how to use Tigris to store your datasets and models, and how to use SkyPilot to abstract away the compute layer so that you can focus on the actual work of training models. This will help you reduce the risk involved with training AI models on custom datasets by importing those datasets and models once, and then always using that copy for training and inference.

A blue tiger surfs the internet waves, object storage in tow. The image has an ukiyo-e style with flat pastel colors and thick outlines.

A blue tiger surfs the internet waves, object storage in tow. The image has an ukiyo-e style with flat pastel colors and thick outlines.

Details

Generation details Generated using Counterfeit v3.0 using a ComfyUI flow stacking several LoRA adapters as well as four rounds of upscaling and denoising. Originally a sketch by Xe Iaso.

· 19 min read
Xe Iaso

A nomadic server hunting down wild GPUs in order to save money on its cloud computing bill. Image generated with Flux [dev] from Black Forest Labs on fal.ai

A nomadic server hunting down wild GPUs in order to save money on its cloud computing bill. Image generated with Flux [dev] from Black Forest Labs on fal.ai.

Taco Bell is a miracle of food preparation. They manage to have a menu of dozens of items that all boil down to permutations of 8 basic items: meat, cheese, beans, vegetables, bread, and sauces. Those basic fundamentals are combined in new and interesting ways to give you the crunchwrap, the chalupa, the doritos locos tacos, and more. Just add hot water and they’re ready to eat.

Even though the results are exciting, the ingredients for them are not. They’re all really simple things. The best designed production systems I’ve ever used take the same basic idea: build exciting things out of boring components that are well understood across all facets of the industry (eg: S3, Postgres, HTTP, JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the industry-disrupting industry.

A bunch of companies want to sell you inference time for your AI workloads or the results of them inferencing AI workloads for you, but nobody really tells you how to make this yourself. That’s the special Mexican Pizza sauce that you can’t replicate at home no matter how much you want to be able to.

Today, we’ll cover how you, a random nerd that likes reading architectural articles, should design a production-ready AI system so that you can maximize effectiveness per dollar, reduce dependency lock-in, and separate concerns down to their cores. Buckle up, it’s gonna be a ride.

· 5 min read
Katie Schilling

A library with a fractal of bookshelves in all directions, wooden ladders connecting the floor to the shelves. Many blue tigers tend to the books. — Image generated with Flux [pro] 1.1 from Black Forest Labs on fal.ai

A library with a fractal of bookshelves in all directions, wooden ladders connecting the floor to the shelves. Many blue tigers tend to the books. — Image generated with Flux [pro] 1.1 from Black Forest Labs on fal.ai

When you have a lot of data, maybe even Big Data ™️, you might start to wonder why you’re paying so much to keep it all hot and ready. Do you really need that prior version of your model weights from last year to be available instantly? Let’s be clear though: we’re happy to serve you petabytes of old model weights and datasets… but we’d rather help you save some money on your infrastructure budget.

When you create new objects or buckets, you can select the storage tier to put it in: Standard, Infrequent Access, or Archive. Everything you currently have in Tigris is likely in the Standard storage tier, and when you create new objects with the S3 API and don’t specify a storage tier, it’ll end up in Standard too.

We’ve updated our pricing with specifics, but you can expect to save $0.016 per GB per month by moving your backups and other old data from the Standard storage tier to the Archive storage tier. If you want to store one terabyte of data in the Archive tier, it will cost you $4 (at time of writing). At Infrequent Access rates, that will cost you $10, and at Standard, it’ll cost you $20 per month. This is a 5x cost reduction for data that you don’t need often and can tolerate waiting an hour or so for it to be pulled out of Archive.

And, of course, none of our Storage Tiers include egress fees.

Want to try it out?

Make a global bucket with no egress fees

Deciding what tier to use

I’m sure you’ve heard of folks regretting their decision to archive data that they end up needing in a hurry. Here’s a good rule of thumb to decide where objects should go: how much downtime can you tolerate when everything’s on fire and you need that data NOW?

If you can tolerate an hour of downtime for that data to get restored from Archive, Archive is fine. If you can’t, Infrequent Access is probably the best bet: Tigris returns Infrequent Access objects as rapidly as Standard tier objects.

Your database backups from 3 years ago or the shared drive from a long-completed project are probably not going to be accessed very often (maybe even never), so it makes sense to Archive them just in case. Your database backups from about 10 minutes ago are much more likely to be accessed, so it makes sense to put them into Infrequent Access. That way you can respond instantly to the wrong database being deleted instead of having to wait for an hour for the backups to load from Archive.

Here’s how to use object tiers

When you create a new bucket in the Tigris Dashboard, you can select which storage tier you want objects to use by default:

The Tigris Dashboard showing storage tier selection with three options: Standard, Infrequent Access, and Archive

A screenshot of the Tigris Dashboard showing the default storage tier selector for a newly created bucket. The options are: Standard, Infrequent Access, and Archive.

Choose between:

  • Standard: the default storage class, it provides high durability, availability, and performance for frequently accessed data.
  • Infrequent access: Lower-cost storage for data that isn’t accessed frequently, but requires rapid access when needed.
  • Archive: Low-cost storage for data archiving. Long-term data archiving with infrequent access.

Otherwise, you can set it when you upload a file:

Standard
aws s3 cp --storage-class STANDARD hello.txt s3://your-bucket-name/your-object-name

Infrequent Access
aws s3 cp --storage-class STANDARD_IA hello.txt s3://your-bucket-name/your-object-name

Archive
aws s3 cp --storage-class GLACIER hello.txt s3://your-bucket-name/your-object-name

What’s up next

I bet you're thinking, Wow this would be really cool to use with a Lifecycle Rule feature so I can better manage my backups and older objects. Us, too! Lifecycle Rules are coming soon.

Convinced? Make a new bucket today and give Tigris a try.

· 5 min read
Garren

Autumn trees on a dusty road in Magoebaskloof, South Africa

Autumn trees on a dusty road in Magoebaskloof, South Africa. Photo by Garren Smith, iPhone 13 Pro.

Tigris now supports object notifications! Object notifications are how you receive events every time something changes in a bucket. Think of it as your bucket's way of saying "Hey, something happened! Come check it out!", much like the inotify subsystem in Linux. These notifications can be helpful for keeping track of what's going on in your application.

Use Case: Automatic Image Processing

Imagine you're building a photo-sharing app. Every time a user uploads a new picture, you want to automatically generate a thumbnail and maybe even run it through an AI to detect any inappropriate content. With object notifications, this becomes a breeze!

  1. User uploads an image to your Tigris bucket.
  2. Tigris sends a notification to your webhook.
  3. Your server receives the notification and springs into action.
  4. It downloads the new image, creates a thumbnail, and runs it through an AI check.
  5. The processed image and its metadata are saved back to Tigris.

All of this happens automatically, triggered by that initial upload.

Behind the Scenes: Building Object Notifications

Now, let's pull back the curtain and see how we built this feature and a few tricky situations we had to handle. Grab your hard hat, because we're going on a little tour of Tigris's inner workings!

Tigris isn't just any object store – it's a global object store. This means that objects can be changed in multiple regions around the world. This makes them available in multiple regions, always ready when you need them. But means we need a way of keeping track of all the changes for the same object. This is where replication comes in.

Replication: Keeping Everyone in the Loop

To make sure everything stays in sync, we replicate changes to multiple regions. This ensures high availability and improved redundancy of our objects.

The caveat to this is that replication is a background task, and the speed at which an object is replicated from one region to another can be affected by many external factors.

To solve this, when a change is received at a region it looks at the Last Modified timestamp of the metadata to determine if the change is new and needs to be applied or if the region has already seen a newer change. It will discard the change if it is old.

Want to try it out?

Make a global bucket with no egress fees

The Object Notification Hub

When object notifications are enabled for a bucket, we assign one region to be the object notification hub for that bucket. This region gets the important job of keeping track of all the changes. We create a special index which is very similar to a secondary index in that region's FoundationDB. We order the changes by FoundationDB Versionstamp, when the change is added to the index, and Last Modified timestamp of object metadata.

The Versionstamp helps the worker keep track of which events it has seen and processed.

Why one region you may ask? If we didn't do this, we end up with multiple regions sending the same events to the webhook, hello friendly DDos attack, or having to build a complex system to try and co-ordinate the regions so they don't send duplicate events.

The Background Task: Our Diligent Messenger

In our object notification region, we have a background task running. Think of it as a tireless worker that's always on the lookout for changes. Every so often, it checks the special index we mentioned earlier, collects all the latest changes, and sends them off to the webhook.

The worker will also keep track of the last processed change and will retry a few times if the request failed. Finally it will remove old changes from the index that have already been processed.

Why We Can't Guarantee Ordered Events

We talked about how object changes replicated from many regions can take different times. The problem arises when the worker is ready to send the latest events for an object. It has no way of knowing if all changes for an object have been replicated to its region. It could in theory contact every region and check, but this would be prohibitively expensive. And still not a complete guarantee.

This forces us to make the trade off of sending events out of order. The worker will read the latest list of changes that have been replicated to the region and send them to the webhook.

Wrapping Up

That's how we built object notifications in Tigris. We took a global system, added some global replication, threw in a change index, topped it off with a hardworking background task.

The result? A system that keeps you in the loop about what's happening in your buckets, no matter where in the world those changes occur. Whether you're building the next big photo-sharing app or just want to keep tabs on your storage, object notifications have got your back!

We hope this peek behind the scenes was fun and informative. Happy coding!

· 5 min read
Xe Iaso

Docker is the universal package format of the internet. When you deploy software to your computers, chances are you build your app into a container image and deploy it through either Docker or something that understands the same formats that Docker uses. However, this is where they get you: Docker image storage in the cloud is not free. Docker registries also have strict image size limits and will charge you egress fees based on the size of your images.

What if you could host your own registry though? What if when doing it you could actually get a better experience than you get with the hosted registries on the big cloud.

A sea of scattered clouds covers the land beneath.

A sea of scattered clouds covers the land beneath. Photo by Xe Iaso, iPhone 15 Pro Max @ 22mm.

· 4 min read
Katie Schilling
Xe Iaso

At Tigris Data, we provide object storage to our users. People put bytes into our servers with a name, and expect that come hell and high water, when they put in the name, they get the exact same bytes back. This is a very high trust position to be in because when people ask themselves things like “Oh, what would happen if my object storage provider is unreliable”, that conversation usually involves phrases like “Maybe we should have gone with The Big Cloud afterall”.

Such conversations are rarely good for the business.

A battle rages on in the field, yet the strong oak tree remains unscathed

A battle rages on in the field, yet the strong oak tree remains unscathed