A majestic blue tiger surfing on the back of a killer whale. The image evokes Ukiyo-E style framing. Image generated using Flux [pro].
DeepSeek R1 is a mixture of experts reasoning frontier AI model; it was released by DeepSeek on January 20th, 2025. Along with the model being available by DeepSeek's API, they released the model weights on HuggingFace and a paper about how they got it working.
DeepSeek R1 is a Mixture of Experts model. This means that instead of all of the model weights being trained and used at the same time, the model is broken up into 256 "experts" that each handle different aspects of the response. This doesn't mean that one "expert" is best at philosophy, music, or other subjects; in practice one expert will end up specializing with the special tokens (begin message, end message, role of interlocutor, etc), another will specialize on punctuation, some will focus on visual description words or verbs, and some can even focus on proper names or numbers. The main advantage of a Mixture of Experts model is that it allows you to get much better results with much less compute spent in training and at inference time. There are some minor difficulties involved in making sure that tokens get spread out between the experts in training, but it works out in the end.
DeepSeek R1 is a reasoning model. This means that the model generates a bunch of "reasoning tokens" before it gives you an answer. These reasoning tokens are effectively a scratch space where the model breaks down the problem, plans out a solution, and then it ends its thinking and starts generating a response. This is a very new workflow and it was originally spelled out in Learning to reason with LLMs by OpenAI in September 2024. This is an extension of Chain-of-Thought prompting as pioneered by the Google Brain team in 2022.
Finally, DeepSeek R1 is a frontier-grade model, which means that the results are really good. Specifically it means that the results of the model are almost as good as the big closed models like Anthropic Claude, OpenAI's GPT and O line, and Google's Gemini models.
So with all that terminology dumping out of the way, what does it look like? It's very similar to other models, but there's an additional reasoning bit. Here's an example:
Note that there's two responses from the model: the text answer you're used to seeing and the chain of thought that the model uses as "scratch space" to work out the best answer.
Why is DeepSeek R1 so good?
One of the big questions that people have is "if Chinese AI companies have to deal with sanctions that limit their access to GPUs, why is one of the best models from a Chinese AI company?" There's a reason why this is the case, but it's not going to be super obvious from the outset.
Most of the time, people don't associate Nintendo with super-advanced high tech 4k HDR graphics technology. Their consoles are actually fairly underpowered and require being clever with your optimizations. This is intentional because the hardware design philosophy encourages lateral thinking with well-understood and cheaply available technology. They buy a SoC and CPU and use the entire SoC and CPU to make games in ways that no other company takes the time to do. When you target a Nintendo platform, you have hard limits that you have to be clever about in ways that you don't need to be when you target a PlayStation 5 or an Xbox Series X. The PlayStation 5 and Xbox Series X are just so powerful that you can blow past all of the limits with ease.
Most western AI companies can just buy faster GPUs en masse. They have access to the NVidia H100 (the most powerful flagship AI GPU) and can just purchase more of them depending on their needs. They can create giant datacentres and stuff them to the gills with as many GPUs as they want.
Chinese AI companies aren't so lucky. They have hard limits on how many GPUs they can buy as well as what the limits of the cards are. I tried to dig up the actual sanctions limits and I wasn't able to find clear guidance in the 300+ pages of legal text. However, DeepSeek says they train their models on a cluster of NVidia H800 GPUs. The H800 has half the memory bandwidth of the H100. AI workloads are mostly memory bandwidth bound. This is a near fatal flaw that means that the DeepSeek team had to get clever.
The DeepSeek V3 paper has a lot more details about how they managed to wring all the performance they could from the cards they had, but here's the cliffsnotes:
- DeepSeek V3 (and R1) both use 256 experts and train 64 experts per node with 8 nodes per group.
- They created their own training framework named HAI-LLM that dedicates some otherwise unused hardware for cross-node communication.
- They did all of their training using 8-bit floating point numbers so that they can get the most effectiveness per hour. Previously it was thought that 8 bit floating point numbers were too imprecise to handle AI training workloads for benchmark results like the ones DeepSeek achieved.
- They saved memory wherever possible in order to avoid costly cross-node communication and even more costly CPU to GPU memory copy calls.
- They used Multi-Head Latent Attention to further optimize memory usage to 5-13% of what would normally be required for this kind of training. This also allows them to cache inputs to the DeepSeek API, making them able to deliver the model cheaply through their platform.
If DeepSeek had unrestricted access to advanced GPUs, very few of these efficiency advancements would need to be made. I was almost certain that we've been leaving a bunch of overhead on the table with our use of GPUs, and DeepSeek has confirmed it. I'm kinda surprised that the stock market reacted so violently to this news though.
I suspect that one of the reasons that people think that DeepSeek R1 is so good is that they're reading the reasoning / chain of thought output. One of the most magic parts of ChatGPT's initial release in late 2022 was that you were able to talk with a computer and have the computer talk back with you in plain English. This anthropomorphized the computer (so that it's a "friend" or at least authoritative-coded enough that you perceive it as a smart person). The reasoning output of DeepSeek R1 is another place that you can see this kind of subconscious anthropomorphization happening. I suspect that when you see it yap away like:
That you subconsciously see that and start anthropomorphizing it because it looks like the computer is "thinking" in the same way that you would take a complicated problem or thought and break it down into smaller bits. It kinda does make the model look adorable.
Personally, I think that when you evaluate the output of a reasoning model like DeepSeek R1, you should not look at the reasoning before you look at the answer to reduce your own bias.
There's also some controversy around DeepSeek R1 supposedly being trained on the output of OpenAI models. It's fairly common practice to use synthetic data in AI training. This is also known as distillation, and is the technique that DeepSeek used to make reasoning-distilled versions of Llama and Qwen models as outlined in the DeepSeek R1 paper (section 2.4).
Let's be real: most AI companies use data generated from OpenAI models, but not in the way that you think. A lot of AI models are trained from webpages on the internet. I personally have been having my own infrastructure struggle to cope with the load of AI scraping operations causing immense load and being unable to contact the operators of those bots to either give them the data they want through an external means or to get them to stop assaulting my servers into submission.
AI slop has taken over The Internet. Infinite shrimp Jesus can be churned out and overwhelms all human-authored content. A project to track the word frequency that people use over the years has stopped collecting data because AI generated articles have overwhelmed everything else. It seems we no longer have reliable word choice data from 2022 onwards. Even if you're not using OpenAI models directly, if you are scraping the internet to get your training set, then you're still using OpenAI model output. Period.
DeepSeek's origins
DeepSeek is a research subsidiary of High-Flyer, a Chinese hedge fund and quantitative trading firm. They are solely dedicated to making AI tools and models better and have no dedicated plans for commercialization beyond selling access to hosted copies of their models. This allows them to stay a pure research organization and avoid the complicated maze of AI regulations in Mainland China.
There's been a lot of fears that DeepSeek is some front for the Chinese government in order to collect data through vague means that could have vague impacts on people using their services. Usually people point to the terms of service having concerning phrases like:
By accepting this agreement, you are saying that you allow the forum engine to display your posts. Every time a major company changes their terms of service, it becomes An Issue™️ and people are fearful that they gave the company they're using to post to social media permission to distribute the things they post to social media.
There's probably something to be said about the differences in legal regimes between data hosted in the US and data hosted in Mainland China. Realistically, if you use social media, you should expect the content you create on your device to be distributed somewhere else.
The model weights are open to the public
One of the ways that DeepSeek R1 stands out is that the model weights are available to anyone on Hugging Face. No authentication or accepting of terms and conditions is required. Anyone can use them as long as they comply with the license (MIT).
This is huge. This means that anyone can use a frontier-grade model on any workflow they want with total privacy as long as they have enough hardware to be able to run the model. This lets you have some of the most advanced AI capabilities in places where you would never be able to run OpenAI or Anthropic models.
All you need is anything from 192 gigabytes of ram on your Mac Studio to 1.2 terabytes of vram across multiple servers.
The fact that you can have the weights on your computer means that you can fine-tune or abliterate the model so that you can customize it to the tasks you want to do. For example, DeepSeek R1 lacks tool use, which limits its ability to fit into AI agent workflows. I suspect that within a few weeks, we'll see fine-tuning to enable tool use.
The distilled models
Along with the main R1 model, DeepSeek published distilled versions of R1 based on popular AI models. This does demonstrate that training one model on the output of another works well, at the risk of model collapse if this is repeated too many times. Clearly this is effective in the short term and the outputs of a model are public domain (copyright is usually restricted to human authored output, there was a court case about a photograph taken by a monkey).
DeepSeek R1 is good enough for most people
AI companies have no moat. If a competitor comes in and gives people output that is 95% as good as your model for 10% of the price, it's an easy decision to switch. Nearly every AI company exposes their models using an API compatible with the OpenAI API client, changing providers is as easy as changing out your API key and the API endpoint.
Most of the time, you bill for AI model output per million output tokens (about 750,000 words, or about two to three 3d-printed save icons worth of text). Input tokens are usually a lot cheaper than output tokens, but most of the time people focus on the output token cost because that helps you figure out the "upper bound" on cost.
Generating one million output tokens with OpenAI's O1 model costs $60. The same number of tokens from DeepSeek R1 costs $2.19, just 3.6% of the cost of OpenAI O1. DeepSeek R1 doesn't have vision support, tool calling, or structured outputs, but the cost more than makes up for it. Many applications can cope with having to "waste output" or do more complicated filtering on the client side because the model is just that much cheaper.
Developers have already been using DeepSeek R1 to do some pretty staggering things such as porting SIMD optimizations in llama.cpp from ARM to WebAssembly. It really does seem to be good enough, even for some advanced/specialized use cases.
One of its biggest advantages is that it can run on hardware that you can look at. You don't need to own a datacenter to run a frontier model anymore. For full precision, eliminating quantization loss on the model weights, you'd need to re-mortgage your house (or raise a series B round) for a few spare 8xH100 servers. But you can get away with much less, in practice a Mac Studio with 192Gi of memory is all you need. I'll try running it on two A100-80Gi or H100-80Gi cards, but I'm on a work trip and haven't had the time to try doing this yet.
The DeepSeek app works well enough on my iPhone that I have no real complaints. Sometimes the R1 model times out and I don't get a response, but if I disable reasoning mode then I get responses pretty quickly. It has helped me rewrite mediocre code, do the math on how many tokens can fit on a floppy disk, and given me some amusement messing around with the model in order to see the limits.
In summary
Overall, it's a really good model and I kinda wish I had the hardware I needed to run it in its full form locally. It's probably the best model you can run on local hardware today. It's not perfect, but it's way better than most of the other models in its openness and weight class. It's available to the public through DeepSeek's platform; and more importantly you can run it yourself should you not want to trust DeepSeek at all.
I like it and I'm mixing it into the list of models that I use for my projects. I would have waited to write this article until I've built some things on top of this model, but y'all've been asking me for my takes on this right now, and who am I to not give the people what they want?
Want to try Tigris?
Make a bucket and store your models, training data, and artifacts across the globe!