In this episode of Unsupervised Learning, host Renee interviews Daniel, the co-founder of Unsloth, an AI training system that fine-tunes language models 30 times faster. They discuss Daniel's beginnings at Nvidia, his passion for making AI accessible and efficient, and his ultimate vision of creating a personal ChatGPT for everyone that operates on local machines. Daniel explains the concept of Retrieval Augmented Generation (RAG) as a knowledge injection system and elaborates on the current uses and future plans for Unsloth. The episode also touches on the issues with representing maths in language models and the misconceptions people have about working with large language models.
Episode 1!!! 🎉
Today we chat about AI Training with (un)Supervised Learning and Daniel from Unsloth.ai
The good stuff- Unsloth
https://www.unsloth.ai
https://ko-fi.com/unsloth
https://github.com/unslothai
In this episode of Unsupervised Learning, host Renee interviews Daniel, the co-founder of Unsloth, an AI training system that fine-tunes language models 30 times faster. They discuss Daniel's beginnings at Nvidia, his passion for making AI accessible and efficient, and his ultimate vision of creating a personal ChatGPT for everyone that operates on local machines. Daniel explains the concept of Retrieval Augmented Generation (RAG) as a knowledge injection system and elaborates on the current uses and future plans for Unsloth. The episode also touches on the issues with representing maths in language models and the misconceptions people have about working with large language models.
Have something to say? feedback, love notes or recommend a mate to join the pod @ renee@unsupervisedlearning.co
00:00 Introduction to the Podcast
00:26 Understanding Unsloth: The AI Training System
00:58 Daniel's Journey from NVIDIA to Unsloth
02:15 The Power of OpenAI's Triton Language
02:38 The Magic Behind Unsloth's Fine-Tuning Process
03:42 Community Engagement and Use Cases of Unsloth
05:03 Working with Family in the AI Space
05:35 The Role of Autonomous Agents in AI Development
06:57 Challenges of Using Language Models for Math
09:03 Unsloth's Vision for Democratizing AI
09:56 Misconceptions and Best Practices in Working with LLMs
14:21 Understanding Retrieval Augmented Generation (RAG)
17:29 Staying Updated in the AI Space
18:26 Supporting Unsloth's Open Source Initiative
19:29 Conclusion: The Future of AI with Unsloth
Renee: [00:00:00] Hi, Renee here at Unsupervised Learning your Easy
Listening podcast for Bleeding Edge open source tech. Today we're speaking to Daniel Co-founder of Unsloth,
the AI training system, boasting 30 times faster, fine tuning for large language models. We speak on working with family, his beginnings at Nvidia, what the hell is rag anyway, and his favourite places to keep up to date.
Join me for this catch up on Unsupervised Learning.
Are you able to give a quick rundown of Unsloth, how it kind of came to be?
Daniel: So we make fine-tuning of language models 30 times faster.
We have an open-source package with 3000 GitHub
stars. It makes fine-tuning two times faster and it reduces memory usage by 50%. , so the goal of uns. Before it can take very slow to fine tune. It can use lots of memory. And now with our like free Colab Notebooks, you can like just click, you know, run or you can put your data set, whatever you like, and then it just fine tunes.
And finally at the end you can like export it to like GGUF, you can export it to like VLLM.
Renee: that's good to know.
So you [00:01:00] previously worked with NVIDIA. Is that where the, uh, writing CUDA, Triton kernel kind of idea came from for
Daniel: Unsloth? Yeah. So yeah, used to work at NVIDIA. So my goal was to make algorithms on the GPU faster. So I made like TSE 2000 times faster. I made like randomized SVD faster as well.
That was my like, old role. And so I think, I think the goal of Unsloth was. I think it came about like last, like October, there was a competition called the LLM Efficiency Challenge by NeurIPS. The goal was you have one day on one GPU to train one LLM to, you know, to attain the highest accuracy. So we took a different approach and we thought like, you know, to get the highest accuracy, you can also train faster to reach that, you know, that accuracy faster.
And so like we diverged and we focused on, you know, training faster. And so like I think, yeah, we did take out Nvidia experience of writing CUDA kernels. To Unsloth. And so we rewrote, you know, all of the kernels in Triton language in OpenAI's, Triton language. We [00:02:00] also did like mass manipulations and we did lots of optimizations in the code.
Renee: Yeah. So it's like something, I was wondering if it was a case of those things that you don't think will be relevant or you don't realize will be relevant then becoming relevant later on.
Daniel: Yeah, that, yeah, I think that's a fair point. I think OpenAI's Triton language wasn't that popular.
But then I think like now it's getting very, very popular because like, you know, you want performance, you want efficiency, you want memory usage. So like Triton kernels is a part of our system, but I think like that's only half of the system. I think the rest, the other half is like all maths equations and like, you know, how do we write the maths equations more efficiently?
Yeah.
Renee: And you are agnostic. So you run, is it, you run NVIDIA, AMD, GPUs, and what's the other Intel for the non-technical people like me and some other people listening, the kind of value prop, I guess is you can fine tune it way faster than traditional methods.
So is it in the [00:03:00] method that is
Daniel: different? So what we do is we take like the entire Backpropagation algorithm in the fine-tuning process, right, like the gradient descent process, and we rewrite it. In like hard maths, right? So like we write all the maths equations down and then we optimize every single line with like the correct, like the more optimal format for like the maths equations, for example.
You can like bracket correctly. And then that can increase speed. And we can all, we also like reduce memory usage by writing everything in CUDA programming. So we re rewrite everything in low-level GPU code. And this can reduce memory of usage and make it faster. So if there's like 50 something tricks in like the open source package and that's how we make it faster.
Renee: Aside from. Knowing you from the team at Jan, I also saw you've got heaps of people talking about you online. And I think there was a, there was a thread on like Black Hat World or something like that, and it was like, people were so amazing, like, oh, it's open source now. But they were like, explain it to me in layman terms because they, they were excited about it, but [00:04:00] they didn't know what they could do with it.
So what are you seeing, like, I'm in your discord, but what are you seeing the community sort of doing with Unsloth
Daniel: so I think a few days ago people like, so for example, the open source models like Mistral and Llama, they're only English based, so they somehow like trained it on their own, like home country language, like Vietnamese or like, you know, any Portuguese or Spanish.
You can train, you can like use our fine tuning notebook to train on any language that you like. So I think that was the fundamental, like problems of Mistral and Llama, it's just English based. You can also train like, you know, Mandarin and other languages very simply. I think that was the most like interesting thing that I found recently.
I think like some other people, they didn't even know how to like convert like a model to gguf or llama.c++ formats, Vllm, for example, to use on Jan. And so like, we also solved that conversion step as well. And so they, like, currently people are just using our notebook just for conversion. So they don't even do the fine-tuning step.
They just [00:05:00] do the conversion step. So that's a very interesting use case. Yeah.
Renee: Yeah. And so is , your brothers, is he like the non-technical, one of you two, would you say like, how deep is he in the technical side?
Daniel: What you're doing? Yeah. So like he does all the front end he does all the website, he does like everything else that I don't wanna do.
So I just do the algorithms. And so he does everything else. He's also like, he's a very good like ChatGPT user, so we use, we actually use ChatGPT and like Bing chat for like some engineering health. So that's, so he's like that side of the equation. And I just do the algorithms. Yeah.
Renee: Obviously the whole kind of idea of this podcast is for people like you, like really clever people to explain things to me like I'm a child, so unlike autonomous agents and stuff like that, like I don't wanna dispute the definition of agent, but like, do you have, 'cause you're just a tiny team, right?
But you clearly have a lot that you're doing, a lot that you're shipping. Do you have, like, you make use of that [00:06:00] in an autonomous way, or is it more like you're kind of. Consulting something like ChatGPT to then help you with code?
Daniel: Yeah. That's a very good question. I think so. I think if we wanted to make like applications and like different things and we get like, we get like stuck somewhere.
Mm-Hmm. I think ChatGPT is a very good unstuck. You know, it can make you unstuck on something that you get stuck on, so that's very useful. I think we haven't actually used it for like a whole, like application design and like, you know, writing everything from scratch. I did use it to like, try to like, do like matrix differentials and try to do like, like differentiation via Chatgpt
it was quite bad. So I did not use, I unfortunately, I had to fall back to like, writing it on the, like, you know, on PayPal and doing the differentials by hand. So I did try to use Chatgpt for like the maths part. It's just not that good. So yeah, so we just generally, we generally use, you know, Chatgpt, gbt-four for like just an unstuck mechanism.
It can improve productivity a lot.
Renee: Like how do you represent maths [00:07:00] in a word format and then is that why it has issues, like why you can't trust it for maths
Daniel: so I think in terms of maths, I think there was like three problems with like language models.
The first one is the tokenization problem. So I think GPT-FOR. So from what I understand, GPT-FOR tokenizes numbers by like, you know, 0 1, 2, 3, 4, 5, 6, 7, 8, 9. They tokenize individual numbers, but they also do like double digits. So like, you know, 10 to ninety-nine. They also tokenize. Each one has like a id. And they're like 100 to 1000.
They take like individual id. So I know like, you know, language models are very bad at like multiplication. They're quite bad at addition. It's because of the tokenization issue. So I know like Llama fixed, fixed it as well by tokenizing each. So for example, if you have a number 100, right? It would tokenize each, like they'll split the numbers into individual digits, right?
So 1 0 0. So that's one of the issues. The other issue is I guess the training data is just too less. Like, you know, in maths, there's, like, in maths, each problem is like very [00:08:00] specialized. If you wanna take the differential of like x squared, that the, that's easy. That's two x,, right? So like, but then if the formula gets very like complicated, it might have not seen it in the training set.
And so I think. Once they have a larger training set, this could, you know, by the phenomenon of grokking, maybe they can do some maths. So there was like a,, I think like people were talking about Q star. So supposedly Q star like can solve, you know, like high school maths level problems. So those are the open air I think.
So those are the two main problems. And the third one, I guess is just like, I think, so language models are design for maths. So you need to like, maybe have like a separate component to like verify the maths, like a proof machine or like Python to write the maths. So those are the three main problems.
Renee: Hmm. Thank you for the context, by the way. So when you spoke about the current, kind of what Unsloth is being used for, your recent kind of partnership, like is that part of something bigger or is it like part of an open source plan
Daniel: [00:09:00] so we did do a partnership with Hugging based on a blog. So we did a blog with them. We are also in the TRL docs. And so we we're trying to like make like, you know, LLMs accessible to everyone. And so our goal is to make, you know, even the very small graphics cards. I think like the biggest difference is like, you have like a eight GB.
Eight gigabyte, like, you know, V-RAM on your GPU. People before couldn't even train on like a very, you know, on an AGB graphics card, but now we made it like, you know, fit, just, you know, just there, it just fits in AGB. Our goal is to make AI like, you know, easily available to everyone, reduce energy usage because this training two times faster.
We also wanna reduce energy usage. And we just wanna make it more productive for people. So that's, yeah. The goal is to democratize AI for everyone, but the only, the fine-tuning space for now. Yeah.
Renee: So what's something that people get wrong [00:10:00] about working with LLMs?
Daniel: Yeah, that's a great question. I think the biggest issue would be I. People treat LLMs are like as one system. So like when you call in an LLM, you might get an answer, but I think that's the wrong way to go about it. You're supposed to treat LLMs as like this one agent. Like, you know, you're supposed to use LLMs as if like you have like 10 of them, right?
So like you build layers on top of each layer. So one LLM, you can tell it, okay? You are a you are a verify, right? You verify someone's answers. Then this someone's answers is another LLM, which generates answers, right? So one, one LLM does the generation of your answer. One of them does the verification.
One of them does the implementation right, and execution of your answer, right? So like you need to, I think the biggest misconception is like L like, you know, a language model is this very powerful system that can answer every single question. But I think like you need to prompt it correctly and then like, you know, use many of them in tandem together.
This can be more powerful than just using like one system. I think like the biggest issue is I can see like people using it [00:11:00] as one system and then they like, you know, constantly repeat answers. And it sometimes doesn't give the correct answer and it sometimes forgets like what you said before. So I think that way to solve this problem is to use many of them together.
Renee: Cpu, gpu, we're gonna go, right? Oh, that's gpu. Yeah, right back to the start, Daniel, because CPU is what?
Daniel: CPU is like the processor on the computer. So like, so the GPU is like a graphics card for like gaming and like, you know, uh, running like. Language models and you know, your computer display, you can have like a CPU, but then if you wanna play games, it'll be so slow.
So you need to add this extra hardware called the GPU. You shove it in your computer and that is the GPU. It's like an extra, it makes like, you know, graphics faster. And that's
Renee: H100s is that, oh,
Daniel: that's very expensive. So H100s are like extremely pricey. Yeah. So Nvidia's like, I think you probably have one like now in like Com.
Com computer, like RTX. RTX, like 2090. You, you probably [00:12:00] do have a GPU in your like com computer. Yeah. I, I can
Renee: run, I can run some things on Jan. I've got like an, it's a Mac, it's a M something,
Daniel: M1. Yeah.
Renee: We can see what we're, what we're working with here. Uh, this is the, yeah, the idea is. If, if AI is going to be accessible and democratized, it means that, you know, potato people, people running potatoes and people who have potato brains need to understand it.
So that's not a roast. That's just a reality of the fact that I think that there's so much technology that's being developed that is so advanced, but we have no way of communicating it. And that sits in this very kind of, uh,
Daniel: I closed off, like, you know, you need to like, like it's, yeah, it's all the advancements are like, you know, very hard to like
Renee: access and stuff.
Yeah. It's, it's unintentional, but I think like there's so much that happens in little tiny subreddits that the world needs to kind of know about. And like, I don't go in there because I'm scared, but I see it and I'm like, oh, that, that would be interesting if I knew what that was. What's a GPU? But, uh.
[00:13:00] Yeah. So
Daniel: no, no. Like, yeah, it looks like Odin. We generally run like CPUs. There's like two, right? There's a CPU and there's a GPU. Technically GPU is not even necessary. It's like, it's optional. You just, if you want it to be faster, you know, more of responsive, you can buy a GPU, you can still do everything on a C CPU.
It's just very slow. Yeah. Mm-hmm.
Renee: So finetuning is the current and the, the future is fluid?
Daniel: I think the goal is to, like, our vision is in the future, like everyone will have a graphics card on your computer. It might be very weak, like, you know, but, but I think everyone will have a GPU on the computer and if this is the case we wanna provide like a fine tuning bot.
On your PC it reads all of your personal data. It does not upload any data. This is your personal chatbot that you have on your local machine, and it does fine tuning every single day. And so this, this model will learn about [00:14:00] you and you can ask questions whoever, however you like. So this will become like a personal ChaiGPT.
A personal ChatGPT for you only. I mean, so that's kind of our ultimate mission. That's, yeah, but that's like a future plan. Yeah,
Renee: yeah. Like a, like a, a non-crappy clippy bit. Yeah,
Daniel: exactly right. Yeah. It's, yeah, it's a personal LGBT for everyone. Yeah. Yeah. And
Renee: so I guess a lot of the conversations in commercial use cases is always that they love the idea of local, because it's local
that's how I came to kind of learn about what RAG is. And that was like a whole thing for me because I was like, oh. So that, to my understanding, this is like semi-off topic, but you are a person who knows in front of me. So I'm gonna ask you that RAG is a way of verifying the information that you get back from a large language model.
Is that right?
Daniel: So yeah. So RAG I guess is like knowledge injection. So [00:15:00] like. Pretend you ask a question, what's today's date? Right? So a language learner won't know that, right? It doesn't understand anything about like current events. So what you can do is you can take all of Wikipedia, right?
Shove all of Wikipedia as like a library, and allow the language model to search in the library for the answer, right? So like if I ask the question again, what is today's date? Wikipedia will say, you know, somewhere on the homepage, today's date is, you know, whatever the date is. Right? And so like the language, you can use RAG, which is retrieval augmented generation to like search a big library for the correct answer.
So that's RAG. Nice.
Renee: So it is different. Is it different from like ChatGPT or Bard accessing the internet? That's just different,
Daniel: right? Oh, that's, that's also rag. So like, so assume So, you know, like when I say like, the Wikipedia is like a library. Mm-Hmm. Assume Google the internet is the library, right? So like, it's just replace, replace Wikipedia with Google, like Google search, replace Wikipedia with like, you know, [00:16:00] with something else.
Like, you know, if you wanna do like movies, if you have a complicated question about a movie. You just replace it with like the IMDB database. If you have like a law question, replace it with like Supreme Court judgments or something. So RAG is just like a knowledge injection system.
You just need to have a big database. Yeah.
Renee: Nice. I've got two more questions. The first one is, who would you interview in the AI space? If you could interview someone
Daniel: in the AI space? I think I would like to talk to an interview, maybe Ilya from OpenAI. I think I've watched his videos and understood his views.
I I think his main, his very interesting take is language models are enough for AGI because he, like most people think that if you predict the next word, right, it's just language models are just predicting the next word, right? It's nothing special. It's not like intelligence or something, right? But then his take is, if you wanna predict the next word.
Before it, you need to understand everything about the world, right? Like, if you [00:17:00] wanna predict something, if you wanna predict the next word, you need to understand everything, the context of everything about it. You know, what, what, why is the word after these few words? But you need to understand everything in the world.
And so, like his take is very fascinating. I think I would like to, yeah, if I had a chance, I would talk to him. Yeah. Hmm.
Renee: Did you, did you go to the Sam Altman chat? I think it was Sydney like last year.
Daniel: No, I did not. It's funny, I was in Sydney, but I did go yeah, it was in UNSW, I think, right? Yeah. .
Renee: Where do you keep up with everything? Because I know that when you work in AI, you actually don't sleep.
It's not allowed, it's in the rules. And I said like, was it an RSS feed? Like do you have a favorite YouTube or is it just Discords? Reddit,
Daniel: yeah. I think I, I think Twitter's generally useful for like, new releases of stuff. Twitter's, yeah, Twitter's pretty good. Reddit. Yeah, localllama is very useful.
I think localllama has like, all the latest information. Very, very useful. I would say like YouTube videos are generally good. I think they're a bit delayed though. I [00:18:00] think YouTube, like if you wanna stay at the very edge, there'll be like one week delayed. I like, like Yannic like YouTube videos, he is very useful.
Like his videos are very helpful. I watch them like all the time, so Yannick's my recommendation yeah, that's if you got like, if you wanna stay up to date with everything. So Twitter, Reddit, and I guess YouTube. I, I guess this sounds like everyone, like, it sounds like general, like general advice.
Oh yeah.
Renee: Yeah. And you are currently bootstrapped, where can people. Uh, support you and how can people support you I know that you have A,, is it, is it pronounced Ko-feee or Caw-fee?
Daniel: I actually dunno. Kofi, I, I call it Kofi or is it Kofi? I
Renee: dunno. I always thought it was Kofi. Probably
Daniel: was Kofi. I dunno.
Renee: I'm gonna message them and ask, but you have one and that is at https://ko-fi.com/unsloth
Daniel: I actually don't know. What's the coffee link? I, I think we just typed, it's on a GitHub.
so we like, we have like a [00:19:00] open source package you know, we're open to any collaborators from like the open source community. If you wanna add like, new feature requests or pull requests. More than happy. If you wanna support our work, we have like a co-filing, so like you can donate like some money to us and we can like implement more features quicker.
Like I think we do have like some co-lab expenses, but it's not that much for like testing, testing stuff out. So I think that's the main expense we have. But yeah, more than happy to like, accept community help. If that's what you like. Uh, yeah. Nice.
Renee: Thank you very much.
Thanks for joining me on Unsupervised Learning. We were speaking to Daniel from Unsloth.ai. This episode was sponsored by Jan. Run any LLM locally on your device Mac Windows and Linux. See their open source GitHub repo at github.com/janhq.