William Brown

@willccbb | willcb.com

v0.1 (June 5, 2024)

Introduction

This document aims to serve as a handbook for learning the key concepts underlying modern artificial intelligence systems. Given the speed of recent development in AI, there really isn’t a good textbook-style source for getting up-to-speed on the latest-and-greatest innovations in LLMs or other generative models, yet there is an abundance of great explainer resources (blog posts, videos, etc.) for these topics scattered across the internet. My goal is to organize the “best” of these resources into a textbook-style presentation, which can serve as a roadmap for filling in the prerequisites towards individual AI-related learning goals. My hope is that this will be a “living document”, to be updated as new innovations and paradigms inevitably emerge, and ideally also a document that can benefit from community input and contribution. This guide is aimed at those with a technical background of some kind, who are interested in diving into AI either out of curiosity or for a potential career. I’ll assume that you have some experience with coding and high-school level math, but otherwise will provide pointers for filling in any other prerequisites. Please let me know if there’s anything you think should be added!

The AI Landscape

As of June 2024, it’s been about 18 months since ChatGPT was released by OpenAI and the world started talking a lot more about artificial intelligence. Much has happened since: tech giants like Meta and Google have released large language models of their own, newer organizations like Mistral and Anthropic have proven to be serious contenders as well, innumerable startups have begun building on top of their APIs, everyone is scrambling for powerful Nvidia GPUs, papers appear on ArXiv at a breakneck pace, demos circulate of physical robots and artificial programmers powered by LLMs, and it seems like chatbots are finding their way into all aspects of online life (to varying degrees of success). In parallel to the LLM race, there’s been rapid development in image generation via diffusion models; DALL-E and Midjourney are displaying increasingly impressive results that often stump humans on social media, and with the progress from Sora, Runway, and Pika, it seems like high-quality video generation is right around the corner as well. There are ongoing debates about when “AGI” will arrive, what “AGI” even means, the merits of open vs. closed models, value alignment, superintelligence, existential risk, fake news, and the future of the economy. Many are concerned about jobs being lost to automation, or excited about the progress that automation might drive. And the world keeps moving: chips get faster, data centers get bigger, models get smarter, contexts get longer, abilities are augmented with tools and vision, and it’s not totally clear where this is all headed. If you’re following “AI news” in 2024, it can often feel like there’s some kind of big new breakthrough happening on a near-daily basis. It’s a lot to keep up with, especially if you’re just tuning in.

With progress happening so quickly, a natural inclination by those seeking to “get in on the action” is to pick up the latest-and-greatest available tools (likely GPT-4o, Gemini 1.5 Pro, or Claude 3 Opus as of this writing, depending on who you ask) and try to build a website or application on top of them. There’s certainly a lot of room for this, but these tools will change quickly, and having a solid understanding of the underlying fundamentals will make it much easier to get the most out of your tools, pick up new tools quickly as they’re introduced, and evaluate tradeoffs for things like cost, performance, speed, modularity, and flexibility. Further, innovation isn’t only happening at the application layer, and companies like Hugging Face, Scale AI, and Together AI have gained footholds by focusing on inference, training, and tooling for open-weights models (among other things). Whether you’re looking to get involved in open-source development, work on fundamental research, or leverage LLMs in settings where costs or privacy concerns preclude outside API usage, it helps to understand how these things work under the hood in order to debug or modify them as needed. From a broader career perspective, a lot of current “AI/ML Engineer” roles will value nuts-and-bolts knowledge in addition to high-level frameworks, just as “Data Scientist” roles have typically sought a strong grasp on theory and fundamentals over proficiency in the ML framework du jour. Diving deep is the harder path, but I think it’s a worthwhile one. But with the pace at which innovation has occurred over the past few years, where should you start? Which topics are essential, what order should you learn them in, and which ones can you skim or skip?

The Content Landscape

Textbooks are great for providing a high-level roadmap of fields where the set of “key ideas” is more stable, but as far as I can tell, there really isn’t a publicly available post-ChatGPT “guide to AI” with textbook-style comprehensiveness or organization. It’s not clear that it would even make sense for someone to write a traditional textbook covering the current state of AI right now; many key ideas (e.g. QLoRA, DPO, vLLM) are no more than a year old, and the field will likely have changed dramatically by the time it’d get to print. The oft-referenced Deep Learning book (Goodfellow et al.) is almost a decade old at this point, and has only a cursory mention of language modeling via RNNs. The newer Dive into Deep Learning book includes coverage up to Transformer architectures and fine-tuning for BERT models, but topics like RLHF and RAG (which are “old” by the standards of some of the more bleeding-edge topics we’ll touch on) are missing. The upcoming “Hands-On Large Language Models” book might be nice, but it’s not officially published yet (available online behind a paywall now) and presumably won’t be free when it is. The Stanford CS224n course seems great if you’re a student there, but without a login you’re limited to slide-decks and a reading list consisting mostly of dense academic papers. Microsoft’s “Generative AI for Beginners” guide is fairly solid for getting your hands dirty with popular frameworks, but it’s more focused on applications rather than understanding the fundamentals.

The closest resource I’m aware of to what I have in mind is Maxime Labonne’s LLM Course on Github. It features many interactive code notebooks, as well as links to sources for learning the underlying concepts, several of which overlap with what I’ll be including here. I’d recommend it as a primary companion guide while working through this handbook, especially if you’re interested in applications; this document doesn’t include notebooks, but the scope of topics I’m covering is a bit broader, including some research threads which aren’t quite “standard” as well as multimodal models.

Still, there’s an abundance of other high-quality and accessible content which covers the latest advances in AI — it’s just not all organized. The best resources for quickly learning about new innovations are often one-off blog posts or YouTube videos (as well as Twitter/X threads, Discord servers, and discussions on Reddit and LessWrong). My goal with this document is to give a roadmap for navigating all of this content, organized into a textbook-style presentation without reinventing the wheel on individual explainers. Throughout, I’ll include multiple styles of content where possible (e.g. videos, blogs, and papers), as well as my opinions on goal-dependent knowledge prioritization and notes on “mental models” I found useful when first encountering these topics.

I’m creating this document not as a “generative AI expert”, but rather as someone who’s recently had the experience of ramping up on many of these topics in a short time frame. While I’ve been working in and around AI since 2016 or so (if we count an internship project running evaluations for vision models as the “start”), I only started paying close attention to LLM developments 18 months ago, with the release of ChatGPT. I first started working with open-weights LLMs around 12 months ago. As such, I’ve spent a lot of the past year sifting through blog posts and papers and videos in search of the gems; this document is hopefully a more direct version of that path. It also serves as a distillation of many conversations I’ve had with friends, where we’ve tried to find and share useful intuitions for grokking complex topics in order to expedite each other’s learning. Compiling this has been a great forcing function for filling in gaps in my own understanding as well; I didn’t know how FlashAttention worked until a couple weeks ago, and I still don’t think that I really understand state-space models that well. But I know a lot more than when I started.

Resources

Some of the sources we’ll draw from are:

I’ll often make reference to the original papers for key ideas throughout, but our emphasis will be on expository content which is more concise and conceptual, aimed at students or practitioners rather than experienced AI researchers (although hopefully the prospect of doing AI research will become less daunting as you progress through these sources). Pointers to multiple resources and media formats will be given when possible, along with some discussion on their relative merits.

Preliminaries

Math

Calculus and linear algebra are pretty much unavoidable if you want to understand modern deep learning, which is largely driven by matrix multiplication and backpropagation of gradients. Many technical people end their formal math educations around multivariable calculus or introductory linear algebra, and it seems common to be left with a sour taste in your mouth from having to memorize a suite of unintuitive identities or manually invert matrices, which can be discouraging towards the prospect of going deeper in one’s math education. Fortunately, we don’t need to do these calculations ourselves — programming libraries will handle them for us — and it’ll instead be more important to have a working knowledge of concepts such as:

Gradients and their relation to local minima/maxima

The chain rule for differentiation

Matrices as linear transformations for vectors

Notions of basis/rank/span/independence/etc.

Good visualizations can really help these ideas sink in, and I don’t think there’s a better source for this than these two YouTube series from 3Blue1Brown:

If your math is rusty, I’d certainly encourage (re)watching these before diving in deeper. To test your understanding, or as a preview of where we’re headed, the shorter Neural networks video series on the same channel is excellent as well, and the latest couple videos in the series give a great overview of Transformer networks for language modeling.

These lecture notes from Waterloo give some useful coverage of multivariable calculus as it relates to optimization, and “Linear Algebra Done Right” by Sheldon Axler is a nice reference text for linear algebra. “Convex Optimization” by Boyd and Vandenberghe shows how these topics lay the foundations for the kinds of optimization problems faced in machine learning, but note that it does get fairly technical, and may not be essential if you’re mostly interested in applications.

Linear programming is certainly worth understanding, and is basically the simplest kind of high-dimensional optimization problem you’ll encounter (but still quite practical); this illustrated video should give you most of the core ideas, and Ryan O’Donnell’s videos (17a-19c in this series, depending on how deep you want to go) are excellent if you want to go deeper into the math. These lectures (#10, #11) from Tim Roughgarden also show some fascinating connections between linear programming and the “online learning” methods we’ll look at later, which will form the conceptual basis for GANs (among many other things).

Programming

Most machine learning code is written in Python nowadays, and some of the references here will include Python examples for illustrating the discussed topics. If you’re unfamiliar with Python, or programming in general, I’ve heard good things about Replit’s 100 Days of Python course for getting started. Some systems-level topics will also touch on implementations in C++ or CUDA — I’m admittedly not much of an expert in either of these, and will focus more on higher-level abstractions which can be accessed through Python libraries, but I’ll include potentially useful references for these languages in the relevant sections nonetheless.

Organization

This document is organized into several sections and chapters, as listed below and in the sidebar. You are encouraged to jump around to whichever parts seem most useful for your personal learning goals. Overall, I’d recommend first skimming many of the linked resources rather than reading (or watching) word-for-word. This should hopefully at least give you a sense of where your knowledge gaps are in terms of dependencies for any particular learning goals, which will help guide a more focused second pass.

Section I: Foundations of Sequential Prediction Goal: Recap machine learning basics + survey (non-DL) methods for tasks under the umbrella of “sequential prediction”. Our focus in this section will be on quickly overviewing classical topics in statistical prediction and reinforcement learning, which we’ll make direct reference to in later sections, as well as highlighting some topics that I think are very useful as conceptual models for understanding LLMs, yet which are often omitted from deep learning crash courses – notably time-series analysis, regret minimization, and Markov models. Statistical Prediction and Supervised Learning Before getting to deep learning and large language models, it’ll be useful to have a solid grasp on some foundational concepts in probability theory and machine learning. In particular, it helps to understand: Random variables, expectations, and variance

Supervised vs. unsupervised learning

Regression vs. classification

Linear models and regularization

Empirical risk minimization

Hypothesis classes and bias-variance tradeoffs For general probability theory, having a solid understanding of how the Central Limit Theorem works is perhaps a reasonable litmus test for how much you’ll need to know about random variables before tackling some of the later topics we’ll cover. This beautifully-animated 3Blue1Brown video is a great starting point, and there are a couple other good probability videos to check out on the channel if you’d like. This set of course notes from UBC covers the basics of random variables. If you’re into blackboard lectures, I’m a big fan of many of Ryan O’Donnell’s CMU courses on YouTube, and this video on random variables and the Central Limit Theorem (from the excellent “CS Theory Toolkit” course) is a nice overview. For understanding linear models and other key machine learning principles, the first two chapters of Hastie’s Elements of Statistical Learning (”Introduction” and “Overview of Supervised Learning”) should be enough to get started. Once you’re familiar with the basics, this blog post by anonymous Twitter/X user @ryxcommar does a nice job discussing some common pitfalls and misconceptions related to linear regression. StatQuest on YouTube has a number of videos that might also be helpful. Introductions to machine learning tend to emphasize linear models, and for good reason. Many phenomena in the real world are modeled quite well by linear equations — the average temperature over past 7 days is likely a solid guess for the temperature tomorrow, barring any other information about weather pattern forecasts. Linear systems and models are a lot easier to study, interpret, and optimize than their nonlinear counterparts. For more complex and high-dimensional problems with potential nonlinear dependencies between features, it’s often useful to ask: What’s a linear model for the problem?

Why does the linear model fail?

What’s the best way to add nonlinearity, given the semantic structure of the problem? In particular, this framing will be helpful for motivating some of the model architectures we’ll look at later (e.g. LSTMs and Transformers). Time-Series Analysis How much do you need to know about time-series analysis in order to understand the mechanics of more complex generative AI methods? Short answer: just a tiny bit for LLMs, a good bit more for diffusion. For modern Transformer-based LLMs, it’ll be useful to know: The basic setup for sequential prediction problems

The notion of an autoregressive model There’s not really a coherent way to “visualize” the full mechanics of a multi-billion-parameter model in your head, but much simpler autoregressive models (like ARIMA) can serve as a nice mental model to extrapolate from. When we get to neural state-space models, a working knowledge of linear time-invariant systems and control theory (which have many connections to classical time-series analysis) will be helpful for intuition, but diffusion is really where it’s most essential to dive deeper into into stochastic differential equations to get the full picture. But we can table that for now. This blog post (Forecasting with Stochastic Models) from Towards Data Science is concise and introduces the basic concepts along with some standard autoregressive models and code examples. This set of notes from UAlberta’s “Time Series Analysis” course is nice if you want to go a bit deeper on the math. Online Learning and Regret Minimization It’s debatable how important it is to have a strong grasp on regret minimization, but I think a basic familiarity is useful. The basic setting here is similar to supervised learning, but: Points arrive one-at-a-time in an arbitrary order

We want low average error across this sequence If you squint and tilt your head, most of the algorithms designed for these problems look basically like gradient descent, often with delicate choices of regularizers and learning rates require for the math to work out. But there’s a lot of satisfying math here. I have a soft spot for it, as it relates to a lot of the research I worked on during my PhD. I think it’s conceptually fascinating. Like the previous section on time-series analysis, online learning is technically “sequential prediction” but you don’t really need it to understand LLMs. The most direct connection to it that we’ll consider is when we look at GANs in Section VIII. There are many deep connections between regret minimization and equilibria in games, and GANs work basically by having two neural networks play a game against each other. Practical gradient-based optimization algorithms like Adam have their roots in this field as well, following the introduction of the AdaGrad algorithm, which was first analyzed for online and adversarial settings. In terms of other insights, one takeaway I find useful is the following: If you’re doing gradient-based optimization with a sensible learning rate schedule, then the order in which you process data points doesn’t actually matter much. Gradient descent can handle it. I’d encourage you to at least skim Chapter 1 of “Introduction to Online Convex Optimization” by Elad Hazan to get a feel for the goal of regret minimization. I’ve spent a lot of time with this book and I think it’s excellent. Reinforcement Learning Reinforcement Learning (RL) will come up most directly when we look at finetuning methods in Section IV, and may also be a useful mental model for thinking about “agent” applications and some of the “control theory” notions which come up for state-space models. Like a lot of the topics discussed in this document, you can go quite deep down many different RL-related threads if you’d like; as it relates to language modeling and alignment, it’ll be most important to be comfortable with the basic problem setup for Markov decision processes, notion of policies and trajectories, and high-level understanding of standard iterative + gradient-based optimization methods for RL. This blog post from Lilian Weng is a great starting point, and is quite dense with important RL ideas despite its relative conciseness. It also touches on connections to AlphaGo and gameplay, which you might find interesting as well. The textbook “Reinforcement Learning: An Introduction” by Sutton and Barto is generally considered the classic reference text for the area, at least for “non-deep” methods. This was my primary guide when I was first learning about RL, and it gives a more in-depth exploration of many of the topics touched on in Lilian’s blog post. If you want to jump ahead to some more neural-flavored content, Andrej Karpathy has a nice blog post on deep RL; this manuscript by Yuxi Li and this textbook by Aske Plaat may be useful for further deep dives. If you like 3Blue1Brown-style animated videos, the series “Reinforcement Learning By the Book” is a great alternative option, and conveys a lot of content from Sutton and Barto, along with some deep RL, using engaging visualizations. Markov Models Running a fixed policy in a Markov decision process yields a Markov chain; processes resembling this kind of setup are fairly abundant, and many branches of machine learning involve modeling systems under Markovian assumptions (i.e. lack of path-dependence, given the current state). This blog post from Aja Hammerly makes a nice case for thinking about language models via Markov processes, and this post from “Essays on Data Science” has examples and code building up towards auto-regressive Hidden Markov Models, which will start to vaguely resemble some of the neural network architectures we’ll look at later on. This blog post from Simeon Carstens gives a nice coverage of Markov chain Monte Carlo methods, which are powerful and widely-used techniques for sampling from implicitly-represented distributions, and are helpful for thinking about probabilistic topics ranging from stochastic gradient descent to diffusion. Markov models are also at the heart of many Bayesian methods. See this tutorial from Zoubin Ghahramani for a nice overview, the textbook “Pattern Recognition and Machine Learning” for Bayesian angles on many machine learning topics (as well as a more-involved HMM presentation), and this chapter of the Goodfellow et al. “Deep Learning” textbook for some connections to deep learning.

Section VI: Performance Optimizations for Efficient Inference Goal: Survey architecture choices and lower-level techniques for improving resource utilization (time, compute, memory). Here we’ll look at a handful of techniques for improving the speed and efficiency of inference from pre-trained Transformer language models, most of which are fairly widely used in practice. It’s worth first reading this short Nvidia blog post for a crash course in several of the topics we’ll look at (and a number of others). Parameter Quantization With the rapid increase in parameter counts for leading LLMs and difficulties (both in cost and availability) in acquiring GPUs to run models on, there’s been a growing interest in quantizing LLM weights to use fewer bits each, which can often yield comparable output quality with a 50-75% (or more) reduction in required memory. Typically this shouldn’t be done naively; Tim Dettmers, one of the pioneers of several modern quantization methods (LLM.int8(), QLoRA, bitsandbytes) has a great blog post for understanding quantization principles, and the need for mixed-precision quantization as it relates to emergent features in large-model training. Other popular methods and formats are GGUF (for llama.cpp), AWQ, HQQ, and GPTQ; see this post from TensorOps for an overview, and this post from Maarten Grootendorst for a discussion of their tradeoffs. In addition to enabling inference on smaller machines, quantization is also popular for parameter-efficient training; in QLoRA, most weights are quantized to 4-bit precision and frozen, while active LoRA adapters are trained in 16-bit precision. See this talk from Tim Dettmers, or this blog from Hugging Face for overviews. This blog post from Answer.AI also shows how to combine QLoRA with FSDP for efficient finetuning of 70B+ parameter models on consumer GPUs. Speculative Decoding The basic idea behind speculative decoding is to speed up inference from a larger model by primarily sampling tokens from a much smaller model and occasionally applying corrections (e.g. every N tokens) from the larger model whenever the output distributions diverge. These batched consistency checks tend to be much faster than sampling N tokens directly, and so there can be large overall speedups if the token sequences from smaller model only diverge periodically. See this blog post from Jay Mody for a walkthrough of the original paper, and this PyTorch article for some evaluation results. There’s a nice video overview from Trelis Research as well. FlashAttention Computing attention matrices tends to be a primary bottleneck in inference and training for Transformers, and FlashAttention has become one of the most widely-used techniques for speeding it up. In contrast to some of the techniques we’ll see in Section 7 which approximate attention with a more concise representation (occurring some representation error as a result), FlashAttention is an exact representation whose speedup comes from hardware-aware impleemntation. It applies a few tricks — namely, tiling and recomputation — to decompose the expression of attention matrices, enabling significantly reduced memory I/O and faster wall-clock performance (even with slightly increasing the required FLOPS). Resources: Talk by Tri Dao (author of FlashAttention)

ELI5: FlashAttention by Aleksa Gordić Key-Value Caching and Paged Attention As noted in the NVIDIA blog referenced above, key-value caching is fairly standard in Transformer implementation matrices to avoid redundant recomputation of attention. This enables a tradeoff between speed and resource utilization, as these matrices are kept in GPU VRAM. While managing this is fairly straightforward for a single “thread” of inference, a number of complexities arise when considering parallel inference or multiple users for a single hosted model instance. How can you avoid recomputing values for system prompts and few-shot examples? When should you evict cache elements for a user who may or may not want to continue a chat session? PagedAttention and its popular implementation vLLM addresses this by leveraging ideas from classical paging in operating systems, and has become a standard for self-hosted multi-user inference servers. Resources: The KV Cache: Memory Usage in Transformers (video, Efficient NLP)

Fast LLM Serving with vLLM and PagedAttention (video, Anyscale)

vLLM blog post CPU Offloading The primary method used for running LLMs either partially or entirely on CPU (vs. GPU) is llama.cpp. See here for a high-level overview; llama.cpp serves as the backend for a number of popular self-hosted LLM tools/frameworks like LMStudio and Ollama. Here’s a blog post with some technical details about CPU performance improvements.

Citation

If you’re making reference to any individual piece of content featured here, please just cite that directly. However, if you wish to cite this as a broad survey, you can use the BibTeX citation below.