Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

AI4LUV — Sun, 30 Nov 2025 02:45:55 GMT

For more information about Stanford's Artificial Intelligence programs visit: https://stanford.io/ai

This lecture provides a concise overview of building a ChatGPT-like model, covering both pretraining (language modeling) and post-training (SFT/RLHF). For each component, it explores common practices in data collection, algorithms, and evaluation methods. This guest lecture was delivered by Yann Dubois in Stanford’s CS229: Machine Learning course, in Summer 2024.

Yann Dubois
PhD Student at Stanford
https://yanndubs.github.io/

About the speaker: Yann Dubois is a fourth-year CS PhD student advised by Percy Liang and Tatsu Hashimoto. His research focuses on improving the effectiveness of AI when resources are scarce. Most recently, he has been part of the Alpaca team, working on training and evaluating language models more efficiently using other LLMs.

To view all online courses and programs offered by Stanford, visit: http://online.stanford.edu

Chapters:
00:00 - Introduction
00:10 - Recap on LLMs
00:16 - Definition of LLMs
00:19 - Examples of LLMs
01:16 - Importance of Data
01:20 - Evaluation Metrics
01:33 - Systems Component
01:41 - Importance of Systems
01:47 - LLMs Based on Transformers
01:57 - Focus on Key Topics
02:00 - Transition to Pretraining
03:02 - Overview of Language Modeling
04:17 - Generative Models Explained
05:15 - Autoregressive Models Definition
06:36 - Autoregressive Task Explanation
07:49 - Training Overview
08:48 - Tokenization Importance
10:50 - Tokenization Process
13:30 - Example of Tokenization
16:00 - Evaluation with Perplexity
20:50 - Current Evaluation Methods
24:30 - Academic Benchmark: MMLU