Sign in Try Playground Console

YouTube video intelligence showcase

How GPT, Claude, and Gemini are actually trained and served – Reiner Pope

Reiner Pope explains the mechanics behind how GPT-style models are trained and served, focusing in this excerpt on inference economics. Using a roofline-style analysis of transformer execution on a GPU cluster, he shows how batch size, weight fetches, compute throughput, and KV cache access shape latency and cost. The discussion helps explain why higher-priced fast modes can stream tokens more quickly, and why serving many users together can dramatically improve efficiency.

Dwarkesh PatelAI Podcasts Programming BusinessBatch size and batchingRoofline analysisWeights and KV cache2 hrs 13 minApr 29, 20266 comment sample

Transcript API Comments API Source video

Build this with Crawlora

Video intelligence API workflow

Video ID: xmkSf5IS-zw
Available APIs: TranscriptCommentsMetadata

YouTube transcript API YouTube comments API YouTube video metadata API YouTube scraping API Creator intelligence workflow Pricing Source video

Open transcript in Playground Open comments in Playground Get API key

cURL

curl "https://api.crawlora.net/api/v1/youtube/transcript/xmkSf5IS-zw" \
  -H "x-api-key: $CRAWLORA_API_KEY"

Video summary

How batch size, memory bandwidth, and KV cache shape AI inference

In this blackboard-style lecture, Reiner Pope walks through how large language models are served in practice, using transformer inference on a GPU cluster to explain why latency and cost behave the way they do. The excerpt focuses on batch size, memory bandwidth, compute throughput, and KV cache fetches, showing how these factors create trade-offs between speed, throughput, and price. It also frames why different serving modes can offer faster token streaming at higher cost.

Why batching matters

Reiner Pope explains why serving many users together can dramatically improve the economics of model inference.

A simple cluster-level model

The lecture uses a roofline-style analysis of transformer inference on a GPU cluster, separating compute time from memory fetch time.

Weights, context, and KV cache

The discussion breaks down weight fetches, active parameters, and KV cache access to show how latency and cost scale.

Why faster modes cost more

The excerpt connects these mechanics to real-world API pricing and latency tiers like faster, higher-priced modes.

Topics

Batch size and batching

How batch size changes both latency and cost per token in model serving.

Roofline analysis

Why memory bandwidth and compute throughput set practical limits on inference speed.

Weights and KV cache

How weight fetches and KV cache access factor into decode-time performance.

Audience comments snapshot

Audience comments: praise for the long-form technical format and production

The sampled comments focus less on the model-training content itself and more on appreciation for the interview format: viewers praise the willingness to go deep on technical material, call the episode a useful public service, and describe the setup as a potential gamechanger. Several comments also mention the high production quality, and one asks for future guests like Karpathy. The only concrete user-generated resource mentioned is a set of flashcards and practice problems shared to help others retain the discussion.

Sampled comments: 6
Visible likes: 2143
Public replies: 26

Comment themes

Appreciation for serious technical depth

Comments frame the episode as unusually substantive for a mainstream podcast and appreciate that the conversation stays technical rather than simplified.

Praise for the interview/lecture format

The audience reacts strongly to the format itself, treating it as a model for future episodes and even for lecture-style media more broadly.

Production and presentation quality

Several commenters notice and value the polished visual and audio presentation, suggesting production quality contributes to the experience.

Audience signals

Strong support for the deep technical format

Multiple comments explicitly endorse the longer, more technical interview style and want it continued with similar guests.

Production quality stands out

Viewers repeatedly compliment the production quality, including microphones, lighting, room setup, and camera work.

Learning support and retention tools

One comment shares flashcards and practice problems for the episode, suggesting the discussion inspired study aids.

Interest in similar high-caliber guest appearances

A comment requests a future appearance by Karpathy in the same setup, indicating interest in more guests like Reiner Pope.

Representative public comments

@DwarkeshPatel2026-04-30

Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too! https://reiner-flashcards.vercel.app

175 likes7 replies

@BLAISEDAHL962026-04-30

Yep, this definitely needs to be the format moving forward with guests that care to instruct something. This is a great public service.

881 likes6 replies

@fernandodutra37882026-04-30

So nice to see a mainstream podcast (1.3M subs) spend 2+h discussing technical state of the art. Not just “let’s explain what 1+1 is so the audience can follow”, but actually just going with the flow. Appreciated!

245 likes1 replies

@WillzMaster852026-04-30

this new format will be a complete gamechanger; a simple yet genius move dwarkesh, good work!

336 likes0 replies

@avnotes2026-05-02

Petition to bring Sir Karpathy in this setup!

435 likes12 replies

@NimTheHuman2026-05-02

This is making me realize that most recordings of lectures are seriously deprived of the production quality they deserve (e.g, high quality mics, thoughtful lightning, aesthetic room setups, intentional camera angle shifts like the one at 7:18). If more university-style lectures adopted this format and quality, huma...

71 likes0 replies

Build with YouTube comments data

Use Crawlora's YouTube comments API with the video and transcript endpoints to collect viewer language, thread activity, and audience signals.

Comments API docs Playground

Build this workflow

1Fetch video metadata

Start with the video endpoint to capture ID, channel, publish date, duration, and source context.

2Fetch transcript

Pull timestamped transcript data for summarization, search, citation, and RAG preparation.

3Fetch public comments

Collect visible audience comments to identify themes, objections, questions, and engagement signals.

4Store, analyze, report

Persist structured JSON, run analysis, and publish dashboards, alerts, or research reports.

Public transcript excerpt

Transcript

Timestamped public transcript passages group captions into readable sections, making the video easier to scan, cite, and summarize.

Public excerpt

Show timestamped transcript excerpt(1 passage)

4:56

We're modeling compute performance. I'm going to keep writing equals, but in all of these cases, you can think of this time as being at least this much, and maybe there will be some terms we ignored. On the memory side, what do we need to do with memory? We need to fetch all of the weights, so there is some time to fetch the total number of parameters, not just the active parameters. There's weight fetch time, and then in addition, there's a KV cache fetch time. This actually depends on batch size.

Build with YouTube transcript data

Use Crawlora's YouTube transcript API to fetch fresh timestamped transcript data for your own server-side workflows.

API docs Sign in

Related Crawlora APIs & guides

Build YouTube data workflows with Crawlora

This showcase is built from Crawlora's public YouTube data APIs. Use the same endpoints and guides to build your own transcript, comment, and creator-intelligence workflows.

More AI video examples

Browse structured transcript and comment showcases in AI.

More Podcasts video examples

Browse structured transcript and comment showcases in Podcasts.

YouTube API

Transcript, comments, and video metadata endpoints that return normalized JSON.

YouTube transcript extraction

Build searchable, RAG-ready transcript pipelines from public videos.

YouTube creator intelligence

Monitor creators, audiences, and content trends across channels.

Podcast & audio intelligence

Turn long-form audio and podcasts into structured, analyzable data.

Related showcases

More structured YouTube examples

Chip design from the bottom up – Reiner Pope

Dwarkesh Patel and Reiner Pope build AI chip design from the ground up, starting with logic gates and multiply-accumulate operations before moving into adders, precision tradeoffs, and why low-bit arithmetic is so powerful for neural nets.

Logic gates to chip primitivesMatrix multiplication as the core workload

What rebuilding AlphaGo teaches us about self-play, RL, and the future of LLMs

Eric Jang explains AlphaGo from the ground up, using Go’s rules, endgame scoring, and search complexity to show why deep learning made the problem tractable. The episode connects those ideas to self-play, reinforcement learning, and broader lessons for future AI systems.

Go fundamentalsAlphaGo’s significance

Jensen Huang on Nvidia’s Moat, Supply Chain Bottlenecks, and Whether AI Software Gets Commoditized

Jensen Huang argues that Nvidia’s moat is not just software, but the hard-to-replicate system that turns electrons into valuable tokens across a broad AI ecosystem. He also discusses supply chain constraints, upstream investments, and how Nvidia plans years ahead to scale through bottlenecks.

Nvidia’s value creationSupply chain and ecosystem

Build this with Crawlora

Video intelligence API workflow

Video ID: xmkSf5IS-zw
Available APIs: TranscriptCommentsMetadata

YouTube transcript API YouTube comments API YouTube video metadata API YouTube scraping API Creator intelligence workflow Pricing Source video

Open transcript in Playground Open comments in Playground Get API key

cURL

curl "https://api.crawlora.net/api/v1/youtube/transcript/xmkSf5IS-zw" \
  -H "x-api-key: $CRAWLORA_API_KEY"