Fast Inference from Transformers via Speculative Decoding Transformer Models - Search Videos

As AI labs race to train and deploy new frontier models, existing models become more affordable with better tokenomics. ✨ "Everybody's trying to get to the next frontier. And every time they get to the next frontier, the last generation AI tokens, the cost starts to decline about a factor of 10x every year," said NVIDIA CEO Jensen Huang in a recent keynote. Model optimization techniques such as speculative decoding and multi-token prediction, combined with inference serving platforms like NVIDIA

As AI labs race to train and deploy new frontier models, existing mod…

5.7K views1 month ago

FacebookNVIDIA AI

How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100

How to Quadruple LLM Decoding Performance with Speculative Dec…

Faster LLMs: Accelerate Inference with Speculative Decoding

Faster LLMs: Accelerate Inference with Speculative Decoding

4M views · 101K reactions | Megatron : Transformers Via : h__super | Gundam World : The Legion | Facebook

4M views · 101K reactions | Megatron : Transformers Via : h__…

4.1M views2 weeks ago

FacebookGundam World : The Legion

Unlocking AI Speed: How KV Caching and MLA Make Transformers 20x Faster

Unlocking AI Speed: How KV Caching and MLA Make Transform…

YouTubeSkill Advancement

DEER: Diffusion Drafting for Faster LLMs

DEER: Diffusion Drafting for Faster LLMs

28 views2 months ago

YouTubeAI Research Roundup

Modern LLM Inference: Architecture, Quantization, and Serving Infrastructure | Uplatz

Modern LLM Inference: Architecture, Quantization, and Serving Infrastr…

11 views2 months ago

26. Transformer Inference Process: How LLMs Predict the Next Word (…

78 views2 weeks ago

YouTubeNeuro Splash (Telugu)

9- Inference Optimization

YouTubeGenoPlan

How to DOUBLE the LM Studio AI Inference Speed with These HIDD…

561 views2 weeks ago

YouTubeAsapGuide

DFlash: Faster LLM Inference via Block Diffusion

30 views2 weeks ago

YouTubeAI Research Roundup

Make Large Language Models 4× Faster! Jacobi Forcing for Causal …

YouTubeAITech_Trends

The Agentic AI Infrastructure Playbook | VentureBeat AI Impact …

101 views2 weeks ago

Demo for Real-time Multi-edge Collaborative Inference System

YouTubeHansong Zhou

AI Frontiers: 101 ML Papers from Nov 21, 2025 - Efficiency, Safety …

11 views2 months ago

YouTubeAI Frontiers

Inference Office Hours with SGLang: Performance Optimizations for LL…

1K views2 weeks ago

YouTubeNVIDIA Developer

The Transformer Secret: How AI Understands Language (Explained)

YouTubeCollapsedLatents

TiDAR: The Future of AI Speed & Quality (One Step, 5x Faster) #Sho…

YouTubeCollapsedLatents

EP5: Speculative Decoding with Nadav Timor

YouTubeThe Information Bottleneck

How AI Replies So Fast! ⚡ Speculative Decoding

130 views1 month ago

YouTubeMr. Doubty – Short. Smart. Techy

What is Speculative Sampling? | Boosting LLM inference speed

3.8K viewsNov 20, 2024

YouTubeAssemblyAI

Leave No Context Behind: Efficient Infinite Context Transformers wit…

59.7K viewsApr 24, 2024

YouTubeYannic Kilcher

Sparse is Enough in Scaling Transformers (aka Terraformer) | …

24.1K viewsDec 2, 2021

YouTubeYannic Kilcher

Transformer models: Encoder-Decoders

103K viewsJun 14, 2021

YouTubeHuggingFace

Vision Transformer for Image Classification

142.1K viewsMay 5, 2021

YouTubeShusen Wang

Speculative Decoding Explained

7.6K viewsDec 21, 2023

YouTubeTrelis Research

How chatgpt works

22.1K viewsFeb 9, 2023

YouTubeLucidate

Accelerating AI Model Performance (APAC)

330 views3 months ago

YouTubeMicrosoft Reactor

MiMo-V2-Flash Technical Report

44 views2 months ago

YouTubeAI Papers Podcast Daily

ChatGPT-5 Architecture Explained

17.2K views6 months ago

YouTubeResDevEng

See more videos