Study Notes

Paged Attention

OS 가상 메모리 페이징 기법을 활용한 vLLM 메모리 단편화 극복

Flash Attention

IO-Aware Tiling과 Online Softmax를 통한 메모리 벽 돌파

Speculative Decoding

메모리 I/O 호출 빈도를 지배하는 시간적 최적화

Token-level Sparsity

Edge 환경의 KV Cache 용량 한계 극복

[Research Note] Heterogeneous Routing

LLM Inference Routing Strategy in Edge Environments

[Research Note] Strategies for Overcoming the Memory Wall

Token-level sparsity and speculative decoding

[Research Note] Activation-level Sparsity for Accelerating LLM Inference

Activation-level Sparsity Analysis and Hardware-Friendly Predictor Design

[Research Note] 1.58-bit Ultra-low-bit Quantization (BitNet)

Mathematical implementation and QAT design

Deep Dive into Distribution-Aware HAR

Research & Implementation Monograph

Why DDPM Is a Probabilistic Generative Model

study note

Diffusion Models 이해하기

study note

RLHF 개요

study note

Transformer 기본 구조

study note