Study Notes
Paged Attention
OS 가상 메모리 페이징 기법을 활용한 vLLM 메모리 단편화 극복
Flash Attention
IO-Aware Tiling과 Online Softmax를 통한 메모리 벽 돌파
Speculative Decoding
메모리 I/O 호출 빈도를 지배하는 시간적 최적화
Token-level Sparsity
Edge 환경의 KV Cache 용량 한계 극복
[Research Note] Heterogeneous Routing
LLM Inference Routing Strategy in Edge Environments
[Research Note] Strategies for Overcoming the Memory Wall
Token-level sparsity and speculative decoding
[Research Note] Activation-level Sparsity for Accelerating LLM Inference
Activation-level Sparsity Analysis and Hardware-Friendly Predictor Design
[Research Note] 1.58-bit Ultra-low-bit Quantization (BitNet)
Mathematical implementation and QAT design
Deep Dive into Distribution-Aware HAR
Research & Implementation Monograph
Why DDPM Is a Probabilistic Generative Model
study note
Diffusion Models 이해하기
study note
RLHF 개요
study note
Transformer 기본 구조
study note