Uni-HAR Phase 2

On-device Optimization

๐Ÿ“„ [Phase 2] Uni-HAR: On-device Optimization Log

From 221 GFLOPs Bottleneck to MPU-Deployable Ultra-Lightweight Inference


1. The Bottleneck: ์ง€๋Šฅ์˜ ๋Œ€๊ฐ€ (The Cost of Intelligence)

Phase 1 ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ๋ฒ”์šฉ์ ์ด๊ณ  ๊ฐ•๊ฑดํ•œ ํ™•๋ฅ  ๋ถ„ํฌ ๊ธฐ๋ฐ˜์˜ HAR ๋ชจ๋ธ(Uni-HAR)์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ชจ๋ธ์˜ ์ง€๋Šฅ(ํ‘œํ˜„๋ ฅ)์„ ๊ทน๋Œ€ํ™”ํ•œ ๋Œ€๊ฐ€๋Š” ํ˜„์‹ค ์„ธ๊ณ„์˜ ํ•˜๋“œ์›จ์–ด ํ•œ๊ณ„๋ผ๋Š” ๋ฒฝ์œผ๋กœ ๋Œ์•„์™”์Šต๋‹ˆ๋‹ค.

ํ•™๋ถ€ ์บก์Šคํ†ค ์‹œ์ ˆ(Phase 0)์˜ 1D-CNN๊ณผ ํ˜„์žฌ(Phase 1)์˜ Uni-HAR ๋ชจ๋ธ ์—ฐ์‚ฐ๋Ÿ‰์„ fvcore ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์ง์ ‘ ํ”„๋กœํŒŒ์ผ๋งํ•œ ๊ฒฐ๊ณผ๋Š” ๊ทน๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.

[Phase 0] ๊ธฐ์กด 1D-CNN ๋ชจ๋ธ (ํฌ์ฆˆ ๋‹จ์ผ ๋ชจ๋‹ฌ)

Total MFLOPs: 0.5099 MFLOPs 
-> ์ดˆ๊ฒฝ๋Ÿ‰์ด์ง€๋งŒ ์˜ˆ์ธก์ด ๋ถˆ์•ˆ์ •ํ•˜๊ณ  ๋ฏธ์„ธํ•œ ๋™์ž‘ ๊ตฌ๋ถ„์ด ๋ถˆ๊ฐ€๋Šฅ.

[Phase 1] Uni-HAR ๋ชจ๋ธ (Multimodal, Factorized Transformer)

Total FLOPs: 221,567,666,944 FLOPs
Total GFLOPs: 221.5677 GFLOPs (Parameters: 14.2M)

[Major Component FLOPs Breakdown]
- Image Encoder: 218.23 GFLOPs (98.5%)
- Pose Backbone: 3.32 GFLOPs (1.5%)

-> ํ”„๋กœํŒŒ์ผ๋ง ์ƒ์„ธ ๊ฒฐ๊ณผ ํ™•์ธ

๋ฌธ์ œ ์ธ์‹: 221 GFLOPs๋Š” ์„œ๋ฒ„๊ธ‰ GPU์—์„œ๋Š” ๋ฌด๋ฆฌ๊ฐ€ ์—†์ง€๋งŒ, ํƒ€๊ฒŸ ํ™˜๊ฒฝ์ธ ์ œํ•œ๋œ Edge Device(Jetson, MPU ๋“ฑ)์—์„œ๋Š” Out-of-Memory(OOM)์™€ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ๋ถˆ๊ฐ€(Frame Drop) ๋ฅผ ์œ ๋ฐœํ•˜๋Š” ์น˜๋ช…์ ์ธ ์ˆ˜์น˜์˜€์Šต๋‹ˆ๋‹ค.

์ด์— 1์ฐจ ๋ชจ๋ธ์˜ ๋†’์€ ์ง€๋Šฅ(Representation Power)์€ ์œ ์ง€ํ•œ ์ฑ„, ์ด๋ฅผ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์— ์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•ด ๊ธฐ์กด ๋ชจ๋ธ์˜ forward ๊ตฌ์กฐ๋ฅผ ํ•ด์ฒด(Decoupling)ํ•˜๊ณ  **์„ธ ๊ฐ€์ง€ ํ•˜๋“œ์›จ์–ด ์นœํ™”์  ์ตœ์ ํ™”(Hardware-Aware Optimization)**๋ฅผ ์ ์šฉํ•œ StreamingUniHAR ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ƒˆ๋กญ๊ฒŒ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.


2. Optimization Step 1: Feature Caching (Stream-Aware Inference)

"Vision Domain์— LLM์˜ KV Cache ์›๋ฆฌ๋ฅผ ์ด์‹ํ•˜๋‹ค"

์ „์ฒด ์—ฐ์‚ฐ์˜ 98.5% (218 GFLOPs) ๊ฐ€ Image Encoder(ResNet18)์—์„œ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” 120ํ”„๋ ˆ์ž„ ๋‹จ์œ„์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์ถ”๋ก  ์‹œ, ์œˆ๋„์šฐ๊ฐ€ 1ํ”„๋ ˆ์ž„ ์ด๋™ํ•  ๋•Œ๋งˆ๋‹ค ๊ณผ๊ฑฐ์˜ 119ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ RGB ํŠน์ง•์„ ๋ฌด์˜๋ฏธํ•˜๊ฒŒ ์ค‘๋ณต ์—ฐ์‚ฐ(Coupled Forward) ํ•˜๊ณ  ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ์•ˆ (Architecture Decoupling): Autoregressive LLM์ด ์ด์ „ ํ† ํฐ ์—ฐ์‚ฐ์„ ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” KV Cache ๊ฐœ๋…์„ Vision ํŒŒ์ดํ”„๋ผ์ธ์— ์ด์‹ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ†ต์งœ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ํ”„๋ ˆ์ž„ ๋‹จ์œ„์˜ Feature Extractor์™€ Temporal Aggregator๋กœ ๋ถ„๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

# StreamingUniHAR์˜ ํ•ต์‹ฌ ๋กœ์ง: ํ(Queue) ๋กค๋ง ๊ธฐ๋ฐ˜ ์บ์‹ฑ
# ํ๋ฅผ ํ•œ ์นธ์”ฉ ๋ฐ€๊ณ  ์ตœ์‹  ํŠน์ง•๊ฐ’์„ ๋งจ ๋’ค์— ์‚ฝ์ž… (1ํ”„๋ ˆ์ž„๋งŒ ์—ฐ์‚ฐ)
self.pose_raw_cache = torch.roll(self.pose_raw_cache, shifts=-1, dims=1)
self.pose_raw_cache[:, -1:, :, :] = curr_pose

self.img_feat_cache = torch.roll(self.img_feat_cache, shifts=-1, dims=1)
self.img_feat_cache[:, -1:, :] = curr_img_feat

  • ๊ฒฐ๊ณผ: ์œˆ๋„์šฐ ์ „์ฒด๋ฅผ ํ•œ ๋ฒˆ์— ์—ฐ์‚ฐํ•˜์ง€ ์•Š๊ณ , ์ƒˆ๋กœ ๋“ค์–ด์˜ค๋Š” 1๊ฐœ์˜ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•ด์„œ๋งŒ Image Encoder ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ณ€๊ฒฝํ•˜์—ฌ ํ”„๋ ˆ์ž„๋‹น ์ˆœ๊ฐ„ ์—ฐ์‚ฐ๋Ÿ‰์„ 221 GFLOPs์—์„œ ์•ฝ 1.8 GFLOPs ์ˆ˜์ค€์œผ๋กœ ํ‰ํƒ„ํ™”(Smoothing) ํ–ˆ์Šต๋‹ˆ๋‹ค.

3. Optimization Step 2: Temporal Sparsity (Dynamic Skipping)

"๊ฐ€๋ฒผ์šด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ๋ฌด๊ฑฐ์šด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ์—ฐ์‚ฐ์„ ํ†ต์ œํ•˜๋‹ค (Cross-modal Sparsity)"

Image Encoder์˜ ์ค‘๋ณต ์—ฐ์‚ฐ์„ ์บ์‹ฑ์œผ๋กœ ๋ง‰์•˜์ง€๋งŒ, ์—ฃ์ง€ ํ™˜๊ฒฝ์˜ ์นด๋ฉ”๋ผ๋Š” ํ”ผ์‚ฌ์ฒด๊ฐ€ ์ •์ง€ํ•ด ์žˆ์„ ๋•Œ๋„ ๊ณ„์† ๋Œ์•„๊ฐ‘๋‹ˆ๋‹ค. ํ–‰๋™ ์ธ์‹ ๋ฐ์ดํ„ฐ์…‹ ๋ถ„์„ ๊ฒฐ๊ณผ, ์ธ์ฒด๊ฐ€ ๋ฉˆ์ถฐ ์žˆ๋Š” ๊ตฌ๊ฐ„(Redundancy)์ด ๊ฝค ๋งŽ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ์•ˆ: ๊ฐ€๋ฒผ์šด ์—ฐ์‚ฐ๋Ÿ‰(1.5%)์„ ๊ฐ€์ง„ Pose ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด ๋ฌด๊ฑฐ์šด ์—ฐ์‚ฐ๋Ÿ‰(98.5%)์„ ๊ฐ€์ง„ RGB CNN ์—ฐ์‚ฐ์„ ์Šคํ‚ตํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ”„๋ ˆ์ž„ ๊ฐ„ ํฌ์ฆˆ ํŠน์ง• ๋ฒกํ„ฐ์˜ ๋ณ€ํ™”๋Ÿ‰(ฮ”\Delta)์ด ํŠน์ • Threshold ์ดํ•˜์ด๋ฉด, ๋ฌด๊ฑฐ์šด CNN ์ถ”์ถœ ์ž์ฒด๋ฅผ ์ƒ๋žต(Zero-Compute) ํ•˜๊ณ  ์ด์ „ ํ”„๋ ˆ์ž„์˜ ํŠน์ง•์„ ๊ทธ๋Œ€๋กœ ๋ณต์‚ฌํ•ฉ๋‹ˆ๋‹ค.

# [Optimization] Temporal Sparsity Check
# ํ˜„์žฌ ํฌ์ฆˆ์™€ ์ด์ „ ํฌ์ฆˆ์˜ ๋ณ€ํ™”๋Ÿ‰(delta) ๊ณ„์‚ฐ
delta = torch.norm(curr_pose - self.last_pose, p=2, dim=-1).mean().item()

if delta < self.threshold and self.last_img_feat is not None:
    # ์›€์ง์ž„์ด ์ž„๊ณ„์น˜ ์ดํ•˜ -> ๋ฌด๊ฑฐ์šด CNN ์ถ”์ถœ ๊ณผ์ • ์ƒ๋žต (Zero-Compute)
    curr_img_feat = self.last_img_feat
    is_sparse = True
else:
    # ์›€์ง์ž„ ๋ฐœ์ƒ -> ํ˜„์žฌ ํ”„๋ ˆ์ž„๋งŒ CNN ์—ฐ์‚ฐ ์‹คํ–‰
    curr_img_4d = curr_img.view(B, 3, curr_img.shape[-2], curr_img.shape[-1])
    curr_img_feat = self.image_encoder(curr_img_4d).unsqueeze(1)

  • ๊ฒฐ๊ณผ: ํ‰๊ท ์ ์ธ ์ถ”๋ก  ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ํ–‰๋™ ๋ณ€ํ™”๊ฐ€ ์—†๋Š” ๊ตฌ๊ฐ„์˜ RGB ์ธ์ฝ”๋”ฉ ์—ฐ์‚ฐ์„ ์™„์ „ํžˆ ์ฐจ๋‹จํ•˜์—ฌ, ์ „์ฒด ์—ฐ์‚ฐ๋Ÿ‰์„ ์ถ”๊ฐ€๋กœ 30~40% ์ ˆ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

4. Optimization Step 3: 1.58-bit QAT (BitNet for Vision)

"MPU ํƒ‘์žฌ๋ฅผ ์œ„ํ•œ ๊ทน๋‹จ์  ์–‘์žํ™”์™€ Adder-only ์—ฐ์‚ฐ"

์ด๋ฏธ์ง€ ์ธ์ฝ”๋” ๋ณ‘๋ชฉ์„ ํ•ด๊ฒฐํ•œ ํ›„, ๋งˆ์ง€๋ง‰ ํƒ€๊ฒŸ์€ 3.3 GFLOPs๋ฅผ ์ฐจ์ง€ํ•˜๋Š” Pose Backbone (Factorized Spatio-Temporal Transformer) ์ด์—ˆ์Šต๋‹ˆ๋‹ค. MPU(Microprocessor Unit) ๋ ˆ๋ฒจ์—์„œ๋Š” FP32 ์ •๋ฐ€๋„์˜ MAC(Multiply-Accumulate) ํ–‰๋ ฌ ๊ณฑ ์—ฐ์‚ฐ ์ž์ฒด๊ฐ€ ์ „๋ ฅ๊ณผ ๋Œ€์—ญํญ์˜ ํ•œ๊ณ„์— ๋ถ€๋”ชํž™๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ์•ˆ:
์ตœ์‹  LLM ๊ฒฝ๋Ÿ‰ํ™” ํŠธ๋ Œ๋“œ์ธ BitNet b1.58์˜ ๊ทน๋‹จ์  ์–‘์žํ™” ๊ธฐ๋ฒ•์„ Transformer ๊ธฐ๋ฐ˜ ๋น„์ „ ๋ชจ๋ธ์— ์ ์šฉํ•˜๋Š” ์‹คํ—˜์„ ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

class WeightQuantSTE(torch.autograd.Function):
    @staticmethod
    def forward(ctx, weight):
        scale = weight.abs().mean().clamp(min=1e-8)
        quantized = torch.round(weight / scale).clamp(-1, 1) # {-1, 0, 1} ๋งคํ•‘
        ctx.save_for_backward(weight)
        return quantized * scale

class BitLinear(nn.Module):
    """๊ธฐ์กด nn.Linear๋ฅผ ๋Œ€์ฒดํ•˜๋Š” 1.58-bit ๋ ˆ์ด์–ด"""
    def __init__(self, in_features, out_features, bias=True):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.02)
        
    def forward(self, x):
        q_weight = WeightQuantSTE.apply(self.weight)
        return F.linear(x, q_weight, None)

  • QAT (Quantization-Aware Training): ๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅํ•œ ์–‘์žํ™” ํ•จ์ˆ˜์˜ ์—ญ์ „ํŒŒ(Backpropagation)๋ฅผ ์œ„ํ•ด STE(Straight-Through Estimator) ๊ธฐ๋ฐ˜์˜ Custom Autograd๋ฅผ ๊ตฌํ˜„ํ•˜์—ฌ ๊ธฐ์กด MLP์˜ nn.Linear๋ฅผ BitLinear๋กœ ๊ต์ฒดํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ•˜๋“œ์›จ์–ด ๊ด€์ ์˜ ์ด์ : ALU(์‚ฐ์ˆ ๋…ผ๋ฆฌ์—ฐ์‚ฐ์žฅ์น˜) ๋ ˆ๋ฒจ์—์„œ ๋ฌด๊ฑฐ์šด Multiplier(๊ณฑ์…ˆ๊ธฐ)๋ฅผ ๋‹จ์ˆœ Adder(๊ฐ€์‚ฐ๊ธฐ)๋กœ ์น˜ํ™˜. ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์••์ถ•ํ•˜์—ฌ DRAM ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๊ณ  SRAM ๋‚ด๋ถ€์—์„œ ์—ฐ์‚ฐ์ด ์™„๋ฃŒ๋  ์ˆ˜ ์žˆ๋Š” ํ† ๋Œ€๋ฅผ ๋งˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค.

5. Conclusion & Next Steps

Phase 2์˜ ์ตœ์ ํ™” ํŒŒ์ดํ”„๋ผ์ธ ์„ค๊ณ„๋ฅผ ํ†ตํ•ด, Uni-HAR๋Š” '์—ฐ๊ตฌ์‹ค์˜ ๋ฌด๊ฑฐ์šด ๋ชจ๋ธ'์—์„œ 'ํ˜„์žฅ์˜ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค(MPU)์—์„œ ์‹ค์‹œ๊ฐ„ ๋™์ž‘ ๊ฐ€๋Šฅํ•œ ์—”์ง„' ์œผ๋กœ ๋ณ€๋ชจํ•  ์ค€๋น„๋ฅผ ๋งˆ์ณค์Šต๋‹ˆ๋‹ค.

[Summary of Transformation]

  • Phase 0: ๊ฐ€๋ณ์ง€๋งŒ ๋ฉ์ฒญํ•˜๋‹ค. (0.5 MFLOPs, 1D-CNN)
  • Phase 1: ๋˜‘๋˜‘ํ•˜์ง€๋งŒ ๋ฌด๊ฒ๋‹ค. (221 GFLOPs, MPU OOM ๋ฐœ์ƒ)
  • Phase 2: ๋˜‘๋˜‘ํ•จ์„ ์œ ์ง€ํ•˜๋ฉฐ ๊ฐ€๋ฒผ์›Œ์กŒ๋‹ค. (Feature Caching & Sparsity๋กœ ํ”„๋ ˆ์ž„๋‹น ์—ฐ์‚ฐ 99% ์‚ญ๊ฐ, 1.58-bit QAT๋กœ Transformer ์—ฐ์‚ฐ ํšจ์œจํ™”)

[Future Work: Heterogeneous Dispatching] ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•˜๊ณ ์ž ํ•˜๋Š” ์ฃผ์ œ๋Š” ์ด๊ธฐ์ข… ํ•˜๋“œ์›จ์–ด(CPU/GPU/NPU) ์ž๋™ ํ• ๋‹น ๋ผ์šฐํŒ…์ž…๋‹ˆ๋‹ค.
๊ฐ€๋ฒผ์šด ์ œ์–ด ๋กœ์ง(Sparsity ํŒ๋‹จ, Buffer ๋กค๋ง)์€ CPU๊ฐ€ ์ „๋‹ดํ•˜๊ณ , Feature Extraction๊ณผ QAT ๋ณ‘๋ ฌ ํ–‰๋ ฌ ์—ฐ์‚ฐ์€ Jetson Board๋“ฑ์˜ GPU/NPU๋กœ Dispatching ํ•˜์—ฌ ์นฉ์…‹์˜ ์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค ๋‚ญ๋น„๋ฅผ 0%๋กœ ๋งŒ๋“œ๋Š” ํ†ตํ•ฉ ์—”์ง„ ์ตœ์ ํ™”๋ฅผ ์ด์–ด๋‚˜๊ฐˆ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.