Deep Dive into Distribution-Aware HAR

Research & Implementation Monograph

๐Ÿ“„ "Deep Dive into Distribution-Aware HAR: From Mathematical Design to Implementation"*

Compositional Action Recognition via Factorized Spatio-Temporal Representation

Abstract :

  • ๋ณธ ๋ฆฌํฌํŠธ๋Š” ์ธ๊ฐ„ ํ–‰๋™(Human Action)์„ ๊ณ ์ •๋œ ํด๋ž˜์Šค ๋ผ๋ฒจ์ด ์•„๋‹Œ ๊ธฐ๋ณธ ๋™์ž‘(Atomic Actions)์˜ ํ™•๋ฅ ์  ์กฐํ•ฉ(Probabilistic Composition) ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค.
  • ์ด๋ฅผ ์œ„ํ•ด Factorized Spatio-Temporal Transformer๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ๊ณต๊ฐ„(Spatial)๊ณผ ์‹œ๊ฐ„(Temporal) ํŠน์ง•์„ ๋ถ„๋ฆฌ ํ•™์Šต(Disentangle)ํ•˜๊ณ , Information-Theoretic Multimodal Fusion์„ ํ†ตํ•ด ๋…ธ์ด์ฆˆ์— ๊ฐ•๊ฑดํ•œ ํ‘œํ˜„์„ ํ•™์Šตํ•œ๋‹ค.
  • ๋˜ํ•œ, Soft-Mapping Training Strategy๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์ด Unseen Action์— ๋Œ€ํ•ด์„œ๋„ ์œ ์˜๋ฏธํ•œ ์ƒํƒœ ๋ถ„ํฌ(State Distribution)๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ์œ ๋„ํ•œ๋‹ค.

1. Introduction

"ํ˜„์‹ค ์„ธ๊ณ„์˜ ๋™์  ๋ฐ์ดํ„ฐ(Human Action, Volumetric Data ๋“ฑ)๋Š” ๋ช…ํ™•ํ•œ ๊ฒฝ๊ณ„(Hard Boundary)๋ฅผ ๊ฐ–์ง€ ์•Š๋Š”๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด '๊ฑท๊ธฐ'์™€ '๋›ฐ๊ธฐ' ์‚ฌ์ด์—๋Š” ๋ฌด์ˆ˜ํ•œ ์ค‘๊ฐ„ ์ƒํƒœ(Intermediate State)๊ฐ€ ์กด์žฌํ•œ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ๋‹ค์ˆ˜์˜ Deep Learning ๋ชจ๋ธ์€ One-hot Label์„ ๊ฐ€์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ๋‚ด์žฌ์  ๋ชจํ˜ธ์„ฑ(Ambiguity)์„ ๋ฌด์‹œํ•˜๋ฉฐ, ์ด๋Š” Unseen Action์— ๋Œ€ํ•œ ์˜ˆ์ธก ์‹คํŒจ๋กœ ์ด์–ด์ง„๋‹ค."

  • Problem: P(YโˆฃX)=One-hotP(Y|X) = \text{One-hot}์€ ํ–‰๋™์˜ ์—ฐ์†์„ฑ๊ณผ ์ค‘์ฒฉ์„ฑ์„ ๋ฌด์‹œํ•œ๋‹ค.
  • Solution: ํ–‰๋™์„ Latent Space ZZ ์ƒ์˜ ๋งค๋‹ˆํด๋“œ(Manifold)๋กœ ๋งคํ•‘ํ•˜๊ณ , ์ถœ๋ ฅ YY๋ฅผ Basis Action๋“ค์˜ Linear Combination์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค.
    • ex) Yfastย walkโ‰ˆฮฑโ‹…Ywalk+ฮฒโ‹…YrunY_{\text{fast walk}} \approx \alpha \cdot Y_{\text{walk}} + \beta \cdot Y_{\text{run}}
  • Goal: "๋ณธ ์—ฐ๊ตฌ๋Š” ํ–‰๋™์„ ๋‹จ์ˆœ ๋ถ„๋ฅ˜๊ฐ€ ์•„๋‹Œ ํ™•๋ฅ  ๋ถ„ํฌ(Distribution)๋กœ ๋ชจ๋ธ๋งํ•˜์—ฌ, Unseen Action์— ๋Œ€ํ•œ ์กฐํ•ฉ์  ํ•ด์„(Compositional Understanding) ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค."

2. Methodology: Designing Inductive Bias

2.1 Short-term Pose Encoder

2.1.1 Input Representation

๊ฐ pose๋Š”:

[xโˆˆRTร—Jร—C][ x \in \mathbb{R}^{T \times J \times C} ]

  • (T): frame
  • (J): number of joints
  • (C = 3) (x, y, normalized depth)

๋ชจ๋ธ์—์„œ๋Š” depth๋ฅผ ์•ˆ์ •ํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด tanh๋ฅผ ์ทจํ•œ๋‹ค:

[zโˆ—=tanhโก(z)][ z^* = \tanh(z) ]


2.1.2 Joint Embedding

๊ฐ joint๋Š” joint ID embedding (e_j) ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ตฌ์กฐ์  ์ •๋ณด๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค.

[ht,j=Wxt,j+ej][ h_{t,j} = W x_{t,j} + e_j ]


2.1.3 Temporal Inductive Bias: Why Learnable PE?
Transformer์—์„œ ์ˆœ์„œ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•˜๊ธฐ ์œ„ํ•ด Positional Encoding(PE)์ด ํ•„์ˆ˜์ ์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ผ๋ฐ˜์ ์ธ Sinusoidal PE ๋Œ€์‹  Learnable PE๋ฅผ ์ฑ„ํƒํ•˜์˜€๋‹ค.

  • Sinusoidal PE์˜ ํ•œ๊ณ„ (HAR์— ๋ถ€์ ํ•ฉํ•œ ์ด์œ )
    sinusoidal PE๋Š” Transformer์˜ ์›๋ž˜ ์˜๋„์ฒ˜๋Ÿผ
    PE(pos,2i)=sinโก(pos/100002i/d)\text{PE}(pos, 2i) = \sin(pos / 10000^{2i/d})
    PE(pos,2i+1)=cosโก(pos/100002i/d)\text{PE}(pos, 2i+1) = \cos(pos / 10000^{2i/d})
    ๋กœ โ€œ์ ˆ๋Œ€ ์œ„์น˜โ€๋ฅผ ๊ณ ์ • ์ฃผํŒŒ์ˆ˜๋กœ ๋ถ€ํ˜ธํ™”ํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ HAR์€ ๋‹ค์Œ ํŠน์ง•์„ ๊ฐ€์ง„๋‹ค:

  • (1) Temporal length๊ฐ€ ๋งค์šฐ ์งง๋‹ค (8~30 frames)
    โ†’ ๊ณ ์ • ์ฃผํŒŒ์ˆ˜ ๊ธฐ๋ฐ˜ ํ‘œํ˜„์—์„œ ๋‹ค์–‘ํ•œ phase ๋ณ€ํ™”๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์•„ ์ฃผํŒŒ์ˆ˜ ๊ธฐ๋ฐ˜ ์œ„์น˜ ๊ตฌ๋ถ„์˜ ์ด์ ์ด ๊ฑฐ์˜ ์—†์Œ.

  • (2) ๋ฐ์ดํ„ฐ์…‹์ด ์ž‘๊ณ  ๋„๋ฉ”์ธ ํŠน์ˆ˜์„ฑ์ด ๊ฐ•ํ•จ
    sinusoidal PE๋Š” task-specific temporal pattern์„ ํ•™์Šตํ•  ์ž์œ ๋„ 0
    โ†’ ์˜คํžˆ๋ ค underfitting ์œ„ํ—˜์ด ๋” ํฌ๋‹ค.

  • (3) short-term modeling ๊ตฌ์กฐ์™€์˜ ๋ถ€์กฐํ™”
    PoseFormerFactorized๋Š”

    • TemporalBlock(๊ฐ joint time-series)
    • SpatialBlock(๊ฐ frame joint-set attention)

    ์œผ๋กœ short-term filter ์—ญํ• ์„ ํ•จ
    โ†’ โ€œrelative local variationโ€์ด ๋” ์ค‘์š”
    (์ฆ‰ absolute position scale์€ ๋ฌด์˜๋ฏธํ•ด์ง€๊ณ  ํ•™์Šต ๊ฐ€๋Šฅํ•œ bias๊ฐ€ ์œ ๋ฆฌํ•จ)

๋”ฐ๋ผ์„œ sinusoidal PE๋Š” HAR์˜ inductive bias์™€ ์ถฉ๋Œํ•œ๋‹ค.


2.1.4 Learnable PE์˜ ์žฅ์  (HAR์—์„œ ๋” ์ ํ•ฉํ•œ ์ด์œ )

HAR ์—์„œ๋Š” learnable parameter

PlearnโˆˆR1ร—Tร—1ร—D\mathbf{P}_{learn} \in \mathbb{R}^{1 \times T \times 1 \times D}

์„ timeline dim์— broadcastingํ•˜์—ฌ ์ ์šฉํ•œ๋‹ค.

์žฅ์  ์š”์•ฝ:

๊ด€์ Learnable PESinusoidal PE
HAR temporal length์ตœ์ ํ™” ์šฉ์ด์˜๋ฏธ ์—†์Œ
๋„๋ฉ”์ธ ํŠนํ™” ํŒจํ„ดํ•™์Šต ๊ฐ€๋Šฅ๋ถˆ๊ฐ€๋Šฅ
short-term block ํ˜ธํ™˜์„ฑ๋†’์Œ๋‚ฎ์Œ
๋ฐ์ดํ„ฐ ๊ทœ๋ชจ ์ž‘์Œ์œ ๋ฆฌ๋ถˆ๋ฆฌ

์ด๋ก ์  ๊ทผ๊ฑฐ (soft position bias)

Learnable PE๋Š” ๋‹ค์Œ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•™์Šตํ•œ๋‹ค:

Plearnโˆ—=argโกminโกPL(f(X+P))\mathbf{P}_{learn}^* = \arg\min_{\mathbf{P}} \mathcal{L}(f(X + \mathbf{P}))

์ด๋Š” ์‚ฌ์‹ค์ƒ โ€œ์‹œ๊ฐ„ ์ถ• latent shift biasโ€ ๋กœ ์ž‘๋™ํ•˜์—ฌ
๊ฐ joint์˜ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์— task-specific temporal variation์„ ๋ถ€์—ฌํ•œ๋‹ค.

์ด๋Š” ๋‹ค์Œ์„ ๋งŒ์กฑํ•œ๋‹ค:

  • local motion derivative

    ฮ”t=Xt+1โˆ’Xt\Delta_t = X_{t+1} - X_t

    ๊ฐ€ ๋” ์ž˜ ๊ตฌ๋ถ„๋˜๋„๋ก ๋ณด์กฐ ์‹ ํ˜ธ ์ œ๊ณต

  • temporal block์˜ self-attention์ด

    Q(X+P),K(X+P),V(X+P)Q(X + P), K(X + P), V(X+P)

    ๋ฅผ ํ†ตํ•ด ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„๋ฆฌ

๊ฒฐ๋ก ์ ์œผ๋กœ HAR์˜ ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๊ณผ ๊ตฌ์กฐ์  ํŠน์„ฑ์—์„œ

learnable PE๊ฐ€ ๋” expressiveํ•˜๊ณ  optimization-friendlyํ•˜๋‹ค.


2.2 Factorized Spatio-Temporal Encoder (Short-term)

Factorized Architecture: Implementation Details

1. Spatial-Temporal Disentanglement
๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์€ "์ž์„ธ(Geometry)" ์™€ "์›€์ง์ž„(Dynamics)" ์„ ๋…๋ฆฝ์ ์ธ Latent Factor๋กœ ๋ถ„๋ฆฌ(Disentangle)ํ•˜์—ฌ ํ•™์Šตํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค.

Temporal Block (Per-Joint):

  • Tensor View: (B,T,J,D)โ†’(Bโ‹…J,T,D)(B, T, J, D) \rightarrow (B \cdot J, T, D)
  • ๊ฐ ๊ด€์ ˆ์˜ ์‹œ๊ฐ„์  ๊ถค์ ๋งŒ์„ ๋…๋ฆฝ์ ์œผ๋กœ Attentionํ•˜์—ฌ Pose-invariant Motion์„ ์ถ”์ถœํ•œ๋‹ค.

Spatial Block (Per-Frame):

  • Tensor View: (B,T,J,D)โ†’(Bโ‹…T,J,D)(B, T, J, D) \rightarrow (B \cdot T, J, D)
  • ๊ฐ ์‹œ์ ์˜ ๊ด€์ ˆ ๊ฐ„ ๊ด€๊ณ„๋ฅผ Attentionํ•˜์—ฌ Time-invariant Geometry๋ฅผ ์ถ”์ถœํ•œ๋‹ค.

2. Complexity Proof (Mathematical Derivation)

์ด๋Ÿฌํ•œ ๋ถ„๋ฆฌ ์„ค๊ณ„๋Š” ์ง€๋Šฅ์  ์ด์ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ทน์ ์ธ ์—ฐ์‚ฐ ํšจ์œจ์„ฑ์„ ์ œ๊ณตํ•œ๋‹ค.

  • Step 1 (Full Attention): N=Tโ‹…JN=T \cdot J ๊ฐœ์˜ ํ† ํฐ์— ๋Œ€ํ•ด QKTQK^T ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
    OFull=O((TJ)2โ‹…d)=O(T2J2d)\mathcal{O}_{Full} = \mathcal{O}((TJ)^2 \cdot d) = \mathcal{O}(T^2 J^2 d)

  • Step 2 (Factorized Attention): ์‹œ๊ฐ„์ถ•๊ณผ ๊ณต๊ฐ„์ถ•์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ˆ˜ํ–‰ํ•œ๋‹ค.
    OFact=O(Jโ‹…T2d)โŸTemporal+O(Tโ‹…J2d)โŸSpatial\mathcal{O}_{Fact} = \underbrace{\mathcal{O}(J \cdot T^2 d)}_{\text{Temporal}} + \underbrace{\mathcal{O}(T \cdot J^2 d)}_{\text{Spatial}}

  • Step 3 (Efficiency Ratio): Ratio=T2Jd+J2TdT2J2d=1J+1T\text{Ratio} = \frac{T^2 J d + J^2 T d}{T^2 J^2 d} = \frac{1}{J} + \frac{1}{T}

Insight: T=30,J=25T=30, J=25์ผ ๋•Œ ์—ฐ์‚ฐ๋Ÿ‰์€ ์•ฝ 13.7๋ฐฐ ๊ฐ์†Œํ•œ๋‹ค. ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์ด ์‹ฌํ•œ 3D Volumetric Data (CT, MRI) ์ฒ˜๋ฆฌ ์‹œ ํ•ด์ƒ๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ ๊นŠ์€ ๋ชจ๋ธ์„ ์Œ“์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ด๋‹ค.

2.3 Long-term Temporal Modeling

Short-term Window๋“ค์€ ๊ตญ์†Œ์  ํŠน์ง•๋งŒ ๊ฐ€์ง„๋‹ค. ์ด๋ฅผ Transformer Encoder๋กœ ๋‹ค์‹œ ํ†ตํ•ฉ(Aggregation)ํ•จ์œผ๋กœ์จ, ๋ชจ๋ธ์€ ๊ธด ์‹œ๊ฐ„(Long-term)์— ๊ฑธ์นœ ์ธ๊ณผ๊ด€๊ณ„(Causality)์™€ ๋งฅ๋ฝ(Context) ์„ ํŒŒ์•…ํ•œ๋‹ค."

2.4 Multimodal Context Injection (Image Encoder)

Pose๊ฐ€ ๋†“์น˜๋Š” ํ™˜๊ฒฝ ์ •๋ณด(Context)๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด 2D CNN์„ ์‚ฌ์šฉ. ํ”ฝ์…€ ๋ ˆ๋ฒจ ์ •๋ณด๋ฅผ ์‹œ๋งจํ‹ฑ ๋ ˆ๋ฒจ๋กœ ์••์ถ•ํ•˜์—ฌ Pose Feature์™€ ๊ฒฐํ•ฉํ•œ๋‹ค.


3. Multimodal Fusion: Information-Theoretic Fusion Analysis

3.1 Robustness of Concat-LN in Noisy Alignment

(1) The Alignment Problem

์‹ค์ œ ํ™˜๊ฒฝ ๋ฐ์ดํ„ฐ(Real-world Data)๋Š” Pose Estimation Jitter, Frame Drop, Motion Artifact ๋“ฑ์œผ๋กœ ์ธํ•ด XposeX_{pose}์™€ XrgbX_{rgb} ๊ฐ„์˜ ์‹œ๊ณต๊ฐ„์  ์ •๋ ฌ์ด ์™„๋ฒฝํ•˜์ง€ ์•Š๋‹ค.

(2) Proof via Fano's Inequality

๋ณธ ์—ฐ๊ตฌ๋Š” Concat-LN (Concatenation + LayerNorm) ๋ฐฉ์‹์ด Noisy Alignment ํ™˜๊ฒฝ์—์„œ Cross-Attention๋ณด๋‹ค ์ •๋ณด์ด๋ก ์ ์œผ๋กœ ๊ฐ•๊ฑดํ•จ์„ ์ฆ๋ช…ํ•œ๋‹ค.

  • Fano's Inequality: ์˜ค๋ถ„๋ฅ˜ ํ™•๋ฅ  ์˜ ํ•˜ํ•œ์€ ์กฐ๊ฑด๋ถ€ ์—”ํŠธ๋กœํ”ผ H(YโˆฃX)H(Y|X)์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค. Peโ‰ฅH(YโˆฃX)โˆ’1logโกโˆฃYโˆฃP_e \ge \frac{H(Y|X) - 1}{\log |Y|}

  • Case 1: Cross-Attention (Hard Alignment Assumption) Cross-Attention์€ Query(tt)์™€ Key(tt)์˜ ์ •ํ™•ํ•œ ๋Œ€์‘์„ ๊ฐ€์ •ํ•œ๋‹ค. ์ •๋ ฌ ๋…ธ์ด์ฆˆ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด Attention Weight๊ฐ€ ๋ฐœ์‚ฐ(Diffuse)ํ•˜์—ฌ ๋ถˆํ™•์‹ค์„ฑ(Entropy)์ด ์ฆ๊ฐ€ํ•œ๋‹ค.
    H(YโˆฃXcross)=H(YโˆฃXpose,Xrgb)+ฮ”noiseH(Y \mid X_{\text{cross}}) = H(Y \mid X_{pose}, X_{rgb}) + \Delta_{\text{noise}}

  • Case 2: Concat-LN (Joint Distribution Approximation) Concat-LN์€ MLP๊ฐ€ ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ **๊ฒฐํ•ฉ ๋ถ„ํฌ(Joint Distribution)**๋ฅผ ๋น„์„ ํ˜•์ ์œผ๋กœ ๊ทผ์‚ฌํ•œ๋‹ค. ์ด๋Š” ์—„๊ฒฉํ•œ tโ†”tt \leftrightarrow t ์ •๋ ฌ ๋Œ€์‹ , ์ƒํ˜ธ์ •๋ณด๋Ÿ‰(Mutual Information) I(Y;XrgbโˆฃXpose)I(Y; X_{rgb} \mid X_{pose})์˜ ์ด๋Ÿ‰์„ ๋ณด์กดํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•œ๋‹ค.
    โˆดH(YโˆฃXconcat)<H(YโˆฃXcross)(underย noisyย alignment)\therefore H(Y \mid X_{\text{concat}}) < H(Y \mid X_{\text{cross}}) \quad \text{(under noisy alignment)}

Conclusion: ๋”ฐ๋ผ์„œ ์ •๋ ฌ์ด ๋ถˆ์™„์ „ํ•œ ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ์—์„œ๋Š” Concat-LN์ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ(Generalization) ์ธก๋ฉด์—์„œ ์šฐ์›”ํ•˜๋‹ค.


4. Optimization Dynamics (Training Strategy)

๋‹จ์ˆœํ•œ Transfer Learning์ด ์•„๋‹Œ, Representation Manifold๋ฅผ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•œ 3๋‹จ๊ณ„ ํ•™์Šต ์ „๋žต์„ ์„ค๊ณ„ํ•˜์˜€๋‹ค.

4.1 Manifold Learning (Stage 1 & 2)

  • Masked Joint Modeling (MJM): BERT์˜ MLM๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๊ด€์ ˆ์„ ๋งˆ์Šคํ‚นํ•˜๊ณ  ๋ณต์›ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ์˜ ๋‚ด์žฌ์  ๊ธฐํ•˜ํ•™(Intrinsic Geometry)์„ ํ•™์Šตํ•œ๋‹ค.
  • Contrastive Learning: ์œ ์‚ฌํ•œ ํ–‰๋™์€ ๊ฐ€๊น๊ฒŒ, ์ƒ์ดํ•œ ํ–‰๋™์€ ๋ฉ€๊ฒŒ ๋งคํ•‘ํ•˜์—ฌ ๊ฒฌ๊ณ ํ•œ Metric Space๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค.

4.2 Regularization for Open-Set (Stage 3)

  • Problem: Fine-tuning ์‹œ ๋‹จ์ˆœ Cross-Entropy(CE) Loss๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด Pretrained Manifold๊ฐ€ Task-specificํ•˜๊ฒŒ ๋ถ•๊ดด(Collapse)๋œ๋‹ค.
  • Solution (Soft-Mapping): Backbone Freezing๊ณผ Label Smoothing์„ ๊ฒฐํ•ฉํ•˜์—ฌ, ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋Š” ํ–‰๋™(Unseen Action)์— ๋Œ€ํ•ด์„œ๋„ ๊ธฐ์กด Basis Action๋“ค์˜ ํ™•๋ฅ ์  ์กฐํ•ฉ(Interpolation) ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•œ๋‹ค. ์ด๋Š” ์ตœ์ ํ™” ์ง€ํ˜•(Optimization Landscape)์„ ํ‰ํƒ„ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด Local Minima๋ฅผ ๋ฐฉ์ง€ํ•œ๋‹ค.

5. Discussion: Generalizability & Representation Power

๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•œ Factorized Spatio-Temporal Representation์€ HAR ๋„๋ฉ”์ธ์— ๊ตญํ•œ๋˜์ง€ ์•Š๋Š” ๋ฒ”์šฉ์„ฑ์„ ๊ฐ€์ง„๋‹ค.

5.1 Isomorphism to Volumetric Data

๋ณธ ๋ชจ๋ธ์˜ ํ•ต์‹ฌ ๊ตฌ์กฐ์ธ ์‹œ๊ฐ„(tt)๊ณผ ๊ณต๊ฐ„(ss)์˜ ๋ถ„ํ•ด๋Š”, 3D Volumetric Data (e.g., CT, MRI, Point Cloud) ๋ถ„์„๊ณผ ์ˆ˜ํ•™์ ์œผ๋กœ ๋™ํ˜•(Isomorphic) ์ด๋‹ค.

  • Time Sequence (TT) โ†”\leftrightarrow Depth / Slice Sequence (DD)
  • ๋ณธ ์—ฐ๊ตฌ์˜ ๋ฐฉ๋ฒ•๋ก ์€ 3D ๋ฐ์ดํ„ฐ์˜ Anisotropic Resolution (๋น„๋“ฑ๋ฐฉ์„ฑ ํ•ด์ƒ๋„) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , Slice ๊ฐ„์˜ ์—ฐ์†์„ฑ(Consistency)์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฒ”์šฉ์ ์ธ ์†”๋ฃจ์…˜์œผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•˜๋‹ค.

5.2 Probabilistic Modeling for Safety-Critical Domains

๋ณธ ์—ฐ๊ตฌ๊ฐ€ ์ง€ํ–ฅํ•œ Distribution-Aware Output์€ ๋‹จ์ˆœ ์˜ˆ์ธก์„ ๋„˜์–ด, ๋ชจ๋ธ์˜ ์‹ ๋ขฐ๋„(Confidence) ์ธก์ • ๋ฐ OOD(Out-of-Distribution) ํƒ์ง€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ์ด๋Š” ์˜๋ฃŒ(Medical), ์ž์œจ์ฃผํ–‰(Autonomous Driving), ๋กœ๋ณดํ‹ฑ์Šค(Robotics) ๋“ฑ ์˜ค๋ฅ˜ ๋น„์šฉ์ด ๋งค์šฐ ๋†’์€ High-Stakes Domain์—์„œ '์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” AI(Trustworthy AI)'๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ํ•ต์‹ฌ ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ ์ด ๋œ๋‹ค.