📄 *"Deep Dive into Distribution-Aware HAR: From Mathematical Design to Implementation"**

Compositional Action Recognition via Factorized Spatio-Temporal Representation

Abstract :

본 리포트는 인간 행동(Human Action)을 고정된 클래스 라벨이 아닌 기본 동작(Atomic Actions)의 확률적 조합(Probabilistic Composition) 으로 모델링하는 방법론을 제안한다.

이를 위해 Factorized Spatio-Temporal Transformer를 설계하여 공간(Spatial)과 시간(Temporal) 특징을 분리 학습(Disentangle)하고, Information-Theoretic Multimodal Fusion을 통해 노이즈에 강건한 표현을 학습한다.

또한, Soft-Mapping Training Strategy를 통해 모델이 Unseen Action에 대해서도 유의미한 상태 분포(State Distribution)를 출력하도록 유도한다.

1. Introduction

"현실 세계의 동적 데이터(Human Action, Volumetric Data 등)는 명확한 경계(Hard Boundary)를 갖지 않는다.
예를 들어 '걷기'와 '뛰기' 사이에는 무수한 중간 상태(Intermediate State)가 존재한다.
그러나 기존 다수의 Deep Learning 모델은 One-hot Label을 가정하여 데이터의 내재적 모호성(Ambiguity)을 무시하며, 이는 Unseen Action에 대한 예측 실패로 이어진다."

Problem: $P(Y|X) = \text{One-hot}$ 은 행동의 연속성과 중첩성을 무시한다.
Solution: 행동을 Latent Space $Z$ $Z$ 상의 매니폴드(Manifold)로 매핑하고, 출력 $Y$ $Y$ 를 Basis Action들의 Linear Combination으로 표현한다.
- ex) $Y_{\text{fast walk}} \approx \alpha \cdot Y_{\text{walk}} + \beta \cdot Y_{\text{run}}$

Goal: "본 연구는 행동을 단순 분류가 아닌 확률 분포(Distribution)로 모델링하여, Unseen Action에 대한 조합적 해석(Compositional Understanding) 을 가능하게 하는 것을 목표로 한다."

2. Methodology: Designing Inductive Bias

2.1 Short-term Pose Encoder

2.1.1 Input Representation

각 pose는:

$[ x \in \mathbb{R}^{T \times J \times C} ]$

(T): frame
(J): number of joints
(C = 3) (x, y, normalized depth)

모델에서는 depth를 안정화시키기 위해 다음과 같이 tanh를 취한다:

$[ z^* = \tanh(z) ]$

2.1.2 Joint Embedding

각 joint는 joint ID embedding (e_j) 을 추가하여 구조적 정보를 부여한다.

$[ h_{t,j} = W x_{t,j} + e_j ]$

2.1.3 Temporal Inductive Bias: Why Learnable PE?
Transformer에서 순서 정보를 주입하기 위해 Positional Encoding(PE)이 필수적이다. 본 연구에서는 일반적인 Sinusoidal PE 대신 Learnable PE를 채택하였다.

Sinusoidal PE의 한계 (HAR에 부적합한 이유)
sinusoidal PE는 Transformer의 원래 의도처럼
$\text{PE}(pos, 2i) = \sin(pos / 10000^{2i/d})$
$\text{PE}(pos, 2i+1) = \cos(pos / 10000^{2i/d})$
로 “절대 위치”를 고정 주파수로 부호화한다.

그러나 HAR은 다음 특징을 가진다:

(1) Temporal length가 매우 짧다 (8~30 frames)
→ 고정 주파수 기반 표현에서 다양한 phase 변화가 발생하지 않아 주파수 기반 위치 구분의 이점이 거의 없음.
(2) 데이터셋이 작고 도메인 특수성이 강함
sinusoidal PE는 task-specific temporal pattern을 학습할 자유도 0
→ 오히려 underfitting 위험이 더 크다.
(3) short-term modeling 구조와의 부조화
PoseFormerFactorized는
- TemporalBlock(각 joint time-series)
- SpatialBlock(각 frame joint-set attention)
으로 short-term filter 역할을 함
→ “relative local variation”이 더 중요
(즉 absolute position scale은 무의미해지고 학습 가능한 bias가 유리함)

따라서 sinusoidal PE는 HAR의 inductive bias와 충돌한다.

2.1.4 Learnable PE의 장점 (HAR에서 더 적합한 이유)

HAR 에서는 learnable parameter

$\mathbf{P}_{learn} \in \mathbb{R}^{1 \times T \times 1 \times D}$

을 timeline dim에 broadcasting하여 적용한다.

장점 요약:

관점	Learnable PE	Sinusoidal PE
HAR temporal length	최적화 용이	의미 없음
도메인 특화 패턴	학습 가능	불가능
short-term block 호환성	높음	낮음
데이터 규모 작음	유리	불리

이론적 근거 (soft position bias)

Learnable PE는 다음 최적화 문제를 학습한다:

$\mathbf{P}_{learn}^* = \arg\min_{\mathbf{P}} \mathcal{L}(f(X + \mathbf{P}))$

이는 사실상 “시간 축 latent shift bias” 로 작동하여
각 joint의 임베딩 공간에 task-specific temporal variation을 부여한다.

이는 다음을 만족한다:

local motion derivative

$\Delta_t = X_{t+1} - X_t$

가 더 잘 구분되도록 보조 신호 제공
temporal block의 self-attention이

$Q(X + P), K(X + P), V(X+P)$

를 통해 시간 차이를 더 효과적으로 분리

결론적으로 HAR의 데이터 스케일과 구조적 특성에서

learnable PE가 더 expressive하고 optimization-friendly하다.

2.2 Factorized Spatio-Temporal Encoder (Short-term)

Factorized Architecture: Implementation Details

1. Spatial-Temporal Disentanglement
본 연구의 모델은 "자세(Geometry)" 와 "움직임(Dynamics)" 을 독립적인 Latent Factor로 분리(Disentangle)하여 학습하도록 설계되었다.

Temporal Block (Per-Joint):

Tensor View: $(B, T, J, D) \rightarrow (B \cdot J, T, D)$
각 관절의 시간적 궤적만을 독립적으로 Attention하여 Pose-invariant Motion을 추출한다.

Spatial Block (Per-Frame):

Tensor View: $(B, T, J, D) \rightarrow (B \cdot T, J, D)$
각 시점의 관절 간 관계를 Attention하여 Time-invariant Geometry를 추출한다.

2. Complexity Proof (Mathematical Derivation)

이러한 분리 설계는 지능적 이점뿐만 아니라 극적인 연산 효율성을 제공한다.

Step 1 (Full Attention): $N=T \cdot J$ 개의 토큰에 대해 $QK^T$ 연산을 수행한다.
$\mathcal{O}_{Full} = \mathcal{O}((TJ)^2 \cdot d) = \mathcal{O}(T^2 J^2 d)$
Step 2 (Factorized Attention): 시간축과 공간축으로 나누어 수행한다.
$\mathcal{O}_{Fact} = \underbrace{\mathcal{O}(J \cdot T^2 d)}_{\text{Temporal}} + \underbrace{\mathcal{O}(T \cdot J^2 d)}_{\text{Spatial}}$
Step 3 (Efficiency Ratio): $\text{Ratio} = \frac{T^2 J d + J^2 T d}{T^2 J^2 d} = \frac{1}{J} + \frac{1}{T}$

Insight: $T=30, J=25$ 일 때 연산량은 약 13.7배 감소한다. 이는 메모리 제약이 심한 3D Volumetric Data (CT, MRI) 처리 시 해상도를 유지하며 깊은 모델을 쌓을 수 있게 하는 핵심 기술이다.

2.3 Long-term Temporal Modeling

Short-term Window들은 국소적 특징만 가진다. 이를 Transformer Encoder로 다시 통합(Aggregation)함으로써, 모델은 긴 시간(Long-term)에 걸친 인과관계(Causality)와 맥락(Context) 을 파악한다."

2.4 Multimodal Context Injection (Image Encoder)

Pose가 놓치는 환경 정보(Context)를 보완하기 위해 2D CNN을 사용. 픽셀 레벨 정보를 시맨틱 레벨로 압축하여 Pose Feature와 결합한다.

3. Multimodal Fusion: Information-Theoretic Fusion Analysis

3.1 Robustness of Concat-LN in Noisy Alignment

(1) The Alignment Problem

실제 환경 데이터(Real-world Data)는 Pose Estimation Jitter, Frame Drop, Motion Artifact 등으로 인해 $X_{pose}$ 와 $X_{rgb}$ 간의 시공간적 정렬이 완벽하지 않다.

(2) Proof via Fano's Inequality

본 연구는 Concat-LN (Concatenation + LayerNorm) 방식이 Noisy Alignment 환경에서 Cross-Attention보다 정보이론적으로 강건함을 증명한다.

Fano's Inequality: 오분류 확률 의 하한은 조건부 엔트로피 $H(Y|X)$ 에 의해 결정된다. $P_e \ge \frac{H(Y|X) - 1}{\log |Y|}$
Case 1: Cross-Attention (Hard Alignment Assumption) Cross-Attention은 Query( $t$ )와 Key( $t$ )의 정확한 대응을 가정한다. 정렬 노이즈가 발생하면 Attention Weight가 발산(Diffuse)하여 불확실성(Entropy)이 증가한다.
$H(Y \mid X_{\text{cross}}) = H(Y \mid X_{pose}, X_{rgb}) + \Delta_{\text{noise}}$
Case 2: Concat-LN (Joint Distribution Approximation) Concat-LN은 MLP가 두 모달리티의 **결합 분포(Joint Distribution)**를 비선형적으로 근사한다. 이는 엄격한 $t \leftrightarrow t$ 정렬 대신, 상호정보량(Mutual Information) $I(Y; X_{rgb} \mid X_{pose})$ 의 총량을 보존하는 데 집중한다.
$\therefore H(Y \mid X_{\text{concat}}) < H(Y \mid X_{\text{cross}}) \quad \text{(under noisy alignment)}$

Conclusion: 따라서 정렬이 불완전한 실세계 데이터에서는 Concat-LN이 일반화 성능(Generalization) 측면에서 우월하다.

4. Optimization Dynamics (Training Strategy)

단순한 Transfer Learning이 아닌, Representation Manifold를 보존하기 위한 3단계 학습 전략을 설계하였다.

4.1 Manifold Learning (Stage 1 & 2)

Masked Joint Modeling (MJM): BERT의 MLM과 유사하게 관절을 마스킹하고 복원하며 데이터의 내재적 기하학(Intrinsic Geometry)을 학습한다.
Contrastive Learning: 유사한 행동은 가깝게, 상이한 행동은 멀게 매핑하여 견고한 Metric Space를 구축한다.

4.2 Regularization for Open-Set (Stage 3)

Problem: Fine-tuning 시 단순 Cross-Entropy(CE) Loss를 사용하면 Pretrained Manifold가 Task-specific하게 붕괴(Collapse)된다.
Solution (Soft-Mapping): Backbone Freezing과 Label Smoothing을 결합하여, 모델이 학습 데이터에 없는 행동(Unseen Action)에 대해서도 기존 Basis Action들의 확률적 조합(Interpolation) 으로 표현할 수 있도록 유도한다. 이는 최적화 지형(Optimization Landscape)을 평탄하게 만들어 Local Minima를 방지한다.

5. Discussion: Generalizability & Representation Power

본 연구에서 제안한 Factorized Spatio-Temporal Representation은 HAR 도메인에 국한되지 않는 범용성을 가진다.

5.1 Isomorphism to Volumetric Data

본 모델의 핵심 구조인 시간( $t$ )과 공간( $s$ )의 분해는, 3D Volumetric Data (e.g., CT, MRI, Point Cloud) 분석과 수학적으로 동형(Isomorphic) 이다.

Time Sequence ( $T$ ) $\leftrightarrow$ Depth / Slice Sequence ( $D$ )
본 연구의 방법론은 3D 데이터의 Anisotropic Resolution (비등방성 해상도) 문제를 해결하고, Slice 간의 연속성(Consistency)을 효율적으로 학습하는 범용적인 솔루션으로 확장 가능하다.

5.2 Probabilistic Modeling for Safety-Critical Domains

본 연구가 지향한 Distribution-Aware Output은 단순 예측을 넘어, 모델의 신뢰도(Confidence) 측정 및 OOD(Out-of-Distribution) 탐지를 가능하게 한다. 이는 의료(Medical), 자율주행(Autonomous Driving), 로보틱스(Robotics) 등 오류 비용이 매우 높은 High-Stakes Domain에서 '신뢰할 수 있는 AI(Trustworthy AI)'를 구축하는 핵심 기반 기술이 된다.

Deep Dive into Distribution-Aware HAR

📄 "Deep Dive into Distribution-Aware HAR: From Mathematical Design to Implementation"*