Yuning AI · Yuning AI

Technical Blog

Field notes on model architecture, inference systems, and agent execution.

A running collection of bilingual writing on transformer principles, GPU inference structure, runtime harnesses, and other implementation-critical topics.

Model Principles

Principles of the Transformer Architecture

A practical reading of self-attention, token mixing, and the structural reasons Transformer became the base of modern language models.

April 22, 20267 min read

TransformerAttentionLLM Foundations

Architecture Research

Deeper Research on Transformer Architecture

Look past the headline design and examine scaling, positional schemes, efficiency variants, and the research directions that made Transformer more useful in practice.

April 21, 20268 min read

ScalingPositional EncodingArchitecture Research

Inference Infrastructure

NVIDIA H100 Architecture and Why Inference Needs LPX

A system-level look at H100: compute units, memory hierarchy, and why inference workloads depend on a carefully engineered low-precision execution path.

April 20, 20268 min read

NVIDIA H100InferenceLPX

Agent Engineering

Research on Harness for Agent Systems

Why agent systems need a harness layer: execution control, tool isolation, observability, retries, and the boundary between model reasoning and production software.

April 19, 20267 min read

HarnessAgentsRuntime