Activation Trace Mode¶
Note
This page covers Extension 0 (module forward hooks), which is the implemented tracing mechanism. Extensions 1-3 are design sketches for future work and are not yet implemented.
Overview¶
Activation trace mode is a zero-overhead-when-off, env-gated mechanism for
dumping per-layer activation statistics during FastVideo inference. Its primary
use case is parity debugging across model ports: enable tracing on both
FastVideo and the upstream reference implementation, then diff the resulting
JSONL files to find the first divergent layer.
The mechanism is intentionally narrow. It doesn't replace general logging, profiling, or function tracing. It answers one question: "at which layer do FastVideo and the reference model first produce different numbers?"
When to use¶
- Investigating numerical drift between FastVideo and an upstream reference.
- Debugging mid-pipeline divergence (e.g., one block produces wrong output while earlier blocks match).
- Validating that a refactor preserves bf16 noise-floor behavior across many layers.
When NOT to use¶
| Goal | Use instead |
|---|---|
| General logging | init_logger(__name__) |
| Per-stage timing | FASTVIDEO_STAGE_LOGGING |
| Profiling kernel timings | FASTVIDEO_TORCH_PROFILER_DIR (see Profiling) |
| Function-call tracing | FASTVIDEO_TRACE_FUNCTION (heavy) |
Quickstart¶
FASTVIDEO_TRACE_ACTIVATIONS=1 \
FASTVIDEO_TRACE_LAYERS="^block\.layers\.[0-9]+$" \
FASTVIDEO_TRACE_STATS="abs_mean,sum,max,shape" \
FASTVIDEO_TRACE_OUTPUT="/tmp/fv_trace.jsonl" \
python examples/inference/basic/basic_magi_human.py
Each line in /tmp/fv_trace.jsonl is a JSON record:
{"module": "block.layers.0", "tensor": "out", "step": 0, "abs_mean": 1.234, "sum": -5.678, "max": 9.012, "shape": [1, 4096, 5120]}
Configuration¶
| Env var | Default | Description |
|---|---|---|
FASTVIDEO_TRACE_ACTIVATIONS |
False |
Master toggle. When unset or false, zero overhead in the production hot path. |
FASTVIDEO_TRACE_LAYERS |
"" (all) |
Python regex filter applied to model.named_modules() names. Empty string matches all modules. |
FASTVIDEO_TRACE_STATS |
"abs_mean,sum" |
Comma-separated stats to compute. Available: abs_mean, sum, min, max, mean, std, shape, dtype. |
FASTVIDEO_TRACE_OUTPUT |
"/tmp/fv_trace_<pid>.jsonl" |
Output file path. <pid> is replaced with the process ID at runtime. |
FASTVIDEO_TRACE_STEPS |
"" (all) |
Comma-separated denoising step indices to capture. Empty string captures all steps. |
Workflow: parity-debug a model port¶
-
Set up a tightly-controlled comparison: a parity test or a small standalone script that loads both the FastVideo model and the upstream reference with identical inputs and seeds.
-
Run the FastVideo side with tracing on:
-
Run the upstream side. The upstream repo needs separate instrumentation. See "Hooking the upstream side" below.
-
Sort both files by
(module, step)if needed, then diff: -
The first divergent line identifies the first layer where FastVideo and the upstream produce different outputs. Start debugging there.
Performance impact¶
When the master toggle is off, overhead is nil. When on, the cost is proportional to how broadly the layer regex matches.
What runs when tracing is on¶
For every match against model.named_modules(), FastVideo registers an
ActivationStatHook that runs after the module's forward returns:
- Walks the (possibly nested) output
tuple/list/dictand extracts everytorch.Tensorleaf. - Computes each enabled stat on the tensor's
.detach().float()view — the cast is required for bf16 inputs because some reductions are not stable in bf16. - Writes one JSON record per tensor to the line-buffered JSONL sink.
Cost is roughly O(num_matched_modules × num_output_tensors × num_stats × tensor_numel)
per forward, dominated by tensor_numel for value-stats (abs_mean, sum,
min, max, mean, std). The shape and dtype stats are O(1).
Cost-shaping knobs¶
The default config (empty FASTVIDEO_TRACE_LAYERS is treated as .*, default
two stats, all steps) is intentionally blunt — useful only for a one-shot
smoke run. For real debugging, scope down:
| Knob | Effect |
|---|---|
Tighten FASTVIDEO_TRACE_LAYERS to a regex matching <50 modules |
Linear reduction in hook count |
Drop unused stats from FASTVIDEO_TRACE_STATS (mean, std, min, max) if you only need divergence detection |
One full reduction per stat saved per matched module per forward |
Set FASTVIDEO_TRACE_STEPS="0,15,31" for a 32-step run |
~10x reduction vs all-steps (the hook still fires but exits early when the step doesn't match) |
Use shape + dtype only on layers where you only care about layout |
Skips tensor reductions entirely on those layers |
Disk¶
Output is line-buffered (open(..., buffering=1)), so every record flushes
on write. A typical 32-step run that traces 40 DiT blocks with 4 stats
writes roughly 5K records — about 1 MB of JSONL. Point
FASTVIDEO_TRACE_OUTPUT at a fast local disk for parity runs; slow network
mounts will dominate runtime once tracing is on.
Troubleshooting¶
"I set the env var but no JSONL file appears"¶
Three things to check, in order:
- Toggle semantics.
FASTVIDEO_TRACE_ACTIVATIONSuses a strict not-equal-to-"0"test. Setting it to1,true, or even""all enable tracing. Only an unset variable orFASTVIDEO_TRACE_ACTIVATIONS=0disables it. - Module exposure. The hook attaches to
pipeline.modules.get("transformer")at the end ofpost_init. If your pipeline does not expose a module under that key (e.g. a non-standard custom pipeline whose DiT is reachable only viapipeline.modules["sr_transformer"]), the trace is silently a no-op. Check theActivation trace attached to N moduleslog line at startup; ifN=0, either the regex didn't match anything or the expected module isn't exposed. - Output path failure. The parent of
FASTVIDEO_TRACE_OUTPUTis auto-created. If creation fails (permissions, read-only mount),JsonlSink.__init__raises at startup — look for anOSErrorearly in the log.
"My regex isn't filtering the way I expect"¶
FASTVIDEO_TRACE_LAYERS is compiled with re.compile(spec) and matched
with pattern.search(name). Two consequences:
search, notfullmatch.block.layersmatchestransformer.block.layers.0.attn. Anchor with^...$if you want exact matches.- Module names use Python dot notation (
transformer.block.layers.0), not slashes. The.in your regex is a metacharacter — escape it as\.if you want a literal dot.
Run once with FASTVIDEO_LOGGING_LEVEL=DEBUG to see the
Activation trace attached to N modules (pattern=...) line. N is the
ground truth for how many modules survived your regex.
"Stats are NaN or <error: ...> for some layers"¶
Some outputs hit numerical edge cases:
mean/stdon a 0-dim tensor returns NaN.abs_meanon an empty tensor or one full ofinfreturns NaN.- Non-tensor outputs (e.g. a Python
boolfrom a verification gate) are silently skipped — the hook only walkstorch.Tensorleaves.
Dropping mean/std and using abs_mean/max is more robust. For
layout-only debugging, shape+dtype never fail.
"Tracing slows the run by 5x"¶
You're probably matching too broadly. An empty FASTVIDEO_TRACE_LAYERS is
treated as .* and matches every named module — for a 15B-param DiT
that's hundreds of submodules, each running stat reductions on every
forward. Tighten to a single block depth: ^transformer\.blocks\.\d+$
typically matches a few dozen modules, which is a manageable trace.
"Trace records are missing tensors I expect"¶
The hook walks tuple / list / dict outputs recursively but does
not unpack custom dataclasses or named tuples — those are silently
skipped. If your module returns
BlockOutput(hidden_states=..., attn_logits=...), no records are
emitted. Workaround: either return a plain dict ({"hidden_states": ..., "attn_logits": ...})
or attach the hook to a deeper module that already returns a raw tensor.
"I want tracing inside a torch.compile'd region"¶
Module forward hooks run on the eager wrapper. If the entire module is compiled, the hook sees only the wrapped op's output, not internal FX nodes. This is by design — Extension 1 (FX backend rewrite) in Future extensions is the planned path when inside-graph granularity is required.
Architecture (Extension 0: module forward hooks)¶
At pipeline initialization, attach_activation_trace() reads the env vars once.
If FASTVIDEO_TRACE_ACTIVATIONS is unset or false, the function returns
immediately and no hooks are registered. If tracing is on, it walks
model.named_modules(), filters by the layer regex, and registers an
ActivationStatHook on each matching module.
During the forward pass, each hook fires after its module completes, computes the requested stats on the output tensor, and appends a JSON record to the output file.
ComposedPipelineBase
└─ attach_activation_trace()
├─ reads env vars (once at startup)
├─ if off: returns None immediately
└─ if on: walks named_modules()
└─ registers ActivationStatHook on matching modules
└─ on each forward: compute stats → append JSONL
Zero-overhead-when-off guarantee¶
- The env var check happens once at startup inside
attach_activation_trace(). - If the env var is unset or false, the function returns
Noneimmediately. - No hooks are registered. No branches are added to the production forward path.
- The only cost when tracing is off is one env var lookup at pipeline initialization, which takes under a microsecond.
Hooking the upstream side¶
The upstream reference repo isn't part of FastVideo, so it can't read FastVideo env vars directly. Two options:
Option 1: Inline patch in your local clone of the upstream repo. Add
register_forward_hook calls in the same shape as ActivationStatHook. Clean
up afterward with git stash or git checkout HEAD -- <file>.
Option 2: Wrapper script. Write a small Python harness that imports the
upstream model, walks its named_modules(), and attaches hooks externally.
This is the same pattern used in
tests/local_tests/transformers/_debug_magi_human_block_parity.py.
The add-model-08-trace skill at .agents/skills/add-model-08-trace/
provides a script template for this purpose.
Future extensions (design only, not yet implemented)¶
Extension 1: FX/Dynamo backend graph rewrite¶
Granularity: per-FX-node (every matmul, every add).
Mechanism: a torch.compile backend that takes the captured GraphModule
and inserts logger nodes after each op. Compiles into a separate artifact from
the production graph.
Off semantics: zero overhead. The production compile path is untouched.
When to add: if you need to trace inside a torch.compile'd graph and
Extension 0 is too coarse.
Build cost: roughly 1-2 days. Reference:
torchao.quantization.pt2e._numeric_debugger.
Extension 2: AST source injection at import time¶
Granularity: per-line (between any two Python statements).
Mechanism: an importlib loader hook rewrites Python source AST at module
import time, inserting if TRACE: dump(...) statements. The decision is made
once at import.
Off semantics: zero overhead. If the env var is off at import time, source is loaded as-is.
When to add: if you need per-line granularity that even FX-node-level can't provide. This is almost never the right choice.
Build cost: roughly 1 week. Brittle and hard to debug.
Extension 3: __torch_dispatch__ / TorchDispatchMode¶
Granularity: per-op (every dispatcher call: matmul, add, view, etc.).
Mechanism: a TorchDispatchMode context manager that intercepts all ops at
the dispatcher level.
Off semantics: zero overhead. PyTorch's dispatcher only invokes mode hooks when a mode is active.
When on: significant overhead. Every op pays a Python callback cost. Triton kernels bypass it.
When to add: useful for quantization or dtype debugging where module-level granularity isn't enough.
Build cost: roughly 1 day. Reference:
torch.utils._python_dispatch.TorchDispatchMode.
Comparison with similar tools¶
| Tool | Pattern | FastVideo equivalent |
|---|---|---|
SGLang --debug-tensor-dump-output-folder |
env-gated forward hooks at startup | Extension 0 (this) |
TransformerEngine DumpTensors |
config-driven selective dumps | Extension 0 (env-driven) |
HuggingFace output_hidden_states=True |
source-level boolean gating | Not used; Extension 0 avoids model code edits |
| torchao numeric debugger | FX pass + node-level loggers | Extension 1 (future) |
W&B wandb.watch() |
runtime forward hooks (always on once registered) | Extension 0 has a similar mechanism, but gated off by default |
Implementation references¶
- Module:
fastvideo/hooks/activation_trace.py - Env vars:
fastvideo/envs.py(FASTVIDEO_TRACE_ACTIVATIONSand friends) - Pipeline integration:
fastvideo/pipelines/composed_pipeline_base.py - Tests:
fastvideo/tests/hooks/test_activation_trace.py - Companion skill (for ad-hoc port investigations):
.agents/skills/add-model-08-trace/
Changelog¶
| Date | Change |
|---|---|
| 2026-05-01 | Initial Extension 0 (module forward hooks) implementation. Extensions 1-3 designed but not implemented. |