Kernel Debugging Playbook
【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills
Use this playbook when an existing kernel is wrong, unstable, warning-heavy, or unclear. Debug in layers. Do not jump between random fixes.
Goal
Find the first broken assumption. Fix the model, then fix the kernel. Do not keep stacking patches on top of an unclear design.
Fast-path: match your symptom first
Most bug reports match one of the patterns below. Try these before running the full layer-by-layer review further down.
Symptom-to-check map
- Wrong everywhere→ check formula, transpose/layout, cast order,
shape_bindings - Only large shapes fail→ check tile budgets, split mode, estimator choice, counter ownership across nested loops
- Only tail tiles fail→ check
valid_*handling, half-row vec writeback split, GM boundary slicing - Autosync warnings or weird pipeline stalls→ check same-side vs cross-side misunderstanding, event family grouping, counter reuse across different lifetimes, unsupported instruction not covered by autosync pairing
- Local event timeout / already-set (
_tmp_*valid_*,_tmp_*ready_*)→ classify the event failure first, dump autosync-expanded instructions, then compare the failing family against a stable kernel before changing the DSL - Simulator passes, generated path looks suspicious→ check parser lowering, codegen handlers, explicit event or mutex placement, assumptions hidden by simulator convenience
- V2
wait_vec/wait_cubetimeout→ see the V2 timeout section below; almost always the other lane's actor crashed silently - Kernel only fails when run alongside other tests→ see the V2 parallel-process section below
Event pairing workflow for local event failures
Use this when V2 reports a lane-local event problem such as:
event_wait timeout: {'name': '_tmp_sevent_valid_fix_0', ...}event_set on already-set flag: _tmp_sevent_valid_l1_0 ...
Debugging sequence:
- Classify the failure from the runtime message.
event_wait timeoutusually means a missingevent_setfor the same family.event_set on already-set flagusually means a duplicateevent_setbefore a matchingevent_waitconsumed the token.
- Read the counters literally.
- On the simulator path,
preset=Trueevents start with one published token. - If a timeout reports
set_count == wait_count, the preset token was consumed and the next producer-sideevent_setnever happened.
- On the simulator path,
- Build the kernel instructions before inspecting split/autosync output.
- Call the
@kernelonce with placeholderGMTensor(...)arguments sokernel.instructionsis populated.
- Call the
- Dump the autosync-expanded lane instructions.
- Use
split_instructions(...)plusinsert_auto_sync(...), then inspect only the failing side (cubeorvec). - Prefer printing just one family at a time:
l1,l0,fix,ubin, orubout.
- Use
- Turn the event stream into an action sequence.
- Record only
event_wait/event_setfor the failing event name(s). - Healthy reuse should look like alternating publish/consume rounds; repeated
waitor repeatedsetwithout the opposite action in between is the broken edge.
- Record only
- Compare against a stable baseline kernel.
- Dump the same family from a nearby working kernel and diff the action sequence.
- This is often faster than reasoning from the fused kernel body.
- Check nested autosync ownership next.
- If the failing edge sits around nested
start_loop/start_ifregions, inspect parent/child mixed-scope handling before touching the kernel. - In particular, confirm whether parent and child are really the same autosync family, not just the same pipe pair.
- If the failing edge sits around nested
- Add a parser regression before rerunning the real kernel.
- Put the minimal reproducer in
testcases/parser/sync/test_autosync_event_metadata.py. - Fix the split/autosync behavior there first, then rerun the full simulator kernel.
- Put the minimal reproducer in
When this workflow points to parser behavior, jump to:
agent/references/constraints/autosync.md
V2 simulator:CvMutex/VcMutextimeout (wait_vec/wait_cube)
When V2 reports a sync timeout such as:
wait_vec timeout: {'scope': 'intra_core', 'name': 'vec_to_cube_0_0', 'target_phase': 3, 'current_phase': 2, 'consumed_phase': 2}The timeout almost always means theother lane's actor thread crashed, not that the sync logic itself is wrong. The crashing thread silently terminates, so the expectedvec_readyorcube_readysignal is never published, and the waiting side eventually times out.
Debugging sequence:
- Capture the real error on the other lane first.Patch
CoreRuntime.startto wrap eachControlActor.start()in a try/except that prints the lane name and exception. The first non-timeout error is the real root cause. - Common root causes behind the silent crash:
- Float8 indexing: PyTorch does not support
tensor[indices]forfloat8_e5m2/float8_e4m3fn. Any_gather_1d,_scatter_1d, or fancy indexing on a float8 register or UB tensor raises"index_cpu" not implemented for 'Float8_e5m2'. Fix by viewing astorch.uint8before indexing. - Non-contiguous UB views in burst copy:
ub_to_gm_pad/ub_to_l1_nzuse.view(torch.uint8)on the source. A column slice (stride > 1) makes.view()fail. Fix with_linear_view_from_pointer(). - Micro op not implemented: a
@vf()body calls an op thatMicroRuntimedoes not dispatch (NotImplementedError). The vec lane dies and itsfree()never fires.
- Float8 indexing: PyTorch does not support
- After fixing the vec/cube error, the sync timeout resolves on its own.
- Donottune sync timeouts or phase counters to work around these failures — the counters are correct; the lane just never ran to completion.
V2 simulator: do not run multiple simulator processes in parallel
Running multiple V2 simulator processes concurrently can producesilent data corruption. Root cause: per-lanePipeWorkerthreads are exposed to intra-process races under heavy CPU thread contention. Primarily affects kernels using NZ layout ops (ub_to_l1_nz,deinterleave,reg_to_ub) or complex@vffunctions.
simulator="legacy"is still accepted but routes to the same V2 runtime — there is no sequential fallback to switch to.- Always run kernel simulator tests sequentially, not in parallel with
&or batch scripts. - If a kernel produces incorrect results only when run alongside other tests, re-run it alone before investigating.
V2 simulator launch rule: use a real script entry and keepPYTHONPATH
When launching helper comparisons or ad-hoc debugging runs, do not start the simulator fromstdinentry points such as:
python - <<'PY'cat script.py | python
V2 uses child processes plus worker threads. On the process-spawn path, Python must be able to re-import the parent__main__module from a real file.stdinentry points show up as<stdin>, so child startup fails with errors such as:
FileNotFoundError: ... '/path/to/repo/<stdin>'- follow-on
EOFErrorwhilemultiprocessing.Manager()starts
Practical rule:
- put the repro in a real
.pyfile and run that file - include the repository root in
PYTHONPATHwhenever the script imports local modules from outside the repo root or from a temp directory
Typical safe form:
PYTHONPATH=/abs/path/to/repo python /tmp/repro.py
Layer-by-layer review
Use this order when the fast-path sections above did not match or did not fix the bug:
- contract and cast order
- layout and shape bindings
- tile and capacity assumptions
- tail handling
- sync and ownership
- counters and lifetime separation
- precision boundaries
- parser/simulator/codegen implementation path
1. Re-check the exact contract
Verify the kernel against the real PyTorch formula. Common failure modes: wrong cast order, wrong transpose interpretation, wrong reshape meaning, accidental semantic drift. If the reference is still fuzzy, stop here and clarify it before changing the DSL code.
2. Re-check layout and shape binding assumptions
Verify tensor logical shapes, transpose site,shape_bindings, repeated scalar dimension mapping.
Common signs: output shape is right but values are wrong everywhere; only some shapes fail; changingM,N, orKflips behavior unpredictably.
Repository reminder: if repeated scalar dimensions are ambiguous, try explicitshape_bindingsbefore deeper kernel surgery.
3. Re-check tile and capacity assumptions
When the kernel is tiled, verifyTILE_M,TILE_N,TILE_K,m_split,n_split,splitk/splitn,L0A/L0B/L0Cbyte budgets.
Repository reminders: keepsplitkandsplitnat>= 32; choosesplitkwhen K-side staging is too large; choosesplitnwhen N-side staging or output tile is too large; do not author non-zeroL0Crow offsets on matmul destinations. For the exact per-device caps and DBuff formulas, seeagent/references/facts-authoring.mdandagent/references/facts-device-runtime.md.
If tile search is non-trivial, useagent/scripts/estimate_matmul_datamove.pyinstead of eyeballing it. Drill intoagent/references/constraints/tiling.mdfor reasoning.
4. Re-check tail handling
Look at GM boundaries first, not local tensor sizes. Rule: local buffers stay full-tile sized; only GM read/write boundaries usevalid_m,valid_n,valid_k.
For cube -> vec writeback, verify the standard half-row split:
half_rows = CeilDiv(valid_m, 2)row_begin = GetSubBlockIdx() * half_rowsrow_end = Min(row_begin + half_rows, valid_m)
For a2 workspace-mediated cube -> vec tails: keep workspace writes and reads on stable tile shapes (ws[..., 0:TILE_M, 0:TILE_N]on cube;ws[..., row_begin:row_begin + row_count, 0:TILE_N]on vec). Applyvalid_nwith vec-side masking and final GM write boundaries, not by cropping the workspace column span first.
Symptoms of tail bugs: aligned cases pass but odd sizes fail; only the last tile is wrong; one vec subblock is correct and the other is garbage.
Drill:agent/references/constraints/tail-safety.md. For normalized online softmax with runningrow_max/row_sum, alsoagent/references/constraints/online-softmax-tail.md.
5. Re-check sync ownership
Assume ownership is wrong until proven otherwise.
auto_sync()only manages same-side ordering and does not replace cross-side ownership transfer. Cube -> vec handoff needsCvMutex; vec -> cube handoff needsVcMutex. Exact mutex signatures per device live inagent/references/facts-device-runtime.md.
If the issue smells like pipeline ordering: inspect where the producer finishes, where the consumer starts, whetherlock/ready/wait/freesurround the real ownership edge, and keep the critical section narrow.
Drill:agent/references/constraints/autosync.md.
6. Re-check counters and lifetimes
Many broken kernels are actually lifetime bugs. Verify which loop owns each buffer family, whether different lifetimes accidentally share one counter, whether the same slot lineage is expressed consistently.
Rules: buffers with different lifetimes must use different counters; same-lifetime paired buffers may share one; reusing one counter across different loop-owned lifetimes can silently break autosync grouping and slot reasoning.
Drill:agent/references/constraints/counters.md.
7. Re-check precision boundaries
Verify where values change dtype. Common failures: casting too early, reducing in the wrong dtype, writing packed or quantized data too early, comparing against a reference with a different cast order.
Rule: keep matmul accumulation infloat; downcast later unless the design proves otherwise.
Drill:agent/references/constraints/precision.md.
8. Inspect the real implementation path
If a rule is still unclear, inspect the actual implementation path instead of theorizing. Device family mapping (950→ C310,b*→ C220) and common target files (easyasc/stub_functions/,easyasc/parser/,easyasc/parser/asc_autosync.py,easyasc/kernelbase/kernelbase.py,easyasc/simulator_v2/,easyasc/shortcuts/matmul.py) are inagent/references/code-paths.md.
Good debugging question: which exact instruction gets emitted, how the parser lowers it, how the simulator executes it, whether the kernel assumption matches that path.
When the simulator itself produces an unexpected error: investigate the simulator path first; inspect the exact simulator stage, runtime view, and lowered instruction that failed; do not assume the upper-layer kernel is wrong just because the simulator failed first.
If simulator behavior still looks inconsistent with the intended model after real inspection: stop blind upper-layer edits, summarize the concrete simulator finding, pause and discuss with the user.
Build a minimal reproducer
When the full kernel is noisy, isolate one mechanism: one matmul, one handoff, one vec postprocess, one autosync chain, one tail tile. A minimal reproducer is usually faster than staring at a fused kernel.
Shrink-down order: keep the original failing shape, remove later stages until only the first wrong stage remains, inside that stage keep only one subformula (odo,rowmax, one GM bridge), shrink again if needed to one instruction and one view shape.
Treat warnings as real signals
Do not accept a passing result with unresolved warnings. Especially forauto_sync, warnings usually mean the lifetime model is off. If a warning persists after real inspection, stop blind iteration — either redesign the stage boundary or ask the user for clarification.
Fallback references
agent/references/code-paths.mdagent/references/simulator-v2.mddoc/11_architecture_for_contributors.md
【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考