CANNBot内核调试指南-洪萨配资

Kernel Debugging Playbook

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Use this playbook when an existing kernel is wrong, unstable, warning-heavy, or unclear. Debug in layers. Do not jump between random fixes.

Goal

Find the first broken assumption. Fix the model, then fix the kernel. Do not keep stacking patches on top of an unclear design.

Fast-path: match your symptom first

Most bug reports match one of the patterns below. Try these before running the full layer-by-layer review further down.

Symptom-to-check map

Wrong everywhere→ check formula, transpose/layout, cast order,shape_bindings
Only large shapes fail→ check tile budgets, split mode, estimator choice, counter ownership across nested loops
Only tail tiles fail→ checkvalid_*handling, half-row vec writeback split, GM boundary slicing
Autosync warnings or weird pipeline stalls→ check same-side vs cross-side misunderstanding, event family grouping, counter reuse across different lifetimes, unsupported instruction not covered by autosync pairing
Local event timeout / already-set (_tmp_*valid_*,_tmp_*ready_*)→ classify the event failure first, dump autosync-expanded instructions, then compare the failing family against a stable kernel before changing the DSL
Simulator passes, generated path looks suspicious→ check parser lowering, codegen handlers, explicit event or mutex placement, assumptions hidden by simulator convenience
V2wait_vec/wait_cubetimeout→ see the V2 timeout section below; almost always the other lane's actor crashed silently
Kernel only fails when run alongside other tests→ see the V2 parallel-process section below

Event pairing workflow for local event failures

Use this when V2 reports a lane-local event problem such as:

event_wait timeout: {'name': '_tmp_sevent_valid_fix_0', ...}
event_set on already-set flag: _tmp_sevent_valid_l1_0 ...

Debugging sequence:

Classify the failure from the runtime message.
- event_wait timeoutusually means a missingevent_setfor the same family.
- event_set on already-set flagusually means a duplicateevent_setbefore a matchingevent_waitconsumed the token.
Read the counters literally.
- On the simulator path,preset=Trueevents start with one published token.
- If a timeout reportsset_count == wait_count, the preset token was consumed and the next producer-sideevent_setnever happened.
Build the kernel instructions before inspecting split/autosync output.
- Call the@kernelonce with placeholderGMTensor(...)arguments sokernel.instructionsis populated.
Dump the autosync-expanded lane instructions.
- Usesplit_instructions(...)plusinsert_auto_sync(...), then inspect only the failing side (cubeorvec).
- Prefer printing just one family at a time:l1,l0,fix,ubin, orubout.
Turn the event stream into an action sequence.
- Record onlyevent_wait/event_setfor the failing event name(s).
- Healthy reuse should look like alternating publish/consume rounds; repeatedwaitor repeatedsetwithout the opposite action in between is the broken edge.
Compare against a stable baseline kernel.
- Dump the same family from a nearby working kernel and diff the action sequence.
- This is often faster than reasoning from the fused kernel body.
Check nested autosync ownership next.
- If the failing edge sits around nestedstart_loop/start_ifregions, inspect parent/child mixed-scope handling before touching the kernel.
- In particular, confirm whether parent and child are really the same autosync family, not just the same pipe pair.
Add a parser regression before rerunning the real kernel.
- Put the minimal reproducer intestcases/parser/sync/test_autosync_event_metadata.py.
- Fix the split/autosync behavior there first, then rerun the full simulator kernel.

When this workflow points to parser behavior, jump to:

agent/references/constraints/autosync.md

V2 simulator:`CvMutex`/`VcMutex`timeout (`wait_vec`/`wait_cube`)

When V2 reports a sync timeout such as:

wait_vec timeout: {'scope': 'intra_core', 'name': 'vec_to_cube_0_0', 'target_phase': 3, 'current_phase': 2, 'consumed_phase': 2}

The timeout almost always means theother lane's actor thread crashed, not that the sync logic itself is wrong. The crashing thread silently terminates, so the expectedvec_readyorcube_readysignal is never published, and the waiting side eventually times out.

Debugging sequence:

Capture the real error on the other lane first.PatchCoreRuntime.startto wrap eachControlActor.start()in a try/except that prints the lane name and exception. The first non-timeout error is the real root cause.
Common root causes behind the silent crash:
- Float8 indexing: PyTorch does not supporttensor[indices]forfloat8_e5m2/float8_e4m3fn. Any_gather_1d,_scatter_1d, or fancy indexing on a float8 register or UB tensor raises"index_cpu" not implemented for 'Float8_e5m2'. Fix by viewing astorch.uint8before indexing.
- Non-contiguous UB views in burst copy:ub_to_gm_pad/ub_to_l1_nzuse.view(torch.uint8)on the source. A column slice (stride > 1) makes.view()fail. Fix with_linear_view_from_pointer().
- Micro op not implemented: a@vf()body calls an op thatMicroRuntimedoes not dispatch (NotImplementedError). The vec lane dies and itsfree()never fires.
After fixing the vec/cube error, the sync timeout resolves on its own.
Donottune sync timeouts or phase counters to work around these failures — the counters are correct; the lane just never ran to completion.

V2 simulator: do not run multiple simulator processes in parallel

Running multiple V2 simulator processes concurrently can producesilent data corruption. Root cause: per-lanePipeWorkerthreads are exposed to intra-process races under heavy CPU thread contention. Primarily affects kernels using NZ layout ops (ub_to_l1_nz,deinterleave,reg_to_ub) or complex@vffunctions.

simulator="legacy"is still accepted but routes to the same V2 runtime — there is no sequential fallback to switch to.
Always run kernel simulator tests sequentially, not in parallel with&or batch scripts.
If a kernel produces incorrect results only when run alongside other tests, re-run it alone before investigating.

V2 simulator launch rule: use a real script entry and keep`PYTHONPATH`

When launching helper comparisons or ad-hoc debugging runs, do not start the simulator fromstdinentry points such as:

python - <<'PY'
cat script.py | python

V2 uses child processes plus worker threads. On the process-spawn path, Python must be able to re-import the parent__main__module from a real file.stdinentry points show up as<stdin>, so child startup fails with errors such as:

FileNotFoundError: ... '/path/to/repo/<stdin>'
follow-onEOFErrorwhilemultiprocessing.Manager()starts

Practical rule:

put the repro in a real.pyfile and run that file
include the repository root inPYTHONPATHwhenever the script imports local modules from outside the repo root or from a temp directory

Typical safe form:

PYTHONPATH=/abs/path/to/repo python /tmp/repro.py

Layer-by-layer review

Use this order when the fast-path sections above did not match or did not fix the bug:

contract and cast order
layout and shape bindings
tile and capacity assumptions
tail handling
sync and ownership
counters and lifetime separation
precision boundaries
parser/simulator/codegen implementation path

1. Re-check the exact contract

Verify the kernel against the real PyTorch formula. Common failure modes: wrong cast order, wrong transpose interpretation, wrong reshape meaning, accidental semantic drift. If the reference is still fuzzy, stop here and clarify it before changing the DSL code.

2. Re-check layout and shape binding assumptions

Verify tensor logical shapes, transpose site,shape_bindings, repeated scalar dimension mapping.

Common signs: output shape is right but values are wrong everywhere; only some shapes fail; changingM,N, orKflips behavior unpredictably.

Repository reminder: if repeated scalar dimensions are ambiguous, try explicitshape_bindingsbefore deeper kernel surgery.

3. Re-check tile and capacity assumptions

When the kernel is tiled, verifyTILE_M,TILE_N,TILE_K,m_split,n_split,splitk/splitn,L0A/L0B/L0Cbyte budgets.

Repository reminders: keepsplitkandsplitnat>= 32; choosesplitkwhen K-side staging is too large; choosesplitnwhen N-side staging or output tile is too large; do not author non-zeroL0Crow offsets on matmul destinations. For the exact per-device caps and DBuff formulas, seeagent/references/facts-authoring.mdandagent/references/facts-device-runtime.md.

If tile search is non-trivial, useagent/scripts/estimate_matmul_datamove.pyinstead of eyeballing it. Drill intoagent/references/constraints/tiling.mdfor reasoning.

4. Re-check tail handling

Look at GM boundaries first, not local tensor sizes. Rule: local buffers stay full-tile sized; only GM read/write boundaries usevalid_m,valid_n,valid_k.

For cube -> vec writeback, verify the standard half-row split:

half_rows = CeilDiv(valid_m, 2)
row_begin = GetSubBlockIdx() * half_rows
row_end = Min(row_begin + half_rows, valid_m)

For a2 workspace-mediated cube -> vec tails: keep workspace writes and reads on stable tile shapes (ws[..., 0:TILE_M, 0:TILE_N]on cube;ws[..., row_begin:row_begin + row_count, 0:TILE_N]on vec). Applyvalid_nwith vec-side masking and final GM write boundaries, not by cropping the workspace column span first.

Symptoms of tail bugs: aligned cases pass but odd sizes fail; only the last tile is wrong; one vec subblock is correct and the other is garbage.

Drill:agent/references/constraints/tail-safety.md. For normalized online softmax with runningrow_max/row_sum, alsoagent/references/constraints/online-softmax-tail.md.

5. Re-check sync ownership

Assume ownership is wrong until proven otherwise.

auto_sync()only manages same-side ordering and does not replace cross-side ownership transfer. Cube -> vec handoff needsCvMutex; vec -> cube handoff needsVcMutex. Exact mutex signatures per device live inagent/references/facts-device-runtime.md.

If the issue smells like pipeline ordering: inspect where the producer finishes, where the consumer starts, whetherlock/ready/wait/freesurround the real ownership edge, and keep the critical section narrow.

Drill:agent/references/constraints/autosync.md.

6. Re-check counters and lifetimes

Many broken kernels are actually lifetime bugs. Verify which loop owns each buffer family, whether different lifetimes accidentally share one counter, whether the same slot lineage is expressed consistently.

Rules: buffers with different lifetimes must use different counters; same-lifetime paired buffers may share one; reusing one counter across different loop-owned lifetimes can silently break autosync grouping and slot reasoning.

Drill:agent/references/constraints/counters.md.

7. Re-check precision boundaries

Verify where values change dtype. Common failures: casting too early, reducing in the wrong dtype, writing packed or quantized data too early, comparing against a reference with a different cast order.

Rule: keep matmul accumulation infloat; downcast later unless the design proves otherwise.

Drill:agent/references/constraints/precision.md.

8. Inspect the real implementation path

If a rule is still unclear, inspect the actual implementation path instead of theorizing. Device family mapping (950→ C310,b*→ C220) and common target files (easyasc/stub_functions/,easyasc/parser/,easyasc/parser/asc_autosync.py,easyasc/kernelbase/kernelbase.py,easyasc/simulator_v2/,easyasc/shortcuts/matmul.py) are inagent/references/code-paths.md.

Good debugging question: which exact instruction gets emitted, how the parser lowers it, how the simulator executes it, whether the kernel assumption matches that path.

When the simulator itself produces an unexpected error: investigate the simulator path first; inspect the exact simulator stage, runtime view, and lowered instruction that failed; do not assume the upper-layer kernel is wrong just because the simulator failed first.

If simulator behavior still looks inconsistent with the intended model after real inspection: stop blind upper-layer edits, summarize the concrete simulator finding, pause and discuss with the user.

Build a minimal reproducer

When the full kernel is noisy, isolate one mechanism: one matmul, one handoff, one vec postprocess, one autosync chain, one tail tile. A minimal reproducer is usually faster than staring at a fused kernel.

Shrink-down order: keep the original failing shape, remove later stages until only the first wrong stage remains, inside that stage keep only one subformula (odo,rowmax, one GM bridge), shrink again if needed to one instruction and one view shape.

Treat warnings as real signals

Do not accept a passing result with unresolved warnings. Especially forauto_sync, warnings usually mean the lifetime model is off. If a warning persists after real inspection, stop blind iteration — either redesign the stage boundary or ask the user for clarification.

Fallback references

agent/references/code-paths.md
agent/references/simulator-v2.md
doc/11_architecture_for_contributors.md

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANNBot内核调试指南

Kernel Debugging Playbook

Goal

Fast-path: match your symptom first

Symptom-to-check map

Event pairing workflow for local event failures

V2 simulator:`CvMutex`/`VcMutex`timeout (`wait_vec`/`wait_cube`)

V2 simulator: do not run multiple simulator processes in parallel

V2 simulator launch rule: use a real script entry and keep`PYTHONPATH`

Layer-by-layer review

1. Re-check the exact contract

2. Re-check layout and shape binding assumptions

3. Re-check tile and capacity assumptions

4. Re-check tail handling

5. Re-check sync ownership

6. Re-check counters and lifetimes

7. Re-check precision boundaries

8. Inspect the real implementation path

Build a minimal reproducer

Treat warnings as real signals

Fallback references

终极鼠标性能测试指南：3步精准评估您的设备表现

CANN学习中心仓技能集合

【触想智能】安卓工业一体机在工业看板上的应用优势

AI 爬虫在吞噬你的过时文档——Cloudflare 用一个开关来解决这个问题

深智微IC华润微代理：MCU选型与工业控制方案推荐

CANN NPU推理运行时错误诊断

Kernel Debugging Playbook

Goal

Fast-path: match your symptom first

Symptom-to-check map

Event pairing workflow for local event failures

V2 simulator:CvMutex/VcMutextimeout (wait_vec/wait_cube)

V2 simulator: do not run multiple simulator processes in parallel

V2 simulator launch rule: use a real script entry and keepPYTHONPATH

Layer-by-layer review

1. Re-check the exact contract

2. Re-check layout and shape binding assumptions

3. Re-check tile and capacity assumptions

4. Re-check tail handling

5. Re-check sync ownership

6. Re-check counters and lifetimes

7. Re-check precision boundaries

8. Inspect the real implementation path

Build a minimal reproducer

Treat warnings as real signals

Fallback references

终极鼠标性能测试指南：3步精准评估您的设备表现

CANN学习中心仓技能集合

【触想智能】安卓工业一体机在工业看板上的应用优势

AI 爬虫在吞噬你的过时文档——Cloudflare 用一个开关来解决这个问题

深智微IC华润微代理：MCU选型与工业控制方案推荐

CANN NPU推理运行时错误诊断

V2 simulator:`CvMutex`/`VcMutex`timeout (`wait_vec`/`wait_cube`)

V2 simulator launch rule: use a real script entry and keep`PYTHONPATH`