news 2026/5/9 12:05:31

CANNBot内核调试指南

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANNBot内核调试指南

Kernel Debugging Playbook

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Use this playbook when an existing kernel is wrong, unstable, warning-heavy, or unclear. Debug in layers. Do not jump between random fixes.

Goal

Find the first broken assumption. Fix the model, then fix the kernel. Do not keep stacking patches on top of an unclear design.

Fast-path: match your symptom first

Most bug reports match one of the patterns below. Try these before running the full layer-by-layer review further down.

Symptom-to-check map

  • Wrong everywhere→ check formula, transpose/layout, cast order,shape_bindings
  • Only large shapes fail→ check tile budgets, split mode, estimator choice, counter ownership across nested loops
  • Only tail tiles fail→ checkvalid_*handling, half-row vec writeback split, GM boundary slicing
  • Autosync warnings or weird pipeline stalls→ check same-side vs cross-side misunderstanding, event family grouping, counter reuse across different lifetimes, unsupported instruction not covered by autosync pairing
  • Local event timeout / already-set (_tmp_*valid_*,_tmp_*ready_*)→ classify the event failure first, dump autosync-expanded instructions, then compare the failing family against a stable kernel before changing the DSL
  • Simulator passes, generated path looks suspicious→ check parser lowering, codegen handlers, explicit event or mutex placement, assumptions hidden by simulator convenience
  • V2wait_vec/wait_cubetimeout→ see the V2 timeout section below; almost always the other lane's actor crashed silently
  • Kernel only fails when run alongside other tests→ see the V2 parallel-process section below

Event pairing workflow for local event failures

Use this when V2 reports a lane-local event problem such as:

  • event_wait timeout: {'name': '_tmp_sevent_valid_fix_0', ...}
  • event_set on already-set flag: _tmp_sevent_valid_l1_0 ...

Debugging sequence:

  1. Classify the failure from the runtime message.
    • event_wait timeoutusually means a missingevent_setfor the same family.
    • event_set on already-set flagusually means a duplicateevent_setbefore a matchingevent_waitconsumed the token.
  2. Read the counters literally.
    • On the simulator path,preset=Trueevents start with one published token.
    • If a timeout reportsset_count == wait_count, the preset token was consumed and the next producer-sideevent_setnever happened.
  3. Build the kernel instructions before inspecting split/autosync output.
    • Call the@kernelonce with placeholderGMTensor(...)arguments sokernel.instructionsis populated.
  4. Dump the autosync-expanded lane instructions.
    • Usesplit_instructions(...)plusinsert_auto_sync(...), then inspect only the failing side (cubeorvec).
    • Prefer printing just one family at a time:l1,l0,fix,ubin, orubout.
  5. Turn the event stream into an action sequence.
    • Record onlyevent_wait/event_setfor the failing event name(s).
    • Healthy reuse should look like alternating publish/consume rounds; repeatedwaitor repeatedsetwithout the opposite action in between is the broken edge.
  6. Compare against a stable baseline kernel.
    • Dump the same family from a nearby working kernel and diff the action sequence.
    • This is often faster than reasoning from the fused kernel body.
  7. Check nested autosync ownership next.
    • If the failing edge sits around nestedstart_loop/start_ifregions, inspect parent/child mixed-scope handling before touching the kernel.
    • In particular, confirm whether parent and child are really the same autosync family, not just the same pipe pair.
  8. Add a parser regression before rerunning the real kernel.
    • Put the minimal reproducer intestcases/parser/sync/test_autosync_event_metadata.py.
    • Fix the split/autosync behavior there first, then rerun the full simulator kernel.

When this workflow points to parser behavior, jump to:

  • agent/references/constraints/autosync.md

V2 simulator:CvMutex/VcMutextimeout (wait_vec/wait_cube)

When V2 reports a sync timeout such as:

wait_vec timeout: {'scope': 'intra_core', 'name': 'vec_to_cube_0_0', 'target_phase': 3, 'current_phase': 2, 'consumed_phase': 2}

The timeout almost always means theother lane's actor thread crashed, not that the sync logic itself is wrong. The crashing thread silently terminates, so the expectedvec_readyorcube_readysignal is never published, and the waiting side eventually times out.

Debugging sequence:

  1. Capture the real error on the other lane first.PatchCoreRuntime.startto wrap eachControlActor.start()in a try/except that prints the lane name and exception. The first non-timeout error is the real root cause.
  2. Common root causes behind the silent crash:
    • Float8 indexing: PyTorch does not supporttensor[indices]forfloat8_e5m2/float8_e4m3fn. Any_gather_1d,_scatter_1d, or fancy indexing on a float8 register or UB tensor raises"index_cpu" not implemented for 'Float8_e5m2'. Fix by viewing astorch.uint8before indexing.
    • Non-contiguous UB views in burst copy:ub_to_gm_pad/ub_to_l1_nzuse.view(torch.uint8)on the source. A column slice (stride > 1) makes.view()fail. Fix with_linear_view_from_pointer().
    • Micro op not implemented: a@vf()body calls an op thatMicroRuntimedoes not dispatch (NotImplementedError). The vec lane dies and itsfree()never fires.
  3. After fixing the vec/cube error, the sync timeout resolves on its own.
  4. Donottune sync timeouts or phase counters to work around these failures — the counters are correct; the lane just never ran to completion.

V2 simulator: do not run multiple simulator processes in parallel

Running multiple V2 simulator processes concurrently can producesilent data corruption. Root cause: per-lanePipeWorkerthreads are exposed to intra-process races under heavy CPU thread contention. Primarily affects kernels using NZ layout ops (ub_to_l1_nz,deinterleave,reg_to_ub) or complex@vffunctions.

  • simulator="legacy"is still accepted but routes to the same V2 runtime — there is no sequential fallback to switch to.
  • Always run kernel simulator tests sequentially, not in parallel with&or batch scripts.
  • If a kernel produces incorrect results only when run alongside other tests, re-run it alone before investigating.

V2 simulator launch rule: use a real script entry and keepPYTHONPATH

When launching helper comparisons or ad-hoc debugging runs, do not start the simulator fromstdinentry points such as:

  • python - <<'PY'
  • cat script.py | python

V2 uses child processes plus worker threads. On the process-spawn path, Python must be able to re-import the parent__main__module from a real file.stdinentry points show up as<stdin>, so child startup fails with errors such as:

  • FileNotFoundError: ... '/path/to/repo/<stdin>'
  • follow-onEOFErrorwhilemultiprocessing.Manager()starts

Practical rule:

  • put the repro in a real.pyfile and run that file
  • include the repository root inPYTHONPATHwhenever the script imports local modules from outside the repo root or from a temp directory

Typical safe form:

  • PYTHONPATH=/abs/path/to/repo python /tmp/repro.py

Layer-by-layer review

Use this order when the fast-path sections above did not match or did not fix the bug:

  1. contract and cast order
  2. layout and shape bindings
  3. tile and capacity assumptions
  4. tail handling
  5. sync and ownership
  6. counters and lifetime separation
  7. precision boundaries
  8. parser/simulator/codegen implementation path

1. Re-check the exact contract

Verify the kernel against the real PyTorch formula. Common failure modes: wrong cast order, wrong transpose interpretation, wrong reshape meaning, accidental semantic drift. If the reference is still fuzzy, stop here and clarify it before changing the DSL code.

2. Re-check layout and shape binding assumptions

Verify tensor logical shapes, transpose site,shape_bindings, repeated scalar dimension mapping.

Common signs: output shape is right but values are wrong everywhere; only some shapes fail; changingM,N, orKflips behavior unpredictably.

Repository reminder: if repeated scalar dimensions are ambiguous, try explicitshape_bindingsbefore deeper kernel surgery.

3. Re-check tile and capacity assumptions

When the kernel is tiled, verifyTILE_M,TILE_N,TILE_K,m_split,n_split,splitk/splitn,L0A/L0B/L0Cbyte budgets.

Repository reminders: keepsplitkandsplitnat>= 32; choosesplitkwhen K-side staging is too large; choosesplitnwhen N-side staging or output tile is too large; do not author non-zeroL0Crow offsets on matmul destinations. For the exact per-device caps and DBuff formulas, seeagent/references/facts-authoring.mdandagent/references/facts-device-runtime.md.

If tile search is non-trivial, useagent/scripts/estimate_matmul_datamove.pyinstead of eyeballing it. Drill intoagent/references/constraints/tiling.mdfor reasoning.

4. Re-check tail handling

Look at GM boundaries first, not local tensor sizes. Rule: local buffers stay full-tile sized; only GM read/write boundaries usevalid_m,valid_n,valid_k.

For cube -> vec writeback, verify the standard half-row split:

  • half_rows = CeilDiv(valid_m, 2)
  • row_begin = GetSubBlockIdx() * half_rows
  • row_end = Min(row_begin + half_rows, valid_m)

For a2 workspace-mediated cube -> vec tails: keep workspace writes and reads on stable tile shapes (ws[..., 0:TILE_M, 0:TILE_N]on cube;ws[..., row_begin:row_begin + row_count, 0:TILE_N]on vec). Applyvalid_nwith vec-side masking and final GM write boundaries, not by cropping the workspace column span first.

Symptoms of tail bugs: aligned cases pass but odd sizes fail; only the last tile is wrong; one vec subblock is correct and the other is garbage.

Drill:agent/references/constraints/tail-safety.md. For normalized online softmax with runningrow_max/row_sum, alsoagent/references/constraints/online-softmax-tail.md.

5. Re-check sync ownership

Assume ownership is wrong until proven otherwise.

auto_sync()only manages same-side ordering and does not replace cross-side ownership transfer. Cube -> vec handoff needsCvMutex; vec -> cube handoff needsVcMutex. Exact mutex signatures per device live inagent/references/facts-device-runtime.md.

If the issue smells like pipeline ordering: inspect where the producer finishes, where the consumer starts, whetherlock/ready/wait/freesurround the real ownership edge, and keep the critical section narrow.

Drill:agent/references/constraints/autosync.md.

6. Re-check counters and lifetimes

Many broken kernels are actually lifetime bugs. Verify which loop owns each buffer family, whether different lifetimes accidentally share one counter, whether the same slot lineage is expressed consistently.

Rules: buffers with different lifetimes must use different counters; same-lifetime paired buffers may share one; reusing one counter across different loop-owned lifetimes can silently break autosync grouping and slot reasoning.

Drill:agent/references/constraints/counters.md.

7. Re-check precision boundaries

Verify where values change dtype. Common failures: casting too early, reducing in the wrong dtype, writing packed or quantized data too early, comparing against a reference with a different cast order.

Rule: keep matmul accumulation infloat; downcast later unless the design proves otherwise.

Drill:agent/references/constraints/precision.md.

8. Inspect the real implementation path

If a rule is still unclear, inspect the actual implementation path instead of theorizing. Device family mapping (950→ C310,b*→ C220) and common target files (easyasc/stub_functions/,easyasc/parser/,easyasc/parser/asc_autosync.py,easyasc/kernelbase/kernelbase.py,easyasc/simulator_v2/,easyasc/shortcuts/matmul.py) are inagent/references/code-paths.md.

Good debugging question: which exact instruction gets emitted, how the parser lowers it, how the simulator executes it, whether the kernel assumption matches that path.

When the simulator itself produces an unexpected error: investigate the simulator path first; inspect the exact simulator stage, runtime view, and lowered instruction that failed; do not assume the upper-layer kernel is wrong just because the simulator failed first.

If simulator behavior still looks inconsistent with the intended model after real inspection: stop blind upper-layer edits, summarize the concrete simulator finding, pause and discuss with the user.

Build a minimal reproducer

When the full kernel is noisy, isolate one mechanism: one matmul, one handoff, one vec postprocess, one autosync chain, one tail tile. A minimal reproducer is usually faster than staring at a fused kernel.

Shrink-down order: keep the original failing shape, remove later stages until only the first wrong stage remains, inside that stage keep only one subformula (odo,rowmax, one GM bridge), shrink again if needed to one instruction and one view shape.

Treat warnings as real signals

Do not accept a passing result with unresolved warnings. Especially forauto_sync, warnings usually mean the lifetime model is off. If a warning persists after real inspection, stop blind iteration — either redesign the stage boundary or ask the user for clarification.

Fallback references

  • agent/references/code-paths.md
  • agent/references/simulator-v2.md
  • doc/11_architecture_for_contributors.md

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 12:03:31

终极鼠标性能测试指南:3步精准评估您的设备表现

终极鼠标性能测试指南&#xff1a;3步精准评估您的设备表现 【免费下载链接】MouseTester 项目地址: https://gitcode.com/gh_mirrors/mo/MouseTester 想要知道您的鼠标是否真的物有所值&#xff1f;MouseTester——这款专业级鼠标性能测试工具&#xff0c;为您提供精确…

作者头像 李华
网站建设 2026/5/9 12:00:36

CANN学习中心仓技能集合

Skills - CANNBot 技能集合 【免费下载链接】cann-learning-hub CANN 学习中心仓&#xff0c;支持在线互动运行、边学边练&#xff0c;提供教程、示例与优化方案&#xff0c;一站式助力昇腾开发者快速上手。 项目地址: https://gitcode.com/cann/cann-learning-hub 本目…

作者头像 李华
网站建设 2026/5/9 11:54:55

【触想智能】安卓工业一体机在工业看板上的应用优势

随着信息技术的不断发展&#xff0c;工业界越来越倾向于数字化和自动化的生产流程。在这一变革中&#xff0c;安卓工业一体机成为了工业看板的重要组成部分&#xff0c;为工厂和生产线的监控、管理和决策提供了有力支持。触想安卓工业一体机TPC-A2为了体现安卓工业一体机在工业…

作者头像 李华
网站建设 2026/5/9 11:54:03

深智微IC华润微代理:MCU选型与工业控制方案推荐

【引言/痛点】工业可编程逻辑控制器&#xff08;PLC&#xff09;的主控MCU选型&#xff0c;常让工程师在“性能冗余”与“成本控制”之间反复权衡。一个典型的中端PLC需要同时处理Modbus RTU通信、高速计数器输入、多路ADC采样以及实时逻辑控制&#xff0c;这对MCU的内核性能、…

作者头像 李华
网站建设 2026/5/9 11:47:34

CANN NPU推理运行时错误诊断

【免费下载链接】cannbot-skills CANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体&#xff0c;本仓库为其提供可复用的 Skills 模块。 项目地址: https://gitcode.com/cann/cannbot-skills name: model-infer-runtime-debug description: 基于 PyTorch 框架的昇…

作者头像 李华