CANN 数据移动约束-洪萨配资

Datamove Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when a kernel needs to move data between GM, UB, L1, or L0 using non-trivial transfer patterns.

Goal

Choose the right datamove recipe so that:

the publish path matches the downstream consumer's expected layout
unaligned widths are handled by padding rather than by shrinking local tensors
strided gathers avoid unnecessary host-sidepermuteorexpand
internal workspace bridges stay explicit when on-chip reuse does not fit

1. ND publish (`ub_to_l1_nd2nz`)

Best for straightforward vec preprocess + cube consume.

write subblock rows into UB, then publish with explicitm_dst/n_dst/m_src/n_src
keep row mapping consistent withGetSubBlockIdx()
in general vec preprocess, split into two half ranges for the two vector sides:
- half_rows = CeilDiv(total_rows, 2)
- vector side 0 handles[0:half_rows]
- vector side 1 handles[half_rows:total_rows]
- publish each half independently to the matching L1 row slice

Files to study:

agent/example/kernels/a5/vec_cube_abs_sqrt_matmul.py

2. NZ publish (`ub.nz()`)

Use when input is already packed for NZ path. Common flow:

do vec compute in ND register form
pack to NZ-friendly UB layout (deinterleave,reg_to_ub)
publish withl1 <<= ub.nz()

Files to study:

agent/example/kernels/a5/vec_cube_abs_sqrt_matmul_nz.py

3. Unaligned width handling

For unaligned GM widths, allocate UB second dim to aligned width and rely on padded transfer behavior. Do not shrink the UB tensor shape to the logical width.

For narrow a5 vec-only row kernels, a useful specialization is:

keep the logical host contract as[rows, H]
whenH < 64, still stage the chunk in UB as[rows, 64]
usegm_to_ub_pad(..., burst_len_element=H, dst_stride=(64 - H) / C0)to zero-pad each row on load
run the same@vf()row logic againstrow_stride = 64
write back withub_to_gm_pad(..., burst_len_element=H, src_stride=(64 - H) / C0)so only the logical columns return to GM
this is a good fit when the vec math is row-recursive and you want one shared@vf()body for both wide rows and narrowH < 64rows

Practical limit:

for float32, this padding shortcut is cleanest when the row-width gap is expressible inC0=8units
it does not solve the wider-column tail case by itself whenH >= 64butH % 64 != 0

Files to study:

agent/example/kernels/a5/vec_unaligned_gm_to_ub_pad.py
agent/example/kernels/a5/chunk_row_cumsum.py

4. Strided GM gather without host`permute`

When logical rows are separated by a fixed stride in flattened GM, usegm_to_ub_paddirectly:

setn_burstto the number of logical rows
setburst_len_elementto the contiguous row width
setsrc_stride_elementtofull_row_step - burst_len_element
keepdst_stride=0when the UB row shape already matches the aligned burst footprint

This is the main way to preserve a reshape-only host contract for attention-style layouts such askey:[B,S,H,D]andprob:[BH,S].

5. Internal workspace bridge for single-kernel fusion

If one kernel stage produces data onMTE3and a later stage must reread it throughMTE2, materialize that intermediate in GM workspace instead of trying to keep it purely local.

Stable attention pattern:

keepqk_tmp:[BH,S]as float workspace for the three-pass softmax
storep.half()intoprob_tmp:[BH,S]workspace
add an explicit stage boundary before reloadingprob_tmp
perform the final value scaling from that half workspace so thep.half().float()contract stays exact

For the final vec-onlyprob_tmp -> value -> outstage:

keep the whole nested reload/compute/writeback chain inside one outerauto_sync()
make DBuff slot ownership explicit through the ready/valid handshake rule
verify both simulator execution and generated C++ declarations before removing manual barriers

If the delayed reuse fits in one tile of on-chip lifetime, prefer an on-chip lookahead bridge:

keep the stage-1 operand needed again by stage-2 resident inL1/TBuff
publish the vec-produced fp8 probability tile directly into anL1slot for the second cube matmul
buffer per-tile rescale state in the same delayed slot family as the later consumer

Do not republish a freshly packed fp8 UB tile straight to L1 when exact downstream reuse matters; the packed UB layout can differ from the ND view expected by the later cube path.

Files to study: