news 2026/5/9 20:50:46

CANN 数据移动约束

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN 数据移动约束

Datamove Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when a kernel needs to move data between GM, UB, L1, or L0 using non-trivial transfer patterns.

Goal

Choose the right datamove recipe so that:

  • the publish path matches the downstream consumer's expected layout
  • unaligned widths are handled by padding rather than by shrinking local tensors
  • strided gathers avoid unnecessary host-sidepermuteorexpand
  • internal workspace bridges stay explicit when on-chip reuse does not fit

1. ND publish (ub_to_l1_nd2nz)

Best for straightforward vec preprocess + cube consume.

  • write subblock rows into UB, then publish with explicitm_dst/n_dst/m_src/n_src
  • keep row mapping consistent withGetSubBlockIdx()
  • in general vec preprocess, split into two half ranges for the two vector sides:
    • half_rows = CeilDiv(total_rows, 2)
    • vector side 0 handles[0:half_rows]
    • vector side 1 handles[half_rows:total_rows]
    • publish each half independently to the matching L1 row slice

Files to study:

  • agent/example/kernels/a5/vec_cube_abs_sqrt_matmul.py

2. NZ publish (ub.nz())

Use when input is already packed for NZ path. Common flow:

  • do vec compute in ND register form
  • pack to NZ-friendly UB layout (deinterleave,reg_to_ub)
  • publish withl1 <<= ub.nz()

Files to study:

  • agent/example/kernels/a5/vec_cube_abs_sqrt_matmul_nz.py

3. Unaligned width handling

For unaligned GM widths, allocate UB second dim to aligned width and rely on padded transfer behavior. Do not shrink the UB tensor shape to the logical width.

For narrow a5 vec-only row kernels, a useful specialization is:

  • keep the logical host contract as[rows, H]
  • whenH < 64, still stage the chunk in UB as[rows, 64]
  • usegm_to_ub_pad(..., burst_len_element=H, dst_stride=(64 - H) / C0)to zero-pad each row on load
  • run the same@vf()row logic againstrow_stride = 64
  • write back withub_to_gm_pad(..., burst_len_element=H, src_stride=(64 - H) / C0)so only the logical columns return to GM
  • this is a good fit when the vec math is row-recursive and you want one shared@vf()body for both wide rows and narrowH < 64rows

Practical limit:

  • for float32, this padding shortcut is cleanest when the row-width gap is expressible inC0=8units
  • it does not solve the wider-column tail case by itself whenH >= 64butH % 64 != 0

Files to study:

  • agent/example/kernels/a5/vec_unaligned_gm_to_ub_pad.py
  • agent/example/kernels/a5/chunk_row_cumsum.py

4. Strided GM gather without hostpermute

When logical rows are separated by a fixed stride in flattened GM, usegm_to_ub_paddirectly:

  • setn_burstto the number of logical rows
  • setburst_len_elementto the contiguous row width
  • setsrc_stride_elementtofull_row_step - burst_len_element
  • keepdst_stride=0when the UB row shape already matches the aligned burst footprint

This is the main way to preserve a reshape-only host contract for attention-style layouts such askey:[B,S,H,D]andprob:[BH,S].

5. Internal workspace bridge for single-kernel fusion

If one kernel stage produces data onMTE3and a later stage must reread it throughMTE2, materialize that intermediate in GM workspace instead of trying to keep it purely local.

Stable attention pattern:

  • keepqk_tmp:[BH,S]as float workspace for the three-pass softmax
  • storep.half()intoprob_tmp:[BH,S]workspace
  • add an explicit stage boundary before reloadingprob_tmp
  • perform the final value scaling from that half workspace so thep.half().float()contract stays exact

For the final vec-onlyprob_tmp -> value -> outstage:

  • keep the whole nested reload/compute/writeback chain inside one outerauto_sync()
  • make DBuff slot ownership explicit through the ready/valid handshake rule
  • verify both simulator execution and generated C++ declarations before removing manual barriers

If the delayed reuse fits in one tile of on-chip lifetime, prefer an on-chip lookahead bridge:

  • keep the stage-1 operand needed again by stage-2 resident inL1/TBuff
  • publish the vec-produced fp8 probability tile directly into anL1slot for the second cube matmul
  • buffer per-tile rescale state in the same delayed slot family as the later consumer

Do not republish a freshly packed fp8 UB tile straight to L1 when exact downstream reuse matters; the packed UB layout can differ from the ND view expected by the later cube path.

Files to study:

  • agent/example/kernels/a5/test_mla_entire.py

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 20:47:34

为Claude Code配置Taotoken密钥与聚合端点解决封号困扰

&#x1f680; 告别海外账号与网络限制&#xff01;稳定直连全球优质大模型&#xff0c;限时半价接入中。 &#x1f449; 点击领取海量免费额度 为Claude Code配置Taotoken密钥与聚合端点解决封号困扰 Claude Code 作为一款高效的编程助手工具&#xff0c;为开发者提供了便捷的…

作者头像 李华
网站建设 2026/5/9 20:42:42

Taxonomy仪表盘:终极数据可视化监控指南

Taxonomy仪表盘&#xff1a;终极数据可视化监控指南 【免费下载链接】taxonomy An open source application built using the new router, server components and everything new in Next.js 13. 项目地址: https://gitcode.com/gh_mirrors/ta/taxonomy Taxonomy是一个基…

作者头像 李华
网站建设 2026/5/9 20:41:27

47.人工智能实战:大模型安全护栏怎么落地?从前期风险发现到输入过滤、输出审核与人工兜底

人工智能实战:大模型安全护栏怎么落地?从前期风险发现到输入过滤、输出审核与人工兜底 一、问题场景:模型没有报错,但回答已经越界了 大模型系统上线后,很多风险不是接口异常,而是回答内容越界。 例如: 1. 用户诱导模型泄露系统 Prompt 2. 用户要求输出内部制度之外的…

作者头像 李华
网站建设 2026/5/9 20:41:00

CANN/catlass分组矩阵乘反量化示例

GroupedMatmulSliceMPerTensorPerChannelDequant Example Readme 【免费下载链接】catlass 本项目是CANN的算子模板库&#xff0c;提供NPU上高性能矩阵乘及其相关融合类算子模板样例。 项目地址: https://gitcode.com/cann/catlass 代码组织 ├── 48_ascend950_group…

作者头像 李华
网站建设 2026/5/9 20:35:34

AI驱动材料发现:从机器学习力场到量子计算的闭环实践

1. 项目概述&#xff1a;当AI遇见晶体材料科学材料科学&#xff0c;尤其是晶体材料的发现与设计&#xff0c;正站在一个前所未有的十字路口。传统的“试错法”研发模式&#xff0c;从理论计算到实验合成&#xff0c;周期漫长、成本高昂&#xff0c;已经难以满足新能源、半导体、…

作者头像 李华
网站建设 2026/5/9 20:33:25

大众认为花钱进修一定能升职加薪,编程统计进修投入,职业晋升数据,无用进修只会增加个人经济负担。

一、实际应用场景描述在职场发展与人力资源管理中&#xff0c;普遍存在一种社会共识&#xff1a;“花钱进修&#xff08;考证、读研、培训班&#xff09;就一定能升职加薪。”这导致许多职场人&#xff1a;- 盲目报考各种证书与课程- 忽视进修内容与实际岗位需求的匹配度- 在未…

作者头像 李华