news 2026/5/8 20:41:47

面试-Tokenizer训练

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
面试-Tokenizer训练

1 代码

# 注:不建议再重复训练tokenizer(“词典”),MiniMind已自带,此脚本仅供学习和参考。基于不同词典训练的模型将导致输出完全不统一,降低社区的模型复用性# Note: It is not recommended to re-train the tokenizer. MiniMind already includes one. This script is for learning and reference only. Training models with different tokenizers will lead to inconsistent outputs and reduce model reusability in the community.importosimportjsonfromtokenizersimportdecoders,models,pre_tokenizers,trainers,Tokenizer DATA_PATH='../dataset/pretrain_hq.jsonl'TOKENIZER_DIR='../model_learn_tokenizer/'VOCAB_SIZE=6400defget_texts(data_path):withopen(data_path,'r',encoding='utf-8')asf:fori,lineinenumerate(f):ifi>=10000:break# 实验性,可只用前10000行测试data=json.loads(line)yielddata['text']deftrain_tokenizer(data_path,tokenizer_dir,vocab_size):tokenizer=Tokenizer(models.BPE())tokenizer.pre_tokenizer=pre_tokenizers.ByteLevel(add_prefix_space=False)trainer=trainers.BpeTrainer(vocab_size=vocab_size,special_tokens=["<|endoftext|>","<|im_start|>","<|im_end|>"],show_progress=True,initial_alphabet=pre_tokenizers.ByteLevel.alphabet())texts=get_texts(data_path)tokenizer.train_from_iterator(texts,trainer=trainer)tokenizer.decoder=decoders.ByteLevel()asserttokenizer.token_to_id("<|endoftext|>")==0asserttokenizer.token_to_id("<|im_start|>")==1asserttokenizer.token_to_id("<|im_end|>")==2os.makedirs(tokenizer_dir,exist_ok=True)tokenizer.save(os.path.join(tokenizer_dir,"tokenizer.json"))tokenizer.model.save(tokenizer_dir)config={"add_bos_token":False,"add_eos_token":False,"add_prefix_space":False,"added_tokens_decoder":{"0":{"content":"<|endoftext|>","lstrip":False,"normalized":False,"rstrip":False,"single_word":False,"special":True},"1":{"content":"<|im_start|>","lstrip":False,"normalized":False,"rstrip":False,"single_word":False,"special":True},"2":{"content":"<|im_end|>","lstrip":False,"normalized":False,"rstrip":False,"single_word":False,"special":True}},"additional_special_tokens":[],"bos_token":"<|im_start|>","clean_up_tokenization_spaces":False,"eos_token":"<|im_end|>","legacy":True,"model_max_length":32768,"pad_token":"<|endoftext|>","sp_model_kwargs":{},"spaces_between_special_tokens":False,"tokenizer_class":"PreTrainedTokenizerFast","unk_token":"<|endoftext|>","chat_template":"{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' -%}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else -%}\n {{- '<|im_start|>system\\nYou are a helpful assistant<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}"}withopen(os.path.join(tokenizer_dir,"tokenizer_config.json"),"w",encoding="utf-8")asf:json.dump(config,f,ensure_ascii=False,indent=4)print("Tokenizer training completed.")defeval_tokenizer(tokenizer_dir):fromtransformersimportAutoTokenizer tokenizer=AutoTokenizer.from_pretrained(tokenizer_dir)messages=[{"role":"system","content":"你是一个优秀的聊天机器人,总是给我正确的回应!"},{"role":"user","content":'你来自哪里?'},{"role":"assistant","content":'我来自地球'}]new_prompt=tokenizer.apply_chat_template(messages,tokenize=False)print('-'*100)print(new_prompt)print('-'*100)print('tokenizer词表长度:',len(tokenizer))model_inputs=tokenizer(new_prompt)print('encoder长度:',len(model_inputs['input_ids']))response=tokenizer.decode(model_inputs['input_ids'],skip_special_tokens=False)print('decoder一致性:',response==new_prompt,"\n")print('-'*100)print('流式解码(字节缓冲)测试:')input_ids=model_inputs['input_ids']token_cache=[]fortidininput_ids:token_cache.append(tid)current_decode=tokenizer.decode(token_cache)ifcurrent_decodeand'\ufffd'notincurrent_decode:display_ids=token_cache[0]iflen(token_cache)==1elsetoken_cache raw_tokens=[tokenizer.convert_ids_to_tokens(int(t))fortin(token_cacheifisinstance(token_cache,list)else[token_cache])]print(f'Token ID:{str(display_ids):15}-> Raw:{str(raw_tokens):20}-> Decode Str:{current_decode}')token_cache=[]if__name__=='__main__':train_tokenizer(DATA_PATH,TOKENIZER_DIR,VOCAB_SIZE)eval_tokenizer(TOKENIZER_DIR)
版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/5 21:55:12

2026.2.3

进入社会已两年有余&#xff0c;生活与工作都算稳定下来。这两年&#xff0c;我实践着学生时代的梦想&#xff0c;走进真实的企业项目&#xff0c;触摸了从需求、设计、开发到上线的完整链条&#xff0c;也体味了团队协作中的碰撞与默契。如今&#xff0c;我对“项目”二字有了…

作者头像 李华
网站建设 2026/5/7 22:57:57

分发安卓证书在线生成:一键搞定应用签名,安全便捷有保障

发现了个特别好用的证书生成网站&#xff0c;大家可以看看第一步点击工具箱&#xff0c;点击安卓证书在线生成点击工具箱&#xff0c;点击安卓证书在线生成进入安卓证书在线生成页面第二步输入生成证书的相关要素第二步输入生成证书的相关要素第三步生成的证书文件&#xff0c;…

作者头像 李华
网站建设 2026/5/2 16:49:28

基于数万次真机评测,RoboChallenge 首份年度报告发布

Datawhale分享 年度报告&#xff1a;RoboChallenge当大语言模型在数字世界不断刷新人类认知边界&#xff0c;一场关于 AI 如何“扎根”现实物理世界的革命正悄然进行。今日&#xff0c;全球首个具身智能大规模真机评测平台—— RoboChallenge 正式发布首份年度报告。报告基于过…

作者头像 李华
网站建设 2026/4/28 9:33:26

马斯克旗下太空探索公司SpaceX合并xAI:前者估值1.5万亿美元

雷递网 乐天 2月3日太空探索公司SpaceX (SPAX.PVT)日前发布公告称&#xff0c;将于xAI (XAAI.PVT)合并&#xff0c;此次交易将整合埃隆马斯克旗下两家最大的私人创业公司。“SpaceX收购xAI&#xff0c;旨在打造地球上&#xff08;乃至太空&#xff09;最具雄心、垂直整合程度最…

作者头像 李华
网站建设 2026/4/29 11:39:54

跨国企业在中国月报 | 联合利华、先正达、默克、奥乐齐、星巴克、达美乐、Visa等公司动态

2026年1月份&#xff0c;跨国企业在中国的发展动态。先正达集团在中国加速布局全球级研发中心和制造工厂 今年1月&#xff0c;先正达集团全球植保中国创新中心在上海金山区正式启动建设。该中心被定位为先正达集团在全球布局的植保研发体系的重要组成部分&#xff0c;与现有的瑞…

作者头像 李华
网站建设 2026/5/3 19:39:37

90分钟上手,自己做一个入库出库系统

自己做一个入库出库系统&#xff0c;听起来像是IT部门或者专业程序员才能搞定的事。我以前也这么觉得&#xff0c;直到我因为仓库管得太乱被老板骂了三次之后&#xff0c;才下决心动手试试。自己做一个入库出库系统&#xff0c;其实并不需要写代码。我之前也研究过一些现成的软…

作者头像 李华