系统:ubuntu22.04
说明:因为github上清华那边说支持了,但是作为 一个支线任务,又是全e文的,摸索下实现了。只需要24G显存。
优势:火山、硅基、等网站平台的api模式deepseek,128k(约23万汉字)上下文,读取论文,小说,档案时会直接报错,但是kt这版修改的支持本地布置的deepseek-r1-671b各种向量化版本包括1.58bit,q2-m,q3-k,q4-k,q6,q8全部支持128k上下文读取分析。这就是为什么要本地化布置。
版本:
kt0.2.3post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl
操作方法:
1.启动kt conda环境:
conda activate kt
2.卸载已经上一版kt:
pip uninstall ktransformers -y
3.在ubuntu\用户名\下载最新kt0.2.3post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl:
Wget "https://cdn3.easylink.cc/659c815e-00d8-40b0-baf0-156a5814c9a5_ ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-1inux_x86_64.whl?e=1742568571&token=J_WyMIdhZtwb0E0QHWRqEfQrd51VSMLff19QxaxP:Q_DVh06hcFMVM5DsPGu76vQjg8s=”-O~/ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl # ktransformers_new
或者去KT网站下载whl到home\用户名\文件下:https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.3post1/ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl
4. 编译新版kt:
pip install ~/ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl
5.安装flashinfer:
pip3 install flashinfer-python
5. 修改deepseek_v3_chat规则文件,让长上下文矩阵化,减小显存占用,先去找到规则文件存放地:
CD home/用户名/miniconda3/envs/kt/lib/python3.11/site-packages/ktransformers/optimize/optimize_rules
6. 确认一下:
Ls
看看DeepSeek-V3-Chat.yaml在不在
7. 打开:
Nano DeepSeek-V3-Chat.yaml
8. 找到如下代码:
- match:
name: "^model\\.layers\\..*\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
9. 修改:
- match:
name: "^model\\.layers\\..*\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
chunk_prefill_size: 4096 # 减小分块预填充大小以进一步减少内存占用
10. 如果启动时报如下错误:
11. 继续修改DeepSeek-V3-Chat.yaml,找到如下代码:
- match:
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGateDeepSeekV3
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
12. 修改代码为:
- match:
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
13. api模式启动命令修改:
export HF_ENDPOINT="https://hf-mirror.com"
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ktransformers \
--model_path deepseek-ai/DeepSeek-R1 \
--gguf_path /home/dministrator/models/DeepSeek-R1-Q4_K_M \
--max_new_tokens 8192 \
--total_context 131072 \
--cache_lens 131072 \
--cpu_infer 31 \
--cache_q4 true \
--temperature 0.9 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 10002
#cpu_infer,你们自己根据自己cpu核心数改
说明:因为github上清华那边说支持了,但是作为 一个支线任务,又是全e文的,摸索下实现了。只需要24G显存。
优势:火山、硅基、等网站平台的api模式deepseek,128k(约23万汉字)上下文,读取论文,小说,档案时会直接报错,但是kt这版修改的支持本地布置的deepseek-r1-671b各种向量化版本包括1.58bit,q2-m,q3-k,q4-k,q6,q8全部支持128k上下文读取分析。这就是为什么要本地化布置。
版本:
kt0.2.3post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl
操作方法:
1.启动kt conda环境:
conda activate kt
2.卸载已经上一版kt:
pip uninstall ktransformers -y
3.在ubuntu\用户名\下载最新kt0.2.3post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl:
Wget "https://cdn3.easylink.cc/659c815e-00d8-40b0-baf0-156a5814c9a5_ ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-1inux_x86_64.whl?e=1742568571&token=J_WyMIdhZtwb0E0QHWRqEfQrd51VSMLff19QxaxP:Q_DVh06hcFMVM5DsPGu76vQjg8s=”-O~/ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl # ktransformers_new
或者去KT网站下载whl到home\用户名\文件下:https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.3post1/ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl
4. 编译新版kt:
pip install ~/ktransformers-0.2.3.post1+cu124torch24avx2-cp311-cp311-linux_x86_64.whl
5.安装flashinfer:
pip3 install flashinfer-python
5. 修改deepseek_v3_chat规则文件,让长上下文矩阵化,减小显存占用,先去找到规则文件存放地:
CD home/用户名/miniconda3/envs/kt/lib/python3.11/site-packages/ktransformers/optimize/optimize_rules
6. 确认一下:
Ls
看看DeepSeek-V3-Chat.yaml在不在
7. 打开:
Nano DeepSeek-V3-Chat.yaml
8. 找到如下代码:
- match:
name: "^model\\.layers\\..*\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
9. 修改:
- match:
name: "^model\\.layers\\..*\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
chunk_prefill_size: 4096 # 减小分块预填充大小以进一步减少内存占用
10. 如果启动时报如下错误:
11. 继续修改DeepSeek-V3-Chat.yaml,找到如下代码:
- match:
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGateDeepSeekV3
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
12. 修改代码为:
- match:
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
13. api模式启动命令修改:
export HF_ENDPOINT="https://hf-mirror.com"
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ktransformers \
--model_path deepseek-ai/DeepSeek-R1 \
--gguf_path /home/dministrator/models/DeepSeek-R1-Q4_K_M \
--max_new_tokens 8192 \
--total_context 131072 \
--cache_lens 131072 \
--cpu_infer 31 \
--cache_q4 true \
--temperature 0.9 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 10002
#cpu_infer,你们自己根据自己cpu核心数改









