B200 FP8 inference FYI

by yoctta - opened Sep 8

Sep 8

•

Running with 2x B200 GPUs in FP8 is tedious for trying. Here is a cheatsheet

docker run --name sglang --rm --gpus all -it -p 8000:30000 -v /mnt/s2:/mnt/s1 -e HF_HOME=/mnt/s1 -e NCCL_DEBUG=INFO -e CUDA_LAUNCH_BLOCKING=1 --ipc=host lmsysorg/sglang:b200-cu129 python -m sglang.launch_server --model-path stepfun-ai/step3-fp8 --host 0.0.0.0 --trust-remote-code --tool-call-parser step3 --reasoning-parser step3 --tp 2 --attention-backend triton --sampling-backend pytorch --decode-attention-backend triton --mm-attention-backend sdpa --enable-multimodal --mem-fraction-static 0.875 --cuda-graph-max-bs 1 --max-prefill-tokens 2048 --max-running-requests 2 --allow-auto-truncate --disable-overlap-schedule --moe-runner-backend triton --dtype float16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment