Qwen
/

Qwen2-72B-Instruct-GGUF

@@ -1,9 +1,9 @@
 ---
-license: apache-2.0
 language:
 - en
 pipeline_tag: text-generation
 tags:
 - chat
 ---
@@ -27,18 +27,66 @@ We pretrained the models with a large amount of data, and we post-trained the mo
 ## Requirements
-We advise you to clone [`llama.cpp`](https://github.com/ggerganov/llama.cpp) and install it following the official guide.
 ## How to use
 Cloning the repo may be inefficient, and thus you can manually download the GGUF file that you need or use `huggingface-cli` (`pip install huggingface_hub`) as shown below:
 ```shell
-huggingface-cli download Qwen/Qwen2-72B-Instruct-GGUF qwen2-72b-instruct-q8_0.gguf --local-dir . --local-dir-use-symlinks False
 ```
-We demonstrate how to use `llama.cpp` to run Qwen2:
-```shell
-./main -m qwen2-72b-instruct-q8_0.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt
 ```
 ## Citation
@@ -50,4 +98,4 @@ If you find our work helpful, feel free to give us a cite.
   title={Qwen2 Technical Report},
   year={2024}
 }
-```

 ---
 language:
 - en
 pipeline_tag: text-generation
 tags:
+- instruct
 - chat
 ---
 ## Requirements
+We advise you to clone [`llama.cpp`](https://github.com/ggerganov/llama.cpp) and install it following the official guide. We follow the latest version of llama.cpp.
+In the following demonstration, we assume that you are running commands under the repository `llama.cpp`.
 ## How to use
 Cloning the repo may be inefficient, and thus you can manually download the GGUF file that you need or use `huggingface-cli` (`pip install huggingface_hub`) as shown below:
 ```shell
+huggingface-cli download Qwen/Qwen2-72B-Instruct-GGUF qwen2-72b-instruct-q4_0.gguf --local-dir . --local-dir-use-symlinks False
 ```
+However, for large files, we split them into multiple segments due to the limitation of 50G for a single file to be uploaded.
+Specifically, for the split files, they share a prefix, with a suffix indicating its index. For examples, the `q5_k_m` GGUF files are:
+```
+qwen2-72b-instruct-q5_k_m-00001-of-00002.gguf
+qwen2-72b-instruct-q5_k_m-00002-of-00002.gguf
+```
+They share the prefix of `qwen2-72b-instruct-q5_k_m`, but have their own suffix for indexing respectively, say `-00001-of-00002`.
+To use the split GGUF files, you need to merge them first with the command `llama-gguf-split` as shown below:
+```bash
+./llama-gguf-split --merge qwen2-72b-instruct-q5_k_m-00001-of-00002.gguf qwen2-72b-instruct-q5_k_m.gguf
+```
+With the upgrade of APIs of llama.cpp, `llama-gguf-split` is equivalent to the previous `gguf-split`.
+For the arguments of this command, the first is the path to the first split GGUF file, and the second is the path to the output GGUF file.
+To run Qwen2, you can use `llama-cli` (the previous `main`) or `llama-server` (the previous `server`).
+We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
+```bash
+./llama-server -m qwen2-72b-instruct-q4_0.gguf
+```
+Then it is easy to access the deployed service with OpenAI API:
+```python
+import openai
+client = openai.OpenAI(
+    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
+    api_key = "sk-no-key-required"
+)
+completion = client.chat.completions.create(
+    model="qwen",
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "tell me something about michael jordan"}
+    ]
+)
+print(completion.choices[0].message.content)
+```
+If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the ChatML template. Instead you should use `--in-prefix` and `--in-suffix` to tackle this problem.
+```bash
+./llama-cli -m qwen2-72b-instruct-q4_0.gguf -n 512 -co -i -if -f prompts/chat-with-qwen.txt --in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n"
 ```
 ## Citation
   title={Qwen2 Technical Report},
   year={2024}
 }
+```