akhaliq HF Staff commited on
Commit
882079e
Β·
verified Β·
1 Parent(s): 5925a3a

Upload app.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. app.py +332 -0
app.py ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ I'll create a chat application for the UserLM-8b model with a clean interface and proper GPU optimization. Since this model runs locally, I won't use the @spaces.GPU decorator as it's not needed for external model loading.
2
+
3
+ ```python
4
+ import gradio as gr
5
+ import spaces
6
+ import torch
7
+ from transformers import AutoTokenizer, AutoModelForCausalLM
8
+ import time
9
+ from typing import List, Tuple
10
+
11
+ # Model configuration
12
+ MODEL_PATH = "microsoft/UserLM-8b"
13
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
14
+
15
+ # Global variables for model and tokenizer
16
+ model = None
17
+ tokenizer = None
18
+
19
+ def load_model():
20
+ """Load the model and tokenizer."""
21
+ global model, tokenizer
22
+
23
+ print(f"Loading model {MODEL_PATH}...")
24
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
25
+ model = AutoModelForCausalLM.from_pretrained(
26
+ MODEL_PATH,
27
+ trust_remote_code=True,
28
+ torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
29
+ low_cpu_mem_usage=True
30
+ ).to(DEVICE)
31
+ print(f"Model loaded successfully on {DEVICE}")
32
+ return model, tokenizer
33
+
34
+ @spaces.GPU(duration=120)
35
+ def generate_response(
36
+ message: str,
37
+ chat_history: List[Tuple[str, str]],
38
+ system_prompt: str,
39
+ temperature: float,
40
+ top_p: float,
41
+ max_new_tokens: int,
42
+ ) -> str:
43
+ """Generate a response from the model."""
44
+ global model, tokenizer
45
+
46
+ # Load model if not already loaded
47
+ if model is None or tokenizer is None:
48
+ model, tokenizer = load_model()
49
+
50
+ # Build conversation history
51
+ messages = []
52
+
53
+ # Add system prompt if provided
54
+ if system_prompt.strip():
55
+ messages.append({"role": "system", "content": system_prompt})
56
+
57
+ # Add chat history
58
+ for user_msg, assistant_msg in chat_history:
59
+ messages.append({"role": "user", "content": user_msg})
60
+ if assistant_msg:
61
+ messages.append({"role": "assistant", "content": assistant_msg})
62
+
63
+ # Add current message
64
+ messages.append({"role": "user", "content": message})
65
+
66
+ # Tokenize input
67
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(DEVICE)
68
+
69
+ # Define special tokens
70
+ end_token = "<|eot_id|>"
71
+ end_token_id = tokenizer.encode(end_token, add_special_tokens=False)
72
+
73
+ end_conv_token = "<|endconversation|>"
74
+ end_conv_token_id = tokenizer.encode(end_conv_token, add_special_tokens=False)
75
+
76
+ # Generate response
77
+ with torch.no_grad():
78
+ outputs = model.generate(
79
+ input_ids=inputs,
80
+ do_sample=True,
81
+ top_p=top_p,
82
+ temperature=temperature,
83
+ max_new_tokens=max_new_tokens,
84
+ eos_token_id=end_token_id,
85
+ pad_token_id=tokenizer.eos_token_id,
86
+ bad_words_ids=[[token_id] for token_id in end_conv_token_id]
87
+ )
88
+
89
+ # Decode response
90
+ response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
91
+ return response
92
+
93
+ def respond(
94
+ message: str,
95
+ chat_history: List[Tuple[str, str]],
96
+ system_prompt: str,
97
+ temperature: float,
98
+ top_p: float,
99
+ max_new_tokens: int,
100
+ ):
101
+ """Stream response to the chatbot."""
102
+ # Generate complete response
103
+ bot_message = generate_response(
104
+ message,
105
+ chat_history,
106
+ system_prompt,
107
+ temperature,
108
+ top_p,
109
+ max_new_tokens
110
+ )
111
+
112
+ # Add to chat history
113
+ chat_history.append((message, bot_message))
114
+
115
+ # Stream the response character by character for better UX
116
+ partial_message = ""
117
+ for char in bot_message:
118
+ partial_message += char
119
+ time.sleep(0.01) # Small delay for streaming effect
120
+ yield chat_history[:-1] + [(message, partial_message)]
121
+
122
+ yield chat_history
123
+
124
+ def clear_conversation():
125
+ """Clear the conversation history."""
126
+ return [], None
127
+
128
+ # Create the Gradio interface
129
+ with gr.Blocks(title="UserLM-8b Chat", theme=gr.themes.Soft()) as demo:
130
+ gr.Markdown(
131
+ """
132
+ # πŸ€– UserLM-8b Chat Interface
133
+
134
+ Chat with Microsoft's UserLM-8b model. This model is designed to simulate user behavior and generate responses as if from a user perspective.
135
+
136
+ [Built with anycoder](https://huggingface.co/spaces/akhaliq/anycoder)
137
+ """
138
+ )
139
+
140
+ with gr.Row():
141
+ with gr.Column(scale=3):
142
+ chatbot = gr.Chatbot(
143
+ height=500,
144
+ show_copy_button=True,
145
+ bubble_full_width=False,
146
+ avatar_images=(None, "πŸ€–"),
147
+ render_markdown=True,
148
+ )
149
+
150
+ with gr.Row():
151
+ msg = gr.Textbox(
152
+ label="Message",
153
+ placeholder="Type your message here and press Enter...",
154
+ lines=2,
155
+ scale=4,
156
+ autofocus=True,
157
+ )
158
+ submit_btn = gr.Button("Send", variant="primary", scale=1)
159
+
160
+ with gr.Row():
161
+ clear_btn = gr.ClearButton(
162
+ [chatbot, msg],
163
+ value="πŸ—‘οΈ Clear Chat"
164
+ )
165
+ retry_btn = gr.Button("πŸ”„ Retry Last")
166
+ undo_btn = gr.Button("↩️ Undo Last")
167
+
168
+ with gr.Column(scale=1):
169
+ gr.Markdown("### βš™οΈ Settings")
170
+
171
+ system_prompt = gr.Textbox(
172
+ label="System Prompt",
173
+ placeholder="Set the behavior of the model...",
174
+ value="You are a user who wants to implement a special type of sequence. The sequence sums up the two previous numbers in the sequence and adds 1 to the result. The first two numbers in the sequence are 1 and 1.",
175
+ lines=4,
176
+ )
177
+
178
+ temperature = gr.Slider(
179
+ minimum=0.1,
180
+ maximum=2.0,
181
+ value=1.0,
182
+ step=0.1,
183
+ label="Temperature",
184
+ info="Higher values make output more random"
185
+ )
186
+
187
+ top_p = gr.Slider(
188
+ minimum=0.1,
189
+ maximum=1.0,
190
+ value=0.8,
191
+ step=0.05,
192
+ label="Top-p (nucleus sampling)",
193
+ info="Lower values focus on more likely tokens"
194
+ )
195
+
196
+ max_new_tokens = gr.Slider(
197
+ minimum=10,
198
+ maximum=512,
199
+ value=100,
200
+ step=10,
201
+ label="Max New Tokens",
202
+ info="Maximum number of tokens to generate"
203
+ )
204
+
205
+ gr.Markdown(
206
+ """
207
+ ### πŸ“Š Model Info
208
+ - **Model**: microsoft/UserLM-8b
209
+ - **Parameters**: 8 billion
210
+ - **Device**: """ + DEVICE.upper() + """
211
+ - **Precision**: FP16 (CUDA) / FP32 (CPU)
212
+ """
213
+ )
214
+
215
+ # Store conversation history
216
+ chat_history = gr.State([])
217
+
218
+ # Event handlers
219
+ def user_submit(message, history):
220
+ return "", history + [(message, None)]
221
+
222
+ def bot_respond(history, system, temp, top_p, max_tokens):
223
+ if not history or history[-1][1] is not None:
224
+ return history
225
+
226
+ message = history[-1][0]
227
+ history_without_last = history[:-1]
228
+
229
+ for new_history in respond(message, history_without_last, system, temp, top_p, max_tokens):
230
+ yield new_history
231
+
232
+ def retry_last(history, system, temp, top_p, max_tokens):
233
+ if not history:
234
+ return history
235
+
236
+ # Remove last exchange and regenerate
237
+ last_user_msg = history[-1][0]
238
+ history = history[:-1]
239
+
240
+ for new_history in respond(last_user_msg, history, system, temp, top_p, max_tokens):
241
+ yield new_history
242
+
243
+ def undo_last(history):
244
+ if history:
245
+ return history[:-1]
246
+ return history
247
+
248
+ # Connect events
249
+ msg.submit(
250
+ user_submit,
251
+ [msg, chatbot],
252
+ [msg, chatbot],
253
+ queue=False
254
+ ).then(
255
+ bot_respond,
256
+ [chatbot, system_prompt, temperature, top_p, max_new_tokens],
257
+ chatbot
258
+ )
259
+
260
+ submit_btn.click(
261
+ user_submit,
262
+ [msg, chatbot],
263
+ [msg, chatbot],
264
+ queue=False
265
+ ).then(
266
+ bot_respond,
267
+ [chatbot, system_prompt, temperature, top_p, max_new_tokens],
268
+ chatbot
269
+ )
270
+
271
+ retry_btn.click(
272
+ retry_last,
273
+ [chatbot, system_prompt, temperature, top_p, max_new_tokens],
274
+ chatbot
275
+ )
276
+
277
+ undo_btn.click(
278
+ undo_last,
279
+ chatbot,
280
+ chatbot
281
+ )
282
+
283
+ # Load model on startup
284
+ demo.load(
285
+ fn=lambda: gr.Info("Model loading... This may take a moment on first run."),
286
+ inputs=None,
287
+ outputs=None
288
+ )
289
+
290
+ # Examples
291
+ gr.Examples(
292
+ examples=[
293
+ ["Can you help me understand how this sequence works?"],
294
+ ["What would be the next 5 numbers in the sequence?"],
295
+ ["Let's implement this sequence in Python together."],
296
+ ["Can you explain the pattern: 1, 1, 3, 5, 9, 15...?"],
297
+ ],
298
+ inputs=msg,
299
+ label="Example Messages",
300
+ )
301
+
302
+ if __name__ == "__main__":
303
+ demo.launch(
304
+ share=False,
305
+ show_error=True,
306
+ server_name="0.0.0.0",
307
+ server_port=7860,
308
+ )
309
+ ```
310
+
311
+ This chat application provides:
312
+
313
+ ## Key Features:
314
+
315
+ 1. **Clean Chat Interface**: A modern, responsive chat UI with message bubbles and avatars
316
+ 2. **Streaming Responses**: Character-by-character streaming for better UX
317
+ 3. **Customizable Settings**: Temperature, top-p, and max token controls
318
+ 4. **System Prompt**: Configurable system prompt with the default sequence example
319
+ 5. **Chat Management**: Clear, retry, and undo functionality
320
+ 6. **GPU Optimization**: Automatic GPU detection and FP16 precision on CUDA
321
+ 7. **Example Messages**: Pre-defined examples to get started quickly
322
+ 8. **Model Info Display**: Shows current device and model configuration
323
+
324
+ ## Technical Highlights:
325
+
326
+ - **Lazy Loading**: Model loads only when first message is sent
327
+ - **Memory Efficient**: Uses `low_cpu_mem_usage=True` and appropriate precision
328
+ - **Proper Token Handling**: Implements the special tokens from your example
329
+ - **State Management**: Maintains conversation history properly
330
+ - **Error Handling**: Graceful fallback to CPU if CUDA unavailable
331
+
332
+ The interface preserves your original model loading and generation logic while wrapping it in a user-friendly Gradio interface. Users can adjust parameters on the fly and have full control over the conversation flow.