prompt = f""" Please annotate each character in the image and provide image tag information in JSON format, along with a detailed description of the scene. 【Character Identification and Annotation】 1. Create a bounding box (bbox) for each character, formatted as [bottom-left x coordinate, bottom-left y coordinate, top-right x coordinate, top-right y coordinate] 2. The bounding box should precisely contain the entire character, neither too large nor too small 3. Character names are temporarily unknown, please use placeholders like $character_1$, $character_2$, etc. 【Overall Image Analysis】 1. After analyzing the positions of all characters, provide an overall description of the image, including both tags and caption sections 2. Tags section: - Reorganize based on the original tags provided in the content - Group tags by character using structured XML format: $character_1$ 1girl/1boy facial features, hair color, hair style, eye color, skin tone, age appearance, etc. clothing type, color, style, accessories, footwear, etc. height, build, physical characteristics, etc. facial expression, emotional state, mood, etc. current pose, movement, gesture, activity, etc. interaction with other characters, objects, or environment precise position in image (center, left, right, foreground, background, etc.) - Use structured XML format for general tags in : * : Overall character count (1girl, 2girls, 3girls, 1boy, multiple boys, etc.) * : Artist name, art style attribution, creator information * indoor, outdoor, landscape, cityscape, etc. room, forest, city, beach, school, etc. from above, from below, side view, close-up, etc. dark, bright, moody, cheerful, romantic, etc. natural light, artificial light, sunset, candlelight, etc. high resolution, masterpiece, best quality, etc. furniture, decorations, tools, vehicles, etc. any other scene-related tags not covered above // Note: Omit any tags above that are not applicable to the specific image ", "caption": "Extremely detailed description of the scene, including all characters' names, genders, appearances, clothing, actions, expressions, precise positions, relative positions, as well as scene background, environment, atmosphere, lighting, objects, perspective, artistic style, etc. The description should be extremely thorough, with vivid details, at least 200 words. Output in English." }} }} {tags} """