| --- |
| license: mit |
| language: |
| - multilingual |
| tags: |
| - nlp |
| base_model: OpenGVLab/InternVL2_5-2B |
| pipeline_tag: text-generation |
| inference: true |
| --- |
| |
| # NuExtract-2-2B [experimental version] by NuMind 🔥 |
|
|
| NuExtract 2.0 experimental is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual. |
|
|
| NB: This is an experimental version that will be superseeded by NuExtract 2.0 |
|
|
| We provide several versions of different sizes, all based on the InternVL2.5 family. |
| | Model Size | Model Name | Base Model | Huggingface Link | |
| |------------|------------|------------|------------------| |
| | 2B | NuExtract-2.0-2B | [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B) | [NuExtract-2-2B](https://huggingface.co/numind/NuExtract-2-2B) | |
| | 4B | NuExtract-2.0-4B | [InternVL2_5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [NuExtract-2-4B](https://huggingface.co/numind/NuExtract-2-4B) | |
| | 8B | NuExtract-2.0-8B | [InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [NuExtract-2-8B](https://huggingface.co/numind/NuExtract-2-8B) | |
|
|
| ## Overview |
|
|
| To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type. |
|
|
| Support types include: |
| * `verbatim-string` - instructs the model to extract text that is present verbatim in the input. |
| * `string` - a generic string field that can incorporate paraphrasing/abstraction. |
| * `integer` - a whole number. |
| * `number` - a whole or decimal number. |
| * `date-time` - ISO formatted date. |
| * Array of any of the above types (e.g. `["string"]`) |
| * `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`). |
| * `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`). |
|
|
| If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels). |
|
|
| The following is an example template: |
| ```json |
| { |
| "first_name": "verbatim-string", |
| "last_name": "verbatim-string", |
| "description": "string", |
| "age": "integer", |
| "gpa": "number", |
| "birth_date": "date-time", |
| "nationality": ["France", "England", "Japan", "USA", "China"], |
| "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]] |
| } |
| ``` |
| An example output: |
| ```json |
| { |
| "first_name": "Susan", |
| "last_name": "Smith", |
| "description": "A student studying computer science.", |
| "age": 20, |
| "gpa": 3.7, |
| "birth_date": "2005-03-01", |
| "nationality": "England", |
| "languages_spoken": ["English", "French"] |
| } |
| ``` |
|
|
| ⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks. |
|
|
| ## Inference |
|
|
| Use the following code to handle loading and preprocessing of input data: |
|
|
| ```python |
| import torch |
| import torchvision.transforms as T |
| from PIL import Image |
| from torchvision.transforms.functional import InterpolationMode |
| |
| IMAGENET_MEAN = (0.485, 0.456, 0.406) |
| IMAGENET_STD = (0.229, 0.224, 0.225) |
| |
| def build_transform(input_size): |
| MEAN, STD = IMAGENET_MEAN, IMAGENET_STD |
| transform = T.Compose([ |
| T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), |
| T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), |
| T.ToTensor(), |
| T.Normalize(mean=MEAN, std=STD) |
| ]) |
| return transform |
| |
| def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): |
| best_ratio_diff = float('inf') |
| best_ratio = (1, 1) |
| area = width * height |
| for ratio in target_ratios: |
| target_aspect_ratio = ratio[0] / ratio[1] |
| ratio_diff = abs(aspect_ratio - target_aspect_ratio) |
| if ratio_diff < best_ratio_diff: |
| best_ratio_diff = ratio_diff |
| best_ratio = ratio |
| elif ratio_diff == best_ratio_diff: |
| if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: |
| best_ratio = ratio |
| return best_ratio |
| |
| def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): |
| orig_width, orig_height = image.size |
| aspect_ratio = orig_width / orig_height |
| |
| # calculate the existing image aspect ratio |
| target_ratios = set( |
| (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if |
| i * j <= max_num and i * j >= min_num) |
| target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) |
| |
| # find the closest aspect ratio to the target |
| target_aspect_ratio = find_closest_aspect_ratio( |
| aspect_ratio, target_ratios, orig_width, orig_height, image_size) |
| |
| # calculate the target width and height |
| target_width = image_size * target_aspect_ratio[0] |
| target_height = image_size * target_aspect_ratio[1] |
| blocks = target_aspect_ratio[0] * target_aspect_ratio[1] |
| |
| # resize the image |
| resized_img = image.resize((target_width, target_height)) |
| processed_images = [] |
| for i in range(blocks): |
| box = ( |
| (i % (target_width // image_size)) * image_size, |
| (i // (target_width // image_size)) * image_size, |
| ((i % (target_width // image_size)) + 1) * image_size, |
| ((i // (target_width // image_size)) + 1) * image_size |
| ) |
| # split the image |
| split_img = resized_img.crop(box) |
| processed_images.append(split_img) |
| assert len(processed_images) == blocks |
| if use_thumbnail and len(processed_images) != 1: |
| thumbnail_img = image.resize((image_size, image_size)) |
| processed_images.append(thumbnail_img) |
| return processed_images |
| |
| def load_image(image_file, input_size=448, max_num=12): |
| image = Image.open(image_file).convert('RGB') |
| transform = build_transform(input_size=input_size) |
| images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) |
| pixel_values = [transform(image) for image in images] |
| pixel_values = torch.stack(pixel_values) |
| return pixel_values |
| |
| def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16): |
| """ |
| Prepares multi-modal input components (supports multiple images per prompt). |
| |
| Args: |
| messages: List of input messages/prompts (strings or dicts with 'role' and 'content') |
| image_paths: List where each element is either None (for text-only) or a list of image paths |
| tokenizer: The tokenizer to use for applying chat templates |
| device: Device to place tensors on ('cuda', 'cpu', etc.) |
| dtype: Data type for image tensors (default: torch.bfloat16) |
| |
| Returns: |
| dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model |
| """ |
| # Make sure image_paths list is at least as long as messages |
| if len(image_paths) < len(messages): |
| # Pad with None for text-only messages |
| image_paths = image_paths + [None] * (len(messages) - len(image_paths)) |
| |
| # Process images and collect patch information |
| loaded_images = [] |
| num_patches_list = [] |
| for paths in image_paths: |
| if paths and isinstance(paths, list) and len(paths) > 0: |
| # Load each image in this prompt |
| prompt_images = [] |
| prompt_patches = [] |
| |
| for path in paths: |
| # Load the image |
| img = load_image(path).to(dtype=dtype, device=device) |
| |
| # Ensure img has correct shape [patches, C, H, W] |
| if len(img.shape) == 3: # [C, H, W] -> [1, C, H, W] |
| img = img.unsqueeze(0) |
| |
| prompt_images.append(img) |
| # Record the number of patches for this image |
| prompt_patches.append(img.shape[0]) |
| |
| loaded_images.append(prompt_images) |
| num_patches_list.append(prompt_patches) |
| else: |
| # Text-only prompt |
| loaded_images.append(None) |
| num_patches_list.append([]) |
| |
| # Create the concatenated pixel_values_list |
| pixel_values_list = [] |
| for prompt_images in loaded_images: |
| if prompt_images: |
| # Concatenate all images for this prompt |
| pixel_values_list.append(torch.cat(prompt_images, dim=0)) |
| else: |
| # Text-only prompt |
| pixel_values_list.append(None) |
| |
| # Format messages for the model |
| if all(isinstance(m, str) for m in messages): |
| # Simple string messages: convert to chat format |
| batch_messages = [ |
| [{"role": "user", "content": message}] |
| for message in messages |
| ] |
| else: |
| # Assume messages are already in the right format |
| batch_messages = messages |
| |
| # Apply chat template |
| prompts = tokenizer.apply_chat_template( |
| batch_messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| return { |
| 'prompts': prompts, |
| 'pixel_values_list': pixel_values_list, |
| 'num_patches_list': num_patches_list |
| } |
| |
| def construct_message(text, template, examples=None): |
| """ |
| Construct the individual NuExtract message texts, prior to chat template formatting. |
| """ |
| # add few-shot examples if needed |
| if examples is not None and len(examples) > 0: |
| icl = "# Examples:\n" |
| for row in examples: |
| icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n" |
| else: |
| icl = "" |
| |
| return f"""# Template:\n{template}\n{icl}# Context:\n{text}""" |
| ``` |
|
|
| To handle inference: |
|
|
| ```python |
| IMG_START_TOKEN='<img>' |
| IMG_END_TOKEN='</img>' |
| IMG_CONTEXT_TOKEN='<IMG_CONTEXT>' |
| |
| def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None): |
| """ |
| Generate responses for a batch of NuExtract inputs. |
| Support for multiple and varying numbers of images per prompt. |
| |
| Args: |
| model: The vision-language model |
| tokenizer: The tokenizer for the model |
| pixel_values_list: List of tensor batches, one per prompt |
| Each batch has shape [num_images, channels, height, width] or None for text-only prompts |
| prompts: List of text prompts |
| generation_config: Configuration for text generation |
| num_patches_list: List of lists, each containing patch counts for images in a prompt |
| |
| Returns: |
| List of generated responses |
| """ |
| img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN) |
| model.img_context_token_id = img_context_token_id |
| |
| # Replace all image placeholders with appropriate tokens |
| modified_prompts = [] |
| total_image_files = 0 |
| total_patches = 0 |
| image_containing_prompts = [] |
| for idx, prompt in enumerate(prompts): |
| # check if this prompt has images |
| has_images = (pixel_values_list and |
| idx < len(pixel_values_list) and |
| pixel_values_list[idx] is not None and |
| isinstance(pixel_values_list[idx], torch.Tensor) and |
| pixel_values_list[idx].shape[0] > 0) |
| |
| if has_images: |
| # prompt with image placeholders |
| image_containing_prompts.append(idx) |
| modified_prompt = prompt |
| |
| patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else [] |
| num_images = len(patches) |
| total_image_files += num_images |
| total_patches += sum(patches) |
| |
| # replace each <image> placeholder with image tokens |
| for i, num_patches in enumerate(patches): |
| image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN |
| modified_prompt = modified_prompt.replace('<image>', image_tokens, 1) |
| else: |
| # text-only prompt |
| modified_prompt = prompt |
| |
| modified_prompts.append(modified_prompt) |
| |
| # process all prompts in a single batch |
| tokenizer.padding_side = 'left' |
| model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True) |
| input_ids = model_inputs['input_ids'].to(model.device) |
| attention_mask = model_inputs['attention_mask'].to(model.device) |
| |
| eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>\n".strip()) |
| generation_config['eos_token_id'] = eos_token_id |
| |
| # prepare pixel values |
| flattened_pixel_values = None |
| if image_containing_prompts: |
| # collect and concatenate all image tensors |
| all_pixel_values = [] |
| for idx in image_containing_prompts: |
| all_pixel_values.append(pixel_values_list[idx]) |
| |
| flattened_pixel_values = torch.cat(all_pixel_values, dim=0) |
| print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches") |
| else: |
| print(f"Processing text-only batch with {len(prompts)} prompts") |
| |
| # generate outputs |
| outputs = model.generate( |
| pixel_values=flattened_pixel_values, # will be None for text-only prompts |
| input_ids=input_ids, |
| attention_mask=attention_mask, |
| **generation_config |
| ) |
| |
| # Decode responses |
| responses = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
| |
| return responses |
| ``` |
|
|
| To load the model: |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left') |
| model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2" # we recommend using flash attention |
| ).to("cuda") |
| ``` |
|
|
| Simple 0-shot text-only example: |
| ```python |
| template = """{"names": ["verbatim-string"]}""" |
| text = "John went to the restaurant with Mary. James went to the cinema." |
| |
| input_messages = [construct_message(text, template)] |
| |
| input_content = prepare_inputs( |
| messages=input_messages, |
| image_paths=[], |
| tokenizer=tokenizer, |
| ) |
| |
| generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} |
| |
| with torch.no_grad(): |
| result = nuextract_generate( |
| model=model, |
| tokenizer=tokenizer, |
| prompts=input_content['prompts'], |
| pixel_values_list=input_content['pixel_values_list'], |
| num_patches_list=input_content['num_patches_list'], |
| generation_config=generation_config |
| ) |
| for y in result: |
| print(y) |
| # {"names": ["John", "Mary", "James"]} |
| ``` |
|
|
| Text-only input with an in-context example: |
| ```python |
| template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""" |
| text = "John went to the restaurant with Mary. James went to the cinema." |
| examples = [ |
| { |
| "input": "Stephen is the manager at Susan's store.", |
| "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}""" |
| } |
| ] |
| |
| input_messages = [construct_message(text, template, examples)] |
| |
| input_content = prepare_inputs( |
| messages=input_messages, |
| image_paths=[], |
| tokenizer=tokenizer, |
| ) |
| |
| generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} |
| |
| with torch.no_grad(): |
| result = nuextract_generate( |
| model=model, |
| tokenizer=tokenizer, |
| prompts=input_content['prompts'], |
| pixel_values_list=input_content['pixel_values_list'], |
| num_patches_list=input_content['num_patches_list'], |
| generation_config=generation_config |
| ) |
| for y in result: |
| print(y) |
| # {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]} |
| ``` |
|
|
| Example with image input and an in-context example. Image inputs should use `<image>` placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example `0.jpg` will be for the in-context example and `1.jpg` for the true input). |
| ```python |
| template = """{"store": "verbatim-string"}""" |
| text = "<image>" |
| examples = [ |
| { |
| "input": "<image>", |
| "output": """{"store": "Walmart"}""" |
| } |
| ] |
| |
| input_messages = [construct_message(text, template, examples)] |
| |
| images = [ |
| ["0.jpg", "1.jpg"] |
| ] |
| |
| input_content = prepare_inputs( |
| messages=input_messages, |
| image_paths=images, |
| tokenizer=tokenizer, |
| ) |
| |
| generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} |
| |
| with torch.no_grad(): |
| result = nuextract_generate( |
| model=model, |
| tokenizer=tokenizer, |
| prompts=input_content['prompts'], |
| pixel_values_list=input_content['pixel_values_list'], |
| num_patches_list=input_content['num_patches_list'], |
| generation_config=generation_config |
| ) |
| for y in result: |
| print(y) |
| # {"store": "Trader Joe's"} |
| ``` |
|
|
| Multi-modal batched input: |
| ```python |
| inputs = [ |
| # image input with no ICL examples |
| { |
| "text": "<image>", |
| "template": """{"store_name": "verbatim-string"}""", |
| "examples": None, |
| }, |
| # image input with 1 ICL example |
| { |
| "text": "<image>", |
| "template": """{"store_name": "verbatim-string"}""", |
| "examples": [ |
| { |
| "input": "<image>", |
| "output": """{"store_name": "Walmart"}""", |
| } |
| ], |
| }, |
| # text input with no ICL examples |
| { |
| "text": "John went to the restaurant with Mary. James went to the cinema.", |
| "template": """{"names": ["verbatim-string"]}""", |
| "examples": None, |
| }, |
| # text input with ICL example |
| { |
| "text": "John went to the restaurant with Mary. James went to the cinema.", |
| "template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""", |
| "examples": [ |
| { |
| "input": "Stephen is the manager at Susan's store.", |
| "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}""" |
| } |
| ], |
| }, |
| ] |
| |
| input_messages = [ |
| construct_message( |
| x["text"], |
| x["template"], |
| x["examples"] |
| ) for x in inputs |
| ] |
| |
| images = [ |
| ["0.jpg"], |
| ["0.jpg", "1.jpg"], |
| None, |
| None |
| ] |
| |
| input_content = prepare_inputs( |
| messages=input_messages, |
| image_paths=images, |
| tokenizer=tokenizer, |
| ) |
| |
| generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} |
| |
| with torch.no_grad(): |
| result = nuextract_generate( |
| model=model, |
| tokenizer=tokenizer, |
| prompts=input_content['prompts'], |
| pixel_values_list=input_content['pixel_values_list'], |
| num_patches_list=input_content['num_patches_list'], |
| generation_config=generation_config |
| ) |
| for y in result: |
| print(y) |
| # {"store_name": "WAL*MART"} |
| # {"store_name": "Trader Joe's"} |
| # {"names": ["John", "Mary", "James"]} |
| # {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]} |
| ``` |
|
|
| ## Template Generation |
| If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2 models can automatically generate this for you. |
|
|
| E.g. convert XML into a NuExtract template: |
| ```python |
| def generate_template(description): |
| input_messages = [description] |
| input_content = prepare_inputs( |
| messages=input_messages, |
| image_paths=[], |
| tokenizer=tokenizer, |
| ) |
| generation_config = {"do_sample": True, "temperature": 0.4, "max_new_tokens": 256} |
| with torch.no_grad(): |
| result = nuextract_generate( |
| model=model, |
| tokenizer=tokenizer, |
| prompts=input_content['prompts'], |
| pixel_values_list=input_content['pixel_values_list'], |
| num_patches_list=input_content['num_patches_list'], |
| generation_config=generation_config |
| ) |
| return result[0] |
| xml_template = """<SportResult> |
| <Date></Date> |
| <Sport></Sport> |
| <Venue></Venue> |
| <HomeTeam></HomeTeam> |
| <AwayTeam></AwayTeam> |
| <HomeScore></HomeScore> |
| <AwayScore></AwayScore> |
| <TopScorer></TopScorer> |
| </SportResult>""" |
| result = generate_template(xml_template) |
| |
| print(result) |
| # { |
| # "SportResult": { |
| # "Date": "date-time", |
| # "Sport": "verbatim-string", |
| # "Venue": "verbatim-string", |
| # "HomeTeam": "verbatim-string", |
| # "AwayTeam": "verbatim-string", |
| # "HomeScore": "integer", |
| # "AwayScore": "integer", |
| # "TopScorer": "verbatim-string" |
| # } |
| # } |
| ``` |
|
|
| E.g. generate a template from natural language description: |
| ```python |
| text = """Give me relevant info about startup companies mentioned.""" |
| result = generate_template(text) |
| |
| print(result) |
| # { |
| # "Startup_Companies": [ |
| # { |
| # "Name": "verbatim-string", |
| # "Products": [ |
| # "string" |
| # ], |
| # "Location": "verbatim-string", |
| # "Company_Type": [ |
| # "Technology", |
| # "Finance", |
| # "Health", |
| # "Education", |
| # "Other" |
| # ] |
| # } |
| # ] |
| # } |
| ``` |