Update README.md

a00d59a verified about 1 year ago

21.5 kB

	---
	license: mit
	language:
	- multilingual
	tags:
	- nlp
	base_model: OpenGVLab/InternVL2_5-2B
	pipeline_tag: text-generation
	inference: true
	---

	# NuExtract-2-2B [experimental version] by NuMind 🔥

	NuExtract 2.0 experimental is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual.

	NB: This is an experimental version that will be superseeded by NuExtract 2.0

	We provide several versions of different sizes, all based on the InternVL2.5 family.
	\| Model Size \| Model Name \| Base Model \| Huggingface Link \|
	\|------------\|------------\|------------\|------------------\|
	\| 2B \| NuExtract-2.0-2B \| [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B) \| [NuExtract-2-2B](https://huggingface.co/numind/NuExtract-2-2B) \|
	\| 4B \| NuExtract-2.0-4B \| [InternVL2_5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) \| [NuExtract-2-4B](https://huggingface.co/numind/NuExtract-2-4B) \|
	\| 8B \| NuExtract-2.0-8B \| [InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) \| [NuExtract-2-8B](https://huggingface.co/numind/NuExtract-2-8B) \|

	## Overview

	To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.

	Support types include:
	* `verbatim-string` - instructs the model to extract text that is present verbatim in the input.
	* `string` - a generic string field that can incorporate paraphrasing/abstraction.
	* `integer` - a whole number.
	* `number` - a whole or decimal number.
	* `date-time` - ISO formatted date.
	* Array of any of the above types (e.g. `["string"]`)
	* `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`).
	* `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`).

	If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels).

	The following is an example template:
	```json
	{
	"first_name": "verbatim-string",
	"last_name": "verbatim-string",
	"description": "string",
	"age": "integer",
	"gpa": "number",
	"birth_date": "date-time",
	"nationality": ["France", "England", "Japan", "USA", "China"],
	"languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
	}
	```
	An example output:
	```json
	{
	"first_name": "Susan",
	"last_name": "Smith",
	"description": "A student studying computer science.",
	"age": 20,
	"gpa": 3.7,
	"birth_date": "2005-03-01",
	"nationality": "England",
	"languages_spoken": ["English", "French"]
	}
	```

	⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.

	## Inference

	Use the following code to handle loading and preprocessing of input data:

	```python
	import torch
	import torchvision.transforms as T
	from PIL import Image
	from torchvision.transforms.functional import InterpolationMode

	IMAGENET_MEAN = (0.485, 0.456, 0.406)
	IMAGENET_STD = (0.229, 0.224, 0.225)

	def build_transform(input_size):
	MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
	transform = T.Compose([
	T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
	T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
	T.ToTensor(),
	T.Normalize(mean=MEAN, std=STD)
	])
	return transform

	def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
	best_ratio_diff = float('inf')
	best_ratio = (1, 1)
	area = width * height
	for ratio in target_ratios:
	target_aspect_ratio = ratio[0] / ratio[1]
	ratio_diff = abs(aspect_ratio - target_aspect_ratio)
	if ratio_diff < best_ratio_diff:
	best_ratio_diff = ratio_diff
	best_ratio = ratio
	elif ratio_diff == best_ratio_diff:
	if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
	best_ratio = ratio
	return best_ratio

	def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
	orig_width, orig_height = image.size
	aspect_ratio = orig_width / orig_height

	# calculate the existing image aspect ratio
	target_ratios = set(
	(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
	i * j <= max_num and i * j >= min_num)
	target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

	# find the closest aspect ratio to the target
	target_aspect_ratio = find_closest_aspect_ratio(
	aspect_ratio, target_ratios, orig_width, orig_height, image_size)

	# calculate the target width and height
	target_width = image_size * target_aspect_ratio[0]
	target_height = image_size * target_aspect_ratio[1]
	blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

	# resize the image
	resized_img = image.resize((target_width, target_height))
	processed_images = []
	for i in range(blocks):
	box = (
	(i % (target_width // image_size)) * image_size,
	(i // (target_width // image_size)) * image_size,
	((i % (target_width // image_size)) + 1) * image_size,
	((i // (target_width // image_size)) + 1) * image_size
	)
	# split the image
	split_img = resized_img.crop(box)
	processed_images.append(split_img)
	assert len(processed_images) == blocks
	if use_thumbnail and len(processed_images) != 1:
	thumbnail_img = image.resize((image_size, image_size))
	processed_images.append(thumbnail_img)
	return processed_images

	def load_image(image_file, input_size=448, max_num=12):
	image = Image.open(image_file).convert('RGB')
	transform = build_transform(input_size=input_size)
	images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
	pixel_values = [transform(image) for image in images]
	pixel_values = torch.stack(pixel_values)
	return pixel_values

	def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16):
	"""
	Prepares multi-modal input components (supports multiple images per prompt).

	Args:
	messages: List of input messages/prompts (strings or dicts with 'role' and 'content')
	image_paths: List where each element is either None (for text-only) or a list of image paths
	tokenizer: The tokenizer to use for applying chat templates
	device: Device to place tensors on ('cuda', 'cpu', etc.)
	dtype: Data type for image tensors (default: torch.bfloat16)

	Returns:
	dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model
	"""
	# Make sure image_paths list is at least as long as messages
	if len(image_paths) < len(messages):
	# Pad with None for text-only messages
	image_paths = image_paths + [None] * (len(messages) - len(image_paths))

	# Process images and collect patch information
	loaded_images = []
	num_patches_list = []
	for paths in image_paths:
	if paths and isinstance(paths, list) and len(paths) > 0:
	# Load each image in this prompt
	prompt_images = []
	prompt_patches = []

	for path in paths:
	# Load the image
	img = load_image(path).to(dtype=dtype, device=device)

	# Ensure img has correct shape [patches, C, H, W]
	if len(img.shape) == 3: # [C, H, W] -> [1, C, H, W]
	img = img.unsqueeze(0)

	prompt_images.append(img)
	# Record the number of patches for this image
	prompt_patches.append(img.shape[0])

	loaded_images.append(prompt_images)
	num_patches_list.append(prompt_patches)
	else:
	# Text-only prompt
	loaded_images.append(None)
	num_patches_list.append([])

	# Create the concatenated pixel_values_list
	pixel_values_list = []
	for prompt_images in loaded_images:
	if prompt_images:
	# Concatenate all images for this prompt
	pixel_values_list.append(torch.cat(prompt_images, dim=0))
	else:
	# Text-only prompt
	pixel_values_list.append(None)

	# Format messages for the model
	if all(isinstance(m, str) for m in messages):
	# Simple string messages: convert to chat format
	batch_messages = [
	[{"role": "user", "content": message}]
	for message in messages
	]
	else:
	# Assume messages are already in the right format
	batch_messages = messages

	# Apply chat template
	prompts = tokenizer.apply_chat_template(
	batch_messages,
	tokenize=False,
	add_generation_prompt=True
	)

	return {
	'prompts': prompts,
	'pixel_values_list': pixel_values_list,
	'num_patches_list': num_patches_list
	}

	def construct_message(text, template, examples=None):
	"""
	Construct the individual NuExtract message texts, prior to chat template formatting.
	"""
	# add few-shot examples if needed
	if examples is not None and len(examples) > 0:
	icl = "# Examples:\n"
	for row in examples:
	icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n"
	else:
	icl = ""

	return f"""# Template:\n{template}\n{icl}# Context:\n{text}"""
	```

	To handle inference:

	```python
	IMG_START_TOKEN='<img>'
	IMG_END_TOKEN='</img>'
	IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'

	def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None):
	"""
	Generate responses for a batch of NuExtract inputs.
	Support for multiple and varying numbers of images per prompt.

	Args:
	model: The vision-language model
	tokenizer: The tokenizer for the model
	pixel_values_list: List of tensor batches, one per prompt
	Each batch has shape [num_images, channels, height, width] or None for text-only prompts
	prompts: List of text prompts
	generation_config: Configuration for text generation
	num_patches_list: List of lists, each containing patch counts for images in a prompt

	Returns:
	List of generated responses
	"""
	img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
	model.img_context_token_id = img_context_token_id

	# Replace all image placeholders with appropriate tokens
	modified_prompts = []
	total_image_files = 0
	total_patches = 0
	image_containing_prompts = []
	for idx, prompt in enumerate(prompts):
	# check if this prompt has images
	has_images = (pixel_values_list and
	idx < len(pixel_values_list) and
	pixel_values_list[idx] is not None and
	isinstance(pixel_values_list[idx], torch.Tensor) and
	pixel_values_list[idx].shape[0] > 0)

	if has_images:
	# prompt with image placeholders
	image_containing_prompts.append(idx)
	modified_prompt = prompt

	patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else []
	num_images = len(patches)
	total_image_files += num_images
	total_patches += sum(patches)

	# replace each <image> placeholder with image tokens
	for i, num_patches in enumerate(patches):
	image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN
	modified_prompt = modified_prompt.replace('<image>', image_tokens, 1)
	else:
	# text-only prompt
	modified_prompt = prompt

	modified_prompts.append(modified_prompt)

	# process all prompts in a single batch
	tokenizer.padding_side = 'left'
	model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True)
	input_ids = model_inputs['input_ids'].to(model.device)
	attention_mask = model_inputs['attention_mask'].to(model.device)

	eos_token_id = tokenizer.convert_tokens_to_ids("<\|im_end\|>\n".strip())
	generation_config['eos_token_id'] = eos_token_id

	# prepare pixel values
	flattened_pixel_values = None
	if image_containing_prompts:
	# collect and concatenate all image tensors
	all_pixel_values = []
	for idx in image_containing_prompts:
	all_pixel_values.append(pixel_values_list[idx])

	flattened_pixel_values = torch.cat(all_pixel_values, dim=0)
	print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches")
	else:
	print(f"Processing text-only batch with {len(prompts)} prompts")

	# generate outputs
	outputs = model.generate(
	pixel_values=flattened_pixel_values, # will be None for text-only prompts
	input_ids=input_ids,
	attention_mask=attention_mask,
	**generation_config
	)

	# Decode responses
	responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)

	return responses
	```

	To load the model:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = ""

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
	model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2" # we recommend using flash attention
	).to("cuda")
	```

	Simple 0-shot text-only example:
	```python
	template = """{"names": ["verbatim-string"]}"""
	text = "John went to the restaurant with Mary. James went to the cinema."

	input_messages = [construct_message(text, template)]

	input_content = prepare_inputs(
	messages=input_messages,
	image_paths=[],
	tokenizer=tokenizer,
	)

	generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

	with torch.no_grad():
	result = nuextract_generate(
	model=model,
	tokenizer=tokenizer,
	prompts=input_content['prompts'],
	pixel_values_list=input_content['pixel_values_list'],
	num_patches_list=input_content['num_patches_list'],
	generation_config=generation_config
	)
	for y in result:
	print(y)
	# {"names": ["John", "Mary", "James"]}
	```

	Text-only input with an in-context example:
	```python
	template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}"""
	text = "John went to the restaurant with Mary. James went to the cinema."
	examples = [
	{
	"input": "Stephen is the manager at Susan's store.",
	"output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
	}
	]

	input_messages = [construct_message(text, template, examples)]

	input_content = prepare_inputs(
	messages=input_messages,
	image_paths=[],
	tokenizer=tokenizer,
	)

	generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

	with torch.no_grad():
	result = nuextract_generate(
	model=model,
	tokenizer=tokenizer,
	prompts=input_content['prompts'],
	pixel_values_list=input_content['pixel_values_list'],
	num_patches_list=input_content['num_patches_list'],
	generation_config=generation_config
	)
	for y in result:
	print(y)
	# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
	```

	Example with image input and an in-context example. Image inputs should use `<image>` placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example `0.jpg` will be for the in-context example and `1.jpg` for the true input).
	```python
	template = """{"store": "verbatim-string"}"""
	text = "<image>"
	examples = [
	{
	"input": "<image>",
	"output": """{"store": "Walmart"}"""
	}
	]

	input_messages = [construct_message(text, template, examples)]

	images = [
	["0.jpg", "1.jpg"]
	]

	input_content = prepare_inputs(
	messages=input_messages,
	image_paths=images,
	tokenizer=tokenizer,
	)

	generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

	with torch.no_grad():
	result = nuextract_generate(
	model=model,
	tokenizer=tokenizer,
	prompts=input_content['prompts'],
	pixel_values_list=input_content['pixel_values_list'],
	num_patches_list=input_content['num_patches_list'],
	generation_config=generation_config
	)
	for y in result:
	print(y)
	# {"store": "Trader Joe's"}
	```

	Multi-modal batched input:
	```python
	inputs = [
	# image input with no ICL examples
	{
	"text": "<image>",
	"template": """{"store_name": "verbatim-string"}""",
	"examples": None,
	},
	# image input with 1 ICL example
	{
	"text": "<image>",
	"template": """{"store_name": "verbatim-string"}""",
	"examples": [
	{
	"input": "<image>",
	"output": """{"store_name": "Walmart"}""",
	}
	],
	},
	# text input with no ICL examples
	{
	"text": "John went to the restaurant with Mary. James went to the cinema.",
	"template": """{"names": ["verbatim-string"]}""",
	"examples": None,
	},
	# text input with ICL example
	{
	"text": "John went to the restaurant with Mary. James went to the cinema.",
	"template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""",
	"examples": [
	{
	"input": "Stephen is the manager at Susan's store.",
	"output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
	}
	],
	},
	]

	input_messages = [
	construct_message(
	x["text"],
	x["template"],
	x["examples"]
	) for x in inputs
	]

	images = [
	["0.jpg"],
	["0.jpg", "1.jpg"],
	None,
	None
	]

	input_content = prepare_inputs(
	messages=input_messages,
	image_paths=images,
	tokenizer=tokenizer,
	)

	generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

	with torch.no_grad():
	result = nuextract_generate(
	model=model,
	tokenizer=tokenizer,
	prompts=input_content['prompts'],
	pixel_values_list=input_content['pixel_values_list'],
	num_patches_list=input_content['num_patches_list'],
	generation_config=generation_config
	)
	for y in result:
	print(y)
	# {"store_name": "WAL*MART"}
	# {"store_name": "Trader Joe's"}
	# {"names": ["John", "Mary", "James"]}
	# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
	```

	## Template Generation
	If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2 models can automatically generate this for you.

	E.g. convert XML into a NuExtract template:
	```python
	def generate_template(description):
	input_messages = [description]
	input_content = prepare_inputs(
	messages=input_messages,
	image_paths=[],
	tokenizer=tokenizer,
	)
	generation_config = {"do_sample": True, "temperature": 0.4, "max_new_tokens": 256}
	with torch.no_grad():
	result = nuextract_generate(
	model=model,
	tokenizer=tokenizer,
	prompts=input_content['prompts'],
	pixel_values_list=input_content['pixel_values_list'],
	num_patches_list=input_content['num_patches_list'],
	generation_config=generation_config
	)
	return result[0]
	xml_template = """<SportResult>
	<Date></Date>
	<Sport></Sport>
	<Venue></Venue>
	<HomeTeam></HomeTeam>
	<AwayTeam></AwayTeam>
	<HomeScore></HomeScore>
	<AwayScore></AwayScore>
	<TopScorer></TopScorer>
	</SportResult>"""
	result = generate_template(xml_template)

	print(result)
	# {
	# "SportResult": {
	# "Date": "date-time",
	# "Sport": "verbatim-string",
	# "Venue": "verbatim-string",
	# "HomeTeam": "verbatim-string",
	# "AwayTeam": "verbatim-string",
	# "HomeScore": "integer",
	# "AwayScore": "integer",
	# "TopScorer": "verbatim-string"
	# }
	# }
	```

	E.g. generate a template from natural language description:
	```python
	text = """Give me relevant info about startup companies mentioned."""
	result = generate_template(text)

	print(result)
	# {
	# "Startup_Companies": [
	# {
	# "Name": "verbatim-string",
	# "Products": [
	# "string"
	# ],
	# "Location": "verbatim-string",
	# "Company_Type": [
	# "Technology",
	# "Finance",
	# "Health",
	# "Education",
	# "Other"
	# ]
	# }
	# ]
	# }
	```