new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Dec 12

Is Underwater Image Enhancement All Object Detectors Need?

Underwater object detection is a crucial and challenging problem in marine engineering and aquatic robot. The difficulty is partly because of the degradation of underwater images caused by light selective absorption and scattering. Intuitively, enhancing underwater images can benefit high-level applications like underwater object detection. However, it is still unclear whether all object detectors need underwater image enhancement as pre-processing. We therefore pose the questions "Does underwater image enhancement really improve underwater object detection?" and "How does underwater image enhancement contribute to underwater object detection?". With these two questions, we conduct extensive studies. Specifically, we use 18 state-of-the-art underwater image enhancement algorithms, covering traditional, CNN-based, and GAN-based algorithms, to pre-process underwater object detection data. Then, we retrain 7 popular deep learning-based object detectors using the corresponding results enhanced by different algorithms, obtaining 126 underwater object detection models. Coupled with 7 object detection models retrained using raw underwater images, we employ these 133 models to comprehensively analyze the effect of underwater image enhancement on underwater object detection. We expect this study can provide sufficient exploration to answer the aforementioned questions and draw more attention of the community to the joint problem of underwater image enhancement and underwater object detection. The pre-trained models and results are publicly available and will be regularly updated. Project page: https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/uw_enhancement_affect_detection.

  • 7 authors
·
Nov 30, 2023

LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond

Low-light image enhancement (LLIE) is essential for numerous computer vision tasks, including object detection, tracking, segmentation, and scene understanding. Despite substantial research on improving low-quality images captured in underexposed conditions, clear vision remains critical for autonomous vehicles, which often struggle with low-light scenarios, signifying the need for continuous research. However, paired datasets for LLIE are scarce, particularly for street scenes, limiting the development of robust LLIE methods. Despite using advanced transformers and/or diffusion-based models, current LLIE methods struggle in real-world low-light conditions and lack training on street-scene datasets, limiting their effectiveness for autonomous vehicles. To bridge these gaps, we introduce a new dataset LoLI-Street (Low-Light Images of Streets) with 33k paired low-light and well-exposed images from street scenes in developed cities, covering 19k object classes for object detection. LoLI-Street dataset also features 1,000 real low-light test images for testing LLIE models under real-life conditions. Furthermore, we propose a transformer and diffusion-based LLIE model named "TriFuse". Leveraging the LoLI-Street dataset, we train and evaluate our TriFuse and SOTA models to benchmark on our dataset. Comparing various models, our dataset's generalization feasibility is evident in testing across different mainstream datasets by significantly enhancing images and object detection for practical applications in autonomous driving and surveillance systems. The complete code and dataset is available on https://github.com/tanvirnwu/TriFuse.

  • 6 authors
·
Oct 13, 2024

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.

  • 7 authors
·
Jul 15, 2023

DiffLLE: Diffusion-guided Domain Calibration for Unsupervised Low-light Image Enhancement

Existing unsupervised low-light image enhancement methods lack enough effectiveness and generalization in practical applications. We suppose this is because of the absence of explicit supervision and the inherent gap between real-world scenarios and the training data domain. In this paper, we develop Diffusion-based domain calibration to realize more robust and effective unsupervised Low-Light Enhancement, called DiffLLE. Since the diffusion model performs impressive denoising capability and has been trained on massive clean images, we adopt it to bridge the gap between the real low-light domain and training degradation domain, while providing efficient priors of real-world content for unsupervised models. Specifically, we adopt a naive unsupervised enhancement algorithm to realize preliminary restoration and design two zero-shot plug-and-play modules based on diffusion model to improve generalization and effectiveness. The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve, which makes the enhancement model perform robustly even in sophisticated wild degradation. Due to the limited enhancement effect of the unsupervised model, we further develop the Fine-grained Target domain Distillation (FTD) module to find a more visual-friendly solution space. It exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrinks the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field, addressing the lack of strong explicit supervision for unsupervised methods. Benefiting from these, our approach even outperforms some supervised methods by using only a simple unsupervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE.

  • 6 authors
·
Aug 17, 2023 1

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a divide-then-aggregate strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters

  • 5 authors
·
Apr 9

Super-resolving Real-world Image Illumination Enhancement: A New Dataset and A Conditional Diffusion Model

Most existing super-resolution methods and datasets have been developed to improve the image quality in well-lighted conditions. However, these methods do not work well in real-world low-light conditions as the images captured in such conditions lose most important information and contain significant unknown noises. To solve this problem, we propose a SRRIIE dataset with an efficient conditional diffusion probabilistic models-based method. The proposed dataset contains 4800 paired low-high quality images. To ensure that the dataset are able to model the real-world image degradation in low-illumination environments, we capture images using an ILDC camera and an optical zoom lens with exposure levels ranging from -6 EV to 0 EV and ISO levels ranging from 50 to 12800. We comprehensively evaluate with various reconstruction and perceptual metrics and demonstrate the practicabilities of the SRRIIE dataset for deep learning-based methods. We show that most existing methods are less effective in preserving the structures and sharpness of restored images from complicated noises. To overcome this problem, we revise the condition for Raw sensor data and propose a novel time-melding condition for diffusion probabilistic model. Comprehensive quantitative and qualitative experimental results on the real-world benchmark datasets demonstrate the feasibility and effectivenesses of the proposed conditional diffusion probabilistic model on Raw sensor data. Code and dataset will be available at https://github.com/Yaofang-Liu/Super-Resolving

  • 7 authors
·
Oct 16, 2024

RAP-SR: RestorAtion Prior Enhancement in Diffusion Models for Realistic Image Super-Resolution

Benefiting from their powerful generative capabilities, pretrained diffusion models have garnered significant attention for real-world image super-resolution (Real-SR). Existing diffusion-based SR approaches typically utilize semantic information from degraded images and restoration prompts to activate prior for producing realistic high-resolution images. However, general-purpose pretrained diffusion models, not designed for restoration tasks, often have suboptimal prior, and manually defined prompts may fail to fully exploit the generated potential. To address these limitations, we introduce RAP-SR, a novel restoration prior enhancement approach in pretrained diffusion models for Real-SR. First, we develop the High-Fidelity Aesthetic Image Dataset (HFAID), curated through a Quality-Driven Aesthetic Image Selection Pipeline (QDAISP). Our dataset not only surpasses existing ones in fidelity but also excels in aesthetic quality. Second, we propose the Restoration Priors Enhancement Framework, which includes Restoration Priors Refinement (RPR) and Restoration-Oriented Prompt Optimization (ROPO) modules. RPR refines the restoration prior using the HFAID, while ROPO optimizes the unique restoration identifier, improving the quality of the resulting images. RAP-SR effectively bridges the gap between general-purpose models and the demands of Real-SR by enhancing restoration prior. Leveraging the plug-and-play nature of RAP-SR, our approach can be seamlessly integrated into existing diffusion-based SR methods, boosting their performance. Extensive experiments demonstrate its broad applicability and state-of-the-art results. Codes and datasets will be available upon acceptance.

  • 6 authors
·
Dec 9, 2024

Burstormer: Burst Image Restoration and Enhancement Transformer

On a shutter press, modern handheld cameras capture multiple images in rapid succession and merge them to generate a single image. However, individual frames in a burst are misaligned due to inevitable motions and contain multiple degradations. The challenge is to properly align the successive image shots and merge their complimentary information to achieve high-quality outputs. Towards this direction, we propose Burstormer: a novel transformer-based architecture for burst image restoration and enhancement. In comparison to existing works, our approach exploits multi-scale local and non-local features to achieve improved alignment and feature fusion. Our key idea is to enable inter-frame communication in the burst neighborhoods for information aggregation and progressive fusion while modeling the burst-wide context. However, the input burst frames need to be properly aligned before fusing their information. Therefore, we propose an enhanced deformable alignment module for aligning burst features with regards to the reference frame. Unlike existing methods, the proposed alignment module not only aligns burst features but also exchanges feature information and maintains focused communication with the reference frame through the proposed reference-based feature enrichment mechanism, which facilitates handling complex motions. After multi-level alignment and enrichment, we re-emphasize on inter-frame communication within burst using a cyclic burst sampling module. Finally, the inter-frame information is aggregated using the proposed burst feature fusion module followed by progressive upsampling. Our Burstormer outperforms state-of-the-art methods on burst super-resolution, burst denoising and burst low-light enhancement. Our codes and pretrained models are available at https:// github.com/akshaydudhane16/Burstormer

  • 5 authors
·
Apr 3, 2023

Latent Diffusion Model for Medical Image Standardization and Enhancement

Computed tomography (CT) serves as an effective tool for lung cancer screening, diagnosis, treatment, and prognosis, providing a rich source of features to quantify temporal and spatial tumor changes. Nonetheless, the diversity of CT scanners and customized acquisition protocols can introduce significant inconsistencies in texture features, even when assessing the same patient. This variability poses a fundamental challenge for subsequent research that relies on consistent image features. Existing CT image standardization models predominantly utilize GAN-based supervised or semi-supervised learning, but their performance remains limited. We present DiffusionCT, an innovative score-based DDPM model that operates in the latent space to transform disparate non-standard distributions into a standardized form. The architecture comprises a U-Net-based encoder-decoder, augmented by a DDPM model integrated at the bottleneck position. First, the encoder-decoder is trained independently, without embedding DDPM, to capture the latent representation of the input data. Second, the latent DDPM model is trained while keeping the encoder-decoder parameters fixed. Finally, the decoder uses the transformed latent representation to generate a standardized CT image, providing a more consistent basis for downstream analysis. Empirical tests on patient CT images indicate notable improvements in image standardization using DiffusionCT. Additionally, the model significantly reduces image noise in SPAD images, further validating the effectiveness of DiffusionCT for advanced imaging tasks.

  • 7 authors
·
Oct 8, 2023

Generative Diffusion Prior for Unified Image Restoration and Enhancement

Existing image restoration methods mostly leverage the posterior distribution of natural images. However, they often assume known degradation and also require supervised training, which restricts their adaptation to complex real applications. In this work, we propose the Generative Diffusion Prior (GDP) to effectively model the posterior distributions in an unsupervised sampling manner. GDP utilizes a pre-train denoising diffusion generative model (DDPM) for solving linear inverse, non-linear, or blind problems. Specifically, GDP systematically explores a protocol of conditional guidance, which is verified more practical than the commonly used guidance way. Furthermore, GDP is strength at optimizing the parameters of degradation model during the denoising process, achieving blind image restoration. Besides, we devise hierarchical guidance and patch-based methods, enabling the GDP to generate images of arbitrary resolutions. Experimentally, we demonstrate GDP's versatility on several image datasets for linear problems, such as super-resolution, deblurring, inpainting, and colorization, as well as non-linear and blind issues, such as low-light enhancement and HDR image recovery. GDP outperforms the current leading unsupervised methods on the diverse benchmarks in reconstruction quality and perceptual quality. Moreover, GDP also generalizes well for natural images or synthesized images with arbitrary sizes from various tasks out of the distribution of the ImageNet training set.

  • 8 authors
·
Apr 3, 2023

Learning Enriched Features for Real Image Restoration and Enhancement

With the goal of recovering high-quality image content from its degraded version, image restoration enjoys numerous applications, such as in surveillance, computational photography, medical imaging, and remote sensing. Recently, convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. Existing CNN-based methods typically operate either on full-resolution or on progressively low-resolution representations. In the former case, spatially precise but contextually less robust results are achieved, while in the latter case, semantically reliable but spatially less accurate outputs are generated. In this paper, we present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network and receiving strong contextual information from the low-resolution representations. The core of our approach is a multi-scale residual block containing several key elements: (a) parallel multi-resolution convolution streams for extracting multi-scale features, (b) information exchange across the multi-resolution streams, (c) spatial and channel attention mechanisms for capturing contextual information, and (d) attention based multi-scale feature aggregation. In a nutshell, our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Extensive experiments on five real image benchmark datasets demonstrate that our method, named as MIRNet, achieves state-of-the-art results for a variety of image processing tasks, including image denoising, super-resolution, and image enhancement. The source code and pre-trained models are available at https://github.com/swz30/MIRNet.

  • 7 authors
·
Mar 15, 2020

AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation

Despite recent advances in photorealistic image generation through large-scale models like FLUX and Stable Diffusion v3, the practical deployment of these architectures remains constrained by their inherent intractability to parameter fine-tuning. While low-rank adaptation (LoRA) have demonstrated efficacy in enabling model customization with minimal parameter overhead, the effective utilization of distributed open-source LoRA modules faces three critical challenges: sparse metadata annotation, the requirement for zero-shot adaptation capabilities, and suboptimal fusion strategies for multi-LoRA fusion strategies. To address these limitations, we introduce a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation through two key components: (1) weight encoding-base LoRA retriever that establishes a shared semantic space between LoRA parameter matrices and text prompts, eliminating dependence on original training data, and (2) fine-grained gated fusion mechanism that computes context-specific fusion weights across network layers and diffusion timesteps to optimally integrate multiple LoRA modules during generation. Our approach achieves significant improvement in image generation perfermance, thereby facilitating scalable and data-efficient enhancement of foundational models. This work establishes a critical bridge between the fragmented landscape of community-developed LoRAs and practical deployment requirements, enabling collaborative model evolution through standardized adapter integration.

  • 7 authors
·
Aug 4

FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models

Modeling and producing lifelike clothed human images has attracted researchers' attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.

  • 7 authors
·
Oct 18, 2024

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

  • 25 authors
·
Nov 18 5

DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models

Despite remarkable advancements, current Text-to-Image (T2I) models struggle with complex, long-form textual instructions, frequently failing to accurately render intricate details, spatial relationships, or specific constraints. This limitation is highlighted by benchmarks such as LongBench-T2I, which reveal deficiencies in handling composition, specific text, and fine textures. To address this, we propose DeCoT (Decomposition-CoT), a novel framework that leverages Large Language Models (LLMs) to significantly enhance T2I models' understanding and execution of complex instructions. DeCoT operates in two core stages: first, Complex Instruction Decomposition and Semantic Enhancement, where an LLM breaks down raw instructions into structured, actionable semantic units and clarifies ambiguities; second, Multi-Stage Prompt Integration and Adaptive Generation, which transforms these units into a hierarchical or optimized single prompt tailored for existing T2I models. Extensive experiments on the LongBench-T2I dataset demonstrate that DeCoT consistently and substantially improves the performance of leading T2I models across all evaluated dimensions, particularly in challenging aspects like "Text" and "Composition". Quantitative results, validated by multiple MLLM evaluators (Gemini-2.0-Flash and InternVL3-78B), show that DeCoT, when integrated with Infinity-8B, achieves an average score of 3.52, outperforming the baseline Infinity-8B (3.44). Ablation studies confirm the critical contribution of each DeCoT component and the importance of sophisticated LLM prompting. Furthermore, human evaluations corroborate these findings, indicating superior perceptual quality and instruction fidelity. DeCoT effectively bridges the gap between high-level user intent and T2I model requirements, leading to more faithful and accurate image generation.

  • 4 authors
·
Aug 17

Enhancing Image Rescaling using Dual Latent Variables in Invertible Neural Network

Normalizing flow models have been used successfully for generative image super-resolution (SR) by approximating complex distribution of natural images to simple tractable distribution in latent space through Invertible Neural Networks (INN). These models can generate multiple realistic SR images from one low-resolution (LR) input using randomly sampled points in the latent space, simulating the ill-posed nature of image upscaling where multiple high-resolution (HR) images correspond to the same LR. Lately, the invertible process in INN has also been used successfully by bidirectional image rescaling models like IRN and HCFlow for joint optimization of downscaling and inverse upscaling, resulting in significant improvements in upscaled image quality. While they are optimized for image downscaling too, the ill-posed nature of image downscaling, where one HR image could be downsized to multiple LR images depending on different interpolation kernels and resampling methods, is not considered. A new downscaling latent variable, in addition to the original one representing uncertainties in image upscaling, is introduced to model variations in the image downscaling process. This dual latent variable enhancement is applicable to different image rescaling models and it is shown in extensive experiments that it can improve image upscaling accuracy consistently without sacrificing image quality in downscaled LR images. It is also shown to be effective in enhancing other INN-based models for image restoration applications like image hiding.

  • 4 authors
·
Jul 24, 2022

Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation

Large-scale generative models have achieved remarkable advancements in various visual tasks, yet their application to shadow removal in images remains challenging. These models often generate diverse, realistic details without adequate focus on fidelity, failing to meet the crucial requirements of shadow removal, which necessitates precise preservation of image content. In contrast to prior approaches that aimed to regenerate shadow-free images from scratch, this paper utilizes diffusion models to generate and refine image residuals. This strategy fully uses the inherent detailed information within shadowed images, resulting in a more efficient and faithful reconstruction of shadow-free content. Additionally, to revent the accumulation of errors during the generation process, a crosstimestep self-enhancement training strategy is proposed. This strategy leverages the network itself to augment the training data, not only increasing the volume of data but also enabling the network to dynamically correct its generation trajectory, ensuring a more accurate and robust output. In addition, to address the loss of original details in the process of image encoding and decoding of large generative models, a content-preserved encoder-decoder structure is designed with a control mechanism and multi-scale skip connections to achieve high-fidelity shadow-free image reconstruction. Experimental results demonstrate that the proposed method can reproduce high-quality results based on a large latent diffusion prior and faithfully preserve the original contents in shadow regions.

  • 6 authors
·
Dec 3, 2024

Self-NPO: Data-Free Diffusion Model Enhancement via Truncated Diffusion Fine-Tuning

Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation. Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences. While existing PO methods primarily concentrate on producing favorable outputs, they often overlook the significance of classifier-free guidance (CFG) in mitigating undesirable results. Diffusion-NPO addresses this gap by introducing negative preference optimization (NPO), training models to generate outputs opposite to human preferences and thereby steering them away from unfavorable outcomes through CFG. However, prior NPO approaches rely on costly and fragile procedures for obtaining explicit preference annotations (e.g., manual pairwise labeling or reward model training), limiting their practicality in domains where such data are scarce or difficult to acquire. In this work, we propose Self-NPO, specifically truncated diffusion fine-tuning, a data-free approach of negative preference optimization by directly learning from the model itself, eliminating the need for manual data labeling or reward model training. This data-free approach is highly efficient (less than 1% training cost of Diffusion-NPO) and achieves comparable performance to Diffusion-NPO in a data-free manner. We demonstrate that Self-NPO integrates seamlessly into widely used diffusion models, including SD1.5, SDXL, and CogVideoX, as well as models already optimized for human preferences, consistently enhancing both their generation quality and alignment with human preferences. Code is available at https://github.com/G-U-N/Diffusion-NPO.

  • 7 authors
·
May 16

A Large-scale Empirical Study on Improving the Fairness of Deep Learning Models

Fairness has been a critical issue that affects the adoption of deep learning models in real practice. To improve model fairness, many existing methods have been proposed and evaluated to be effective in their own contexts. However, there is still no systematic evaluation among them for a comprehensive comparison under the same context, which makes it hard to understand the performance distinction among them, hindering the research progress and practical adoption of them. To fill this gap, this paper endeavours to conduct the first large-scale empirical study to comprehensively compare the performance of existing state-of-the-art fairness improving techniques. Specifically, we target the widely-used application scenario of image classification, and utilized three different datasets and five commonly-used performance metrics to assess in total 13 methods from diverse categories. Our findings reveal substantial variations in the performance of each method across different datasets and sensitive attributes, indicating over-fitting on specific datasets by many existing methods. Furthermore, different fairness evaluation metrics, due to their distinct focuses, yield significantly different assessment results. Overall, we observe that pre-processing methods and in-processing methods outperform post-processing methods, with pre-processing methods exhibiting the best performance. Our empirical study offers comprehensive recommendations for enhancing fairness in deep learning models. We approach the problem from multiple dimensions, aiming to provide a uniform evaluation platform and inspire researchers to explore more effective fairness solutions via a set of implications.

  • 4 authors
·
Jan 8, 2024

From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning

Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.

  • 11 authors
·
Jul 11

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce VisualToolBench, a visual tool-use reasoning benchmark that rigorously evaluates MLLMs' ability to perceive, transform, and reason across complex visual-textual tasks under the think-with-images paradigm. VisualToolBench comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, VisualToolBench offers critical insights for advancing visual intelligence in MLLMs.

  • 11 authors
·
Oct 14

HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling

Diffusion models have emerged as the leading approach for image synthesis, demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts, including object duplication and spatial incoherence. In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline: generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures. Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality. A user study confirmed HiWave's performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.

  • 4 authors
·
Jun 25 6

ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction

The emergence of diffusion models has significantly advanced image synthesis. The recent studies of model interaction and self-corrective reasoning approach in large language models offer new insights for enhancing text-to-image models. Inspired by these studies, we propose a novel method called ArtAug for enhancing text-to-image models in this paper. To the best of our knowledge, ArtAug is the first one that improves image synthesis models via model interactions with understanding models. In the interactions, we leverage human preferences implicitly learned by image understanding models to provide fine-grained suggestions for image synthesis models. The interactions can modify the image content to make it aesthetically pleasing, such as adjusting exposure, changing shooting angles, and adding atmospheric effects. The enhancements brought by the interaction are iteratively fused into the synthesis model itself through an additional enhancement module. This enables the synthesis model to directly produce aesthetically pleasing images without any extra computational cost. In the experiments, we train the ArtAug enhancement module on existing text-to-image models. Various evaluation metrics consistently demonstrate that ArtAug enhances the generative capabilities of text-to-image models without incurring additional computational costs. The source code and models will be released publicly.

  • 7 authors
·
Dec 17, 2024

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of "object replacement" samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through "object removal" and conduct thorough evaluation to confirm the dataset's diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs' fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.

  • 5 authors
·
Aug 8, 2024 2

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a fusion matrix of multiple Low-Rank Adaptations (LoRAs) to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, where the model struggles to preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through concept injection constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, concept isolation constraints are introduced, refining the self-attention computation. Furthermore, latent re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer's performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations. Code is released at https://github.com/Young98CN/LoRA_Composer

  • 11 authors
·
Mar 18, 2024

Early Exit or Not: Resource-Efficient Blind Quality Enhancement for Compressed Images

Lossy image compression is pervasively conducted to save communication bandwidth, resulting in undesirable compression artifacts. Recently, extensive approaches have been proposed to reduce image compression artifacts at the decoder side; however, they require a series of architecture-identical models to process images with different quality, which are inefficient and resource-consuming. Besides, it is common in practice that compressed images are with unknown quality and it is intractable for existing approaches to select a suitable model for blind quality enhancement. In this paper, we propose a resource-efficient blind quality enhancement (RBQE) approach for compressed images. Specifically, our approach blindly and progressively enhances the quality of compressed images through a dynamic deep neural network (DNN), in which an early-exit strategy is embedded. Then, our approach can automatically decide to terminate or continue enhancement according to the assessed quality of enhanced images. Consequently, slight artifacts can be removed in a simpler and faster process, while the severe artifacts can be further removed in a more elaborate process. Extensive experiments demonstrate that our RBQE approach achieves state-of-the-art performance in terms of both blind quality enhancement and resource efficiency. The code is available at https://github.com/RyanXingQL/RBQE.

  • 4 authors
·
Jun 30, 2020

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

While vision-language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine-grained visual perception tasks that require detailed pixel-level analysis. Effectively eliciting comprehensive reasoning from VLMs on such intricate visual elements remains an open challenge. In this paper, we present VipAct, an agent framework that enhances VLMs by integrating multi-agent collaboration and vision expert models, enabling more precise visual understanding and comprehensive reasoning. VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks such as image captioning and vision expert models that provide high-precision perceptual information. This multi-agent approach allows VLMs to better perform fine-grained visual perception tasks by synergizing planning, reasoning, and tool use. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements over state-of-the-art baselines across all tasks. Furthermore, comprehensive ablation studies reveal the critical role of multi-agent collaboration in eliciting more detailed System-2 reasoning and highlight the importance of image input for task planning. Additionally, our error analysis identifies patterns of VLMs' inherent limitations in visual perception, providing insights into potential future improvements. VipAct offers a flexible and extensible framework, paving the way for more advanced visual perception systems across various real-world applications.

  • 10 authors
·
Oct 21, 2024

MAXIM: Multi-Axis MLP for Image Processing

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and `fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at https://github.com/google-research/maxim.

  • 7 authors
·
Jan 9, 2022

Fine-Grained Perturbation Guidance via Attention Head Selection

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network

This paper presents a novel network structure with illumination-aware gamma correction and complete image modelling to solve the low-light image enhancement problem. Low-light environments usually lead to less informative large-scale dark areas, directly learning deep representations from low-light images is insensitive to recovering normal illumination. We propose to integrate the effectiveness of gamma correction with the strong modelling capacities of deep networks, which enables the correction factor gamma to be learned in a coarse to elaborate manner via adaptively perceiving the deviated illumination. Because exponential operation introduces high computational complexity, we propose to use Taylor Series to approximate gamma correction, accelerating the training and inference speed. Dark areas usually occupy large scales in low-light images, common local modelling structures, e.g., CNN, SwinIR, are thus insufficient to recover accurate illumination across whole low-light images. We propose a novel Transformer block to completely simulate the dependencies of all pixels across images via a local-to-global hierarchical attention mechanism, so that dark areas could be inferred by borrowing the information from far informative regions in a highly effective manner. Extensive experiments on several benchmark datasets demonstrate that our approach outperforms state-of-the-art methods.

  • 5 authors
·
Aug 16, 2023

AllWeatherNet:Unified Image Enhancement for Autonomous Driving under Adverse Weather and Lowlight-conditions

Adverse conditions like snow, rain, nighttime, and fog, pose challenges for autonomous driving perception systems. Existing methods have limited effectiveness in improving essential computer vision tasks, such as semantic segmentation, and often focus on only one specific condition, such as removing rain or translating nighttime images into daytime ones. To address these limitations, we propose a method to improve the visual quality and clarity degraded by such adverse conditions. Our method, AllWeather-Net, utilizes a novel hierarchical architecture to enhance images across all adverse conditions. This architecture incorporates information at three semantic levels: scene, object, and texture, by discriminating patches at each level. Furthermore, we introduce a Scaled Illumination-aware Attention Mechanism (SIAM) that guides the learning towards road elements critical for autonomous driving perception. SIAM exhibits robustness, remaining unaffected by changes in weather conditions or environmental scenes. AllWeather-Net effectively transforms images into normal weather and daytime scenes, demonstrating superior image enhancement results and subsequently enhancing the performance of semantic segmentation, with up to a 5.3% improvement in mIoU in the trained domain. We also show our model's generalization ability by applying it to unseen domains without re-training, achieving up to 3.9% mIoU improvement. Code can be accessed at: https://github.com/Jumponthemoon/AllWeatherNet.

  • 7 authors
·
Sep 3, 2024

InstaRevive: One-Step Image Enhancement via Dynamic Score Matching

Image enhancement finds wide-ranging applications in real-world scenarios due to complex environments and the inherent limitations of imaging devices. Recent diffusion-based methods yield promising outcomes but necessitate prolonged and computationally intensive iterative sampling. In response, we propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps. To fully exploit the potential of the pre-trained diffusion model, we devise a practical and effective diffusion distillation pipeline using dynamic control to address inaccuracies in updating direction during score matching. Our control strategy enables a dynamic diffusing scope, facilitating precise learning of denoising trajectories within the diffusion model and ensuring accurate distribution matching gradients during training. Additionally, to enrich guidance for the generative power, we incorporate textual prompts via image captioning as auxiliary conditions, fostering further exploration of the diffusion model. Extensive experiments substantiate the efficacy of our framework across a diverse array of challenging tasks and datasets, unveiling the compelling efficacy and efficiency of InstaRevive in delivering high-quality and visually appealing results. Code is available at https://github.com/EternalEvan/InstaRevive.

  • 9 authors
·
Apr 21

Imaging foundation model for universal enhancement of non-ideal measurement CT

Non-ideal measurement computed tomography (NICT), which sacrifices optimal imaging standards for new advantages in CT imaging, is expanding the clinical application scope of CT images. However, with the reduction of imaging standards, the image quality has also been reduced, extremely limiting the clinical acceptability. Although numerous studies have demonstrated the feasibility of deep learning for the NICT enhancement in specific scenarios, their high data cost and limited generalizability have become large obstacles. The recent research on the foundation model has brought new opportunities for building a universal NICT enhancement model - bridging the image quality degradation with minimal data cost. However, owing to the challenges in the collection of large pre-training datasets and the compatibility of data variation, no success has been reported. In this paper, we propose a multi-scale integrated Transformer AMPlifier (TAMP), the first imaging foundation model for universal NICT enhancement. It has been pre-trained on a large-scale physical-driven simulation dataset with 3.6 million NICT-ICT image pairs, and is able to directly generalize to the NICT enhancement tasks with various non-ideal settings and body regions. Via the adaptation with few data, it can further achieve professional performance in real-world specific scenarios. Our extensive experiments have demonstrated that the proposed TAMP has significant potential for promoting the exploration and application of NICT and serving a wider range of medical scenarios.

  • 7 authors
·
Oct 2, 2024

Troublemaker Learning for Low-Light Image Enhancement

Low-light image enhancement (LLIE) restores the color and brightness of underexposed images. Supervised methods suffer from high costs in collecting low/normal-light image pairs. Unsupervised methods invest substantial effort in crafting complex loss functions. We address these two challenges through the proposed TroubleMaker Learning (TML) strategy, which employs normal-light images as inputs for training. TML is simple: we first dim the input and then increase its brightness. TML is based on two core components. First, the troublemaker model (TM) constructs pseudo low-light images from normal images to relieve the cost of pairwise data. Second, the predicting model (PM) enhances the brightness of pseudo low-light images. Additionally, we incorporate an enhancing model (EM) to further improve the visual performance of PM outputs. Moreover, in LLIE tasks, characterizing global element correlations is important because more information on the same object can be captured. CNN cannot achieve this well, and self-attention has high time complexity. Accordingly, we propose Global Dynamic Convolution (GDC) with O(n) time complexity, which essentially imitates the partial calculation process of self-attention to formulate elementwise correlations. Based on the GDC module, we build the UGDC model. Extensive quantitative and qualitative experiments demonstrate that UGDC trained with TML can achieve competitive performance against state-of-the-art approaches on public datasets. The code is available at https://github.com/Rainbowman0/TML_LLIE.

  • 8 authors
·
Feb 6, 2024

NoiSER: Noise is All You Need for Low-Light Image Enhancement

In this paper, we present an embarrassingly simple yet effective solution to a seemingly impossible mission, low-light image enhancement (LLIE) without access to any task-related data. The proposed solution, Noise SElf-Regression (NoiSER), simply learns a convolutional neural network equipped with a instance-normalization layer by taking a random noise image, N(0,sigma^2) for each pixel, as both input and output for each training pair, and then the low-light image is fed to the learned network for predicting the normal-light image. Technically, an intuitive explanation for its effectiveness is as follows: 1) the self-regression reconstructs the contrast between adjacent pixels of the input image, 2) the instance-normalization layers may naturally remediate the overall magnitude/lighting of the input image, and 3) the N(0,sigma^2) assumption for each pixel enforces the output image to follow the well-known gray-world hypothesis Gary-world_Hypothesis when the image size is big enough, namely, the averages of three RGB components of an image converge to the same value. Compared to existing SOTA LLIE methods with access to different task-related data, NoiSER is surprisingly highly competitive in enhancement quality, yet with a much smaller model size, and much lower training and inference cost. With only sim 1K parameters, NoiSER realizes about 1 minute for training and 1.2 ms for inference with 600x400 resolution on RTX 2080 Ti. As a bonus, NoiSER possesses automated over-exposure suppression ability and shows excellent performance on over-exposed photos.

  • 6 authors
·
Nov 9, 2022

CCDWT-GAN: Generative Adversarial Networks Based on Color Channel Using Discrete Wavelet Transform for Document Image Binarization

To efficiently extract textual information from color degraded document images is a significant research area. The prolonged imperfect preservation of ancient documents has led to various types of degradation, such as page staining, paper yellowing, and ink bleeding. These types of degradation badly impact the image processing for features extraction. This paper introduces a novelty method employing generative adversarial networks based on color channel using discrete wavelet transform (CCDWT-GAN). The proposed method involves three stages: image preprocessing, image enhancement, and image binarization. In the initial step, we apply discrete wavelet transform (DWT) to retain the low-low (LL) subband image, thereby enhancing image quality. Subsequently, we divide the original input image into four single-channel colors (red, green, blue, and gray) to separately train adversarial networks. For the extraction of global and local features, we utilize the output image from the image enhancement stage and the entire input image to train adversarial networks independently, and then combine these two results as the final output. To validate the positive impact of the image enhancement and binarization stages on model performance, we conduct an ablation study. This work compares the performance of the proposed method with other state-of-the-art (SOTA) methods on DIBCO and H-DIBCO ((Handwritten) Document Image Binarization Competition) datasets. The experimental results demonstrate that CCDWT-GAN achieves a top two performance on multiple benchmark datasets. Notably, on DIBCO 2013 and 2016 dataset, our method achieves F-measure (FM) values of 95.24 and 91.46, respectively.

  • 6 authors
·
May 27, 2023

Empowering Low-Light Image Enhancer through Customized Learnable Priors

Deep neural networks have achieved remarkable progress in enhancing low-light images by improving their brightness and eliminating noise. However, most existing methods construct end-to-end mapping networks heuristically, neglecting the intrinsic prior of image enhancement task and lacking transparency and interpretability. Although some unfolding solutions have been proposed to relieve these issues, they rely on proximal operator networks that deliver ambiguous and implicit priors. In this work, we propose a paradigm for low-light image enhancement that explores the potential of customized learnable priors to improve the transparency of the deep unfolding paradigm. Motivated by the powerful feature representation capability of Masked Autoencoder (MAE), we customize MAE-based illumination and noise priors and redevelop them from two perspectives: 1) structure flow: we train the MAE from a normal-light image to its illumination properties and then embed it into the proximal operator design of the unfolding architecture; and m2) optimization flow: we train MAE from a normal-light image to its gradient representation and then employ it as a regularization term to constrain noise in the model output. These designs improve the interpretability and representation capability of the model.Extensive experiments on multiple low-light image enhancement datasets demonstrate the superiority of our proposed paradigm over state-of-the-art methods. Code is available at https://github.com/zheng980629/CUE.

  • 7 authors
·
Sep 5, 2023

Bridging the Gap Between Computational Photography and Visual Recognition

What is the current state-of-the-art for image restoration and enhancement applied to degraded images acquired under less than ideal circumstances? Can the application of such algorithms as a pre-processing step to improve image interpretability for manual analysis or automatic visual recognition to classify scene content? While there have been important advances in the area of computational photography to restore or enhance the visual quality of an image, the capabilities of such techniques have not always translated in a useful way to visual recognition tasks. Consequently, there is a pressing need for the development of algorithms that are designed for the joint problem of improving visual appearance and recognition, which will be an enabling factor for the deployment of visual recognition tools in many real-world scenarios. To address this, we introduce the UG^2 dataset as a large-scale benchmark composed of video imagery captured under challenging conditions, and two enhancement tasks designed to test algorithmic impact on visual quality and automatic object recognition. Furthermore, we propose a set of metrics to evaluate the joint improvement of such tasks as well as individual algorithmic advances, including a novel psychophysics-based evaluation regime for human assessment and a realistic set of quantitative measures for object recognition performance. We introduce six new algorithms for image restoration or enhancement, which were created as part of the IARPA sponsored UG^2 Challenge workshop held at CVPR 2018. Under the proposed evaluation regime, we present an in-depth analysis of these algorithms and a host of deep learning-based and classic baseline approaches. From the observed results, it is evident that we are in the early days of building a bridge between computational photography and visual recognition, leaving many opportunities for innovation in this area.

  • 24 authors
·
Jan 27, 2019

UNICE: Training A Universal Image Contrast Enhancer

Existing image contrast enhancement methods are typically designed for specific tasks such as under-/over-exposure correction, low-light and backlit image enhancement, etc. The learned models, however, exhibit poor generalization performance across different tasks, even across different datasets of a specific task. It is important to explore whether we can learn a universal and generalized model for various contrast enhancement tasks. In this work, we observe that the common key factor of these tasks lies in the need of exposure and contrast adjustment, which can be well-addressed if high-dynamic range (HDR) inputs are available. We hence collect 46,928 HDR raw images from public sources, and render 328,496 sRGB images to build multi-exposure sequences (MES) and the corresponding pseudo sRGB ground-truths via multi-exposure fusion. Consequently, we train a network to generate an MES from a single sRGB image, followed by training another network to fuse the generated MES into an enhanced image. Our proposed method, namely UNiversal Image Contrast Enhancer (UNICE), is free of costly human labeling. However, it demonstrates significantly stronger generalization performance than existing image contrast enhancement methods across and within different tasks, even outperforming manually created ground-truths in multiple no-reference image quality metrics. The dataset, code and model are available at https://github.com/BeyondHeaven/UNICE.

  • 2 authors
·
Jul 22

SMMix: Self-Motivated Image Mixing for Vision Transformers

CutMix is a vital augmentation strategy that determines the performance and generalization ability of vision transformers (ViTs). However, the inconsistency between the mixed images and the corresponding labels harms its efficacy. Existing CutMix variants tackle this problem by generating more consistent mixed images or more precise mixed labels, but inevitably introduce heavy training overhead or require extra information, undermining ease of use. To this end, we propose an novel and effective Self-Motivated image Mixing method (SMMix), which motivates both image and label enhancement by the model under training itself. Specifically, we propose a max-min attention region mixing approach that enriches the attention-focused objects in the mixed images. Then, we introduce a fine-grained label assignment technique that co-trains the output tokens of mixed images with fine-grained supervision. Moreover, we devise a novel feature consistency constraint to align features from mixed and unmixed images. Due to the subtle designs of the self-motivated paradigm, our SMMix is significant in its smaller training overhead and better performance than other CutMix variants. In particular, SMMix improves the accuracy of DeiT-T/S/B, CaiT-XXS-24/36, and PVT-T/S/M/L by more than +1% on ImageNet-1k. The generalization capability of our method is also demonstrated on downstream tasks and out-of-distribution datasets. Our project is anonymously available at https://github.com/ChenMnZ/SMMix.

  • 6 authors
·
Dec 25, 2022

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.

Deep Optimal Transport: A Practical Algorithm for Photo-realistic Image Restoration

We propose an image restoration algorithm that can control the perceptual quality and/or the mean square error (MSE) of any pre-trained model, trading one over the other at test time. Our algorithm is few-shot: Given about a dozen images restored by the model, it can significantly improve the perceptual quality and/or the MSE of the model for newly restored images without further training. Our approach is motivated by a recent theoretical result that links between the minimum MSE (MMSE) predictor and the predictor that minimizes the MSE under a perfect perceptual quality constraint. Specifically, it has been shown that the latter can be obtained by optimally transporting the output of the former, such that its distribution matches the source data. Thus, to improve the perceptual quality of a predictor that was originally trained to minimize MSE, we approximate the optimal transport by a linear transformation in the latent space of a variational auto-encoder, which we compute in closed-form using empirical means and covariances. Going beyond the theory, we find that applying the same procedure on models that were initially trained to achieve high perceptual quality, typically improves their perceptual quality even further. And by interpolating the results with the original output of the model, we can improve their MSE on the expense of perceptual quality. We illustrate our method on a variety of degradations applied to general content images of arbitrary dimensions.

  • 4 authors
·
Jun 4, 2023

Textual Prompt Guided Image Restoration

Image restoration has always been a cutting-edge topic in the academic and industrial fields of computer vision. Since degradation signals are often random and diverse, "all-in-one" models that can do blind image restoration have been concerned in recent years. Early works require training specialized headers and tails to handle each degradation of concern, which are manually cumbersome. Recent works focus on learning visual prompts from data distribution to identify degradation type. However, the prompts employed in most of models are non-text, lacking sufficient emphasis on the importance of human-in-the-loop. In this paper, an effective textual prompt guided image restoration model has been proposed. In this model, task-specific BERT is fine-tuned to accurately understand user's instructions and generating textual prompt guidance. Depth-wise multi-head transposed attentions and gated convolution modules are designed to bridge the gap between textual prompts and visual features. The proposed model has innovatively introduced semantic prompts into low-level visual domain. It highlights the potential to provide a natural, precise, and controllable way to perform image restoration tasks. Extensive experiments have been done on public denoising, dehazing and deraining datasets. The experiment results demonstrate that, compared with popular state-of-the-art methods, the proposed model can obtain much more superior performance, achieving accurate recognition and removal of degradation without increasing model's complexity. Related source codes and data will be publicly available on github site https://github.com/MoTong-AI-studio/TextPromptIR.

  • 6 authors
·
Dec 11, 2023

Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model's optimization, effectively uncovering "winning tickets" that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.

  • 2 authors
·
Oct 16 2

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: https://qihao067.github.io/projects/DiMR

  • 6 authors
·
Jun 13, 2024 1

Iterative Prompt Learning for Unsupervised Backlit Image Enhancement

We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel-level image enhancement. We show that the open-world CLIP prior not only aids in distinguishing between backlit and well-lit images, but also in perceiving heterogeneous regions with different luminance, facilitating the optimization of the enhancement network. Unlike high-level and image manipulation tasks, directly applying CLIP to enhancement tasks is non-trivial, owing to the difficulty in finding accurate prompts. To solve this issue, we devise a prompt learning framework that first learns an initial prompt pair by constraining the text-image similarity between the prompt (negative/positive sample) and the corresponding image (backlit image/well-lit image) in the CLIP latent space. Then, we train the enhancement network based on the text-image similarity between the enhanced result and the initial prompt pair. To further improve the accuracy of the initial prompt pair, we iteratively fine-tune the prompt learning framework to reduce the distribution gaps between the backlit images, enhanced results, and well-lit images via rank learning, boosting the enhancement performance. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability, without requiring any paired data.

  • 5 authors
·
Mar 30, 2023

Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation

Recent efforts on image restoration have focused on developing "all-in-one" models that can handle different degradation types and levels within single model. However, most of mainstream Transformer-based ones confronted with dilemma between model capabilities and computation burdens, since self-attention mechanism quadratically increase in computational complexity with respect to image size, and has inadequacies in capturing long-range dependencies. Most of Mamba-related ones solely scanned feature map in spatial dimension for global modeling, failing to fully utilize information in channel dimension. To address aforementioned problems, this paper has proposed to fully utilize complementary advantages from Mamba and Transformer without sacrificing computation efficiency. Specifically, the selective scanning mechanism of Mamba is employed to focus on spatial modeling, enabling capture long-range spatial dependencies under linear complexity. The self-attention mechanism of Transformer is applied to focus on channel modeling, avoiding high computation burdens that are in quadratic growth with image's spatial dimensions. Moreover, to enrich informative prompts for effective image restoration, multi-dimensional prompt learning modules are proposed to learn prompt-flows from multi-scale encoder/decoder layers, benefiting for revealing underlying characteristic of various degradations from both spatial and channel perspectives, therefore, enhancing the capabilities of "all-in-one" model to solve various restoration tasks. Extensive experiment results on several image restoration benchmark tasks such as image denoising, dehazing, and deraining, have demonstrated that the proposed method can achieve new state-of-the-art performance, compared with many popular mainstream methods. Related source codes and pre-trained parameters will be public on github https://github.com/12138-chr/MTAIR.

  • 5 authors
·
Dec 20, 2024

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

  • 3 authors
·
Apr 2, 2024