Comprehensive Relighting:
Generalizable and Consistent Monocular Human Relighting and Harmonization


We introduce Comprehensive Relighting, a generalizable and consistent model for relighting and harmonization, which controls the lighting property from a single image or video of humans with arbitrary body parts. Given target lighting coefficients, e.g., Spherical harmonics, background scenes, or their combination (insets), our model performs consistent and harmonized relighting.

Abstract

This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.

Our Goal

Static image relighting and harmonization
Static Image Relighting and Harmonization

Video Relighting and Harmonization

We introduce a generalizable human relighting model that can control the lighting from an image or video of humans with arbitrary body parts, which are well-harmonized with a conditioning space as shown.

Overview

(a) Given an input image of humans with coarse lighting and background image, our diffusion model generates the relit images harmonized with background scenes. (b) The external temporal modules learn the temporal cycle consistency from many real-world videos to construct temporal lighting features. (c) In inference time, we blend the features from lighting and temporal modules spatially and temporally to enable coherent and generalizable human relighting.

Static Image with Rotating Light

Portrait and Half-body Scenario

Full-body Scenario

Multi-person Scenario

Video Relighting