
I combined FLUX Fill with ControlNet for structured inpainting
I've been experimenting with FLUX.1-Fill-dev lately and kept running into the same wall: the Fill model is great for mask-based edits, but there's no built-in way to feed it a ControlNet signal (depth, canny, pose, etc.) at the same time.
So I built one.
The idea is simple:
FLUX Fill handles the mask-based edit, while ControlNet guides the structure using inputs like depth, canny, pose, tile, blur, gray, or low-quality conditioning. This makes the inpainting more controlled, especially when you want the generated object or edit to follow a specific structure or composition.
Since FLUX.1-Fill-dev was not originally trained jointly with ControlNet, this is more of an experimental/community implementation. In practice, it works well for structured inpainting, but results depend a lot on the mask quality, control image alignment, and conditioning strength.
Links
- Personal Repo : https://github.com/pratim4dasude/pipline_flux_fill_controlnet_Inpaint
- Pipeline file (Diffusers community): https://github.com/huggingface/diffusers/blob/main/examples/community/pipline_flux_fill_controlnet_Inpaint.py
- Community Pipelines README (FLUX Fill ControlNet section): https://github.com/huggingface/diffusers/tree/main/examples/community#flux-fill-controlnet-pipeline
- FLUX Pipelines docs: https://huggingface.co/docs/diffusers/api/pipelines/flux
- ControlNet in Diffusers docs: https://huggingface.co/docs/diffusers/api/pipelines/controlnet_flux
Code example
import torch
from diffusers import FluxControlNetModel
from diffusers.utils import load_image
from pipline_flux_fill_controlnet_Inpaint import FluxControlNetFillInpaintPipeline
dtype = torch.bfloat16
device = "cuda"
controlnet = FluxControlNetModel.from_pretrained(
"Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0",
torch_dtype=dtype,
)
fill_pipe = FluxControlNetFillInpaintPipeline.from_pretrained(
"black-forest-labs/FLUX.1-Fill-dev",
controlnet=controlnet,
torch_dtype=dtype,
).to(device)
img = load_image("imgs/background.jpg")
mask = load_image("imgs/mask.png")
ctrl = load_image("imgs/dog_depth_2.png")
result = fill_pipe(
prompt="a dog on a bench",
image=img,
mask_image=mask,
control_image=ctrl,
control_mode=[2],
# canny=0, tile=1, depth=2, blur=3, pose=4
controlnet_conditioning_scale=0.9,
control_guidance_start=0.0,
control_guidance_end=0.8,
height=1024, width=1024,
strength=1.0,
guidance_scale=50.0,
num_inference_steps=60,
max_sequence_length=512,
)
result.images[0].save("output.jpg")
If you find this useful, a GitHub star ⭐ would really help support the project.