Meta-CoT Enhancing Granularity and Generalization
in Image Editing

CVPR 2026
Tsinghua UniversityTsinghua University Hunyuan, TencentTencent Hunyuan

1 Tsinghua University    2 Hunyuan, Tencent

* Equal Contribution    † Corresponding Author

inputDelete the standing person wearing glasses.edited
inputGenerate a collage-style artwork.edited
inputTransfer into embroidered cross-stitch textile sty...edited
inputTransform into a sunset environment, replacing the...edited
inputAdd a sailboat in the ocean to the left side of th...edited
inputExtract the human figure standing in the image, in...edited
inputDelete the broccoli.edited
inputRemove the Christmas tree on the left side of the ...edited
inputAnimate the dog in the image.edited
inputChange the weather to a snowstorm.edited
inputDraw the appearance of the image after the camera ...edited
inputChange the color of the bear to black.edited
inputStraighten the wavy section of the ponytail into a...edited
inputBuild the horse using red bricks.edited
inputReplace the guitar in front of the fox with Coca-C...edited
inputChange the person's movements to look forward.edited
inputChange the color of the domes to gradient of blue ...edited
inputExtract the human figure standing in the image alo...edited
inputDraw what it will look like three months later.edited
inputRotate the top shoe 180 degrees horizontally so it...edited
inputTransfer the image into a folded-paper origami art...edited
inputReplace the baseball glove with a ruby-adorned smo...edited
inputAdd 3 clouds to the sky.edited
inputDraw what it will look like after the question mar...edited
inputChange the camera view to the side.edited
inputOpen the shoji screens in the living room and bedr...edited
inputRemove the human standing prominently in the foreg...edited
inputChange all yellow tomato slices to vibrant lime gr...edited
inputAdd 3 seagulls to the beach.edited
inputChange the number of daisies to 2.edited
inputDraw the scene where the model (green shape in a w...edited
inputDraw an image showing the top view of the provided...edited
inputDraw an image of a fully assembled lamp using the ...edited
inputAdd a cat sitting on the bench in the foreground.edited
inputExtract the vehicle in the image.edited
inputRemove the woman in the white dress from the image...edited
inputDraw the pattern represented by the question mark.edited
inputPan the camera to the left, placing the man on the...edited
inputReplace the wooden table background with an elegan...edited
inputDraw what it will look like after being solved.edited
inputDraw the result of prolonged sunlight exposure.edited
inputDraw what it will look like after being eaten by a...edited
inputTilt the camera upwards to get a high-angle view o...edited
inputMove the camera backward to widen the shot and sho...edited
inputChange 3 strawberries to 1 strawberry.edited
inputDraw what it will look like after being continuous...edited
inputDraw what it will look like after a cat has played...edited
inputDraw the appearance of the image after the camera ...edited

Meta-CoT operates like a structured reasoning engine — it decomposes editing tasks into fundamental meta-operations, reasons step-by-step through a triplet framework, and aligns CoT with editing output through reinforcement learning.

Meta-CoT overview

Figure 1. Overview of Meta-CoT. We propose a two-level decomposition paradigm: (1) Triplet Decomposition into task, target, and understanding; (2) Meta-task Decomposition into five fundamental meta-tasks for generalization.

Abstract

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet — (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks.

Key Features

Three core innovations that enable granular understanding and broad generalization in image editing.

Triplet Decomposition

Decomposes any editing intention into a structured triplet: (task, target, required understanding) — enabling fine-grained reasoning over every editing element.

Meta-task Generalization

Five fundamental meta-tasks (add, delete, replace, camera motion, position change) are sufficient to generalize across 21+ diverse editing tasks.

CEC Reward

CoT-Editing Consistency Reward uses a VLM to align the model's reasoning chain with its actual editing output via reinforcement learning.

+0.0%
21-Task Benchmark
vs. no Meta-CoT
+0.0%
ImgEdit Benchmark
vs. BAGEL (w/ think)
0
Editing Tasks
Comprehensive coverage
0
Meta-tasks
Sufficient for generalization

Method

Triplet Decomposition

We observe that any editing intention can be decomposed into a triplet: (task, target, required understanding ability). This insight drives our first-level decomposition.

  • Task: Identifies the editing operation type (e.g., replacement, addition, camera motion)
  • Target: Traverses all objects and regions that need to be edited
  • Understanding: Determines the visual understanding capability required
Triplet Decomposition

Meta-task Decomposition

In the second decomposition level, we break down all editing tasks into five fundamental meta-tasks:

  • Addition — Adding new objects or elements
  • Deletion — Removing existing objects
  • Replacement — Swapping objects or attributes
  • Camera Motion — Changing viewpoint or perspective
  • Position Change — Moving objects spatially

Training on just these five meta-tasks achieves performance comparable to full-data training on 21 diverse editing tasks.

Meta-task Decomposition

CoT-Editing Consistency Reward

A key challenge is ensuring the model's editing behavior is actually aligned with its CoT reasoning. We propose the CEC Reward:

  • Leverages a VLM (Qwen2.5-VL) to measure consistency between CoT and editing output
  • Rewards edits that faithfully follow the reasoning chain
  • Integrated into a Flow-GRPO reinforcement learning framework
CEC Reward

Training Pipeline

Training Pipeline

Stage 1: SFT on reasoning and editing. Stage 2: Flow-GRPO with CEC reward.

Data Construction Pipeline

Data Construction Pipeline

Automated pipeline for Meta-CoT training data with triplet annotation.

Qualitative Results

Qualitative Comparison with CoT

Qualitative comparison with CoT

Quantitative Results

Table 2. Overall Scores on the 21-Task Benchmark (GPT-4.1 evaluated)

Method Background Color Material Action Human Attr. Style Add Remove Replace Text Tone Average
Bagel (w/o think) 6.1726.6686.0244.1824.9336.4605.1025.8145.2265.6126.2135.673
Bagel (w/ think) 6.6866.4695.9693.9974.4045.7834.7645.2214.9635.5304.5915.307
Train Editing Only 6.7436.5376.0524.1554.7106.0604.9605.6005.1705.6205.3105.538
SFT (Meta-CoT) 7.1737.2016.4934.4705.3846.9205.7306.3406.1105.5607.0806.224
Meta-CoT + RL (Ours) 7.3427.3656.7114.8125.6537.1055.9246.5286.2875.4427.3966.415

Table 3. System Comparison on ImgEdit (GPT-4.1 evaluated)

Method Add Adjust Extract Replace Remove Background Style Hybrid Action Overall
FLUX Kontext [Pro] 4.254.152.354.574.001.813.192.682.764.26
GPT Image 1 [High] 4.614.332.904.934.202.773.553.763.824.57
Step1X-Edit 3.083.212.243.843.052.042.851.912.653.14
SeedEdit 3.163.192.644.192.962.522.851.562.983.32
BAGEL (w/o think) 3.083.211.753.762.701.442.381.201.463.20
BAGEL (w/ think) 3.162.832.243.842.962.042.851.561.913.39
Meta-CoT + RL (Ours) 4.384.813.824.694.273.064.632.643.683.83

Human Evaluation

Pairwise human preference study comparing Meta-CoT against baseline methods.

Human evaluation

Citation


        
input edited