한줄 요약: 

짧은 요약(Abstract) :    



우리는 대규모 사전 학습된 텍스트-이미지 변환 모델에 공간적 조건 제어를 추가하기 위한 신경망 아키텍처인 ControlNet을 소개합니다. ControlNet은 실제 사용 준비가 완료된 대규모 변환 모델을 잠그고, 수십억 개의 이미지로 사전 학습된 깊고 강력한 인코딩 레이어를 강력한 백본으로 재사용하여 다양한 조건 제어를 학습합니다. 신경망 아키텍처는 '제로 컨볼루션' (제로로 초기화된 컨볼루션 레이어)으로 연결되어, 파라미터가 점진적으로 증가하고 미세 조정에 해로운 노이즈가 영향을 미치지 않도록 합니다. 우리는 단일 또는 다중 조건을 사용하여 Stable Diffusion과 함께 가장자리, 깊이, 분할, 인간 포즈 등 다양한 조건 제어를 테스트합니다. ControlNet의 훈련은 작은 (<50k) 및 큰 (>1m) 데이터 세트에서 모두 강력하다는 것을 보여줍니다. 광범위한 결과는 ControlNet이 이미지 변환 모델을 제어하는 더 넓은 응용 프로그램을 가능하게 할 수 있음을 보여줍니다.


We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.






* Useful sentences :

단어정리

Methodology

ControlNet은 대규모 사전 학습된 레이어를 재사용하여 특정 조건을 학습하는 깊고 강력한 인코더를 구축합니다. 원본 모델과 학습 가능한 복사본은 “제로 컨볼루션” 레이어를 통해 연결되며, 이는 훈련 중에 유해한 노이즈를 제거합니다. 다양한 실험을 통해 ControlNet이 단일 또는 다중 조건, 프롬프트 유무에 상관없이 Stable Diffusion을 효과적으로 제어할 수 있음을 검증했습니다. 다양한 조건 데이터 세트에 대한 결과는 ControlNet 구조가 더 넓은 범위의 조건에 적용될 가능성이 있으며 관련 응용 프로그램을 촉진할 수 있음을 보여줍니다.

이미지 확산 모델은 이미지의 노이즈를 점진적으로 제거하고 학습 도메인에서 샘플을 생성하는 것을 학습합니다. 이 제거 과정은 픽셀 공간이나 학습 데이터에서 인코딩된 잠재 공간에서 발생할 수 있습니다. Stable Diffusion은 이 공간에서 작업하면 학습 과정이 안정화됨을 보여준 잠재 이미지를 학습 도메인으로 사용합니다. 특히, Stable Diffusion은 VQ-GAN과 유사한 전처리 방법을 사용하여 512×512 픽셀 공간 이미지를 더 작은 64×64 잠재 이미지로 변환합니다. ControlNet을 Stable Diffusion에 추가하기 위해, 우리는 먼저 각 입력 조건 이미지를 512×512 입력 크기에서 Stable Diffusion의 크기와 일치하는 64×64 특징 공간 벡터로 변환합니다. 특히, 우리는 네 개의 컨볼루션 레이어로 구성된 작은 네트워크 E(·)를 사용하여 이미지 공간 조건 ci를 특징 공간 조건 벡터 cf로 인코딩합니다.

cf = E(ci). (4)

훈련 과정에서, 우리는 랜덤하게 50%의 텍스트 프롬프트 ct를 빈 문자열로 대체합니다. 이 접근 방식은 ControlNet이 프롬프트 대신 입력 조건 이미지를 통해 직접 의미를 인식할 수 있는 능력을 증가시킵니다.

추가 조건이 ControlNet의 노이즈 제거 확산 과정에 어떻게 영향을 미치는지 여러 가지 방법으로 제어할 수 있습니다.

ControlNet reuses the large-scale pretrained layers of source models to build a deep and strong encoder to learn specific conditions. The original model and trainable copy are connected via “zero convolution” layers that eliminate harmful noise during training. Extensive experiments verify that ControlNet can effectively control Stable Diffusion with single or multiple conditions, with or without prompts. Results on diverse conditioning datasets show that the ControlNet structure is likely to be applicable to a wider range of conditions and facilitate relevant applications.

Image diffusion models learn to progressively denoise images and generate samples from the training domain. The denoising process can occur in pixel space or in a latent space encoded from training data. Stable Diffusion uses latent images as the training domain as working in this space has been shown to stabilize the training process. Specifically, Stable Diffusion uses a pre-processing method similar to VQ-GAN to convert 512×512 pixel-space images into smaller 64×64 latent images. To add ControlNet to Stable Diffusion, we first convert each input conditioning image from an input size of 512×512 into a 64×64 feature space vector that matches the size of Stable Diffusion. In particular, we use a tiny network E(·) of four convolution layers to encode an image-space condition ci into a feature space conditioning vector cf.

cf = E(ci). (4)

During the training process, we randomly replace 50% of text prompts ct with empty strings. This approach increases ControlNet’s ability to directly recognize semantics in the input conditioning images as a replacement for the prompt.

We can further control how the extra conditions of ControlNet affect the denoising diffusion process in several ways.

Results

질적 결과 (Qualitative Results)

다양한 프롬프트 설정에서 생성된 이미지를 Figure 1에서 보여줍니다. Figure 7에서는 프롬프트 없이 다양한 조건에서 ControlNet의 결과를 보여주며, ControlNet이 다양한 입력 조건 이미지를 해석하는 능력을 보여줍니다.

정량적 평가 (Quantitative Evaluation)

ControlNet의 성능을 평가하기 위해, 20개의 보지 않은 손으로 그린 스케치를 샘플링하고, 각 스케치를 5가지 방법(PITI, Sketch-Guided Diffusion (SGD), ControlNet-lite, ControlNet)으로 할당했습니다. 12명의 사용자가 “표시된 이미지의 품질”과 “스케치에 대한 충실도”에 대해 개별적으로 평가했습니다. 이로 인해 결과 품질과 조건 충실도에 대한 100개의 랭킹을 얻었으며, 사용자 선호도 지표로 Average Human Ranking (AHR)을 사용했습니다. 평균 랭킹은 Table 1에 나와 있습니다.

산업 모델과의 비교 (Comparison to Industrial Models)

Stable Diffusion V2 Depth-to-Image (SDv2-D2I) 모델과 비교하여, ControlNet은 NVIDIA RTX 3090Ti GPU 하나로 훈련하여도 경쟁력 있는 결과를 달성할 수 있음을 보여주었습니다. 12명의 사용자가 두 모델을 구별하도록 하였을 때, 평균 정확도는 0.52±0.17로 두 방법이 거의 구별할 수 없다는 것을 나타냅니다.

조건 재구성 및 FID 점수 (Condition Reconstruction and FID Score)

ADE20K 데이터 세트를 사용하여 조건 충실도를 평가했습니다. 최첨단 분할 방법인 OneFormer는 0.58의 IoU를 달성했습니다. 우리는 다양한 방법을 사용하여 ADE20K 분할을 조건으로 이미지를 생성하고, FID, CLIP 텍스트-이미지 점수, CLIP 미적 점수를 평가했습니다. Table 3에서 결과를 확인할 수 있습니다.

다른 방법과의 비교 (Comparison to Previous Methods)

Figure 9에서는 PITI, Sketch-Guided Diffusion, Taming Transformers와 우리의 방법을 시각적으로 비교합니다. ControlNet은 다양한 조건 이미지를 견고하게 처리하고, 선명하고 깨끗한 결과를 달성함을 관찰할 수 있습니다.

Qualitative Results

Figure 1 shows the generated images in several prompt settings. Figure 7 shows our results with various conditions without prompts, where the ControlNet robustly interprets content semantics in diverse input conditioning images.

Quantitative Evaluation

To evaluate the performance of ControlNet, we sampled 20 unseen hand-drawn sketches and assigned each sketch to 5 methods: PITI, Sketch-Guided Diffusion (SGD), ControlNet-lite, and ControlNet. We invited 12 users to rank these 20 groups of 5 results individually in terms of “the quality of displayed images” and “the fidelity to the sketch”. This resulted in 100 rankings for result quality and 100 for condition fidelity. We used the Average Human Ranking (AHR) as a preference metric. The average rankings are shown in Table 1.

Comparison to Industrial Models

Comparing to the industrial model Stable Diffusion V2 Depth-to-Image (SDv2-D2I), ControlNet showed that it can achieve competitive results even when trained on a single NVIDIA RTX 3090Ti GPU. When 12 users were asked to distinguish between the images generated by the two models, the average precision was 0.52±0.17, indicating that the results are almost indistinguishable.

Condition Reconstruction and FID Score

Using the ADE20K dataset, we evaluated the conditioning fidelity. The state-of-the-art segmentation method OneFormer achieved an IoU of 0.58. We generated images conditioned on ADE20K segmentations using different methods and evaluated FID, CLIP text-image scores, and CLIP aesthetic scores. The results are presented in Table 3.

Comparison to Previous Methods

Figure 9 presents a visual comparison between our method and PITI, Sketch-Guided Diffusion, and Taming Transformers. We observe that ControlNet robustly handles diverse conditioning images and achieves sharp and clean results.

네, 질적 평가(qualitative evaluation)는 주로 사람의 주관적인 판단에 의해 이루어집니다. 질적 평가는 특정 방법이나 모델이 얼마나 잘 작동하는지를 평가하기 위해 시각적으로 생성된 이미지를 검토하는 과정을 포함합니다. 이 과정에서 평가자들은 생성된 이미지의 시각적 품질, 조건 충실도, 그리고 기대하는 결과와의 일치도를 평가합니다.

예시로 설명

질적 평가 과정:
- 평가자들에게 여러 가지 설정에서 생성된 이미지를 보여줍니다.
- 예를 들어, 동일한 텍스트 프롬프트와 조건 이미지를 사용하여 다양한 모델(PITI, Sketch-Guided Diffusion, ControlNet 등)로 생성된 이미지를 나란히 보여줍니다.
- 평가자들은 각 이미지가 얼마나 잘 생성되었는지, 얼마나 선명한지, 텍스트 프롬프트와 조건을 얼마나 잘 반영하는지 등을 평가합니다.
평가 기준:
- 이미지 품질: 이미지는 얼마나 선명하고 깨끗한가? 색상과 세부 사항은 적절한가?
- 조건 충실도: 생성된 이미지는 주어진 조건(예: 엣지 맵, 포즈, 깊이 맵 등)을 얼마나 잘 반영하는가?
- 텍스트 프롬프트 일치도: 텍스트 프롬프트에 묘사된 내용을 이미지가 얼마나 잘 표현하는가?
평가 방법:
- 평가자들은 각 이미지를 1에서 5까지의 척도로 평가합니다.
- 여러 평가자의 점수를 평균내어 각 모델의 성능을 비교합니다.

Qualitative Evaluation

Qualitative evaluation primarily involves subjective judgments by human evaluators. It includes visually inspecting the generated images to assess how well a particular method or model performs. Evaluators review the visual quality, conditional fidelity, and alignment with the expected outcome.

Example Explanation

Qualitative Evaluation Process:
- Present evaluators with images generated under various settings.
- For example, show images generated by different models (PITI, Sketch-Guided Diffusion, ControlNet, etc.) using the same text prompt and conditioning image side by side.
- Evaluators assess how well each image is generated, how sharp it is, and how well it adheres to the text prompt and conditioning.
Evaluation Criteria:
- Image Quality: How sharp and clear is the image? Are the colors and details appropriate?
- Conditional Fidelity: How well does the generated image reflect the given conditions (e.g., edge map, pose, depth map)?
- Text Prompt Alignment: How well does the image represent the content described in the text prompt?
Evaluation Method:
- Evaluators rate each image on a scale from 1 to 5.
- The scores from multiple evaluators are averaged to compare the performance of each model.

예시

Stable Diffusion 및 ControlNet의 작동 방식을 예시를 통해 설명

기본 이미지 변환 과정

노이즈 추가 단계: 초기 이미지를 점진적으로 노이즈로 변환합니다. 예를 들어, 깨끗한 이미지가 있다면, 이 이미지에 점차적으로 노이즈를 추가하여 점점 더 노이즈가 많은 이미지로 만듭니다.
노이즈 제거 단계: 노이즈가 많은 이미지에서 점차적으로 노이즈를 제거하여 원래의 이미지를 복원합니다. 이 과정은 U-Net 구조를 사용하여 이루어집니다.

텍스트를 반영하는 방법

Stable Diffusion은 단순히 이미지에서 노이즈를 제거하는 것뿐만 아니라, 텍스트 정보를 반영하여 이미지가 특정 텍스트 설명과 일치하도록 만듭니다. 이를 위해 CLIP (Contrastive Language–Image Pre-Training) 모델을 사용합니다.

텍스트 인코딩: 사용자가 입력한 텍스트 프롬프트(예: “a beautiful sunset”)를 CLIP 모델을 사용하여 잠재 벡터로 변환합니다. 이 잠재 벡터는 텍스트의 의미를 숫자로 표현한 것입니다.
노이즈 제거 과정에서 텍스트 반영: U-Net 구조에서 노이즈를 제거할 때, 이 텍스트 잠재 벡터를 사용하여 이미지가 텍스트 설명과 일치하도록 조정합니다. 구체적으로는, 텍스트 잠재 벡터가 이미지의 특정 부분을 강조하거나 변경하는 방식으로 반영됩니다.

ControlNet의 역할

ControlNet은 이 과정에 추가적인 조건을 반영합니다. 예를 들어, 사용자가 이미지의 특정 부분을 강조하고 싶다면, 추가적인 입력 이미지(예: 엣지 맵, 깊이 맵, 분할 맵 등)를 제공하여 이러한 조건을 반영합니다.

예를 들어, 사용자가 “a beautiful sunset”이라는 텍스트 프롬프트와 함께 특정 형태의 엣지 맵을 제공한다고 가정해보겠습니다. ControlNet은 이 엣지 맵을 학습하여 텍스트 프롬프트와 함께 이미지를 생성하는 데 반영합니다. 이 과정에서 ControlNet은 추가적인 제어 조건(엣지 맵)을 반영하여 더욱 정확하고 세밀한 이미지를 생성할 수 있습니다.

예시

텍스트 프롬프트: “a beautiful sunset over the mountains”
조건 입력: 산의 윤곽을 나타내는 엣지 맵

과정은 다음과 같습니다:

CLIP 모델을 사용하여 텍스트 프롬프트 “a beautiful sunset over the mountains”을 잠재 벡터로 변환합니다.
ControlNet을 사용하여 엣지 맵을 잠재 벡터로 변환합니다.
U-Net 구조에서 노이즈 제거 과정을 진행하며, 텍스트 잠재 벡터와 엣지 맵 잠재 벡터를 동시에 반영합니다.
최종적으로, 텍스트 설명과 엣지 맵 조건을 모두 만족하는 이미지를 생성합니다.

결론

텍스트 프롬프트는 이미지 생성 과정에서 중요한 조건으로 작용하며, ControlNet은 추가적인 조건을 반영하여 더 정교한 이미지를 생성할 수 있도록 돕습니다. 이 과정은 CLIP 모델과 U-Net 구조를 결합하여 이루어지며, 다양한 입력 조건을 반영하여 원하는 이미지를 생성할 수 있습니다.

Basic Image Transformation Process

Noise Addition Stage: The initial image is progressively transformed into a noisy image. For example, starting with a clean image, noise is gradually added, making the image progressively noisier.
Noise Removal Stage: The noisy image is then progressively denoised to recover the original image. This process is carried out using a U-Net structure.

How Text is Incorporated

Stable Diffusion not only denoises the image but also incorporates text information to ensure that the generated image matches the text description. This is achieved using the CLIP (Contrastive Language–Image Pre-Training) model.

Text Encoding: The user-provided text prompt (e.g., “a beautiful sunset”) is converted into a latent vector using the CLIP model. This latent vector is a numerical representation of the text’s meaning.
Incorporating Text During Noise Removal: During the noise removal process with the U-Net structure, the text latent vector is used to adjust the image so that it aligns with the text description. Specifically, the text latent vector influences certain aspects of the image to emphasize or modify them according to the text.

Role of ControlNet

ControlNet adds additional conditioning to this process. For instance, if the user wants to emphasize specific parts of the image, they can provide additional input images (e.g., edge maps, depth maps, segmentation maps) that serve as conditions.

For example, let’s say the user provides the text prompt “a beautiful sunset” along with an edge map showing the outline of mountains. ControlNet learns from this edge map and incorporates it along with the text prompt to generate the image. This allows ControlNet to integrate additional control conditions (the edge map) to create a more precise and detailed image.

Example

Text Prompt: “a beautiful sunset over the mountains”
Condition Input: An edge map outlining the mountains

The process would be:

Use the CLIP model to convert the text prompt “a beautiful sunset over the mountains” into a latent vector.
Use ControlNet to convert the edge map into a latent vector.
During the noise removal process with the U-Net structure, incorporate both the text latent vector and the edge map latent vector.
The final image generated will satisfy both the text description and the edge map condition.

Conclusion

The text prompt acts as a crucial condition during the image generation process, and ControlNet facilitates the incorporation of additional conditions to create more sophisticated images. This process involves combining the CLIP model and the U-Net structure to reflect various input conditions and generate the desired image.

요약

이 논문은 대규모 사전 학습된 텍스트-이미지 변환 모델에 공간적 조건 제어를 추가하는 ControlNet이라는 신경망 아키텍처를 소개합니다. ControlNet은 ‘제로 컨볼루션’ 레이어를 사용하여 모델을 안정적으로 미세 조정할 수 있게 합니다. 실험 결과, ControlNet은 다양한 조건과 프롬프트에서 이미지 생성을 효과적으로 제어할 수 있음을 보여주었습니다. 질적 및 정량적 평가에서 ControlNet은 다른 모델보다 우수한 성능을 보였으며, 산업 모델과 비교했을 때도 경쟁력 있는 결과를 달성했습니다. ControlNet은 더 넓은 범위의 조건에 적용될 가능성이 있으며, 다양한 응용 프로그램에서 사용될 수 있습니다.

This paper introduces ControlNet, a neural network architecture that adds spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet uses “zero convolutions” to ensure stable fine-tuning of the model. Experimental results demonstrate that ControlNet effectively controls image generation under various conditions and prompts. Both qualitative and quantitative evaluations show that ControlNet outperforms other models and achieves competitive results compared to industrial models. ControlNet is likely to be applicable to a wider range of conditions and can be utilized in various applications.

기타

사실, 저는 Stable Diffusion 모델에 대해 잘 알지 못했고 이해하지도 못했습니다. 또한, 이 모델의 성능에 놀랐습니다. 추가로, ControlNet은 제어 요인을 사용하여 생성된 이미지를 조절했는데, 이는 정말 놀랍고 충격적이었습니다.

Actually, I didn’t know well and didn’t understand the Stable Diffusion model. I was also surprised by the performance of this model. Additionally, ControlNet used control factors to adjust the generated images, which is really surprising and stunning.

ControlNet은 대규모 사전 학습된 텍스트-이미지 변환 모델에 ‘제로 컨볼루션’을 사용하여 안정적으로 조건 제어를 추가합니다. 이를 통해 다양한 입력 조건에 따라 생성된 이미지를 정밀하게 조절할 수 있습니다.

ControlNet adds stable conditional control to large, pretrained text-to-image models using “zero convolutions.” This allows precise adjustment of generated images based on various input conditions.

ControlNet은 ‘제로 컨볼루션’을 사용하여 텍스트-이미지 변환 모델에 조건 제어를 추가합니다.

ControlNet adds conditional control to text-to-image models using “zero convolutions.”

제로 컨볼루션은 가중치와 바이어스가 모두 0으로 초기화된 컨볼루션 레이어입니다. 학습 초기 단계에서 이 레이어들은 모델에 추가적인 노이즈를 발생시키지 않으며, 점진적으로 파라미터를 학습해 나가면서 모델이 안정적으로 훈련될 수 있도록 돕습니다.

Zero convolutions are convolution layers where both weights and biases are initialized to zero. In the early stages of training, these layers do not introduce additional noise to the model and progressively learn parameters, helping the model train stably.

제로 컨볼루션을 사용함으로써 ControlNet은 대규모 사전 학습된 텍스트-이미지 변환 모델에 안정적으로 조건 제어를 추가할 수 있었습니다. 이를 통해 다양한 입력 조건을 반영하여 생성된 이미지를 정밀하게 조절할 수 있었으며, 더 넓은 범위의 조건에 적용 가능한 강력한 제어 능력을 확보했습니다.

By using zero convolutions, ControlNet was able to add stable conditional control to large, pretrained text-to-image models. This allowed for precise adjustment of generated images based on various input conditions, resulting in a robust control capability applicable to a wider range of conditions.

ControlNet adds conditional control to text-to-image models using zero convolutions which are convolution layers where both weights and biases are initialized to zero. By using zero convolutions, ControlNet allowed for precise adjustment of generated images based on various input conditions, resulting in a robust control capability.

refer format:

Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. “Adding Conditional Control to Text-to-Image Diffusion Models.” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3836–3847. October 2023.

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. “Adding Conditional Control to Text-to-Image Diffusion Models.” Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Graphics (cs.GR), Human-Computer Interaction (cs.HC), Multimedia (cs.MM), arXiv:2302.05543, 2023. DOI: https://doi.org/10.48550/arXiv.2302.05543.

@article{zhang2023controlnet, title={Adding Conditional Control to Text-to-Image Diffusion Models}, author={Lvmin Zhang and Anyi Rao and Maneesh Agrawala}, journal={arXiv preprint arXiv:2302.05543}, year={2023}, note={Codes and Supplementary Material: \url{https://doi.org/10.48550/arXiv.2302.05543}}, url={https://doi.org/10.48550/arXiv.2302.05543}, eprint={2302.05543}, archivePrefix={arXiv}, primaryClass={cs.CV}, }