한줄 요약: 

vision language model들에 대한 typo에 대한 robustness 연구.  
평가 데이터셋 제안 및 완화책(추가적인 시각적 정보를 포함) 제안    



짧은 요약(Abstract) :    


대규모 비전-언어 모델(Large Vision-Language Models, LVLMs)은 비전 인코더와 대규모 언어 모델(LLMs)을 결합하여 다중 모달 작업에서 탁월한 성능을 보여줍니다. 하지만 LVLM은 타이포그래픽 공격(typographic attacks)이라는 취약점을 가지고 있습니다. 이 연구는 다음과 같은 세 가지 주요 기여를 합니다:

1. 현재 상용 및 오픈 소스 LVLM에 대해 타이포그래픽 공격의 효과를 검증하고, 이 취약점이 널리 퍼져 있음을 확인하였습니다.
2. 다양한 멀티모달 작업과 타이포그래픽 요인을 평가할 수 있는 **가장 포괄적이고 대규모의 타이포그래픽 데이터셋**을 제안하였습니다.
3. 평가 결과를 바탕으로 LVLM에 영향을 미치는 타이포그래픽 공격의 원인을 조사하고, 세 가지 주요 통찰력을 발견하였습니다.

이 연구는 타이포그래픽 공격으로 인한 성능 저하를 42.07%에서 13.90%로 줄이는 방법도 제시하며, 이를 통해 LVLM의 보안을 강화할 수 있는 실질적인 방향성을 제시합니다.

---

Large Vision-Language Models (LVLMs) combine vision encoders and Large Language Models (LLMs) to achieve remarkable performance in multimodal tasks. However, they are vulnerable to typographic attacks. This study makes three key contributions:

1. It verifies the susceptibility of current commercial and open-source LVLMs to typographic attacks, uncovering their widespread existence.
2. It proposes the most comprehensive and large-scale **Typographic Dataset** for evaluating typographic vulnerabilities under diverse multimodal tasks and typographic factors.
3. Based on evaluations, it investigates the causes of typographic attacks in LVLMs and identifies three key insights.

The study also demonstrates how performance degradation due to typographic attacks can be reduced from 42.07% to 13.90%, paving the way for enhancing LVLM security.

* Useful sentences :

단어정리

Methodology

이 연구에서는 타이포그래픽 공격에 대한 LVLM(Large Vision-Language Model)의 취약성을 조사하기 위해 LLaVA(Large Language and Vision Assistant)와 InstructBLIP라는 두 가지 대표적인 LVLM 모델을 사용하였습니다.

타이포그래픽 데이터셋 (Typographic Dataset, TypoD):
- 데이터셋은 ImageNet, DAQUAR, CountBench, A-OKVQA와 같은 기존 데이터셋에서 유래되었습니다.
- 데이터셋은 크게 두 단계로 생성되었습니다:
  - 요소 탐색 단계(Factor Exploring Stage): 타이포그래픽 요소(글꼴 크기, 불투명도, 색상 속성, 공간 위치)의 영향을 평가하기 위해 다양한 요인을 조합한 데이터셋을 생성했습니다.
  - 요소 고정 단계(Factor Fixing Stage): 가장 효과적인 타이포그래픽 요소를 바탕으로 TypoD-Base (1570개 이미지)와 TypoD-Large (20,000개 이미지)를 생성했습니다.
평가 방법:
- 모델의 성능은 타이포그래픽 공격이 없는 이미지(정상 이미지)와 공격이 포함된 이미지에서의 성능 차이를 비교해 평가되었습니다.
- Grad-CAM(Gradient-weighted Class Activation Mapping)을 사용해 모델의 주의(attention) 영역을 시각화하여, 타이포그래픽 텍스트가 모델의 시각적 주의를 어떻게 이동시키는지 확인했습니다.
발견:
- 타이포그래픽 텍스트가 시각적 주의를 분산시키며, 이를 통해 LVLM의 성능 저하를 유발함을 확인했습니다.
- 텍스트 입력(prompt)에 추가적인 시각적 정보가 포함될 경우, 모델이 타이포그래픽 공격에 대한 저항력을 높일 수 있음을 발견했습니다.

In this study, the vulnerability of Large Vision-Language Models (LVLMs) to typographic attacks was investigated using two representative models: LLaVA (Large Language and Vision Assistant) and InstructBLIP.

Typographic Dataset (TypoD):
- The dataset was derived from existing datasets such as ImageNet, DAQUAR, CountBench, and A-OKVQA.
- It was constructed in two main stages:
  - Factor Exploring Stage: A dataset was generated by exploring various typographic factors such as font size, opacity, color attributes, and spatial positioning.
  - Factor Fixing Stage: Based on the most effective typographic factors, TypoD-Base (1,570 images) and TypoD-Large (20,000 images) were created.
Evaluation Methods:
- The performance of LVLMs was assessed by comparing their accuracy on normal images and images containing typographic attacks.
- Grad-CAM (Gradient-weighted Class Activation Mapping) was used to visualize the attention regions of the models and observe how typographic text influenced their visual attention.
Findings:
- Typographic text disperses the visual attention of the models, leading to performance degradation in LVLMs.
- Adding additional visual information to text prompts enhances the models’ resistance to typographic attacks.

Results

이 연구에서는 타이포그래픽 데이터셋(TypoD)에서 LVLM(Large Vision-Language Model) 모델인 LLaVA와 InstructBLIP의 성능을 평가하고, 타이포그래픽 공격이 모델에 미치는 영향을 정량적으로 분석했습니다.

비교 모델:
- 테스트 대상 모델: LLaVA-v1.5와 InstructBLIP.
- Vision-Language Model(VLM)인 CLIP과도 비교하여 LVLM의 타이포그래픽 취약성을 추가로 검증하였습니다.
테스트 데이터셋:
- TypoD-Base (1,570개 이미지)와 TypoD-Large (20,000개 이미지)를 사용하여 실험을 수행했습니다.
- 테스트는 4가지 멀티모달 작업(객체 인식, 시각 속성 탐지, 개체 수 세기, 상식 추론)을 포함하였으며, 각 작업에서 정상 이미지와 타이포그래픽 공격이 포함된 이미지에 대한 성능 차이를 비교했습니다.
메트릭 및 결과:
- 성능 감소량(GAP):
  - 타이포그래픽 공격이 포함된 데이터셋에서 LLaVA-v1.5는 평균 42.3%의 성능 감소를 보였으며, InstructBLIP는 평균 27.1%의 성능 감소를 보였습니다.
  - InstructBLIP가 LLaVA-v1.5보다 더 높은 견고성을 보여줬습니다.
- 대응 방법 개선:
  - LLaVA의 프롬프트(prompt)에 추가적인 시각적 정보를 포함시킨 결과, 성능 감소량(GAP)을 42.3%에서 18.2%로 줄이는 데 성공했습니다.
  - Grad-CAM 시각화 결과, 정보가 추가된 프롬프트를 사용할 경우 모델이 타이포그래픽 텍스트 대신 이미지의 시각적 요소에 더 집중함을 확인했습니다.

This study evaluated the performance of Large Vision-Language Models (LVLMs), specifically LLaVA and InstructBLIP, on the Typographic Dataset (TypoD) to quantify the impact of typographic attacks.

Compared Models:
- The primary tested models were LLaVA-v1.5 and InstructBLIP.
- Vision-Language Model (VLM) CLIP was also used as a comparative baseline to further verify the typographic vulnerabilities in LVLMs.
Test Dataset:
- Experiments were conducted on TypoD-Base (1,570 images) and TypoD-Large (20,000 images).
- The tests included four multimodal tasks (object recognition, visual attribute detection, enumeration, and commonsense reasoning) to compare performance on normal images and those containing typographic attacks.
Metrics and Results:
- Performance Degradation (GAP):
  - On typographic datasets, LLaVA-v1.5 exhibited an average performance drop of 42.3%, while InstructBLIP showed a smaller drop of 27.1%.
  - InstructBLIP demonstrated higher robustness compared to LLaVA-v1.5.
- Improvements via Prompt Engineering:
  - By including additional visual information in LLaVA’s prompts, the performance degradation (GAP) was reduced from 42.3% to 18.2%.
  - Grad-CAM visualizations confirmed that prompts with additional information redirected the model’s attention from typographic text to the visual elements of the image.

예제

테스트 데이터 예시:
- Object Recognition (객체 인식): ImageNet 데이터셋에서 1,000개의 객체 카테고리 중 하나에 속하는 이미지를 테스트했습니다. 예를 들어, 고양이(cat) 이미지에 “dog”라는 타이포그래픽 텍스트를 추가해 모델이 이를 “dog”로 분류하는지 확인했습니다.
- Visual Attribute Detection (시각 속성 탐지): DAQUAR 데이터셋에서 특정 객체의 색상을 묻는 질문과 이미지를 테스트했습니다. 예를 들어, 빨간 공(red ball) 이미지에 “blue”라는 텍스트를 추가해 모델이 색상을 잘못 인식하는지 평가했습니다.
- Enumeration (개체 수 세기): CountBench 데이터셋에서 물체 개수를 세는 작업을 수행했습니다. 예를 들어, 물체가 3개 있는 이미지에 “5”라는 텍스트를 삽입해 잘못된 개수를 예측하게 하는 공격을 실험했습니다.
제안 모델과 경쟁 모델 결과:
- Object Recognition (객체 인식):
  - LLaVA-v1.5: 정상 이미지에서는 97.8%의 정확도를 보였으나, 타이포그래픽 공격 이미지에서는 정확도가 35.6%로 감소했습니다.
  - InstructBLIP: 정상 이미지에서 97.8% 정확도, 타이포그래픽 공격 이미지에서는 66.4% 정확도를 보여, LLaVA보다 약 30.8% 더 나은 성능을 보였습니다.
- Visual Attribute Detection (시각 속성 탐지):
  - LLaVA는 타이포그래픽 공격 이미지에서 정확도가 59.5%로 감소했으나, InstructBLIP는 59.5%에서 72.0%로 더 높은 정확도를 기록했습니다.
- Prompt를 활용한 LLaVA 개선:
  - 기본 프롬프트 사용 시 정확도 감소(GAP)가 42.3%였던 반면, 프롬프트에 더 많은 정보를 추가한 후에는 18.2%까지 감소했습니다.
구체적 차별화:
- 타이포그래픽 텍스트 “dog”를 고양이 이미지에 삽입했을 때:
  - LLaVA: 타이포그래픽 텍스트에 주의를 빼앗겨 “dog”로 분류.
  - InstructBLIP: 추가 정보를 통해 이미지 콘텐츠와 텍스트를 구분하여 “cat”으로 올바르게 분류.

Test Data Examples:
- Object Recognition: Used ImageNet dataset with 1,000 object categories. For example, a cat image with the typographic text “dog” added to test if the model misclassifies it as “dog.”
- Visual Attribute Detection: Used DAQUAR dataset with questions about object colors. For instance, a red ball image with the text “blue” added to evaluate if the model misidentifies the color.
- Enumeration: Used CountBench dataset to test object counting. For example, an image with 3 objects, where the number “5” is overlaid to induce miscounting.
Proposed Model vs. Competing Models:
- Object Recognition:
  - LLaVA-v1.5: Achieved 97.8% accuracy on normal images but dropped to 35.6% on typographic attack images.
  - InstructBLIP: Maintained 97.8% accuracy on normal images and achieved 66.4% on typographic attack images, outperforming LLaVA by 30.8%.
- Visual Attribute Detection:
  - LLaVA scored 59.5% accuracy on typographic attack images, while InstructBLIP improved to 72.0% accuracy under the same conditions.
- LLaVA Improvement via Prompt Engineering:
  - With default prompts, LLaVA showed a performance degradation (GAP) of 42.3%, which was reduced to 18.2% after including additional information in the prompts.
Key Differentiation:
- When the typographic text “dog” was added to a cat image:
  - LLaVA: Misclassified the image as “dog” due to attention diversion.
  - InstructBLIP: Correctly identified it as “cat” by distinguishing the image content from the text through better prompt utilization.

요약

이 연구는 LLaVA와 InstructBLIP 모델을 사용하여 타이포그래픽 공격에 대한 LVLM의 취약성을 평가했습니다. 연구진은 ImageNet, DAQUAR, CountBench와 같은 데이터셋을 활용해 객체 인식, 속성 탐지, 개체 수 세기 등의 작업을 테스트했으며, 타이포그래픽 텍스트로 인해 모델 성능이 크게 저하되는 것을 확인했습니다. 예를 들어, 고양이 이미지에 “dog”라는 텍스트를 삽입했을 때, LLaVA는 35.6% 정확도로 오답을 내는 반면, InstructBLIP는 66.4%로 더 높은 정확도를 보였습니다. 또한, LLaVA의 프롬프트에 시각적 정보를 추가하면 성능 감소량(GAP)을 42.3%에서 18.2%로 줄이는 데 성공했습니다. 이를 통해 프롬프트 엔지니어링이 타이포그래픽 공격에 대한 LVLM의 저항력을 효과적으로 향상시킬 수 있음을 입증했습니다.

This study evaluated the vulnerability of LVLMs to typographic attacks using LLaVA and InstructBLIP models. Researchers tested tasks such as object recognition, attribute detection, and enumeration using datasets like ImageNet, DAQUAR, and CountBench, finding significant performance drops due to typographic text. For instance, when “dog” text was added to a cat image, LLaVA achieved only 35.6% accuracy, while InstructBLIP performed better with 66.4% accuracy. Moreover, adding visual information to LLaVA’s prompts reduced the performance degradation (GAP) from 42.3% to 18.2%. This demonstrates that prompt engineering can effectively enhance the robustness of LVLMs against typographic attacks.

기타

Figure 1: 타이포그래픽 공격 사례
- GPT-4V, Google Bard, LLaVA-v1.5, MiniGPT-4와 같은 주요 LVLM 모델들이 타이포그래픽 텍스트에 의해 영향을 받는 예제를 보여줍니다. 예를 들어, 고양이 이미지에 “dog”라는 텍스트가 추가될 경우 모델이 “dog”로 잘못 분류하는 것을 시각적으로 표현합니다.
Figure 2: 타이포그래픽 공격에 의한 주의 분산
- LVLM이 멀티모달 작업(예: 객체 인식, 속성 탐지)에서 타이포그래픽 공격에 의해 얼마나 영향을 받는지 시각적으로 나타냅니다. 특히, 텍스트 삽입 위치와 색상이 모델 성능에 미치는 영향을 강조합니다.
Table 1: TypoD 데이터셋의 규모
- TypoD-Base와 TypoD-Large 데이터셋이 포함된 4개의 작업(객체 인식, 속성 탐지, 개체 수 세기, 상식 추론)에 대해 데이터 수와 확장된 요인(글꼴 크기, 불투명도, 색상, 위치)을 요약합니다.
Figure 4: Grad-CAM 시각화
- CLIP과 LLaVA가 타이포그래픽 텍스트에 반응하는 주의 영역을 시각화하여, 시각적 콘텐츠와 텍스트 간 주의 변화를 보여줍니다. 텍스트 입력이 더 구체적일수록 모델의 주의가 타이포그래픽 텍스트에서 시각적 콘텐츠로 이동합니다.
Figure 5: 프롬프트의 효과
- 간단한 프롬프트와 구체적인 프롬프트를 사용할 때 LLaVA의 주의 영역(Grad-CAM)과 시퀀스의 주의 맵 변화를 비교합니다. 구체적인 프롬프트가 모델의 성능을 크게 향상시키는 것을 확인할 수 있습니다.
Figure 6: 요인 별 성능 비교
- 객체 인식 작업에서 글꼴 크기, 불투명도, 색상, 위치와 같은 타이포그래픽 요인이 모델 정확도에 미치는 영향을 막대 그래프로 보여줍니다. 글꼴 크기와 불투명도가 증가할수록 정확도 감소가 더 뚜렷하게 나타납니다.
Table 2: TypoD 평가 결과
- LLaVA와 InstructBLIP의 정상 이미지와 타이포그래픽 공격 이미지에서의 정확도와 성능 감소량(GAP)을 비교합니다. InstructBLIP가 전반적으로 더 높은 정확도를 보입니다.
Table 3: 프롬프트에 따른 LLaVA 성능
- LLaVA의 프롬프트 설계 변화(간단한 프롬프트, 구체적인 프롬프트)에 따른 성능을 비교하며, 구체적인 프롬프트가 GAP을 크게 줄이는 것을 보여줍니다.

Figure 1: Examples of Typographic Attacks
- Demonstrates how major LVLM models such as GPT-4V, Google Bard, LLaVA-v1.5, and MiniGPT-4 are affected by typographic text. For instance, adding the word “dog” to a cat image misleads the model into classifying it as “dog.”
Figure 2: Attention Distraction from Typographic Attacks
- Visualizes how LVLMs are impacted by typographic attacks in multimodal tasks (e.g., object recognition, attribute detection). It highlights the effects of text insertion positions and colors on model performance.
Table 1: Scale of the TypoD Dataset
- Summarizes the number of data instances and expanded factors (font size, opacity, color, position) across four tasks (object recognition, attribute detection, enumeration, and commonsense reasoning) in TypoD-Base and TypoD-Large datasets.
Figure 4: Grad-CAM Visualization
- Illustrates attention regions of CLIP and LLaVA in response to typographic text, showing how attention shifts between visual content and text. More specific text inputs redirect attention from typographic text to visual content.
Figure 5: Prompt Effects
- Compares changes in attention regions (Grad-CAM) and sequence attention maps when LLaVA is provided with simple versus detailed prompts. Detailed prompts significantly improve model performance.
Figure 6: Performance Comparison by Factors
- Bar chart showing how typographic factors such as font size, opacity, color, and position impact accuracy in object recognition tasks. Larger font sizes and higher opacity lead to more pronounced accuracy drops.
Table 2: TypoD Evaluation Results
- Compares accuracy and performance degradation (GAP) of LLaVA and InstructBLIP on normal and typographic attack images. InstructBLIP consistently achieves higher accuracy.
Table 3: LLaVA Performance with Different Prompts
- Shows the impact of prompt design (simple vs. detailed) on LLaVA’s performance, demonstrating that detailed prompts substantially reduce the GAP.

refer format:

@inproceedings{cheng2025typographic, title={Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models}, author={Cheng, Hao and Xiao, Erjia and Gu, Jindong and Yang, Le and Duan, Jinhao and Zhang, Jize and Cao, Jiahang and Xu, Kaidi and Xu, Renjing}, booktitle={Proceedings of the European Conference on Computer Vision (ECCV) 2024}, volume={15117}, pages={179–196}, year={2025}, publisher={Springer}, doi={10.1007/978-3-031-73202-7_11} }

Chicago Style Citation (Paragraph) Cheng, Hao, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, and Renjing Xu. “Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models.” In Proceedings of the European Conference on Computer Vision (ECCV) 2024, vol. 15117, 179–196. Springer, 2025. https://doi.org/10.1007/978-3-031-73202-7_11.