한줄 요약: 

LLM의 출력 제어를 개선하는 방법으로 GAPO는 GAN(Generative Adversarial Network)과 PPO(Proximal Policy Optimization)를 통합한 것  

인코더 전용 보상 모델을 사용하여 프롬프트와 응답 간의 관계를 학습  

GAN의 적대적 훈련을 통해 다양한 난이도의 훈련 샘플을 자동으로 생성하며, 이를 통해 모델이 점진적으로 복잡한 제약 조건을 학습할 수 있도록 함   




짧은 요약(Abstract) :


최근 대규모 언어 모델의 발전은 모델 출력에 대한 정확한 제어의 필요성을 강조하고 있습니다. 기존 방법들은 직접적인 지시-응답 생성이나 선호 응답 최적화를 통해 이를 달성하려고 하지만, 제약 이해와 적응에서 어려움을 겪고 있습니다. 이러한 한계는 세밀한 제약을 처리할 때 특히 두드러지며, 이는 환각 현상이나 취약한 성능으로 이어질 수 있습니다. 본 논문에서는 Generative Adversarial Policy Optimization (GAPO)라는 새로운 프레임워크를 소개합니다. GAPO는 GAN 기반의 훈련 동역학과 인코더 전용 보상 모델을 결합하여 점진적으로 복잡한 제약을 학습하고 적응합니다. GAPO는 적대적 훈련을 활용하여 다양한 난이도의 훈련 샘플을 자동으로 생성하며, 인코더 전용 아키텍처를 통해 프롬프트-응답 관계를 더 잘 포착합니다. 광범위한 실험 결과는 GAPO가 여러 벤치마크에서 우수한 성능을 보이며, 특히 세밀한 제약 처리가 필요한 시나리오에서 PPO, DPO, KTO와 같은 기존 방법들을 크게 초월함을 보여줍니다. 우리의 결과는 GAPO의 선호 프롬프트 학습 접근 방식이 LLM 출력 제어를 위한 보다 강력하고 효과적인 솔루션을 제공함을 시사합니다.




Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO’s superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO’s unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs.


* Useful sentences :

단어정리

Methodology

GAPO(Generative Adversarial Policy Optimization)는 제너레이티브 적대 신경망(GAN)과 근접 정책 최적화(Proximal Policy Optimization, PPO) 프레임워크를 통합하여 복잡한 제약 조건을 이해하고 적응하는 새로운 방법론입니다. 이 방법론은 대규모 언어 모델(LLM)의 출력 제어를 위한 정밀한 접근 방식을 제공합니다.

모델 아키텍처: GAPO는 인코더 전용 보상 모델을 사용하여 프롬프트와 응답 간의 관계를 학습합니다. 이 인코더 전용 아키텍처는 프롬프트의 세부 사항을 더 잘 포착할 수 있도록 설계되었습니다. GAN의 적대적 훈련을 통해 다양한 난이도의 훈련 샘플을 자동으로 생성하며, 이를 통해 모델이 점진적으로 복잡한 제약 조건을 학습할 수 있도록 합니다.
훈련 데이터: GAPO는 기존의 선호 데이터(Preferential Data)를 활용하여 훈련합니다. 이 데이터는 모델이 특정 제약 조건을 준수하는 응답을 생성하도록 돕습니다. GAPO는 훈련 과정에서 생성된 데이터와 기존의 선호 데이터를 결합하여 보상 모델을 훈련시키고, 이를 통해 생성 모델의 성능을 향상시킵니다.
특별한 기법: GAPO는 두 가지 주요 훈련 단계를 포함합니다. 첫 번째 단계는 ‘웜업(Warmup) 단계’로, 이 단계에서 보상 모델이 기존의 선호 데이터를 사용하여 훈련됩니다. 두 번째 단계는 ‘적대적 훈련(Adversarial Training) 단계’로, 이 단계에서는 생성 모델과 보상 모델이 번갈아 가며 훈련됩니다. 이 과정에서 생성 모델은 보상 모델의 피드백을 바탕으로 업데이트되며, 보상 모델은 생성된 응답의 품질을 평가하여 점진적으로 개선됩니다.
성능 평가: GAPO는 여러 벤치마크에서 기존의 방법들(PPO, DPO, KTO 등)보다 우수한 성능을 보였습니다. 특히 세밀한 제약 조건을 처리하는 데 있어 뛰어난 성능을 발휘하며, 이는 모델이 제약 조건을 이해하고 준수하는 데 있어 더 강력한 능력을 갖추고 있음을 나타냅니다.

이러한 방식으로 GAPO는 대규모 언어 모델의 출력 제어를 위한 보다 견고하고 효과적인 솔루션을 제공합니다.

GAPO (Generative Adversarial Policy Optimization) is a novel methodology that integrates Generative Adversarial Networks (GAN) with Proximal Policy Optimization (PPO) frameworks to understand and adapt to complex constraints. This approach provides a precise mechanism for controlling the outputs of large language models (LLMs).

Model Architecture: GAPO employs an encoder-only reward model to learn the relationships between prompts and responses. This encoder-only architecture is designed to better capture the nuances of prompts. Through adversarial training with GAN, it automatically generates training samples of varying difficulty, allowing the model to progressively learn complex constraints.
Training Data: GAPO utilizes existing preferential data to train the model. This data helps the model generate responses that comply with specific constraints. During the training process, GAPO combines data generated by the model with existing preferential data to train the reward model, enhancing the performance of the generator.
Special Techniques: GAPO consists of two main training phases. The first phase is the ‘warmup phase,’ where the reward model is trained using existing preferential data. The second phase is the ‘adversarial training phase,’ where the generator and reward model are trained alternately. In this process, the generator is updated based on feedback from the reward model, while the reward model evaluates the quality of the generated responses, leading to gradual improvements.
Performance Evaluation: GAPO has demonstrated superior performance across multiple benchmarks compared to existing methods (PPO, DPO, KTO, etc.). It particularly excels in handling fine-grained constraints, indicating that the model possesses a stronger capability to understand and adhere to constraints.

In this way, GAPO provides a more robust and effective solution for controlling the outputs of large language models.

Results

이 논문에서는 GAPO(Generative Adversarial Policy Optimization)라는 새로운 프레임워크를 제안하고, 이를 통해 대규모 언어 모델(LLM)의 출력 제어를 개선하는 방법을 다룹니다. GAPO는 GAN(Generative Adversarial Network)과 PPO(Proximal Policy Optimization)를 통합하여 점진적으로 복잡한 제약 조건을 학습하고 적응할 수 있도록 설계되었습니다.

실험 결과

GAPO는 여러 벤치마크에서 기존의 방법들(PPO, DPO, KTO 등)과 비교하여 우수한 성능을 보였습니다. 특히, 세부적인 제약 조건을 처리해야 하는 상황에서 GAPO는 기존 방법들보다 현저히 높은 성능을 기록했습니다. 예를 들어, IFEval 벤치마크에서 GAPO는 95.4%의 성능을 달성했으며, 이는 PPO의 89.4%와 비교할 때 6% 이상 높은 수치입니다. DPO, SimPO, ORPO와 같은 다른 방법들은 5%에서 33% 사이의 낮은 성능을 보였으며, 특히 복잡한 제약 조건을 처리하는 데 있어 큰 어려움을 겪었습니다.

테스트 데이터

GAPO는 두 가지 주요 데이터셋을 사용하여 평가되었습니다. 첫 번째는 IFEval 데이터셋으로, 이는 LLM의 지침 준수 능력을 평가하기 위해 설계된 벤치마크입니다. 두 번째는 제품 설명 데이터셋(PDD)으로, 제품 카테고리와 속성-값 쌍을 포함하여 모델이 주어진 정보를 기반으로 설명을 생성하도록 요구합니다.

메트릭

모델의 성능은 여러 메트릭을 통해 평가되었습니다. IFEval 데이터셋에서는 프롬프트 수준의 정확도와 지침 수준의 정확도를 측정하였고, PDD 데이터셋에서는 생성된 설명이 제공된 모든 속성 정보를 포함하고 있는지, 추가적인 정보가 포함되지 않았는지를 기준으로 평가하였습니다. GAPO는 이러한 메트릭에서 모두 높은 점수를 기록하며, 기존 방법들보다 우수한 성능을 입증했습니다.

비교

GAPO는 기존의 방법들과 비교할 때, 특히 세부적인 제약 조건을 이해하고 준수하는 데 있어 더 나은 성능을 보였습니다. 예를 들어, GAPO는 6,600개의 훈련 샘플을 사용했을 때 PDD 성능에서 95.4%를 기록하며, 이는 기존의 방법들보다 12.5% 포인트 높은 수치입니다. 이러한 결과는 GAPO가 제약 조건을 보다 효과적으로 학습하고 적용할 수 있는 능력을 가지고 있음을 보여줍니다.

This paper introduces a novel framework called GAPO (Generative Adversarial Policy Optimization) and discusses how it improves the control of outputs from large language models (LLMs). GAPO is designed to progressively learn and adapt to increasingly complex constraints by integrating Generative Adversarial Networks (GAN) with Proximal Policy Optimization (PPO).

Experimental Results

GAPO demonstrated superior performance across multiple benchmarks compared to existing methods (PPO, DPO, KTO, etc.). Notably, in scenarios requiring fine-grained constraint handling, GAPO significantly outperformed existing methods. For instance, on the IFEval benchmark, GAPO achieved a performance score of 95.4%, which is over 6% higher than PPO’s score of 89.4%. Other methods like DPO, SimPO, and ORPO showed low performance ranging from 5% to 33%, particularly struggling with complex constraints.

Test Data

GAPO was evaluated using two main datasets. The first is the IFEval dataset, designed as a benchmark to assess the instruction-following capabilities of LLMs. The second is the Product Description Dataset (PDD), which includes product categories and property-value pairs, requiring the model to generate descriptions based on the provided information.

Metrics

The performance of the models was evaluated using various metrics. In the IFEval dataset, prompt-level accuracy and instruction-level accuracy were measured, while in the PDD dataset, the generated descriptions were assessed based on whether they included all provided attribute information and did not introduce any extraneous information. GAPO achieved high scores on all these metrics, demonstrating superior performance compared to existing methods.

Comparison

Compared to existing methods, GAPO exhibited better performance, especially in understanding and adhering to detailed constraints. For example, when using 6,600 training samples, GAPO achieved a PDD performance score of 95.4%, which is 12.5 percentage points higher than its counterparts. These results indicate that GAPO has a greater ability to effectively learn and apply constraints.

예제

이 논문에서는 GAPO(Generative Adversarial Policy Optimization)라는 새로운 프레임워크를 소개하고 있습니다. GAPO는 대규모 언어 모델(LLM)이 주어진 제약 조건을 준수하며 텍스트를 생성할 수 있도록 돕는 방법론입니다. 이 프레임워크는 GAN(Generative Adversarial Network)과 PPO(Proximal Policy Optimization)를 결합하여 점진적으로 복잡한 제약 조건을 학습하고 적응할 수 있도록 설계되었습니다.

트레이닝 데이터와 테스트 데이터

트레이닝 데이터:
- 제품 설명 데이터셋(PDD): 이 데이터셋은 201개의 제품 카테고리와 93,616개의 속성-값 쌍으로 구성되어 있습니다. 모델은 주어진 속성-값 쌍을 기반으로 일관된 제품 설명을 생성해야 하며, 두 가지 주요 제약 조건을 준수해야 합니다: (1) 모든 주어진 사실을 포함해야 하며, (2) 원본 데이터에 없는 추가 정보를 포함해서는 안 됩니다.
- 예시:
  - 입력:
    - 제품 이름: “애완동물 배낭”
    - 단어 수 요구 사항: “약 50단어”
    - 감정: “자부심”
    - 사실 정보: “재질: 두꺼운 PU 가죽, 적합한 애완동물 유형: 작은 앵무새, 무게: 750그램, 색상 옵션: 햇빛 노랑, 안전 조치: 고강도 나일론 스트랩, 치수: 30 x 25 x 40 cm, 스트랩 디자인: 미끄럼 방지 코팅”
  - 출력: “햇빛 노랑의 두꺼운 PU 가죽으로 제작된 이 애완동물 배낭은 작은 앵무새를 위한 완벽한 선택입니다. 750그램의 가벼운 무게로 휴대가 용이하며, 30 x 25 x 40 cm의 넉넉한 공간을 제공합니다. 고강도 나일론 스트랩과 미끄럼 방지 코팅이 적용된 스트랩으로 안전하게 이동할 수 있습니다.”
테스트 데이터:
- IFEval 데이터셋: 이 데이터셋은 LLM의 지침 준수 능력을 평가하기 위해 설계된 벤치마크 데이터셋입니다. 25가지의 검증 가능한 지침 유형을 기반으로 약 541개의 프롬프트가 생성되었습니다. 이 프롬프트는 프로그램적으로 검증 가능하여 주관적인 평가 편향을 제거합니다.
- 예시:
  - 입력: “애완동물 배낭에 대한 설명을 생성하되, 약 50단어로 작성하고, 주어진 사실 정보를 모두 포함해야 하며, 추가적인 사실 정보를 포함하지 않아야 합니다.”
  - 출력: “이 애완동물 배낭은 햇빛 노랑의 두꺼운 PU 가죽으로 제작되어 작은 앵무새를 위한 안전하고 편리한 선택입니다. 750그램의 가벼운 무게와 30 x 25 x 40 cm의 넉넉한 공간을 제공하며, 고강도 나일론 스트랩과 미끄럼 방지 코팅이 적용된 스트랩으로 안전하게 이동할 수 있습니다.”

이와 같은 방식으로 GAPO는 다양한 제약 조건을 준수하며 텍스트를 생성하는 능력을 향상시키기 위해 설계되었습니다.

This paper introduces a novel framework called GAPO (Generative Adversarial Policy Optimization). GAPO is designed to help large language models (LLMs) generate text while adhering to given constraints. The framework combines Generative Adversarial Networks (GAN) and Proximal Policy Optimization (PPO) to progressively learn and adapt to increasingly complex constraints.

Training Data and Test Data

Training Data:
- Product Description Dataset (PDD): This dataset consists of 201 product categories and 93,616 property-value pairs. The model is tasked with generating coherent product descriptions based on the provided property-value pairs, adhering to two main constraints: (1) it must incorporate all given facts, and (2) it must not include any additional information not present in the source data.
- Example:
  - Input:
    - Product Name: “Pet Backpack”
    - Word Count Requirement: “Approximately 50 words”
    - Emotion: “Pride”
    - Factual Information: “Material: Thickened PU leather, Suitable Pet Type: Small parrots, Weight: 750 grams, Color Options: Sunlight yellow, Safety Measures: High-strength nylon straps, Dimensions: 30 x 25 x 40 cm, Strap Design: Anti-slip coating”
  - Output: “This pet backpack, made of thickened PU leather in a cheerful sunlight yellow, is perfect for small parrots. Weighing just 750 grams, it offers ample space with dimensions of 30 x 25 x 40 cm. The high-strength nylon straps and anti-slip coated straps ensure a safe travel experience.”
Test Data:
- IFEval Dataset: This dataset is a benchmark specifically designed to evaluate the instruction-following capabilities of LLMs. Approximately 541 prompts were generated based on 25 distinct types of verifiable instructions. These prompts are verifiable programmatically, eliminating potential subjective assessment biases.
- Example:
  - Input: “Generate a description of a pet backpack, ensuring it is approximately 50 words long, incorporates all provided factual information, and does not include any additional factual information.”
  - Output: “This pet backpack is crafted from thickened PU leather in a vibrant sunlight yellow, making it a safe and convenient choice for small parrots. Weighing just 750 grams and measuring 30 x 25 x 40 cm, it features high-strength nylon straps and anti-slip coated straps for secure travel.”

In this manner, GAPO is designed to enhance the ability to generate text while adhering to various constraints.

요약

GAPO는 GAN과 PPO를 통합하여 점진적으로 복잡한 제약을 학습하는 새로운 프레임워크로, 기존 방법들보다 우수한 성능을 보인다. 실험 결과, GAPO는 세밀한 제약 처리에서 특히 뛰어난 성능을 발휘하며, 기존의 PPO, DPO, KTO와 비교하여 안정성과 제약 준수에서 우수함을 입증했다. 예를 들어, GAPO는 6,600개의 훈련 샘플을 사용하여 PDD 성능에서 95.4%를 달성했다.

GAPO is a novel framework that integrates GAN and PPO to progressively learn complex constraints, demonstrating superior performance compared to existing methods. Experimental results show that GAPO excels in fine-grained constraint handling, outperforming traditional methods like PPO, DPO, and KTO in terms of stability and constraint adherence. For instance, GAPO achieved 95.4% performance on the PDD using 6,600 training samples.

기타

다이어그램 및 피규어
- Figure 1: Preferential Response와 Preferential Prompt의 절차적 차이를 설명하는 다이어그램으로, 두 접근 방식의 프롬프트와 응답 활용의 차이를 강조합니다. 이 그림은 GAPO의 접근 방식이 어떻게 프롬프트 수정의 간단한 방법을 통해 세밀한 선호 데이터를 수집할 수 있는지를 보여줍니다.
- Figure 2: GAPO 프레임워크의 두 가지 조정 단계를 설명하는 다이어그램으로, 초기 단계에서 보상 모델이 기존 선호 데이터를 사용하여 훈련되고, 이후 적대적 훈련 단계에서 생성기와 보상 모델이 상호 작용하는 과정을 보여줍니다.
- Figure 3: GAPO의 성능에 영향을 미치는 상관 요인 분석을 보여주는 그래프입니다. 이 분석은 PDD 및 IFEval 벤치마크에서의 성능을 평가하기 위해 300개의 샘플을 사용하여 수행되었습니다.
- Figure 4: 적대적 훈련 단계에서 보상 모델의 진화를 보여주는 그래프입니다. 초기 단계에서 모든 모델이 생성된 샘플에 대해 거의 제로 점수를 부여하는 것을 보여주며, 이후 학습 경로의 차별화가 나타납니다.
- Figure 5: 다양한 훈련 기준에 따른 모델 성능의 사례 연구를 보여줍니다. GAPO가 다른 접근 방식에 비해 어떻게 세밀한 제어를 유지하며 언어적 진정성을 보장하는지를 강조합니다.
테이블
- Table 2: PDD 데이터셋의 구성 요소를 요약한 표로, 훈련 및 테스트 데이터의 샘플 수와 토큰 수를 포함합니다. 이 표는 데이터셋의 크기와 다양성을 보여줍니다.
- Table 4: IFEval 벤치마크에서의 다양한 모델 성능 비교를 보여주는 표입니다. GAPO가 PPO와 비교하여 우수한 성능을 보이는 것을 확인할 수 있습니다.
- Table 5: PDD 데이터셋에서의 모델 성능을 비교한 표로, GAPO가 다른 방법들에 비해 높은 점수를 기록한 것을 보여줍니다.
- Table 6: Preferential Prompt와 Preferential Response의 비교 분석을 제공하는 표로, GAPO가 두 접근 방식 모두에서 우수한 성능을 보이는 것을 나타냅니다.
어펜딕스
- Appendix A: IFEval 및 PDD 데이터셋의 구성 및 평가 방법론에 대한 자세한 설명을 포함하고 있습니다. 데이터셋의 품질 보증 및 공정성 평가 기준을 명시하여 연구의 신뢰성을 높이고 있습니다.
- Appendix B: 모델 출력 평가 방법론을 설명하며, 평가 프로세스의 일관성과 신뢰성을 보장하기 위한 절차를 상세히 설명합니다.

Insights

GAPO는 기존의 방법들에 비해 세밀한 제어와 안정성을 제공하며, 특히 복잡한 제약 조건을 처리하는 데 강점을 보입니다.
데이터셋의 품질과 다양성은 모델의 성능에 직접적인 영향을 미치며, GAPO는 이러한 요소를 효과적으로 활용하여 우수한 결과를 도출합니다.
적대적 훈련 과정에서 보상 모델과 생성기가 상호 작용함으로써, 모델이 점진적으로 더 복잡한 제약 조건을 이해하고 처리할 수 있도록 돕습니다.

Diagrams and Figures
- Figure 1: A diagram illustrating the procedural differences between Preferential Response and Preferential Prompt, emphasizing the distinct utilization of prompts and responses. This figure shows how GAPO’s approach can collect fine-grained preference data through simple prompt modifications.
- Figure 2: A diagram explaining the two distinct tuning phases of the GAPO framework, where the initial phase involves training the reward model using existing preference data, followed by an adversarial training phase where the generator and reward model interact.
- Figure 3: A graph showing the analysis of correlative factors influencing GAPO’s performance. This analysis was conducted using 300 randomly sampled instances from the PDD test set and the complete IFEval test set.
- Figure 4: A graph depicting the evolution of reward models during adversarial training, showing how all models assign near-zero scores to generated samples in the initial phase, followed by distinct learning trajectories.
- Figure 5: A case study illustrating model performance under different training baselines, highlighting how GAPO maintains meticulous control over word count and emotional articulation compared to alternative approaches.
Tables
- Table 2: A summary of the components of the PDD dataset, including the number of training and testing samples and tokens. This table illustrates the size and diversity of the dataset.
- Table 4: A comparison of various model performances on the IFEval benchmark, confirming GAPO’s superior performance compared to PPO.
- Table 5: A comparison of model performance on the PDD dataset, showing that GAPO achieved higher scores than other methods.
- Table 6: A comparative analysis of using Preferential Prompt versus Preferential Response, indicating that GAPO performs well in both approaches.
Appendix
- Appendix A: Detailed descriptions of the construction and evaluation methodologies for the IFEval and PDD datasets. It specifies quality assurance and fairness assessment criteria to enhance the reliability of the research.
- Appendix B: Describes the evaluation methodology for model outputs, detailing procedures to ensure consistency and reliability across assessments.

Insights

GAPO provides enhanced control and stability compared to existing methods, particularly excelling in handling complex constraint requirements.
The quality and diversity of the dataset directly impact model performance, and GAPO effectively leverages these elements to achieve superior results.
The adversarial training process, where the reward model and generator interact, helps the model progressively understand and handle increasingly complex constraints.

refer format:

BibTeX 형식

@inproceedings{Gu2025GAPO,
  author    = {Zhouhong Gu and Xingzhou Chen and Xiaoran Shi and Tao Wang and Suhang Zheng and Tianyu Li and Hongwei Feng and Yanghua Xiao},
  title     = {GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages     = {282--296},
  year      = {2025},
  month     = {July},
  publisher = {Association for Computational Linguistics},
  url       = {https://github.com/MikeGu721/GAPO}
}

시카고 스타일

Zhouhong Gu, Xingzhou Chen, Xiaoran Shi, Tao Wang, Suhang Zheng, Tianyu Li, Hongwei Feng, and Yanghua Xiao. “GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization.” In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 282–296. Association for Computational Linguistics, July 2025. https://github.com/MikeGu721/GAPO.