한줄 요약: 



짧은 요약(Abstract) :    

이 논문은 인간의 느리고 신중한 사고 방식인 "시스템 2 사고(System 2 Thinking)"를 인공지능 모델에 적용하려는 시도를 다룹니다. 기존의 접근 방식들은 텍스트나 수학, 코딩 등 특정 분야에만 제한되거나, 추가 학습이나 보조 모델을 요구하는 한계가 있습니다. 저자들은 완전히 비지도 학습만으로 시스템 2 사고가 가능한지를 질문하고, 이에 대한 해답으로 **Energy-Based Transformers (EBTs)**라는 새로운 모델을 제안합니다.

EBT는 입력과 후보 예측 쌍에 대해 에너지(정규화되지 않은 확률값)를 부여하고, 이를 최소화하는 방식으로 최종 예측을 생성합니다. 이렇게 하면 텍스트와 이미지 같은 다양한 입력 형식에서도 문제에 맞춰 계산 자원을 유연하게 할당하며, 더 어려운 문제에 대해 더 많이 "생각"할 수 있게 됩니다.

실험 결과, EBT는 기존 Transformer++ 모델보다 더 빠른 학습 확장 속도를 보였고, 추론 시 시스템 2 방식(즉, 반복 계산)을 사용할 경우 성능이 더욱 향상되었습니다. 특히, 훈련 데이터에서 벗어난(out-of-distribution) 데이터를 다룰 때 더 뛰어난 일반화 능력을 보여주었습니다.

결론적으로, EBT는 학습과 사고 모두에서 확장 가능한 매우 유망한 새로운 패러다임이라고 할 수 있습니다.



This paper investigates whether System 2 Thinking—a deliberate, reasoning-based computation style—can emerge from purely unsupervised learning, without additional supervision or task-specific fine-tuning. The authors introduce Energy-Based Transformers (EBTs), a new type of Energy-Based Model (EBM) that learns to assign unnormalized energy values to input–prediction pairs. Predictions are generated by minimizing this energy via gradient descent, enabling the model to “think longer” and allocate computation dynamically.

Unlike prior methods limited by modality or task, EBTs generalize across both discrete (text) and continuous (visual) modalities. They scale better than Transformer++ models—up to 35% faster across data, model size, and computation. Inference-time “System 2 Thinking” (i.e., extra reasoning steps) boosts EBT performance by 29% more than Transformer++ on language tasks and yields better image denoising results than diffusion models with 99% fewer forward passes.

Importantly, EBTs show stronger generalization on out-of-distribution data despite similar or worse pretraining performance, suggesting they are better thinkers and learners. This positions EBTs as a promising paradigm for scaling both learning and reasoning in future AI systems.



* Useful sentences :

단어정리

Methodology

이 논문은 **Energy-Based Transformer (EBT)**라는 새로운 모델 구조를 제안합니다. 이 모델은 전통적인 Transformer의 한계—정적인 연산, 불확실성 추론 불가능, 예측 검증 부재—를 극복하기 위해 설계된 **에너지 기반 모델 (Energy-Based Model; EBM)**입니다.

주요 구성 요소

모델 구조 (Architecture):
- Decoder-only EBT: GPT처럼 한 방향(Autoregressive)으로 동작하며, 문장 생성 등에 적합.
- Bidirectional EBT: BERT 또는 Diffusion Transformer처럼 양방향으로 정보를 주고받으며 이미지 복원 등에 적합.
- 두 구조 모두 입력–예측 쌍에 대해 에너지 값을 부여함으로써 그 호환성을 판단합니다.
학습 방식 (Training Objective):
- 초기 예측값(예: 무작위 노이즈)에서 출발하여, 에너지를 최소화하도록 반복적으로 예측값을 업데이트합니다 (gradient descent).
- 손실 함수는 기존 Transformer와 동일하게 사용됩니다:
  - 텍스트: Cross-Entropy
  - 이미지: Mean Squared Error
학습 기법 및 안정화 기법:
- Replay Buffer: 예전 예측 경로를 다시 사용하는 버퍼로 안정적인 에너지 학습 유도
- Langevin Dynamics: 노이즈를 추가하여 탐색 범위를 확장
- 랜덤 스텝 크기/횟수: 다양한 경로 학습을 통해 일반화 성능 향상
추론 방법 (Inference / Thinking Process):
- Thinking Longer: 더 많은 gradient descent step 수행
- Self-Verification: 여러 예측 후보를 생성 후 가장 에너지가 낮은 것을 선택
학습 데이터와 작업 (Training Data & Tasks):
- 텍스트: RedPajamaV2 데이터셋 100B 샘플 (Autoregressive Language Modeling)
- 비디오: Something-Something V2 (다음 프레임 예측)
- 이미지: COCO 2014 (이미지 노이징 복원)

이 방식은 기존 Transformer++ 대비 더 뛰어난 확장성과 일반화 능력을 보여주었고, 특히 추론 시 “생각하는” 과정에서 성능이 더욱 향상되는 특징을 가집니다.

🇺🇸 English Version

This paper proposes Energy-Based Transformers (EBTs), a novel model class that integrates the principles of Energy-Based Models (EBMs) into Transformer architectures to enable System 2 Thinking—deliberate, iterative reasoning through energy minimization.

Key Components

Model Architecture:
- Decoder-only EBT: Inspired by GPT, used for autoregressive tasks like next-token prediction.
- Bidirectional EBT: Similar to BERT or Diffusion Transformers, used for denoising or bidirectional tasks.
- Both assign a scalar energy value to input–prediction pairs, acting as verifiers of compatibility.
Training Objective:
- Training simulates a thinking process: starting from a random prediction, the model updates it iteratively using gradient descent to minimize energy.
- Objective functions:
  - Cross-entropy for language modeling
  - Mean squared error for image denoising
Training Stability Techniques:
- Replay Buffer: Stores past trajectories to stabilize learning.
- Langevin Dynamics: Adds Gaussian noise during optimization to encourage exploration.
- Random step size and number of steps: Enhances generalization by diversifying update paths.
Inference Strategy (Thinking Process):
- Thinking Longer: More optimization steps during inference.
- Self-Verification: Generates multiple candidates and selects the one with lowest energy.
Training Data and Tasks:
- Text: RedPajamaV2 (100B samples), used for autoregressive language modeling.
- Video: Something-Something V2, used for next-frame prediction.
- Image: COCO 2014, used for image denoising.

This method demonstrates better scaling trends than Transformer++, and its inference-time thinking mechanism further enhances generalization—especially for out-of-distribution (OOD) data.

Results

이 논문에서는 EBT(Energy-Based Transformer)의 성능을 다양한 **모달리티(텍스트, 이미지, 비디오)**와 **작업(task)**에서 기존의 강력한 모델들과 비교하여 검증했습니다. 특히 학습 확장성과 사고(Thinking) 확장성, OOD(Out-of-Distribution) 일반화를 중점으로 평가했습니다.

주요 비교 모델

Transformer++: 표준 GPT-유형 모델 (Autoregressive 기반)
Diffusion Transformer (DiT): 이미지 복원에서 비교됨
기타: RNN, 기존 EBM (비교 배경용)

테스트 데이터셋과 작업

모달리티	작업	데이터셋
텍스트	언어 모델링, 추론	RedPajamaV2 (100B), GSM8K, SQuAD, BigBench Math QA, Dyck Languages
비디오	다음 프레임 예측	Something-Something V2
이미지	이미지 디노이징, 분류	COCO 2014, ImageNet-1k

주요 결과 요약

학습 확장성 (Learning Scalability) EBT는 데이터 크기, 파라미터 수, FLOPs, 모델 깊이 등 모든 면에서 Transformer++보다 최대 35% 더 빠르게 확장됨.
사고(Thinking) 확장성 (System 2 Thinking) 추론 시간 동안 더 많이 “생각”할수록 성능 향상됨.
- 언어 모델링에서 29% 추가 향상 (Thinking Longer + Self-Verification)
- EBT는 Transformer++와 달리 토큰 단위로 더 많은 연산 시 성능 증가
OOD 일반화
- 훈련 데이터와 다른 분포의 데이터에서도 성능 유지 또는 향상됨.
- 예: Dyck Languages와 같은 복잡한 문법 데이터에서 Thinking을 할수록 성능이 선형적으로 증가
이미지 복원 및 분류 성능
- EBT는 DiT 대비 99% 적은 계산량으로 더 나은 이미지 디노이징 성능 (PSNR 및 MSE 기준)
- ImageNet 분류에서 **Top-1 정확도 5.32%, Top-5 정확도 13.2%**로 DiT 대비 약 10배 향상

In the Results section, the authors evaluate the performance of Energy-Based Transformers (EBTs) across text, video, and image tasks, focusing on learning scalability, System 2 Thinking, and out-of-distribution (OOD) generalization.

Baseline / Competing Models

Transformer++: Strong GPT-style baseline for language modeling.
Diffusion Transformer (DiT): Compared on image denoising.
Others like RNNs or prior EBMs are mentioned for architectural background.

Test Datasets & Tasks

Modality	Task	Dataset
Text	Language modeling, reasoning	RedPajamaV2, GSM8K, SQuAD, BigBench Math QA, Dyck Languages
Video	Next-frame prediction	Something-Something V2
Image	Denoising, classification	COCO 2014, ImageNet-1k

Key Results

Learning Scalability EBTs outperform Transformer++ across all axes (data, parameters, FLOPs, depth), achieving up to 35% faster scaling.
System 2 Thinking (Inference-Time Reasoning)
- Improves performance significantly during inference:
  - Up to +29% gain on text tasks using Thinking Longer + Self-Verification.
- Unlike Transformer++, EBTs improve per-token performance with more computation.
Out-of-Distribution Generalization
- EBTs generalize better to OOD datasets, such as BigBench Dyck Languages.
- Linear trend: More OOD → More gains from “thinking”.
Image Denoising and Classification
- EBTs outperform DiTs in PSNR and MSE using 99% fewer forward passes.
- On ImageNet-1k, EBTs achieve Top-1 accuracy: 5.32%, Top-5: 13.2%, which is nearly 10× higher than DiTs.

예제

예시 1. 훈련 데이터 예시 (Autoregressive Language Modeling)

데이터셋: RedPajamaV2 (100B 샘플)
형식: 영어 문서 텍스트, 예: "The quick brown fox jumps over the lazy dog."
EBT 동작: EBT는 다음 토큰 "jumps"를 예측하기 위해 처음에는 무작위 분포에서 시작해, 에너지를 최소화하면서 점차 "jumps"로 수렴함.
특징: 쉬운 단어(예: “the”, “is”)는 빠르게 에너지가 수렴하지만, 예측이 어려운 단어(“fox”, “problem”)는 높은 에너지 값을 유지하면서 느리게 수렴함 → 불확실성 학습

예시 2. 테스트 데이터 예시 (Out-of-Distribution Reasoning Task)

데이터셋: BigBench Dyck Languages (문법적으로 중첩된 괄호 예측)
문제 예시: 입력 문자열: "([[]([])])"
테스크: 모델이 이 시퀀스가 문법적으로 올바른지를 판별하거나 다음 괄호를 예측
EBT 결과:
- Transformer++는 일반적인 시퀀스에서는 잘 작동하지만, 중첩 구조에 대해 성능 하락
- EBT는 System 2 Thinking(생각하는 단계 수 증가, 후보 중 검증)을 통해 더 나은 성능 도출
- Thinking step을 늘릴수록 정확도 개선

예시 3. 이미지/비디오 테스크 예시 (비디오 프레임 예측)

데이터셋: Something-Something V2
상황 예시: 초기 프레임에서 사람이 옷을 들고 있음 → 다음 프레임에서 옷이 더 명확하게 보일지 예측
EBT 특징:
- 초반 프레임에는 높은 에너지 → 불확실성 높음
- 옷이 드러날수록 에너지가 낮아짐 → 모델이 더 확신을 가짐

Example 1. Training Data Example (Autoregressive Language Modeling)

Dataset: RedPajamaV2 (100B samples)
Input: A sentence like: "The quick brown fox jumps over the lazy dog."
EBT behavior: To predict the next token "jumps", the model starts from a random distribution and iteratively minimizes the energy until it converges to the correct token.
Key observation: Simple words like "the" or "is" quickly reach low energy, while harder tokens like "fox" or "problem" converge more slowly → reflects uncertainty modeling.

Example 2. Test Data Example (Out-of-Distribution Reasoning Task)

Dataset: BigBench Dyck Languages
Input example: A sequence like "([[]([])])"
Task: Determine whether the nested brackets are grammatically correct or predict the next bracket.
EBT results:
- Transformer++ performs poorly on deeply nested structures.
- EBT improves performance using System 2 Thinking (e.g., multiple steps + self-verification).
- More thinking steps = better accuracy.

Example 3. Video Task Example (Next Frame Prediction)

Dataset: Something-Something V2
Scenario: A person begins lifting a blue garment into view.
EBT behavior:
- Early frames have high energy (high uncertainty).
- As the object becomes visible, energy decreases (model becomes more confident).
Key takeaway: EBT learns to model uncertainty naturally without supervision in continuous visual scenes.

요약

Energy-Based Transformer(EBT)는 입력-예측 쌍의 호환성(에너지)을 학습하여, 이를 최소화하는 방식으로 추론 중 ‘생각하는 과정(System 2 Thinking)’을 구현합니다. 실험 결과, EBT는 기존 Transformer++보다 학습 확장성과 추론 성능에서 우수하며, 특히 OOD 데이터에 대해 더 강력한 일반화 능력을 보였습니다. 예를 들어 중첩 괄호를 판별하는 BigBench Dyck task에서 EBT는 생각 단계를 늘릴수록 정확도가 향상되었습니다.

Energy-Based Transformers (EBTs) learn to assign energy to input–prediction pairs and perform reasoning by minimizing this energy through iterative inference. Experiments show that EBTs outperform Transformer++ in both training scalability and inference-time performance, especially on out-of-distribution data. For example, on the BigBench Dyck task involving nested brackets, EBT accuracy improves as the number of thinking steps increases.

기타

Figure 1: Autoregressive Architecture 비교

내용: 기존 AR Transformer, RNN, Diffusion Transformer, EBT 아키텍처 비교
인사이트:
- EBT는 Diffusion 모델과 유사하지만, noise가 아니라 energy를 직접 예측하며 명시적 검증 기능이 있음
- 기존 모델은 동적 계산, 불확실성 추론, 검증 기능 중 일부만 가능하지만, EBT는 모두 가능함

Figure 2 & 3: Thinking 과정 시각화

내용:
- Figure 2: 매 step마다 예측을 업데이트하면서 energy를 줄이는 구조
- Figure 3: energy landscape에서 gradient descent를 통해 예측이 수렴하는 과정을 시각적으로 표현
인사이트:
- Thinking = Optimization 개념을 직관적으로 보여줌
- 불확실성이 높은 경우 수렴이 느리고, 에너지가 높게 유지됨 → 추론에 더 많은 단계 필요

Table 1: 구조별 인지적 기능 비교

내용: 다양한 모델이 Cognitive Facet 1~3 (동적 연산, 불확실성 추론, 검증) 기능을 가지는지 여부
인사이트:
- 기존 Transformer, RNN, Diffusion은 일부만 지원
- EBT만이 세 가지 기능 모두 충족

Table 2: Thinking 기법 별 성능 (Ablation)

내용: Langevin Dynamics, Replay Buffer, Random Step 등을 제거했을 때 성능 비교
인사이트:
- 모든 regularization 기법을 함께 쓸 때 가장 좋은 성능
- 랜덤 스텝 크기 제거 시 성능 거의 손실 → 다양한 경로 탐색이 중요함

Figure 4–5: 학습 확장성 (Scalability)

내용: 데이터 크기, batch size, depth, FLOPs, 파라미터 수 등에서의 학습 속도 비교
인사이트:
- EBT가 Transformer++보다 최대 35% 더 빠른 확장률을 가짐
- 특히 **깊이(deep depth)**에서 큰 차이를 보여 reasoning에 유리함

Figure 6–7: Thinking 성능 및 OOD 일반화

내용: Thinking step 증가 시 성능 개선, OOD 데이터일수록 더 큰 향상
인사이트:
- System 2 Thinking이 OOD 일반화에 효과적
- 생각 단계를 늘릴수록 정확도가 선형적으로 향상됨

Figure 8 & 11: 불확실성 시각화 (텍스트, 비디오)

내용: 쉬운 단어(“the”)는 빠르게 에너지 수렴, 어려운 단어(“problem”)는 높은 에너지 유지
인사이트:
- EBT는 불확실성 정도를 token/frame 단위로 자연스럽게 학습함

Figure 9 & Table 4: 비디오/이미지 실험

내용: 이미지 디노이징에서 EBT vs DiT 성능 비교
인사이트:
- EBT는 99% 적은 forward pass로 DiT보다 PSNR, MSE 성능 우수
- ImageNet 분류 정확도도 10배 이상 높음 → 이미지 표현 학습에도 강력

Figure 10: OOD 이미지 복원 품질

내용: 시각적으로 DiT보다 EBT의 복원 이미지가 더 선명함
인사이트:
- 적은 연산으로도 더 나은 질의 이미지 생성 가능

Figure 12: PSNR 성능 vs Thinking Steps

내용: 이미지 디노이징에서 thinking step 수가 늘어날수록 PSNR 향상
인사이트:
- 추론 시간 동안 더 생각할수록 성능 향상됨 → System 2 Thinking 효과 확인

부록 (Appendix Sections C, D, E, etc.)

Section C: System 2 Thinking 정의 및 알고리즘 정리
Section D: 하이퍼파라미터, FLOP 계산 방식 등 실험 세부
Section E: Diffusion 모델과 EBT 비교 분석
Section F: System 2 Thinking 추가 facet 제안
Section H: 초보자용 EBM 개요와 pseudocode 제공

Figure 1: Architecture Comparison

Compares AR Transformers, RNNs, Diffusion Transformers, and EBTs
Insight: Only EBTs support all three cognitive facets—dynamic computation, uncertainty modeling, and prediction verification.

Figures 2–3: Thinking Process

Visualizes prediction refinement via energy minimization
Insight: Models iterate predictions like humans think—uncertainty slows convergence.

Shows which architectures support dynamic compute, uncertainty, and verification
Insight: EBTs uniquely support all three.

Table 2: Ablation Study on Thinking

Removes Langevin, Replay Buffer, etc., and compares performance
Insight: Full combination performs best; random step size is especially critical.

Figures 4–5: Learning Scalability

EBT outpaces Transformer++ in data, batch, depth, FLOPs, and parameter scaling
Insight: EBTs scale up to 35% faster, especially in deep models → better for reasoning.

Figures 6–7: Thinking & Generalization

More inference steps → better performance, especially on OOD data
Insight: System 2 Thinking improves generalization, showing linear performance gains.

Figures 8 & 11: Uncertainty Visualization

EBT assigns higher energy to uncertain tokens/frames
Insight: Learns token/frame-level uncertainty without supervision.

Figure 9 & Table 4: Image/Video Results

EBT beats DiT in PSNR, MSE, and classification with 99% fewer steps
Insight: Stronger representations and generalization in continuous domains.

Figure 10: OOD Image Quality

EBT produces clearer restored images than DiT
Insight: Better visual quality with far less compute.

Figure 12: PSNR vs Inference Steps

More steps → higher PSNR
Insight: Confirms that thinking longer improves prediction quality.

Appendix

Section C: Defines System 2 Thinking and formal algorithms
Section D: Hyperparameters, FLOP calculations
Section E: EBT vs Diffusion models
Section F: Additional cognitive facets
Section H: EBM introduction and pseudocode

refer format:

@article{gladstone2025energy, title={Energy-Based Transformers are Scalable Learners and Thinkers}, author={Gladstone, Alexi and Nanduru, Ganesh and Islam, Md Mofijul and Han, Peixuan and Ha, Hyeonjeong and Chadha, Aman and Du, Yilun and Ji, Heng and Li, Jundong and Iqbal, Tariq}, journal={arXiv preprint arXiv:2507.02092}, year={2025}, url={https://arxiv.org/abs/2507.02092} }

Gladstone, Alexi, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, and Tariq Iqbal. 2025. “Energy-Based Transformers Are Scalable Learners and Thinkers.” arXiv preprint arXiv:2507.02092. https://arxiv.org/abs/2507.02092.