한줄 요약: 



짧은 요약(Abstract) :    


Grok 4는 xAI가 개발한 차세대 언어 모델로, 도구 사용 능력과 실시간 검색 기능을 통합하여 다양한 태스크에 적응 가능한 멀티에이전트 추론 시스템으로 설계되었다. 본 모델은 200,000개의 GPU로 구성된 Colossus 슈퍼클러스터를 활용하여 사전학습 수준에서 대규모 강화학습을 수행함으로써, 수학, 과학, 추론, 비즈니스 시뮬레이션 등 복잡한 문제 영역에서 성능을 대폭 향상시켰다. Grok 4 Heavy는 Humanity’s Last Exam, AIME, GPQA 등 주요 벤치마크에서 기존 대형 모델(GPT-4o, Gemini 2.5 Pro, Claude Opus 4)을 상회하는 결과를 기록하며, 실시간 상호작용형 지능형 에이전트로서의 가능성을 보여준다.



Grok 4 is a next-generation language model developed by xAI, designed as a multi-agent reasoning system with integrated tool use and real-time web search capabilities. Leveraging the Colossus supercluster composed of 200,000 GPUs, the model scales reinforcement learning to pretraining levels, achieving substantial improvements across complex tasks such as mathematics, science, reasoning, and business simulations. Grok 4 Heavy surpasses leading models like GPT-4o, Gemini 2.5 Pro, and Claude Opus 4 on benchmarks including Humanity’s Last Exam, AIME, and GPQA, demonstrating its potential as an interactive, real-world intelligent agent.




* Useful sentences :

단어정리

Methodology

1. Colossus GPU 클러스터 기반의 강화학습 (RL) 대규모 확장

Grok 4는 xAI의 Colossus 200,000 GPU 클러스터를 활용하여 대규모로 강화학습을 실행하였습니다.
이전 버전인 Grok 3에서도 Reasoning RL을 도입했지만, Grok 4는 사전학습(pretraining) 수준의 스케일에서 RL을 수행한 것이 특징입니다.
이를 통해 **추론 능력(reasoning)**을 강화하고, 상호작용형 시스템에 적합한 학습을 가능하게 했습니다.

2. 새로운 인프라와 알고리즘을 통한 학습 효율 향상

학습 효율성을 약 6배 향상시킨 새로운 알고리즘 설계 및 인프라 혁신이 이루어졌습니다.
예를 들어, 효율적인 샘플링, 강화학습용 트레이닝 루프 최적화 등이 포함됩니다.

3. 강화학습을 위한 사용자 피드백 기반의 데이터 수집

Grok 4는 실제 사용자 상호작용에서 얻은 피드백과 행동을 활용하여 학습 데이터를 수집하였고, 이를 통해 도구(tool) 사용 능력과 다중 에이전트 reasoning 성능을 강화하였습니다.

4. 다중 도메인 대응을 위한 프롬프트 기반 학습

수학·코딩 중심이었던 이전 학습에서 벗어나, 다양한 분야의 태스크에 대응하도록 프롬프트 학습이 개선되었습니다.

Reinforcement Learning at Scale with Colossus (200k GPUs): Grok 4 was trained using a massive GPU cluster to run reinforcement learning (RL) at pretraining scale. Unlike Grok 3, which applied RL mainly for reasoning capabilities post-pretraining, Grok 4 integrates RL directly into training at scale, allowing for richer reasoning and tool usage capabilities.
6× Training Efficiency via Infrastructure and Algorithm Innovations: Grok 4’s training efficiency was improved sixfold through innovations across the stack, including new algorithmic optimizations, data sampling methods, and reinforcement learning pipelines.
User Feedback–Driven Data Collection: The training incorporated feedback and behavioral data from real user interactions, enabling the model to better use tools and engage in multi-agent reasoning scenarios.
Prompt Diversity Across Domains: Training objectives expanded beyond mathematics and coding, supporting a broader array of tasks across scientific, conversational, and open-ended domains.

Results

Humanity’s Last Exam (HLE)
- Grok 4 (도구 없음): 25.4%
- Grok 4 Heavy: 44.4%
- 비교 대상: GPT-4o (21%), Gemini 2.5 Pro (21.6%) → 기존 모델보다 20% 이상 향상된 정확도
AIME (미국 수학 경시대회)
- Grok 3: 52.2%
- Grok 4 Heavy: 100% 정답률 → 수학적 추론 및 계산 능력에서 완벽한 향상
ARC-AGI-2 (추론 퍼즐)
- Grok 4: 16.2%
- Claude Opus 4: 8–8.6% → 추상적 패턴 유추 능력에서 2배 이상 향상
GPQA (대학원 수준 물리 문제)
- Grok 3: 75.4%
- Grok 4 Heavy: 87%
VendingBench (비즈니스 시뮬레이션)
- Grok 4 Heavy: $4,694 수익 / 4,569개 제품 판매
- GPT-4: $1,843 / 1,363개
- Claude Opus 4: $2,077 / 1,412개
- 인간 참가자 평균: $844 / 344개 → 실제 시뮬레이션 환경에서 가장 높은 수익성과 전략 수행 능력 입증

Humanity’s Last Exam (HLE)
- Grok 4 (no tools): 25.4%
- Grok 4 Heavy: 44.4%
- Baselines: GPT-4o (21%), Gemini 2.5 Pro (21.6%) → Over 20% higher accuracy than leading models
AIME (American Invitational Math Exam)
- Grok 3: 52.2%
- Grok 4 Heavy: 100% accuracy → Perfect mathematical reasoning and computation
ARC-AGI-2 (abstract reasoning puzzles)
- Grok 4: 16.2%
- Claude Opus 4: ~8–8.6% → 2× improvement in complex pattern inference
GPQA (graduate-level physics QA)
- Grok 3: 75.4%
- Grok 4 Heavy: 87%
VendingBench (business strategy simulation)
- Grok 4 Heavy: $4,694 profit / 4,569 units sold
- GPT-4: $1,843 / 1,363 units
- Claude Opus 4: $2,077 / 1,412 units
- Human average: $844 / 344 units → Demonstrates highest strategic performance in simulation-based decision tasks

예제

예시 1: [Humanity’s Last Exam (HLE)]

입력: 인간 수준의 종합 평가 문제들 (수학, 물리, 역사, 윤리, 사회적 추론 등 다양한 분야)
출력: 주관식 또는 다지선다 문제에 대한 정확한 정답 또는 논리적인 해설
결과: Grok 4 Heavy는 도구 사용 없이도 44.4% 정확도를 달성, GPT-4o 및 Gemini 2.5 Pro보다 높은 성능

예시 2: [AIME (American Invitational Mathematics Examination)]

입력: 고등학교 수준의 수학 문제 (예: “Find the number of positive integers less than 1000 that are divisible by 7 or 11 but not both.”)
출력: 정확한 숫자 정답과 풀이 과정
결과: Grok 4 Heavy는 모든 문제 100% 정답, 이전 Grok 3은 52.2%만 맞힘

예시 3: [ARC-AGI-2 퍼즐 추론 태스크]

입력: 도형 배열이나 추상적인 패턴에서 규칙을 유추하는 문제 (일종의 Raven’s Matrix와 유사)
출력: 누락된 패턴을 채우는 정답 선택
결과: Grok 4는 16.2% 정확도로, Claude Opus 4보다 거의 2배 높은 성능

예시 4: [Vending-Bench 비즈니스 시뮬레이션 태스크]

입력: 제시된 마케팅/재고 데이터에 기반해 하루 동안의 자동판매기 운영 전략 결정 (가격 설정, 프로모션 등)
출력: 재무 성과 (수익, 판매량 등) 예측 및 실행 전략
결과: Grok 4는 $4,694의 순이익과 4,569개 제품 판매로 GPT-4, Claude, Gemini, 인간 참가자들보다 모두 우수한 결과

Example 1: [Humanity’s Last Exam (HLE)]

Input: Open-ended multi-domain evaluation questions across math, physics, ethics, history, and social reasoning
Output: Written answers or multiple-choice selections with logical explanations
Result: Grok 4 Heavy scored 44.4% accuracy without tool usage—outperforming GPT‑4o and Gemini 2.5 Pro

Example 2: [AIME (American Math Competition)]

Input: Advanced high school math problem Example: “Find the number of positive integers < 1000 divisible by 7 or 11 but not both.”
Output: Correct answer and step-by-step solution
Result: Grok 4 Heavy achieved 100% accuracy, compared to 52.2% by Grok 3

Example 3: [ARC-AGI-2 Reasoning Task]

Input: Abstract pattern recognition puzzle (e.g., inferring missing images in a matrix)
Output: Selection of the correct completion
Result: Grok 4 scored 16.2%, nearly double Claude Opus 4’s accuracy (~8%)

Example 4: [Vending-Bench Business Simulation]

Input: Market parameters and product info (e.g., daily cost, weather, customer trends)
Output: Operational decisions and pricing strategy
Result: Grok 4 earned $4,694 in net profit and sold 4,569 units, far exceeding all other models and even human participants

요약

Grok 4는 20만 개 GPU 클러스터 Colossus를 활용해 강화학습을 사전학습 수준으로 확장하고, 툴 사용 및 검색 능력을 포함한 멀티에이전트 reasoning을 학습했습니다. Grok 4 Heavy는 HLE 벤치마크에서 44.4%, AIME 수학 경시에서 100% 정확도를 기록하며 GPT-4o, Claude Opus 4, Gemini 2.5 Pro보다 뛰어난 성능을 보였습니다. 예시로, 고난도 수학 문제 풀이, 시뮬레이션 기반 전략 게임, 추론 퍼즐 등에서 도구 없이도 정확한 해답과 전략을 생성해냈습니다.

Grok 4 scales reinforcement learning to pretraining levels using the 200k-GPU Colossus cluster and learns tool usage and multi-agent reasoning across tasks. Grok 4 Heavy outperforms GPT‑4o, Claude Opus 4, and Gemini 2.5 Pro, achieving 44.4% on Humanity’s Last Exam and 100% on the AIME math benchmark. Example tasks include solving complex math problems, reasoning through abstract puzzles, and optimizing strategies in simulation environments—all without external tools.

기타

1. 성능 비교 테이블 (표 형식)

여러 벤치마크(HLE, ARC-AGI-2, AIME, GPQA, VendingBench 등)에 대한 정량적 결과를 나열한 표가 존재합니다.
예를 들어, Grok 4 Heavy는 VendingBench에서 $4,694 순이익, 4,569개 판매로 GPT-4o, Claude, Gemini, 인간 참가자를 모두 능가했습니다.
Grok 4는 HLE 25.4% (도구 없이), **Grok 4 Heavy는 44.4%**로 GPT-4o 대비 +20% 이상 향상된 점수를 기록했습니다.
인사이트: 도구 사용 없이도 Grok 4가 상당한 reasoning 능력을 갖추고 있으며, Heavy 버전은 멀티에이전트 및 복합 연산에 강점을 보임.

2. 성능 변화 그래프 (공식 발표 영상 및 외부 리뷰에서 확인됨)

강화학습과 툴 사용 기능을 도입한 이후 성능이 어떻게 향상되었는지를 시각적으로 보여주는 그래프가 포함되어 있음.
특히 수학 영역에서는 Grok 3 → Grok 4로 넘어오며 **AIME 정확도 52.2% → 100%**까지 상승한 추세가 그래프로 제시됨.
결론: RL 스케일링 및 툴 통합이 학습 효과에 중요한 영향을 미친 것으로 해석됨.

3. 데모 영상 (YouTube 링크 포함)

xAI 공식 영상에서는 Grok 4가 실시간으로 사용자 질문에 대답하고, 검색/계산/요약 등 복합 task를 수행하는 장면을 시연함.
예시: “이전 미국 대통령 5명과 그들의 주요 업적 정리해줘” → Grok이 직접 요약하고 비교표로 정리.
인사이트: 단순한 언어 생성이 아니라 툴 연동 기반의 액션형 AI로 설계되었음을 보여줌.

4. 마케팅 이미지와 슬로건 (“Smartest Model in the World”)

“세계에서 가장 지능적인 모델”이라는 문구가 강조되어 있으며, Gemini 및 GPT-4 대비 성능 우위가 마케팅 중심에 배치됨.
실제로 HLE, ARC-AGI-2, GPQA, VendingBench 등의 수치 비교가 홍보 도표로 활용됨.
의의: 단순한 스펙 경쟁이 아니라 실전 과제에서의 “행동 기반 지능”을 주요 차별점으로 내세움.

1. Performance Comparison Tables

The official release includes tabular summaries comparing Grok 4 and Grok 4 Heavy to Gemini 2.5 Pro, GPT‑4o, and Claude Opus 4 across multiple benchmarks (e.g., HLE, AIME, GPQA, VendingBench).
Notably, Grok 4 Heavy achieved $4,694 in profit and 4,569 units sold in VendingBench, outperforming all other models and humans.
Insight: Even without tool usage, Grok 4 demonstrates advanced reasoning, while the Heavy variant excels in multi-agent and simulation tasks.

2. Progress Graphs

Charts demonstrate performance gains from Grok 3 to Grok 4 due to scaled reinforcement learning and integrated tool usage.
For example, Grok 3 scored 52.2% on AIME, while Grok 4 Heavy reached 100%, visually highlighting dramatic improvement.
Conclusion: RL at scale, combined with behavioral data and real-time tools, significantly boosts performance across complex domains.

3. Demonstration Video (YouTube)

The launch video shows Grok 4 performing multi-step tasks like answering complex questions, performing live search, and generating structured summaries.
Example: “List five past US presidents and their key policies” → Grok generates a comparative summary chart in real time.
Insight: Grok 4 acts not only as a language model but as a tool-integrated action agent.

4. Marketing Visuals & Messaging

Slogans like “The Smartest Model in the World” frame Grok 4’s identity, emphasizing superiority over GPT and Gemini in practical reasoning.
Marketing charts showcase benchmark scores to substantiate the claim, especially on zero-shot tool-less tasks.
Significance: xAI positions Grok not just as a language generator but as a real-world AI agent capable of tool interaction and decision-making.

refer format:

@misc{xai2025grok4, author = , title = {Grok 4: The Smartest Model in the World}, year = {2025}, howpublished = {\url{https://x.ai/news/grok-4}}, note = {Accessed: 2025-07-12} }

xAI. 2025. “Grok 4: The Smartest Model in the World.” xAI News. July 11, 2025. https://x.ai/news/grok-4.