한줄 요약: 


mRNA seq로 BERT 학습   


짧은 요약(Abstract) :    




mRNA 기반 백신과 치료제는 다양한 질병에 걸쳐 빠르게 중요성이 커지고 있습니다. mRNA를 설계할 때 중요한 문제 중 하나는 **서열 최적화**입니다. 단일 단백질도 수천 가지 mRNA 서열로 인코딩할 수 있는데, 실제로 어떤 mRNA 서열을 선택하느냐에 따라 발현, 안정성, 면역원성 등 여러 특성이 크게 달라집니다.

이를 해결하기 위해 저자들은 **CodonBERT**라는 대규모 언어 모델(LLM)을 개발했습니다. CodonBERT는 기존 모델들과 달리 **코돈(codon)** 단위로 입력을 받아 더 좋은 생물학적 표현을 학습할 수 있게 했습니다. 1천만 개 이상의 다양한 생물체 mRNA 서열을 이용해 학습했으며, 이를 통해 중요한 생물학적 개념을 포착할 수 있었습니다.

CodonBERT는 다양한 mRNA 특성 예측 작업에도 확장 가능하며, 새로운 독감 백신 데이터셋을 포함한 여러 테스트에서 기존 방법들을 뛰어넘는 성능을 보였습니다.

---



mRNA-based vaccines and therapeutics are rapidly gaining importance across a wide range of conditions. A critical challenge in designing such mRNAs is **sequence optimization**. Even small proteins can be encoded by thousands of different mRNA sequences, and the choice of sequence can significantly impact properties such as expression, stability, and immunogenicity.

To address this, the authors developed **CodonBERT**, a large language model (LLM) specifically for mRNA. Unlike previous models, CodonBERT uses **codons** as the input unit, allowing it to learn richer biological representations. It was trained on more than 10 million mRNA sequences from a wide variety of organisms, enabling it to capture important biological concepts.

CodonBERT can also be extended to various mRNA property prediction tasks and outperforms previous methods, including on a newly created flu vaccine dataset.

---



* Useful sentences :

단어정리

Methodology

1. 트레이닝 데이터 (Pre-training Data)

데이터 수집: NCBI 데이터베이스에서 인간 바이러스, 포유류, 박테리아(주로 E. coli)의 mRNA 서열을 수집했습니다.
총 데이터: 약 1천만 개 이상의 mRNA 코딩 서열(coding sequences, CDS) 사용.
평가 데이터: 추가로 새로운 독감 백신(Influenza H3N2)용 mRNA 서열 데이터셋을 실험실에서 합성하여 제작했고, 이를 통해 in vitro 단백질 발현 검증 실험도 수행했습니다.

2. 백본 모델 (Model Architecture)

CodonBERT는 BERT 구조를 확장한 모델입니다.
입력은 코돈(codon, 3염기) 단위로 처리됩니다 (nucleotide가 아닌 codon 단위).
입력 시, 각 코돈에 대해 세 가지 임베딩을 더합니다:
- Codon Embedding (코돈 자체)
- Position Embedding (위치 정보)
- Segment Embedding (문장 구분 정보)
모델 구성:
- 총 12개의 Transformer 인코더 블록 사용
- 각 블록은 12개의 self-attention 헤드를 가집니다.
- 히든 사이즈(hidden size): 768
- Residual connection과 feed-forward network 추가
출력: 컨텍스트화된 codon 임베딩을 생성한 후, 분류 레이어를 통과시켜 결과를 출력합니다.

3. Pre-training Task
CodonBERT는 두 가지 pre-training 작업을 동시에 수행합니다:

Masked Language Modeling (MLM): 전체 코돈의 15%를 무작위로 마스킹하고, 이를 맞추는 작업.
Homologous Sequence Prediction (HSP): 주어진 두 mRNA 서열이 유사(homologous)한지 예측하는 작업.

총 손실 함수는 MLM 손실 + HSP 손실을 합한 것으로 최적화됩니다.

1. Pre-training Data

Data Collection: mRNA sequences were collected from the NCBI database, including sequences from human viruses, mammals, and bacteria (mainly E. coli).
Total Data Size: Over 10 million mRNA coding sequences (CDS) were used for training.
Evaluation Data: A novel dataset was generated by synthesizing mRNA sequences encoding the Influenza H3N2 hemagglutinin protein, followed by in vitro protein expression experiments using HeLa cells.

2. Model Architecture (Backbone)

CodonBERT extends the BERT architecture, tailored specifically for mRNA sequences.
Unlike traditional models that operate on nucleotides, CodonBERT processes inputs at the codon (three-nucleotide) level.
Each input codon is represented using three types of embeddings:
- Codon Embedding (for the codon itself)
- Position Embedding (for positional information)
- Segment Embedding (for distinguishing sequence pairs)
The model consists of:
- 12 Transformer encoder blocks
- 12 self-attention heads per block
- Hidden size of 768
- Residual connections and feed-forward networks after each self-attention layer.
Output: Produces contextualized codon representations, followed by a classification layer for task-specific outputs.

3. Pre-training Tasks
CodonBERT was trained using two simultaneous tasks:

Masked Language Modeling (MLM): Randomly masks 15% of codons and learns to predict them.
Homologous Sequence Prediction (HSP): Predicts whether two mRNA sequences are homologous (i.e., evolutionarily related).

The final loss function is the sum of the MLM loss and HSP loss.

Results

1. 테스트 데이터셋
CodonBERT를 다양한 downstream task에 평가하기 위해 다음과 같은 데이터셋을 사용했습니다:

독감 백신 (Flu Vaccines) — 발현량 예측
mRFP Expression — 발현량 예측
Fungal Expression — 발현량 예측
E. coli Proteins — 단백질 발현 여부 분류
mRNA Stability — 안정성 예측
Tc-Riboswitch — 전환 스위칭 예측
SARS-CoV-2 Vaccine — mRNA 분해(degradation) 예측

※ 모든 데이터셋은 7:1.5:1.5 비율로 train/validation/test로 나눠 사용했습니다.

2. 비교 대상 경쟁 모델
CodonBERT와 다음 모델들을 비교했습니다:

Nucleotide 기반 모델:
- TextCNN (plain, nucleotides 단위)
- RNABERT
- RNA-FM
Codon 기반 모델:
- TF-IDF + Random Forest
- TextCNN (codon 버전)
- Codon2vec

3. 메트릭 (성능 평가 기준)

회귀 문제(regression tasks):
- Spearman’s rank correlation (순위 상관계수) 사용
분류 문제(classification tasks):
- 정확도(accuracy) 사용
(보조 지표로 MSE loss나 Cross Entropy loss도 계산했지만, 주로 상관계수와 정확도 중심으로 비교)

4. 주요 결과 요약

Codon 기반 모델들이 Nucleotide 기반 모델들보다 대체로 우수했습니다.
특히 CodonBERT는 7개 task 중 4개 task에서 1등, 나머지 대부분에서도 2등 기록.
CodonBERT는 발현량 예측, 단백질 발현 예측 등에서 다른 모든 모델 대비 가장 높은 성능을 보였습니다.
단, mRNA 안정성(mRNA stability) 예측에서는 구조적 특성이 중요한 탓에 다른 모델 대비 상대적으로 약간 낮은 성능을 보였습니다 (Codon 단위이기 때문).

1. Test Datasets
CodonBERT was evaluated on a wide range of downstream tasks using the following datasets:

Flu Vaccines — Expression prediction
mRFP Expression — Expression prediction
Fungal Expression — Expression prediction
E. coli Proteins — Protein presence classification
mRNA Stability — Stability prediction
Tc-Riboswitch — Switching prediction
SARS-CoV-2 Vaccine — Degradation prediction

※ All datasets were split into train/validation/test sets with a 70/15/15 ratio.

2. Baseline Competitor Models
CodonBERT was compared against:

Nucleotide-based models:
- Plain TextCNN (nucleotide level)
- RNABERT
- RNA-FM
Codon-based models:
- TF-IDF + Random Forest
- TextCNN (codon version)
- Codon2vec

3. Metrics

Regression tasks:
- Spearman’s rank correlation was used as the main evaluation metric.
Classification tasks:
- Accuracy was used.
(MSE loss and Cross Entropy loss were also reported but were secondary.)

4. Key Findings

Codon-based models consistently outperformed nucleotide-based models across most tasks.
CodonBERT achieved the best performance on 4 out of 7 tasks, and ranked second in most of the remaining tasks.
It significantly outperformed previous models for tasks like protein expression prediction.
However, CodonBERT showed slightly lower performance on mRNA stability prediction, likely because codon-level modeling does not capture fine-grained structural features critical for stability.

예제

1. 트레이닝 및 테스트 데이터 예시

Flu Vaccines 데이터셋 예시:
- mRNA 서열 길이: 약 1698–1704 nucleotides
- 목표(label): mRNA 서열이 만들어내는 단백질 발현량 (연속적인 수치값, 회귀 문제)
E. coli 단백질 발현 데이터셋 예시:
- mRNA 서열 길이: 171–3000 nucleotides
- 목표(label): 해당 서열이 단백질을 발현하는지 여부 (Binary classification: 발현/비발현)
mRNA Stability 데이터셋 예시:
- mRNA 조각들의 길이: 30–1497 nucleotides
- 목표(label): mRNA 분해 속도(안정성 정도) (회귀 문제)

(※ 전체 데이터 요약은 Supplementary Table S1에 정리되어 있음)

2. 테스크 인풋과 아웃풋 예시

입력(Input):
- 하나의 mRNA 서열 (코돈 단위로 토크나이징: 3개 뉴클레오타이드 단위)
  - 예: “AUG UUC CCG GUA AAG AUG GUG GAG AAA UCA UAA” (AUG=Start codon)
- 서열쌍 입력도 가능 (Homologous prediction 같은 경우)
출력(Output):
- 회귀 테스크: 예측된 수치값 (예: 발현량, 안정성 등)
- 분류 테스크: 예측된 클래스(label) (예: 발현 vs 비발현)

3. 모델이 수행하는 실제 예시 흐름

예를 들어 Flu Vaccines task에서는:

Input:
- 특정 독감 바이러스 항원을 인코딩하는 mRNA 서열
Model processing:
- Codon 단위로 임베딩 → Transformer 인코더 통과 → Contextualized codon representation 생성 → Regression head를 통해 예측
Output:
- 해당 mRNA 서열의 단백질 발현량 예측 (continuous value)

또는 E. coli 단백질 데이터셋에서는:

Input:
- 특정 박테리아 유래 mRNA 서열
Output:
- “단백질 발현 가능” 또는 “발현 불가”라는 클래스로 분류

1. Training and Test Dataset Samples

Flu Vaccines dataset:
- mRNA sequence length: 1698–1704 nucleotides
- Target (label): Protein expression level (continuous value, regression task)
E. coli Protein Expression dataset:
- mRNA sequence length: 171–3000 nucleotides
- Target (label): Binary classification (whether the sequence leads to protein expression)
mRNA Stability dataset:
- Fragment lengths: 30–1497 nucleotides
- Target (label): Stability/degradation score (continuous value, regression)

(※ Full dataset descriptions are summarized in Supplementary Table S1.)

2. Task Input and Output Examples

Input:
- A single mRNA sequence tokenized at the codon level (three nucleotides per token)
  - Example: “AUG UUC CCG GUA AAG AUG GUG GAG AAA UCA UAA” (AUG as start codon)
- Pairs of sequences can also be input for tasks like homologous sequence prediction.
Output:
- For regression tasks: A predicted continuous value (e.g., expression level, stability score)
- For classification tasks: A predicted class label (e.g., expressed vs non-expressed)

3. Example of Model Workflow

For the Flu Vaccines task:

Input:
- An mRNA sequence encoding an influenza antigen.
Model Processing:
- Tokenize by codon → Generate codon embeddings → Pass through 12-layer Transformer → Obtain contextualized representations → Predict expression level via regression head.
Output:
- A predicted expression level value.

Or for the E. coli proteins dataset:

Input:
- An mRNA sequence derived from E. coli.
Output:
- Classification into either “protein expressed” or “not expressed.”

요약

메서드: CodonBERT는 1천만 개 이상의 다양한 생물 종의 mRNA 코딩 서열을 코돈(codon) 단위로 입력하여 학습한 BERT 기반 대형 언어 모델로, 마스킹 복원(Masked Language Modeling)과 유사 서열 예측(Homologous Prediction) 작업을 동시에 수행하도록 사전학습되었습니다. 결론: CodonBERT는 다양한 mRNA 특성 예측(task)에서 기존 코돈/뉴클레오타이드 기반 모델들을 대부분 능가했으며, 특히 발현량 예측, 단백질 발현 여부 분류 등에서 탁월한 성능을 보였습니다. 예제: Flu 백신, mRNA 안정성, E. coli 발현 예측 등 실제 mRNA 서열을 입력받아 발현량 또는 안정성 점수를 예측하거나 단백질 발현 여부를 분류하는 다양한 테스크에 적용되었습니다.

Methods: CodonBERT is a BERT-based large language model trained on over 10 million mRNA coding sequences from diverse organisms, using codon-level inputs and optimizing both masked language modeling and homologous sequence prediction tasks during pre-training. Conclusion: CodonBERT outperformed previous codon- and nucleotide-based models across various mRNA property prediction tasks, particularly excelling in expression level prediction and protein expression classification. Examples: It was applied to real-world tasks such as predicting expression levels for Flu vaccine mRNAs, mRNA stability scores, and classifying protein expression outcomes for E. coli sequences.

기타

1. 주요 테이블

Table 1:
- CodonBERT와 기존 방법(TextCNN, RNABERT, RNA-FM, Codon2vec 등)을 7개 downstream task (발현, 안정성 등)에서 비교한 결과.
- Spearman’s rank correlation(회귀)와 Accuracy(분류)로 평가됨.
- CodonBERT가 대부분의 테스크에서 1위 또는 2위를 기록.
Table S1:
- 사용된 데이터셋(Flu Vaccines, E. coli 등)의 mRNA 수, 길이 범위, task 유형(회귀/분류) 요약.
Table S2:
- 각 모델별 Test Set에서의 MSE Loss / Spearman Correlation / Accuracy를 상세 비교.

2. 주요 피규어

Figure 1:
- CodonBERT 모델 구조도: 입력부터 Transformer 통과, contextualized codon 임베딩 생성 과정을 그림으로 설명.
Figure 2:
- Pre-trained CodonBERT가 학습한 코돈 및 서열 임베딩의 UMAP 2D 시각화 — 코돈 간 유사성과 진화적 계통 정보가 군집(cluster)으로 표현됨.
Figure S1 (Supplementary):
- CodonBERT 사전학습 과정의 loss, accuracy 학습 곡선.
Figure S3:
- CodonBERT와 경쟁 방법론(TF-IDF, Codon2vec 등) 비교 스키마.
Figure S4:
- 7개 테스크별로 모델별 순위 그래프.

3. 어펜딕스 (Supplementary Information)

Codon 임베딩 및 서열 임베딩 시각화 추가(Umap plots).
Flu 백신용 새로운 데이터 생성 실험 디자인(HeLa 세포, ELISA 검출법) 세부 절차 수록.

1. Key Tables

Table 1:
- Comparison of CodonBERT against previous methods (TextCNN, RNABERT, RNA-FM, Codon2vec) across 7 downstream tasks including expression and stability prediction.
- Evaluated using Spearman’s rank correlation (for regression) and Accuracy (for classification).
- CodonBERT ranked first or second on almost all tasks.
Table S1:
- Summarizes datasets used (e.g., Flu Vaccines, E. coli) with number of mRNA samples, sequence length range, and task types (regression/classification).
Table S2:
- Details test set MSE Loss, Spearman Correlation, and Accuracy comparisons across models.

2. Key Figures

Figure 1:
- CodonBERT model architecture showing the input processing pipeline, Transformer layers, and output generation.
Figure 2:
- UMAP 2D visualizations of learned codon and sequence embeddings — showing codon similarities and evolutionary clusters.
Figure S1 (Supplementary):
- Learning curves (loss, accuracy) during CodonBERT pre-training.
Figure S3:
- Schematics comparing CodonBERT and other baseline methods (e.g., TF-IDF, Codon2vec).
Figure S4:
- Ranking plots across the 7 tasks for different models.

3. Appendix (Supplementary Information)

Additional visualizations of codon and sequence embeddings using UMAP.
Detailed experimental procedure for generating new Flu vaccine data (e.g., HeLa cell transfection, ELISA detection).

refer format:

@inproceedings{li2023codonbert, title={CodonBERT: Large Language Models for mRNA Design and Optimization}, author={Li, Sizhen and Moayedpour, Saeed and Li, Ruijiang and Bailey, Michael and Riahi, Saleh and Kogler-Anele, Lorenzo and Miladi, Milad and Miner, Jacob and Zheng, Dinghai and Wang, Jun and Balsubramani, Akshay and Tran, Khang and Zacharia, Minnie and Wu, Monica and Gu, Xiaobo and Clinton, Ryan and Asquith, Carla and Skaleski, Joseph and Boeglin, Lianne and Chivukula, Sudha and Dias, Anusha and Ulloa Montoya, Fernando and Agarwal, Vikram and Bar-Joseph, Ziv and Jager, Sven}, booktitle={Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)}, year={2023}, organization={NeurIPS} }

Li, Sizhen, Saeed Moayedpour, Ruijiang Li, Michael Bailey, Saleh Riahi, Lorenzo Kogler-Anele, Milad Miladi, Jacob Miner, Dinghai Zheng, Jun Wang, Akshay Balsubramani, Khang Tran, Minnie Zacharia, Monica Wu, Xiaobo Gu, Ryan Clinton, Carla Asquith, Joseph Skaleski, Lianne Boeglin, Sudha Chivukula, Anusha Dias, Fernando Ulloa Montoya, Vikram Agarwal, Ziv Bar-Joseph, and Sven Jager. “CodonBERT: Large Language Models for mRNA Design and Optimization.” In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.