한줄 요약: 



시퀀스기반이 아닌것이 특이하네...   
유전자 이름만 다루는 진포머라는 랭기지모델을 사용하여 유전자 목록과 치매와의 관계를 파인튜닝..  
이미 있는 데이터로 파인튜닝한거였넹.. ㅎㅎ   


짧은 요약(Abstract) :    



알츠하이머병(AD)은 전 세계적으로 5,500만 명 이상이 영향을 받는 질병이지만, 주요 유전적 요인은 아직 명확히 밝혀지지 않았다. 본 연구에서는 최신 게놈 기초 모델(genomic foundation models)을 활용하여 **Reverse-Gene-Finder**라는 혁신적인 기술을 개발하였다. 이 기술은 신경망 아키텍처에서 뉴런에서 유전자 토큰으로 역추적(backtracking)하여 AD 발병과 관련된 유전자 바이오마커를 밝히는 새로운 방법이다. Reverse-Gene-Finder는 세 가지 주요 혁신을 포함한다.

1. **가장 인과적인 유전자(MCG, Most Causal Genes)와 가장 인과적인 뉴런(MCN, Most Causal Neurons)의 관계 탐색**  
   - AD를 유발할 가능성이 가장 높은 뉴런이 활성화될 확률이 가장 높은 유전자가 MCG가 될 가능성이 크다는 점을 활용함.

2. **유전자 토큰(Gene Token) 표현 도입**  
   - 입력층에서 각 유전자를 개별적인 토큰으로 표현하여 기존 AD 관련 유전자뿐만 아니라 새로운 유전자를 독립적인 엔터티로 다룰 수 있도록 함.

3. **신경망의 역추적 기법 개발**  
   - 기존의 신경망이 입력에서 출력으로 한 방향으로 추적하는 것과 달리, AD와 가장 관련된 뉴런(MCN)에서 입력층까지 거슬러 올라가면서 유전자를 분석하는 새로운 역추적(backtracking) 기법을 개발.

이 기술은 **해석 가능성**, **일반화 가능성**, **적응 가능성**을 갖춘 방법으로, AD뿐만 아니라 다른 질병 연구에도 확장 적용될 수 있는 가능성을 제시한다.

---


Alzheimer’s Disease (AD) affects over 55 million people globally, yet the key genetic contributors remain poorly understood. Leveraging recent advancements in **genomic foundation models**, we present the innovative **Reverse-Gene-Finder** technology, a groundbreaking neuron-to-gene-token backtracking approach in a neural network architecture to elucidate the novel causal genetic biomarkers driving AD onset. Reverse-Gene-Finder comprises three key innovations.

1. **Discovery of the Most Causal Genes (MCGs) and the Most Causal Neurons (MCNs)**  
   - The most probable genes causing AD (MCGs) are those with the highest probability of activating neurons with the highest probability of causing AD (MCNs).

2. **Introduction of Gene Token Representation at the Input Layer**  
   - Each gene (whether previously known to AD or novel) is represented as a discrete and unique entity in the input space.

3. **Development of a Backtracking Method**  
   - Unlike conventional neural networks that track neuron activations from input to output in a feed-forward manner, we develop an innovative backtracking method to identify the Most Causal Tokens (MCTs) and their corresponding MCGs.

Reverse-Gene-Finder is highly **interpretable, generalizable, and adaptable**, providing a promising avenue for application in other disease scenarios.



 

* Useful sentences :

단어정리

Methodology

본 연구에서 제안하는 Reverse-Gene-Finder(RGF) 기술은 알츠하이머병(AD)의 유전적 바이오마커를 규명하기 위해 뉴런-유전자 토큰 역추적(neuron-to-gene-token backtracking) 기법을 활용하는 새로운 신경망 기반 접근 방식이다. RGF는 크게 모델 아키텍처, 트레이닝 데이터, 역추적 기법의 세 가지 주요 요소로 구성된다.

1. 제안 모델 및 아키텍처

RGF는 대형 게놈 기초 모델(genomic foundation model)을 활용하여 AD와 관련된 유전자 바이오마커를 찾는 새로운 방식이다. 이 모델의 주요 특징은 다음과 같다.

사전 학습된 유전자 모델 활용: 사전 학습된 Geneformer 모델을 기반으로 AD에 특화된 형태로 미세 조정(fine-tuning)됨.
유전자 토큰 기반 입력 표현: 각 유전자는 고유한 토큰으로 변환되며, 모델의 입력층에서 개별적인 단위로 처리됨.
신경망 역추적(backtracking) 기법: 기존 신경망이 입력층에서 출력층으로만 정보 흐름을 따르는 것과 달리, RGF는 출력층에서 입력층까지 역추적하며, 가장 인과적인 뉴런(MCN, Most Causal Neurons)에서 출발하여 가장 인과적인 유전자(MCG, Most Causal Genes)를 찾아냄.

2. 트레이닝 데이터

모델 학습에는 단일 세포(single-cell) 유전자 발현 데이터를 활용하였다.

데이터셋: 미국 Religious Orders Study and Memory and Aging Project (ROSMAP)에서 제공한 알츠하이머병 환자 및 건강한 대조군의 단일 세포 유전자 발현 데이터 사용.
데이터 전처리:
- AD 진행 단계에 따라 환자를 비AD(non-AD), 초기AD(early-AD), 후기AD(late-AD)로 구분.
- 연구에서는 초기AD(early-AD)와 비AD(non-AD) 환자 데이터(총 351명)만 사용하여 바이오마커 분석 수행.
- 유전자 필터링 및 품질 제어 수행: 유전자 발현 수준이 너무 낮거나 높은 데이터를 필터링하여 신뢰성 확보.
모델 입력 형식: 각 유전자 발현 값은 유전자 토큰(gene token)으로 변환되어 모델 입력 벡터로 구성됨.
교차 검증: 5-폴드 교차 검증(five-fold cross-validation)을 통해 모델의 일반화 성능을 평가.

3. 역추적 기법 및 인과 유전자 탐색

기존 신경망 모델은 입력층에서 출력층으로 정보를 전달하는 방식(순전파, feed-forward)을 따르지만, RGF는 출력층에서 입력층으로 거슬러 올라가며 가장 중요한 유전자 바이오마커를 찾는 새로운 역추적 기법을 제안한다.

가장 인과적인 뉴런(MCN) 찾기:
- 모델을 통해 AD 예측에 가장 중요한 역할을 하는 뉴런(MCN)을 식별.
- 기존 AD 관련 유전자를 제거(마스킹, masking)한 후 신경망 뉴런 활성도를 비교하여 MCN을 찾음.
가장 인과적인 유전자(MCG) 찾기:
- 식별된 MCN을 역추적(backtracking)하여 AD 발병 가능성이 가장 높은 유전자(MCG)를 탐색.
- 역추적 과정에서 자기회귀(attention weight)와 뉴런 간 연결 강도(weight matrix)를 활용하여 최종적으로 유전자 토큰(MCT, Most Causal Token)과 MCG를 추출.

결과 및 기대 효과

RGF는 기존 AD 관련 유전자뿐만 아니라, 새로운 유전자 바이오마커도 발견 가능.
기존의 유전자 연관 연구와 달리, 순전파(feed-forward) 방식이 아니라 역추적(backtracking) 방식을 통해 보다 해석 가능하고 신뢰성 있는 결과 제공.
RGF 기술은 다른 질병 연구에도 적용 가능하여 다양한 질병의 유전적 원인을 밝히는 데 기여할 수 있음.

This study introduces the Reverse-Gene-Finder (RGF) technology, a novel neuron-to-gene-token backtracking approach that utilizes a neural network framework to identify causal genetic biomarkers of Alzheimer’s Disease (AD). The methodology consists of three main components: model architecture, training data, and the backtracking mechanism.

1. Proposed Model and Architecture

The RGF leverages large genomic foundation models to uncover AD-related genetic biomarkers. The key features of the model include:

Utilization of a pre-trained genomic foundation model: The Geneformer model is fine-tuned specifically for AD biomarker identification.
Gene token representation: Each gene is represented as a discrete token, allowing both known and novel AD-related genes to be processed independently in the input space.
Neural network backtracking mechanism: Unlike conventional neural networks that process data in a feed-forward manner (input-to-output), RGF introduces a reverse-tracking approach, tracing information from the most causal neurons (MCNs) back to the input layer to identify most causal genes (MCGs).

2. Training Data

The training dataset consists of single-cell gene expression data from AD patients and healthy controls.

Dataset: Data sourced from the Religious Orders Study and Memory and Aging Project (ROSMAP), a comprehensive longitudinal study on AD.
Preprocessing:
- Patients are categorized into non-AD, early-AD, and late-AD stages.
- The study focuses on early-AD (132 patients) and non-AD (219 patients) to investigate causal biomarkers.
- Stringent quality control measures applied to filter out noisy gene expression data.
Input Representation: Each gene expression profile is tokenized as gene tokens, forming an input vector for the model.
Cross-validation: Five-fold cross-validation is employed to ensure robust generalization performance.

3. Backtracking Mechanism and Causal Gene Identification

Unlike conventional neural networks that follow a feed-forward structure (input-to-output), RGF proposes a backtracking approach to trace causal genetic biomarkers.

Identifying Most Causal Neurons (MCNs):
- The model detects neurons highly predictive of AD onset.
- MCNs are identified by masking known AD-associated genes and analyzing the effect on neuron activations.
Identifying Most Causal Genes (MCGs):
- MCNs are then traced backward to identify the most likely genetic contributors to AD.
- The attention weights and connectivity strength between neurons are used to compute the influence of gene tokens (MCTs), ultimately identifying MCGs.

Findings and Potential Impact

RGF can identify not only known AD-related genes but also novel genetic biomarkers.
Unlike traditional gene association studies, this method improves interpretability by employing a backtracking approach instead of a feed-forward process.
The RGF framework can be generalized to other diseases, providing new insights into the genetic mechanisms driving various conditions.

This methodology represents a significant step forward in interpretable AI-driven biomarker discovery, offering a highly adaptable framework for genomic research.

Results

본 연구에서 제안한 Reverse-Gene-Finder (RGF) 기술의 성능을 평가하기 위해, 경쟁 모델과의 비교, 테스트 데이터, 평가 메트릭, 발견된 유전자 바이오마커 검증을 종합적으로 분석하였다.

1. 경쟁 모델과 비교

RGF의 성능을 평가하기 위해 기존의 AD 유전자 바이오마커 분석 모델들과 비교를 수행하였다.
대표적인 비교 모델은 다음과 같다.

기존 게놈 기초 모델 (Geneformer)
- RGF가 아닌 단순한 Geneformer 기반 분류 모델을 사용하여 AD 예측 성능 비교.
딥러닝 기반 SNP 분석 모델 (Jo et al., 2022)
- CNN을 활용한 유전자 변이(SNP) 기반 AD 바이오마커 분석 모델.
해석 가능한 신경망 모델 (Kassani et al., 2022)
- 계층적 신경망을 사용하여 AD 바이오마커를 분석하는 기법으로, 모델 해석 가능성이 높음.

각 모델의 성능을 비교하기 위해 동일한 테스트 데이터를 사용하여 성능을 분석하였다.

2. 테스트 데이터

모델 평가를 위해 단일 세포(single-cell) 유전자 발현 데이터를 사용하였다.

데이터 출처: Religious Orders Study and Memory and Aging Project (ROSMAP).
데이터셋 구성:
- 초기 AD(early-AD) 환자 132명, 비AD(non-AD) 환자 219명.
- 단일 세포 마이크로글리아(transcriptomic microglia) 데이터 사용.
5-폴드 교차 검증 수행:
- 80% 학습 데이터, 20% 테스트 데이터로 나누어 성능 평가.
- 모든 실험은 Nvidia A100 40GB GPU에서 수행.

3. 평가 메트릭

모델의 성능을 AUC(ROC 곡선 아래 면적), 민감도(Sensitivity), 특이도(Specificity)를 기준으로 평가하였다.

모델	AUC(%)	민감도(%)	특이도(%)
RGF (제안 모델)	74.67 (±6.74)	62.68 (±11.43)	73.22 (±16.01)
기존 Geneformer	70.24 (±5.88)	58.12 (±10.34)	69.47 (±12.58)
SNP 기반 CNN	68.53 (±7.12)	55.34 (±9.27)	67.92 (±15.43)
계층적 신경망	71.36 (±6.92)	60.14 (±10.85)	71.02 (±13.72)

AUC (Area Under Curve): 모델의 전체적인 예측 성능을 평가하는 핵심 지표.
민감도(Sensitivity, TPR): 실제 AD 환자를 올바르게 예측하는 비율.
특이도(Specificity, TNR): 건강한 대조군을 정확히 예측하는 비율.

RGF는 기존 모델보다 AUC, 민감도, 특이도에서 더 높은 성능을 기록하였으며, 특히 기존 Geneformer 모델 대비 4.43% AUC 향상을 보였다.

4. 발견된 유전자 바이오마커 검증

RGF가 발견한 최상위 10개 유전자 바이오마커(MCGs, Most Causal Genes)의 생물학적 타당성을 검증하였다.

주요 발견 유전자:
- PLXDC2, MT-CO3, DOCK4, ARHGAP24, MEF2A, RUNX1, ITPR2, FOXN3, MT-CO1, SORL1
문헌 검토 및 생물학적 연관성 확인:
- PLXDC2: AD의 주요 병리학적 특징인 아밀로이드 베타 및 타우 축적과 연관.
- MT-CO3/MT-CO1: 미토콘드리아 기능 장애 및 산화 스트레스 조절과 관련.
- DOCK4: 신경 퇴행성 질환 및 자폐 스펙트럼 장애와 연관.
- SORL1: 기존 AD 위험 유전자(APOE 및 CLU) 조절 역할 수행.

이러한 결과를 통해 RGF가 기존 연구에서 확인되지 않은 새로운 AD 유전자 바이오마커를 성공적으로 발견했음을 검증하였다.

5. 결론 및 기대 효과

RGF는 기존 모델보다 높은 예측 성능을 보였으며, 새로운 유전자 바이오마커를 발견.
기존 신경망 방식과 달리 “역추적(backtracking)” 기법을 적용하여 해석 가능성이 높은 결과를 제공.
다른 신경퇴행성 질환(파킨슨병, 헌팅턴병) 연구에도 적용 가능.

본 연구는 해석 가능하고 확장 가능한 AI 기반 유전자 분석 기술로, AD뿐만 아니라 다양한 질병의 유전적 원인을 규명하는 데 기여할 가능성이 큼.

This study evaluates the performance of the Reverse-Gene-Finder (RGF) using comparison with existing models, test data, evaluation metrics, and biological validation of identified genetic biomarkers.

1. Comparison with Competitive Models

To assess RGF’s effectiveness, we compared it with existing AD biomarker discovery models.

Baseline Geneformer Model
- A standard Geneformer model fine-tuned for AD classification.
SNP-Based CNN (Jo et al., 2022)
- A convolutional neural network (CNN) trained on single nucleotide polymorphisms (SNPs) for AD biomarker identification.
Interpretable Hierarchical Neural Network (Kassani et al., 2022)
- A neural network with hierarchical layers designed for explainable AD biomarker discovery.

Each model was evaluated on the same test dataset to ensure fair comparison.

2. Test Dataset

The model was evaluated using single-cell gene expression data.

Dataset Source: Religious Orders Study and Memory and Aging Project (ROSMAP).
Data Composition:
- 132 early-AD patients, 219 non-AD control subjects.
- Microglial single-cell transcriptomic data.
Cross-validation:
- 80% training, 20% testing split.
- All experiments were conducted on an Nvidia A100 40GB GPU.

3. Evaluation Metrics

The model’s performance was assessed using AUC (Area Under the Curve), Sensitivity, and Specificity.

Model	AUC (%)	Sensitivity (%)	Specificity (%)
RGF (Proposed)	74.67 (±6.74)	62.68 (±11.43)	73.22 (±16.01)
Geneformer (Baseline)	70.24 (±5.88)	58.12 (±10.34)	69.47 (±12.58)
SNP CNN	68.53 (±7.12)	55.34 (±9.27)	67.92 (±15.43)
Hierarchical NN	71.36 (±6.92)	60.14 (±10.85)	71.02 (±13.72)

RGF outperforms all baseline models, particularly achieving 4.43% higher AUC than Geneformer.

4. Validation of Identified Genetic Biomarkers

Top 10 Most Causal Genes (MCGs):
- PLXDC2, MT-CO3, DOCK4, ARHGAP24, MEF2A, RUNX1, ITPR2, FOXN3, MT-CO1, SORL1.
Biological validation confirmed their relevance to AD.

5. Conclusion & Impact

RGF achieves superior predictive performance and identifies novel AD biomarkers.
Its backtracking approach enhances interpretability.
It can be adapted for other neurodegenerative diseases.

RGF represents a groundbreaking interpretable AI-driven genetic biomarker discovery method.

예제

본 연구에서 제안한 Reverse-Gene-Finder (RGF) 모델을 평가하기 위해, 트레이닝 및 테스트 데이터, 테스트 작업(Task) 구성, 입력 및 출력 예시를 설명한다.

1. 트레이닝 및 테스트 데이터

RGF 모델은 단일 세포(single-cell) 유전자 발현 데이터를 사용하여 훈련되며, 테스트는 동일한 도메인의 독립적인 데이터셋에서 수행된다.

데이터 출처: Religious Orders Study and Memory and Aging Project (ROSMAP)에서 제공된 단일 세포 마이크로글리아(transcriptomic microglia) 데이터.
데이터 구성:
- 총 351명(132명 early-AD, 219명 non-AD).
- 각 환자로부터 수천 개의 단일 세포 샘플 확보.
- 각 샘플에 대해 15,549개의 유전자 발현 값을 측정.
데이터 전처리:
- 불필요한 저발현 및 고발현 유전자 필터링.
- 각 샘플을 유전자 토큰(gene token) 벡터로 변환.
- 5-폴드 교차 검증을 사용하여 모델의 일반화 성능 평가.

2. 테스크(Task) 정의

RGF의 핵심 목표는 AD 관련 유전자 바이오마커를 찾는 것이며, 이를 위해 다음과 같은 작업(Task)들이 수행된다.

(1) AD 진단 예측 (Binary Classification)

입력(Input): 단일 세포의 유전자 발현 데이터(유전자 토큰 벡터).
출력(Output): AD vs. Non-AD (이진 분류, 0 또는 1).
목표: 모델이 AD 환자의 세포를 올바르게 예측하는지 평가.

(2) 가장 인과적인 뉴런(MCN) 찾기

입력(Input): 사전 학습된 모델과 마스킹된 AD 관련 유전자.
출력(Output): 가장 인과적인 뉴런 리스트(MCNs).
목표: AD를 예측하는 데 가장 중요한 뉴런 식별.

(3) 가장 인과적인 유전자(MCG) 찾기

입력(Input): 역추적(backtracking)된 뉴런 활성도와 유전자 토큰.
출력(Output): 가장 인과적인 유전자 리스트(MCGs).
목표: 새로운 AD 바이오마커를 찾아내는 것.

3. 입력 및 출력 예시

RGF 모델이 처리하는 데이터는 유전자 발현 값을 기반으로 한다.

(1) 입력 데이터 예시

단일 샘플 입력 (유전자 토큰화된 벡터)

{
"gene_tokens": ["GENE_A", "GENE_B", "GENE_C", ..., "GENE_15549"],
"expression_values": [2.34, 0.87, 5.21, ..., 1.02]
}

(2) AD 예측 출력 예시

{
  "predicted_class": "AD",
  "confidence_score": 0.92
}

(3) 가장 인과적인 뉴런 출력 예시

{
  "most_causal_neurons": ["Neuron_21", "Neuron_54", "Neuron_89"]
}

(4) 가장 인과적인 유전자 출력 예시

{
  "most_causal_genes": ["PLXDC2", "MT-CO3", "DOCK4", "SORL1"]
}

4. 결론

본 연구의 예제 데이터는 유전자 발현 데이터를 유전자 토큰 벡터로 변환하여 모델이 학습 및 예측할 수 있도록 구성되었다. RGF는 기존 신경망이 단순 예측하는 방식과 달리, 뉴런에서 유전자까지 역추적하여 AD를 유발하는 중요한 유전자를 찾아낸다. 이 접근 방식은 AD뿐만 아니라 다른 질병 연구에도 확장 가능하다.

To evaluate the proposed Reverse-Gene-Finder (RGF) model, we describe training and test datasets, task definitions, and input/output examples.

1. Training and Test Data

The RGF model is trained on single-cell gene expression data and tested on an independent dataset from the same domain.

Dataset Source: Single-cell microglial transcriptomic data from the Religious Orders Study and Memory and Aging Project (ROSMAP).
Data Composition:
- 351 total subjects (132 early-AD, 219 non-AD).
- Thousands of single-cell samples per subject.
- 15,549 gene expression values per sample.
Preprocessing:
- Filtering low/high expression genes.
- Converting each sample into a gene token vector.
- Using five-fold cross-validation to ensure generalizability.

2. Task Definition

The main objective of RGF is to identify AD-related genetic biomarkers through the following tasks.

(1) AD Diagnosis Prediction (Binary Classification)

Input: Single-cell gene expression data (gene token vector).
Output: AD vs. Non-AD (binary classification, 0 or 1).
Goal: Evaluate if the model can correctly classify AD patients.

(2) Identifying Most Causal Neurons (MCNs)

Input: Pre-trained model with masked AD-related genes.
Output: List of most causal neurons (MCNs).
Goal: Identify neurons that are most significant for AD prediction.

(3) Identifying Most Causal Genes (MCGs)

Input: Backtracked neuron activations and gene tokens.
Output: List of most causal genes (MCGs).
Goal: Discover new AD biomarkers.

3. Example Inputs and Outputs

RGF processes gene expression data represented as gene token vectors.

(1) Example Input Data

Single sample input (gene tokenized vector)

{
"gene_tokens": ["GENE_A", "GENE_B", "GENE_C", ..., "GENE_15549"],
"expression_values": [2.34, 0.87, 5.21, ..., 1.02]
}

(2) Example AD Prediction Output

{
  "predicted_class": "AD",
  "confidence_score": 0.92
}

(3) Example Most Causal Neurons Output

{
  "most_causal_neurons": ["Neuron_21", "Neuron_54", "Neuron_89"]
}

(4) Example Most Causal Genes Output

{
  "most_causal_genes": ["PLXDC2", "MT-CO3", "DOCK4", "SORL1"]
}

4. Conclusion

The dataset used in this study converts raw gene expression data into gene token vectors, enabling the model to trace back from neurons to genes. Unlike conventional neural networks that merely classify input data, RGF identifies the underlying genetic factors contributing to AD through backtracking. This approach can be extended to other disease research beyond AD.

요약

본 연구에서는 Reverse-Gene-Finder (RGF)를 활용하여 뉴런에서 유전자까지 역추적(backtracking)하는 방법으로 알츠하이머병(AD) 관련 유전자 바이오마커를 규명하였다. RGF는 기존 모델 대비 AUC 74.67%로 우수한 성능을 보이며, PLXDC2, MT-CO3, DOCK4, SORL1 등의 새로운 AD 바이오마커를 발견하였다. 테스트에서는 단일 세포 유전자 발현 데이터를 유전자 토큰 벡터로 변환하여 모델이 AD를 예측하고, 가장 인과적인 뉴런과 유전자를 식별하도록 설계되었다.

This study introduces Reverse-Gene-Finder (RGF), a novel backtracking method tracing from neurons to genes to identify Alzheimer’s Disease (AD) biomarkers. RGF outperforms existing models with AUC 74.67%, discovering novel AD biomarkers like PLXDC2, MT-CO3, DOCK4, and SORL1. The test setup converts single-cell gene expression data into gene token vectors, allowing the model to predict AD and identify the most causal neurons and genes.

기타

본 연구에서 사용된 피규어 및 테이블(표)는 Reverse-Gene-Finder (RGF)의 개념, 모델 성능, 발견된 유전자 바이오마커, 실험 결과 분석 등을 효과적으로 전달하는 데 활용되었다.

1. Reverse-Gene-Finder 개념도 (Figure 1)

설명: RGF의 전체적인 메커니즘과 3가지 핵심 기술(가장 인과적인 뉴런(MCN) 탐색, 유전자 토큰 표현, 역추적 기법)을 시각적으로 표현.
핵심 내용:
- 입력층에서 유전자 토큰(Gene Tokens)이 주어지고, 신경망을 통해 AD 예측 수행.
- 출력층에서 가장 인과적인 뉴런(MCNs)을 식별하고, 이를 역추적하여 가장 인과적인 유전자(MCGs)를 찾음.
- 기존 신경망이 순전파(feed-forward) 방식이라면, RGF는 출력에서 입력으로 역추적(backtracking)하는 차이점 강조.

2. 모델 성능 비교 표 (Table 1)

설명: RGF와 기존 모델(Geneformer, SNP CNN, 계층적 신경망)의 성능 비교.
핵심 내용:
- 평가 지표로 AUC(%), 민감도(Sensitivity, %), 특이도(Specificity, %) 사용.
- RGF가 기존 모델 대비 AUC 4.43% 향상, 특히 민감도(Sensitivity)가 높아 AD 진단 예측에서 우수한 성능을 보임.

모델	AUC(%)	민감도(%)	특이도(%)
RGF (제안 모델)	74.67 (±6.74)	62.68 (±11.43)	73.22 (±16.01)
기존 Geneformer	70.24 (±5.88)	58.12 (±10.34)	69.47 (±12.58)
SNP 기반 CNN	68.53 (±7.12)	55.34 (±9.27)	67.92 (±15.43)
계층적 신경망	71.36 (±6.92)	60.14 (±10.85)	71.02 (±13.72)

3. 가장 인과적인 뉴런(MCNs) 위치 시각화 (Figure 2)

설명: MCNs가 신경망의 어느 위치에서 발견되는지 시각적으로 표현.
핵심 내용:
- MCNs가 네트워크의 초반 및 중간층에 집중됨.
- 기존의 유전자 중요도 분석이 단순히 유전자 발현량에 의존했다면, RGF는 뉴런 활성도를 바탕으로 유전자 중요도를 계산하는 점 강조.

4. 가장 인과적인 유전자(MCGs) 리스트 (Table 2)

설명: RGF가 역추적하여 찾은 상위 10개 유전자 바이오마커와 기존 연구에서의 타당성 검증 결과.
핵심 내용:
- PLXDC2, MT-CO3, DOCK4, SORL1 등 10개 유전자 발견.
- 발견된 유전자가 기존 연구에서 신경퇴행성 질환과 연관성이 있음을 문헌 검토를 통해 검증.

유전자	중요도 점수	기존 연구 연관성
PLXDC2	0.499	아밀로이드 베타 및 타우 단백질 조절
MT-CO3	0.498	미토콘드리아 기능 장애 관련
DOCK4	0.482	신경퇴행성 질환 연관
SORL1	0.389	APOE 및 CLU 조절 역할 수행

5. 경로 분석 (Pathway Enrichment) 결과 (Figure 3)

설명: RGF가 찾은 유전자들이 어떤 생물학적 경로(pathway)에 영향을 미치는지 분석한 결과.
핵심 내용:
- AD, 헌팅턴병, 파킨슨병 등 신경퇴행성 질환과 밀접한 연관성이 있는 유전자가 다수 포함됨.
- 미토콘드리아 기능, 대사 경로, 혈관 건강과 관련된 유의미한 경로(KEGG pathway) 확인.

6. 입력 및 출력 예시 (Table 3)

설명: RGF 모델이 실제로 입력을 어떻게 처리하고, 어떤 결과를 출력하는지 설명.
핵심 내용:
- 입력: 유전자 발현 값을 토큰화한 벡터.
- 출력: AD 예측 결과, MCNs 리스트, 최종적으로 MCGs 리스트.

입력 데이터	예측 결과	MCNs	최종 MCGs
GENE_A, GENE_B, GENE_C, …	AD (0.92 확신도)	Neuron_21, Neuron_54	PLXDC2, MT-CO3

결론

본 연구의 피규어 및 테이블은 RGF의 구조적 차별성, 성능 우수성, 새로운 바이오마커 발견 가능성을 강조한다. 특히, 뉴런에서 유전자로의 역추적 기법이 기존 연구와 차별화되는 핵심 요소임을 데이터와 시각화를 통해 입증하였다.

The figures and tables in this study illustrate the Reverse-Gene-Finder (RGF) framework, model performance, discovered genetic biomarkers, and experimental results.

1. Reverse-Gene-Finder Mechanism (Figure 1)

Description: Visual representation of RGF’s framework and its three core innovations: MCN discovery, gene token representation, and backtracking method.
Key Insights:
- Gene tokens serve as input, and the neural network predicts AD diagnosis.
- The model identifies Most Causal Neurons (MCNs) and backtracks to Most Causal Genes (MCGs).
- Unlike conventional feed-forward models, RGF reverses the process to trace causality.

2. Model Performance Comparison (Table 1)

Description: Performance comparison between RGF and baseline models.
Key Insights:
- Evaluation Metrics: AUC (%), Sensitivity (%), Specificity (%).
- RGF achieves 74.67% AUC, outperforming the baseline Geneformer by 4.43%.

Model	AUC (%)	Sensitivity (%)	Specificity (%)
RGF (Proposed)	74.67 (±6.74)	62.68 (±11.43)	73.22 (±16.01)
Geneformer (Baseline)	70.24 (±5.88)	58.12 (±10.34)	69.47 (±12.58)

3. Most Causal Neurons (MCNs) Distribution (Figure 2)

Description: Visualization of MCN locations in the network.
Key Insights:
- MCNs are concentrated in early and intermediate network layers.
- RGF determines gene importance based on neuron activation, rather than gene expression levels alone.

4. Most Causal Genes (MCGs) Discovered (Table 2)

Description: List of top 10 identified genetic biomarkers and their biological relevance.
Key Insights:
- PLXDC2, MT-CO3, DOCK4, SORL1 identified as novel AD biomarkers.
- Validated through existing literature.

5. Pathway Enrichment Analysis (Figure 3)

Description: Analysis of biological pathways affected by discovered genes.
Key Insights:
- Genes linked to Alzheimer’s, Huntington’s, and Parkinson’s disease.
- Significant pathways in mitochondrial function, metabolism, and vascular health.

Conclusion

The figures and tables demonstrate RGF’s superior performance, causal gene discovery, and interpretability through backtracking, setting it apart from conventional methods.

refer format:

@article{Li2025, author = {Victor OK Li and Yang Han and Jacqueline CK Lam}, title = {Unravelling Causal Genetic Biomarkers of Alzheimer’s Disease via Neuron to Gene-token Backtracking in Neural Architecture: A Groundbreaking Reverse-Gene-Finder Approach}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, year = {2025}, publisher = {Association for the Advancement of Artificial Intelligence (AAAI)}, url = {https://github.com/yanghangit/RGF}, }

Victor OK Li, Yang Han, and Jacqueline CK Lam. “Unravelling Causal Genetic Biomarkers of Alzheimer’s Disease via Neuron to Gene-token Backtracking in Neural Architecture: A Groundbreaking Reverse-Gene-Finder Approach.” Proceedings of the AAAI Conference on Artificial Intelligence, 2025. Association for the Advancement of Artificial Intelligence (AAAI)