[2024]Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
February 23, 2024
짧은요약(Abstract):*LLM은다양한분야에서다양한질문에답할수있는능력이입증됨*사전훈련된가중치내에방대한양의사실정보를내포함*그러나이지식은본질적으로훈련데이터의특성에크게의존하며제한적*따라서외부데이터셋을사용하여새로운정보를통합하거나LLM의기존정보에대한능력을개선하는것은상당한도전과제*이연구에서는두가지일반적인접근방식인비지도미세조정과검색-증강생성(RAG)을비교하는것이이논문의골자*저자들은다양한주제에걸친지식집약적인작업에서두접근방식을평가*평가결과비지도미세조정이일부개선을제공하는반면,RAG는훈련중에마주친기존지식과완전히새로운지식모두에대해일관되게그것을능가한다는것을보여줌*특히LLM이비지도미세조정을통해새로운사실정보를학습하는데어려움을겪고있으며,훈련중에동일한사실의다양한변형에노출시키는것이이문제를완화할수있다는것을발견*LLMscananswermanyquestionsindifferentareasbecausetheyhavealotoffactsfromtraining*Knowledgetheyhaveislimitedanddependsalotonthetrainingdata*UsingoutsidedatatoaddnewinfoormakeLLMsbetteratwhattheyalreadyknowishard*Thispaperlooksattwocommonwaystodothis: unsupervisedfine-tuningandRAG*Theauthorstestedthesetwomethodsontasksthatneedalotofknowledge*Theyfoundthatwhileunsupervisedfine-tuninghelpsabit,RAGisusuallybetterforbotholdandnewknowledge*Especially,theysawthatLLMshardnesstolearnnewfactswithunsupervisedfine-tuning*ShowingthemmanyversionsofthesamefactwhiletrainingmighthelpwiththisproblemUsefulsentences:*LLM의한계:지식은정적이며시간이지남에따라업데이트되지않고비특정적이어서특정분야에서미세한전문성이부족*최근에는특정분야에LLM을적응시키고지식을업데이트하는아이디어가점점더보편화되고있고다양한모델이제안되었으며,이는건강관리,금융,법률등다양한분야에서사실지식과능력을향상시키기위한것*이작업에서는모델의지식과사실데이터를기억,이해,검색하는능력을평가하는데중점*저자들은텍스트코퍼스형태의지식베이스가주어졌을때,사전훈련된모델에게이지식을가르치는최선의방법이무엇인지이해하려함*지식을사전훈련된모델에추가하는한가지방법은미세조정을통한것으로미세조정을통해모델의훈련과정을계속하고작업특정데이터를사용하여모델을조정*이는모델의전반적인품질을크게향상시키는데매우효과적이지만,반드시모델에새로운지식을가르치는것은아님*검색증강생성(RAG)은외부지식소스를사용하여지식집약적작업에서특히LLM의능력을확장하는기술*보조지식베이스BQ와사전훈련된임베딩모델Me가주어지면,BQ에있는각문서b에대해임베딩을생성하고이를벡터저장소에저장*새로운쿼리q를받으면,그쿼리의임베딩Me(q)를사용하여dot-productranking에따라q와가장가까운상위K개의이웃bq={bk}K1을검색합니*그런다음q를bq와q를문자열연결하는것으로업데이트하여q˜=bq∥q로만들고모델의출력으로M(q˜)를반환*저자들은Wikipedia에서관련청크를수집한후GPT-4의도움으로새로운다지선다형데이터셋을생성*이데이터셋은매우구체적이고고품질의다지선다형질문으로구성*실험프레임워크에서는LM-Evaluation-Harness저장소를사용하여선택된지식집약적작업에서LLM의성능을평가*이플랫폼은표준화된평가프레임워크를보장하고모델,방법,데이터셋간의일관된비교를허용*모델선택에서는추론평가를위해Llama2-7B,Mistral-7B,Orca2-7B세가지모델을선택*이모델들은가장인기있는오픈소스베이스모델과지시조정모델을대표*Anatomy(0-shot)태스크에서는Mistral-7B모델이RAG를사용했을때0.681의정확도로가장높은성능*Astronomy(0-shot)태스크에서는Orca2-7B모델이RAG를사용했을때0.750의정확도로가장높은성능*CollegeBiology(0-shot)태스크에서는Mistral-7B모델이FinetuningRAG를사용했을때0.764의정확도로가장높은성능*CollegeChemistry(0-shot)태스크에서는Mistral7B모델이RAG를사용했을때0.500의정확도로가장높은성능*Prehistory(0-shot)태스크에서는Mistral-7B모델이RAG를사용했을때0.750의정확도로가장높은성능*CurrentEvents결과에서는Orca2-7B모델이RAG를사용했을때0.876의정확도로가장높은성능*RAGusesoutsideknowledgesourcestomakeLLMsbetterattasksthatneedalotofknowledge*ForRAG,authorsmadeadensevectorforeachdocumentinahelpknowledgebase*Whenanewquestioncomes,theyfindtheclosestdocumentstothequestionandaddthemtothequestion,givingmorecontext*Theauthorsmadeanewsetofmultiple-choicequestionswithGPT-4byusingchunksfromWikipedia*TheyusedLM-Evaluation-HarnesstocheckhowwellLLMsdidontheseknowledge-heavytasks*Theychosethreemodelsfortesting: Llama2-7B,Mistral-7B,andOrca2-7B*InAnatomy(0-shot)task,Mistral-7BwithRAGhadthehighestaccuracyof0.681accuracy*InAstronomy(0-shot)task,Orca2-7BwithRAGdidthebestwith0.750accuracy*InCollegeBiology(0-shot)task,Mistral-7BwithFinetuningRAGscoredhighestat0.764accuracy*InCollegeChemistry(0-shot)task,Mistral7BwithRAGreached0.500accuracy*InPrehistory(0-shot)task,Mistral-7BwithRAGgotthetopscorewith0.750accuracy*InCurrentEventstask,Orca2-7BwithRAGhadthebestperformancewith0.876accuracyUsefulsentences2:
*RAGconsistentlyoutperformedjustfine-tuning*UsingRAGwiththebasemodelasthegeneratorwasbetterthanonlyfine-tuning*RAGwasparticularlyeffectiveforthecurrenteventstaskduetothedirectmatchbetweenthequestionsandtheauxiliarydataset*Fine-tuningwasn't competitive with RAG
* However, fine-tuning with multiple paraphrases provided a significant improvement over the baseline
* Combining RAG with fine-tuning didn'tperformaswellasRAGalone*Fortaskswithnewinformation,suchascurrenteventsnotseenduringpre-training,standardfine-tuningdidnotimproveandevendegradedLlama2's performance
* They explored data augmentation using paraphrases to improve fine-tuning results
* Data augmentation is a well-established method for enhancing language model performance
* They used generative models for augmentations, successfully improving classification models in the past
* The approach showed a direct correlation between the number of paraphrases used and model accuracy
* The accuracy of all models tested increased monotonically with the number of paraphrases used, suggesting a positive impact of paraphrase augmentation on the model'sabilitytounderstandandgeneralizenewknowledge*Aninterestingphenomenonobservedwasasignificantdropintraininglossaftereachepoch,consistentwithLLMsmemorizingdataduringtrainingandoverfitting*Theirhypothesisisthattoteachpre-trainedLLMsnewknowledge,theinformationmustberepeatedinnumerouswaysIdea?:*근데MMLU벤치마크같은거좀유용해보임*paraphrasesgeneration도좀유의미하게쓰기좋아보임,실제로성능도높였다고하고