[논문 리뷰] PARROT: MULTILINGUAL VISUAL INSTRUCTION TUNING

AI/NLP (LLM)

[논문 리뷰] PARROT: MULTILINGUAL VISUAL INSTRUCTION TUNING

민사민서 2024. 10. 31. 23:48

https://github.com/AIDC-AI/Parrot

GitHub - AIDC-AI/Parrot: 🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.

🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch. - AIDC-AI/Parrot

github.com

Abstraction & Introduction

기존 MLLM의 학습 방식은 Supervised Fine-Tuning(SFT) 방식

주로 사전 학습된 LLM과 Vision encoder에 의존
vision encoder를 LLM과 정렬하여 LLM에 멀티모달 능력을 부여하는데 초점
- visual feature를 언어 임베딩 토큰과 정렬하여 mode간 격차 해소에 중점
- 주로 Q-Former 또는 MLP projector 사용
alignment training 이후에 multilingual erosion 현상을 보이기도 함
- lose its ability to understand, process, or generate in non-English languages
- 예시로 LLaVA usually responds in English, regardless of the input language

이 현상은 visual token과 textual token in other languages 간 alignment 부족으로 인해 발생

이 논문에서는

provide textual guidance to drive visual token alignment at the language level
converts visual tokens into language-specific embeddings using a Mixture-of-Experts (MoE) module
- visual feature의 class token과 textual token embedding 간 cross attention 계산
- MoE의 router를 통과시켜 activated probability distribution of each language expert 획득
- English-biased visual tokens을 language-specific embeddings로 변환

6개의 언어 + 15개의 카테고리 + 12000개의 질문으로 구성된 다국어 멀티모달 벤치마크를 수집하여 제공했다는데 아쉽게도 한국어는 없음 (데이터셋 구축방법은 나와있지 x)

기존 MLLM 벤치마크의 한계

오래된 벤치마크 (비교적 쉬운 문제들로 인해 현재 모델들 성능에 적합 x)
비표준화된 평가 (LLaVA-Bench 등은 GPT-4를 사용한 평가에 의존, 재현성 저해, 혹은 여러 언어에 걸쳐 일관된 테스트 샘플을 제공하지 않기도)
제한된 언어 (영어/중국어 로 제한된 경우)

- 좋은 벤치마크라면현저한 차이가 있는 언어 =뚜렷하고 반복되지 않는 언어를 선택하여 다양한 언어군 포괄

- 중간 난이도의 문제 = 논리적 추론 능력보다는 다국어 이해, 처리 및 생성 능력을 평가하는 것이 목적이어야 함

- 다국어 및 다중 모달을 사용하는 작업 = 데이터 세트 내의 데이터는 영어와 밀접한 관련이 없게 구성 (ex. 영어로 된 코드 추론), 이미지와 텍스트 간 유의미한 상관관계를 강조하는 질문이 필수적임

- 언어 간 콘텐츠 일관성

Methods

vision encoder + projection layer를 거친 시각적 토큰 H_v 와 텍스트 토큰 H_t 사이에 내재된 불일치가 있지 않을까??

projection layer 뒤에 multilingual MoE module을 추가

two step training

- moe module 비활성화, LLM+Vision Encoder freeze 후 projector만 학습 (pre-train)

- LLM + MoE + Projector 학습 (instruction training)

코드를 보니 parrot_arch.py에서 input word embedding과 projection layer 거친 image embedding을 moe model로 건네줌

moe_output = self.get_moe_model()(cur_input_embeds_1[question_attention_mask], image_features[cur_image_idx])

# MoE module forward using cur_input_embeds and cur_image_features
moe_output = self.get_moe_model()(cur_input_embeds[torch.cat(cur_question_attention_mask)],
                                                image_features[cur_image_idx])

moe는 거창한게 아니라 cross attention + experts (linear layers) 조합임

def forward(self, input_embeds, image_features):
    # for a single image and a single text
    if not self.use_moe:
        return image_features

    assert self.config.mm_vision_select_feature == 'cls_patch'
    cls_token = image_features[0:1, :]  # 1, hidden
    sequence_length, hidden_dim = image_features.shape

    # cross attention for image and text
    scores = torch.matmul(cls_token, input_embeds.transpose(-2, -1)) / torch.sqrt(torch.tensor(cls_token.size(-1)))
    attention_weight = F.softmax(scores, dim=-1)
    embeds_for_gating = torch.matmul(attention_weight, input_embeds)

    router_logits = self.gate(embeds_for_gating)
    routing_weights = F.softmax(router_logits, dim=-1, dtype=image_features.dtype)  # 1, expert

    experts_output = torch.stack([expert(image_features) for expert in self.experts], dim=0)

    final_hidden_states = torch.sum(routing_weights[0].unsqueeze(-1).unsqueeze(-1) * experts_output, dim=0)

    if self.use_moe_residual:
        final_hidden_states += self.moe_weight * image_features

    return final_hidden_states

코드를 보면

- input word embedding을 Q, K, image feature의 cls token을 V로 해서 cross attention 수행해 embeds_for_gating 획득

- 그리고 moe 모듈의 router 통과시킴 (softmax)

- 각 expert(linear layers)를 통과시킨다음 weighted sum을 구해 최종적인 embedding을 구한다

Methods의 이론적인 설명

Parrot의 특징은 다음과 같다

- CLIP 같은 vision encoder로부터 얻어진 english-biased visual features를 다른 언어에 적합하도록 facilitate

- language-specific visual tokens을 제공함으로써 Multilingual capability 증가시킨다

two step으로 진행된다

1. text guidance to drive visual token alignment

vision encoder + projector 거친 embedding token H_v 와 text input을 word embedding table로 변환한 embedding H_t 에 대해 cross attention

2. apply Mix-of-Experts module

근데 왜 expert라 부름? 모르겠음...

- router = a linear layer that generates a probability distribution over the set of experts [e1, e2, … e_E]

- Each expert is an MLP designed to convert English-biased embeddings into language-specific embeddings

- same dimension output

(3) select and activate the most relevant language experts

(4) obtain the language-specific embeddings

(5) employ MoE reweighting to convert visual embeddings with less variance in original visual-semantic information

학습 방법

Goal: multilingual data를 최대한 적게 활용하며 multilingual capacity를 높이자

Stage 1: Modality Alignment

keep both the vision encoder and the LLM weights frozen, focusing solely on optimizing the projectors to align the visual features H_v with the pretrained LLM word embedding

Stage 2: Instruction Tuning for Multilingual Alignment

We still keep the vision encoder weights frozen while continuing to train the projector, MoE, and LLM
rapidly learn to align visual representations across multiple languages by using a small amount of multilingual image-text data

참고로 LLaVA architecture를 활용했다고 함

학습 시 사용한 하이퍼파라미터는 다음과 같음

Experiments

좋대요

일단 기존 LLaVA-OneVision, Qwen-VL 등에서의 Vision Encoder + Projector + Backbone LLM 구조에다가 한 단계 (moe module)을 추가한 것

projector 거친 임베딩에 multilingual ability를 더 부여하기 위해 cross attention + MLP layers 단계를 추가한 것

한국어에 대해서도 효과적으로 작용하는지

오버헤드가 크진 않는지

등등을 체크해보면 좋을듯?

'AI > NLP (LLM)' 카테고리의 다른 글

[논문 리뷰] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models (1)	2024.11.28
[논문 리뷰] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step (2)	2024.11.28
LLaVA-OneVision (opensource VLM) (0)	2024.08.15
업스테이지 Solar LLM - tool RAG (0)	2024.05.19
업스테이지 Solar LLM - smart RAG, self-improving RAG (0)	2024.05.19

현재글[논문 리뷰] PARROT: MULTILINGUAL VISUAL INSTRUCTION TUNING

(2023.02 ~ ) 해킹 공부 기록용으로 시작했다가 잡다한 거 다올리는 공부 메모장 느낌으로 봐주세요😺

Today :
Yesterday :

대충공부한거적어두는블로그