AI/NLP (LLM)

LLaVA-OneVision (opensource VLM)

민사민서 2024. 8. 15. 17:40

LLaVA-NeXT의 다음 버전인 LLaVA-OneVision이 나왔다고 들었다.

https://github.com/LLaVA-VL/LLaVA-NeXT

 

GitHub - LLaVA-VL/LLaVA-NeXT

Contribute to LLaVA-VL/LLaVA-NeXT development by creating an account on GitHub.

github.com

 

LLM에 대해 아는게 거의 없지만 논문을 읽어보기로 했다

 

LLaVA 관련 논문 리뷰들

llava 관련 글 1

llava 관련 글 2

llava-next 관련 글 1

llava-next 관련 글 2

 

목표

- aims to fill gap by demonstrating state-of-the-art performance across a broad range of tasks

- showcase interesting emerging capabilities through cross-scenario task transfer and composition

 

구조

  • LLM : WechooseQwen-2
  • Vision Encoder : We consider the SigLIP
  • Projector : 2-layer MLP

데이터

continuous exposure of the model to new, high-quality data for further knowledge acquisition 가 중요하다 생각

99퍼 이상의 데이터가 synthetic

  • re-Captioned Detailed Description Data
  • Document / OCR Data
  • Chinese and Language Data

Single image data (for multimodal capabilities) + onevision data (mixture of video, image, multi-image)

https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data

허깅페이스에 데이터셋 올라와있다

 

3단계의 학습 전략

  • Stage-1: Language-Image Alignment. The goal is to well align the visual features into the word embedding space of LLMs.
  • Stage-1.5: High-Quality Knowledge Learning. To strike a balance between compute efficiency and injecting new knowledge into LMMs, we recommend to consider the high-quality knowledge for LMM learning.
  • Stage-2: Visual Instruction Tuning. To teach LMM to solve a diverse set of visual task with preferred responces, we organize the instruction data into different groups. The model is scheduled to train on these groups in order.

의의

Our largest model LLaVA-OneVision-72B yields superior performance between GPT-4V and GPT-4o on most benchmarks

a relatively larger gap remains in complex tasks such as visual chat scenarios ⇒ 지금은 qwen-2를 백본으로 쓰고 있지만 stronger LLM 나오면 + training dataset 더 강화되면 ??

 

benchmark evaluation

너무 많아서 자세한건 논문 표 참고해보시면 될 듯한데

(single image에서는)

- 대부분 GPT-4V는 surpass 하고 performance level of GPT-4o 에 다가가고

- open-source models of similar parameter size 와 대등하거나 더 우수한 성능을 보이고

 

- multi-image benchmark 에서는 outperforms existing multi-image LMMs in all benchmarks, 특정 도메인에서는 gpt-4v 보다 우수

- video benchmark에서는 previous open source model (with bigger size)보다도 잘한다네요

 

possible utilization?

복합적으로 해석 가능하다니..
오... Appagent 나 다른 agent open source 보다 잘하려나요
비디오와 이미지 연관 짓기도 상당히 하는 듯한데

 

일단 데모를 써봤는데 한국어는 아직 (chinese language data 로 학습해서 그런듯)

 

확실히 OCR이랑 이미지는 진짜 잘하네요

 

프로젝트에 써봐야징