LLaVA-OneVision (opensource VLM)

AI/NLP (LLM)

LLaVA-OneVision (opensource VLM)

민사민서 2024. 8. 15. 17:40

LLaVA-NeXT의 다음 버전인 LLaVA-OneVision이 나왔다고 들었다.

https://github.com/LLaVA-VL/LLaVA-NeXT

GitHub - LLaVA-VL/LLaVA-NeXT

Contribute to LLaVA-VL/LLaVA-NeXT development by creating an account on GitHub.

github.com

LLM에 대해 아는게 거의 없지만 논문을 읽어보기로 했다

LLaVA 관련 논문 리뷰들

목표

- aims to fill gap by demonstrating state-of-the-art performance across a broad range of tasks

- showcase interesting emerging capabilities through cross-scenario task transfer and composition

구조

LLM : WechooseQwen-2
Vision Encoder : We consider the SigLIP
Projector : 2-layer MLP

데이터

continuous exposure of the model to new, high-quality data for further knowledge acquisition 가 중요하다 생각

99퍼 이상의 데이터가 synthetic

re-Captioned Detailed Description Data
Document / OCR Data
Chinese and Language Data

Single image data (for multimodal capabilities) + onevision data (mixture of video, image, multi-image)

https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data

허깅페이스에 데이터셋 올라와있다

3단계의 학습 전략

Stage-1: Language-Image Alignment. The goal is to well align the visual features into the word embedding space of LLMs.
Stage-1.5: High-Quality Knowledge Learning. To strike a balance between compute efficiency and injecting new knowledge into LMMs, we recommend to consider the high-quality knowledge for LMM learning.
Stage-2: Visual Instruction Tuning. To teach LMM to solve a diverse set of visual task with preferred responces, we organize the instruction data into different groups. The model is scheduled to train on these groups in order.

의의

Our largest model LLaVA-OneVision-72B yields superior performance between GPT-4V and GPT-4o on most benchmarks

a relatively larger gap remains in complex tasks such as visual chat scenarios ⇒ 지금은 qwen-2를 백본으로 쓰고 있지만 stronger LLM 나오면 + training dataset 더 강화되면 ??

benchmark evaluation

너무 많아서 자세한건 논문 표 참고해보시면 될 듯한데

(single image에서는)

- 대부분 GPT-4V는 surpass 하고 performance level of GPT-4o 에 다가가고

- open-source models of similar parameter size 와 대등하거나 더 우수한 성능을 보이고

- multi-image benchmark 에서는 outperforms existing multi-image LMMs in all benchmarks, 특정 도메인에서는 gpt-4v 보다 우수

- video benchmark에서는 previous open source model (with bigger size)보다도 잘한다네요

possible utilization?

오... Appagent 나 다른 agent open source 보다 잘하려나요

일단 데모를 써봤는데 한국어는 아직 (chinese language data 로 학습해서 그런듯)

확실히 OCR이랑 이미지는 진짜 잘하네요

프로젝트에 써봐야징

'AI > NLP (LLM)' 카테고리의 다른 글

[논문 리뷰] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step (2)	2024.11.28
[논문 리뷰] PARROT: MULTILINGUAL VISUAL INSTRUCTION TUNING (1)	2024.10.31
업스테이지 Solar LLM - tool RAG (0)	2024.05.19
업스테이지 Solar LLM - smart RAG, self-improving RAG (0)	2024.05.19
업스테이지 Solar LLM - embedding RAG (0)	2024.05.19

현재글LLaVA-OneVision (opensource VLM)

(2023.02 ~ ) 해킹 공부 기록용으로 시작했다가 잡다한 거 다올리는 공부 메모장 느낌으로 봐주세요😺

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

대충공부한거적어두는블로그