LLaVA-NeXT의 다음 버전인 LLaVA-OneVision이 나왔다고 들었다.
https://github.com/LLaVA-VL/LLaVA-NeXT
LLM에 대해 아는게 거의 없지만 논문을 읽어보기로 했다
LLaVA 관련 논문 리뷰들
목표
- aims to fill gap by demonstrating state-of-the-art performance across a broad range of tasks
- showcase interesting emerging capabilities through cross-scenario task transfer and composition
구조
- LLM : WechooseQwen-2
- Vision Encoder : We consider the SigLIP
- Projector : 2-layer MLP
데이터
continuous exposure of the model to new, high-quality data for further knowledge acquisition 가 중요하다 생각
99퍼 이상의 데이터가 synthetic
- re-Captioned Detailed Description Data
- Document / OCR Data
- Chinese and Language Data
Single image data (for multimodal capabilities) + onevision data (mixture of video, image, multi-image)
https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data
허깅페이스에 데이터셋 올라와있다
3단계의 학습 전략
- Stage-1: Language-Image Alignment. The goal is to well align the visual features into the word embedding space of LLMs.
- Stage-1.5: High-Quality Knowledge Learning. To strike a balance between compute efficiency and injecting new knowledge into LMMs, we recommend to consider the high-quality knowledge for LMM learning.
- Stage-2: Visual Instruction Tuning. To teach LMM to solve a diverse set of visual task with preferred responces, we organize the instruction data into different groups. The model is scheduled to train on these groups in order.
의의
Our largest model LLaVA-OneVision-72B yields superior performance between GPT-4V and GPT-4o on most benchmarks
a relatively larger gap remains in complex tasks such as visual chat scenarios ⇒ 지금은 qwen-2를 백본으로 쓰고 있지만 stronger LLM 나오면 + training dataset 더 강화되면 ??
benchmark evaluation
너무 많아서 자세한건 논문 표 참고해보시면 될 듯한데
(single image에서는)
- 대부분 GPT-4V는 surpass 하고 performance level of GPT-4o 에 다가가고
- open-source models of similar parameter size 와 대등하거나 더 우수한 성능을 보이고
- multi-image benchmark 에서는 outperforms existing multi-image LMMs in all benchmarks, 특정 도메인에서는 gpt-4v 보다 우수
- video benchmark에서는 previous open source model (with bigger size)보다도 잘한다네요
possible utilization?
일단 데모를 써봤는데 한국어는 아직 (chinese language data 로 학습해서 그런듯)
확실히 OCR이랑 이미지는 진짜 잘하네요
프로젝트에 써봐야징
'AI > NLP (LLM)' 카테고리의 다른 글
Attention Is All You Need ! (1) | 2024.11.01 |
---|---|
[논문 리뷰] PARROT: MULTILINGUAL VISUAL INSTRUCTION TUNING (1) | 2024.10.31 |
업스테이지 Solar LLM - tool RAG (0) | 2024.05.19 |
업스테이지 Solar LLM - smart RAG, self-improving RAG (0) | 2024.05.19 |
업스테이지 Solar LLM - embedding RAG (0) | 2024.05.19 |