openai whisper - STT model

speech

openai whisper - STT model

민사민서 2024. 8. 15. 15:03

https://github.com/openai/whisper

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

github.com

openai 에서 만든 STT 모델. 챗지피티의 음성인식 기능도 이거 기반으로 구현되었다고 함

주 관심사인 한국어 성능은 large model의 경우 word error rate가 5퍼 정도밖에 안된다

물론 cpu에서, 가볍고 빠른 속도를 얻고 싶기에 fast-whisper(https://github.com/SYSTRAN/faster-whisper) 를 사용할 것임

기존 speech recognition 모델의 한계

wav2vec

- unsupervised learning 으로, 6만시간의 대규모 음성 데이터로 학습

- fine tuning 은 적은 수의 labeled 데이터로 supervised learning 으로 진행

=> 실제로 STT 기능 하려면 fine tuning 해야함, 특정 도메인에만 우수한 성능을 보일 수 있고, fine tnue 데이터셋이 성능을 좌우할 가능성이 있음, 또 fine tune 자체가 어려움

=> general하고 robust 한 모델 필요?

그래서 whisper는?

CV (computer vision) 분야에서 대량의 weakly supervised 데이터로 모델을 학습시켰더니 결과가 좋더라

wav2vec 과 다르게 적은 수의 dataset 으로 supervised learning 했는데 성능이 괜찮더라

=> 680,000시간의 학습 데이터 활용한 weakly-supervised learning 수행

=> 참고로 한국어는 약 8000 시간

encoder-decoder Transformer 사용함 (기존 transformer 구조를 크게 바꾸지 않음, 대신 interface 바꾸고, 데이터셋 늘리고)

모든 오디오는 is re-sampled to 16,000 Hz, and an 80-channel log-magnitude Mel spectrogram representation 로 계산

multitask를 달성하기 위해?

하나의 모델이 entire speech processing pipeline 담당한다

voice activity detection + speaker diarization + inverse text normalization + core recognition part

통일된 포맷으로 task specification 달성

=> a simple format to specify all tasks and conditioning information as a sequence of input tokens to the decoder

어떤 포맷?

참고로 Whisper models은 30-second audio chunks로 학습됨, cannot consume longer audio inputs at once

학습시킬 때 history of text of the transcript 도 일부 포함시켜 스스로 longer-range text context를 활용해 ambiguous audio를 해결하는 걸 배우도록 함

with some probability we add the transcript text preceding the current audio segment to the decoder’s context
indicate the beginning of prediction with a <|startoftranscript|> token

토큰 포맷 구조는 아래와 같다

predict the language being spoken (with unique token for 99 languages) ⇒ VoxLingua107 model
model is trained to predict a <|nospeech|> token indicating if there’s no speech
specify whether to predict timestamps or not by including a <|notimestamps|> token
- predict time relative to the current audio segment (nearest 20ms)
- add additional tokens to our vocabulary for each of these
add a <|endoftranscript|> token

모델 학습은?

- fp16, dynamic loss scaling, AdamW, gradient norm clipping, 256 segments batch

- 참고로 few epochs만 돌려서 overfitting 가능성이 없다고 함

- 신기하게도 any data augmentation or regularization을 사용하지 않았으며 오로지 diversity contained within such a large dataset로 generalization and robustness 를 달성하고자 함

evaluation?

WER (word error rate) 를 그대로 사용하기엔 문제가 있었음

- string edit distance 기반, innocuous differences in transcript style 같은 것도 전부 penalize 했기 때문이다

- zero-shot model인 whisper는 특히 특정 dataset transcript formats를 본 적이 없었으므로 더욱 이 문제가 심각했음

따라서 extension standarization of text를 WER calculation 전에 적용시킨 후 매트릭 측정

⇒ 논문 appendix C에 있더라고요

Multi-lingual Speech Recognition

Language Identification

Robustness to Additive Noise

Long-form Transcription

Comparison with Human Performance

다양한 evaluation metrics + method 나와있으니 나중에 필요하면 읽어보면 좋을듯

한계는?

model size가 작아지면 steadiness reliability 떨어짐

- perception-related errors such as confusing similar-sounding words

- combination of failure modes of seq2seq models, language models, and text-audio alignment

- getting stuck in repeat loops, not transcribing the first or last few words of an audio segment

- hallucination where the model will output a transcript entirely unrelated to the actual audio

pre-training dataset is currently very English-heavy

- most languages have less than 1000 hours of training data

lack of unsupervised pre-training or self-teaching methods

- 이거 추가해서 fine tune 해보라고 추천하던데요

의의

- without the need for the self-supervision and self-training techniques that have been a mainstay of recent large-scale speech recognition work

- demonstrate how simply training on a large and diverse supervised dataset and focusing on zero-shot transfer can significantly improve the robustness of a speech recognition system

실제로 써봤는데 마이크 음질이 안좋을 때 hallucination이 심함.

인식 잘 안되면 안된다 해야하는데 어떻게든 output을 generate하려다보니 생긴 문제인듯

inference 속도도 빨라서 hallucination만 개선하면 괜찮을듯, 사소한 오인식은 llm 거치면서 correction 될 수 있기에

'speech' 카테고리의 다른 글

OpenVoice - TTS model (0)	2024.08.14

현재글openai whisper - STT model

(2023.02 ~ ) 해킹 공부 기록용으로 시작했다가 잡다한 거 다올리는 공부 메모장 느낌으로 봐주세요😺

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

대충공부한거적어두는블로그