多年之前,就有为不带字幕的游戏视频配上字幕的想法。
% a5 V/ G! c& N6 \8 {8 k 但是当时条件不成熟,但是目前来看,条件似乎成熟了8 o! J E' [ m4 D
! u9 d* U4 ^9 J
Whisper是openAI的开源语音识别软件。
) r& g2 ~6 O- Q( _% F 它有一个.net的版本,在这个版本的基础上进行少量修改,就能将游戏视频对应的字幕识别成srt格式。- l5 ~' B# O; i( M0 d; c
之后,对这个srt文件再进行在线批量翻译之后,进行少量调整之后,汉化工作就完成了。
" ^/ y$ D) N# U& E4 Z4 y! ]: z! u# }) ^6 d% v, l
地址如下0 u. p8 {0 b, j7 }
https://github.com/sandrohanea/whisper.net
/ R' l, w6 Z- c/ S* d6 ?* T! P* ^$ S
, D/ P1 [! }5 r. u5 [- }" s% W3 Q8 Q+ B- R9 @
编译最好使用vs2022编译,否则在.net sdk版本上会出很多问题。. b8 ^" i- C9 s) [# a. N1 V; @
# c1 d5 v" N- p) L/ H
编译好之后,有几个注意点
- S1 C( L q1 V5 w/ M9 \5 ?6 m
: i9 \! ?( G% j. d8 S <0>使用的模型文件修改为大模型,ggml-large.bin,用这个模型效果比较好。* r% m, A: e/ V! t; W
当然,所有时间也会比较多,估计转换一批文件需要几个甚至几十个小时。 ! ]* }6 e6 k6 _ {8 B A
6 H/ m* V# D X) t* b/ g <1>Language要设定为"english"。& E$ ~/ A, w0 p0 ]+ ~4 h' C
- i8 e8 G: L) ~5 M
- /* var builder = factory.CreateBuilder()& m6 |9 ^( y- \/ ]3 O# |9 @
- .WithLanguage(opt.Language);*/
9 L! }1 p1 S$ S6 I - var builder = factory.CreateBuilder()- y- F8 V. Z" Q c5 j
- .WithLanguage("english");
复制代码
- ~8 {5 l) {4 y! V0 o0 b0 b <2>缺省好像只支持Wav格式,而且是要16K采样率的,需要实现转换成这种格式,否则会出错。
; g# k9 | x9 C7 o- {) P
. E7 I5 t+ V1 g! T4 y0 c <3>缺省只提供了一个例子wav文件的转换,需要改为批量形式。
! `# ?0 X* F5 T2 P8 x v+ z3 \ (遍历某个目录中的所有文件)
0 r) A C9 x) m4 C" F
/ R/ V7 M, Q$ Q6 f& D. C <4>输出的文件,需要稍加整理,以符合srt格式
1 g2 ]+ C* D: W* j3 x) u4 s Y9 i) m. b. G1 f
以下是一个Wav文件的控制台输出(幽魂开场动画)5 I. ~6 m! o# }2 d7 b& k
* B3 l. X0 Z, W6 L" e# N, \
- 5 E/ A, l) f% }8 H8 y! K
- whisper_init_from_file_no_state: loading model from 'ggml-large.bin'
! R/ e# r0 m. t* F$ \0 \ - whisper_model_load: loading model
) H6 I( s; o4 y7 q/ \ - whisper_model_load: n_vocab = 51865# c. A) p$ G0 k4 N5 H1 C
- whisper_model_load: n_audio_ctx = 1500
' J7 \; f, B3 d/ m4 x - whisper_model_load: n_audio_state = 1280
% z9 D, N) f; c9 ~& t' Y5 L - whisper_model_load: n_audio_head = 20
: h5 ]3 O# s* L7 ^ o8 k9 h - whisper_model_load: n_audio_layer = 32
7 p8 ^7 @+ i Z; D# G3 p. H - whisper_model_load: n_text_ctx = 448) x: L0 f$ L8 D) c7 _9 q F
- whisper_model_load: n_text_state = 1280! L% M2 q0 |* Q" Q4 m1 c
- whisper_model_load: n_text_head = 20' Q' i0 l6 N% O8 Y
- whisper_model_load: n_text_layer = 32
0 v8 |* C7 R9 d4 [9 a( N5 P8 W - whisper_model_load: n_mels = 80
; f8 w# B' F) u* Y# n5 w& w - whisper_model_load: ftype = 1
2 [' f" `1 _8 ?2 j - whisper_model_load: qntvr = 0
5 k0 q$ p: k0 y - whisper_model_load: type = 59 M7 k. X, Q4 `/ W
- whisper_model_load: mem required = 3557.00 MB (+ 71.00 MB per decoder)
+ e' @' Q1 B: X: N - whisper_model_load: adding 1608 extra tokens
+ P7 \6 x; g- q. `# _ - whisper_model_load: model ctx = 2951.27 MB2 f( `1 ?. b h) B; Y) p$ H+ v
- whisper_model_load: model size = 2950.66 MB
+ S% j4 z K& b7 C1 ^% e% z, d - whisper_init_state: kv self size = 70.00 MB
1 P- c3 k8 ]' N* l4 @* @" D7 f - whisper_init_state: kv cross size = 234.38 MB2 i$ k9 j" n( W& g
- New Segment: 00:00:00 ==> 00:00:02.7600000 : (birds chirping)
3 V* z3 m- A3 B1 _" z - New Segment: 00:00:03.6600000 ==> 00:00:05.9000000 : (exhaling)0 Y; j4 Z8 ~4 s2 e
- New Segment: 00:00:05.9000000 ==> 00:00:08.6600000 : (birds chirping)
3 I- b+ k' ?9 R1 m3 i3 J$ [6 n3 e- U - New Segment: 00:00:08.6600000 ==> 00:00:35.1200000 : (gun firing)
7 W' B7 ^; X$ | - New Segment: 00:00:36.1200000 ==> 00:00:38.5400000 : (gun firing); n6 x" y, Z% `% Q( [
- New Segment: 00:00:39.0600000 ==> 00:00:41.4800000 : (gun firing)6 {+ x! `) q. y- N: _8 z
- New Segment: 00:00:41.4800000 ==> 00:00:49.4000000 : (tires screeching)
% A6 C3 l9 o7 f7 q! b) }; D - New Segment: 00:00:49.4000000 ==> 00:00:58.5800000 : (glass shattering)
+ ^3 x6 e1 l9 k5 [/ b - New Segment: 00:00:58.5800000 ==> 00:01:07.7400000 : (singing in foreign language)
4 i& `0 w3 Q7 P0 L% [1 J, M, M* A - New Segment: 00:01:07.7400000 ==> 00:01:11.5800000 : (singing in foreign language)
( j& O( L9 [' F- V1 M - New Segment: 00:01:11.5800000 ==> 00:01:17 : (tires screeching)" Q1 w3 i5 _. h. h
- New Segment: 00:01:17 ==> 00:01:24.8400000 : (singing in foreign language)
& g9 f( I+ \0 ~- \ - New Segment: 00:01:24.8400000 ==> 00:01:28.6400000 : (panting)
- A! s% Z* u' k5 ] ? - New Segment: 00:01:36.7800000 ==> 00:01:39.2000000 : (gun firing)0 i; A) R$ ]+ C5 M1 v
- New Segment: 00:01:39.2000000 ==> 00:01:43.4600000 : - Adrian., I- z4 g! c4 G( l! q
- New Segment: 00:01:43.4600000 ==> 00:01:45.6200000 : - Oh God.
$ h) Q" p5 p* C- ?$ p - New Segment: 00:01:45.6200000 ==> 00:01:48.2000000 : - What's the matter sweetheart?
! x2 q; o" @, y" d2 t( s - New Segment: 00:01:48.2000000 ==> 00:01:50.4200000 : Oh.
7 D' v0 X C$ F6 e. I - New Segment: 00:01:50.4200000 ==> 00:01:53.4600000 : - Oh it's horrible.. J Z+ |) h1 g( E
- New Segment: 00:01:53.4600000 ==> 00:01:55.3000000 : - Shh.
4 O2 ^+ ?! D( j" U* e0 t - New Segment: 00:01:55.3000000 ==> 00:02:02.3400000 : It was just a bad dream.
/ G; e! p0 B& n7 l8 b - New Segment: 00:02:05.4200000 ==> 00:02:09.8800000 : - You don't ever have to be afraid of anything.
: J5 R' r4 ^5 b7 {0 m. c3 A - New Segment: 00:02:09.8800000 ==> 00:02:12.8000000 : I'll always be here to protect you." v. p* Y c1 H$ i/ v" ~: m4 F
- New Segment: 00:02:12.9200000 ==> 00:02:15.5000000 : (gentle music)
2 i8 v1 e: X- _( K4 b+ ? - New Segment: 00:02:16.4800000 ==> 00:02:19.0600000 : (gentle music)
8 [. }4 ^! r3 ]1 Y1 B5 @: Q - New Segment: 00:02:19.0600000 ==> 00:02:21.6400000 : (gentle music) V, R$ K1 K8 Y
- New Segment: 00:02:21.6400000 ==> 00:02:24.2200000 : (gentle music)
- ^6 l( h6 Q# _6 G, h - New Segment: 00:02:24.5400000 ==> 00:02:27.1200000 : (gentle music)# B: Z/ N/ J- l4 i: C" @ N; J
- New Segment: 00:02:27.1200000 ==> 00:02:29.7000000 : (gentle music)
" j! b! O* [: _/ ~, L$ d - New Segment: 00:02:29.7000000 ==> 00:02:33.1800000 : [Music]' w" j% @/ p/ @; X: L) F2 s5 a
-
复制代码 ; q' D0 |& l' H+ M5 V
4 U- t6 h6 [# o! } |