记录了一些自己在用keras简单实现lstm+ctc中觉得需要注意的点。
lstm和ctc的相关原理不再赘述,附以下两个链接,可供参考。
人人都能看懂的LSTM
一文读懂CRNN+CTC文字识别
有的时候,虽然感觉原理看了个大概,但实际操作起来还是有点无从下手,所以如果对网络每一层layer中输入输出的shape有着清晰的了解,对于网络的代码实现会有很大帮助。
LSTM层
lstm = LSTM(units=40, return_sequences=True)
输入shape为(batch_size, time_steps, step_length)
输出shape为(batch_size, time_steps, units)
这里的time_steps可以是提取语音特征mfcc的帧数,step_length则是一帧mfcc的特征数
Dense层
dense = Dense(n_classes, activation='softmax')(lstm)
输入shape为(batch_size, time_steps, units)
输出shape为(batch_size, time_steps, n_classes)
这里的n_classes是音素的个数,如26个英文字母+1个space+1个blank
CTC loss
keras自带ctc loss函数为_batch_cost,需要Lambda层进行层封装。
import keras.backend as K
def ctc_lambda_func(args):y_pred, labels, input_length, label_length = _batch_cost(labels, y_pred, input_length, label_length)
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')([dense, label_true, input_length, label_length])
这里的input_length的shape为(batch_size, 1),元素为训练数据的time_steps
label_length的shape为(batch_size, 1),元素为训练数据的max_string_length
我们需要构建两个模型base_model和model
base_model 以 dense 作为输出,用于训练好之后的预测
model 以 loss 作为输出,用于训练参数
以下模型使用 GRU,同 LSTM相似
input = Input(shape=(time_steps, step_length))
gru = Bidirectional(GRU(units=40, return_sequences=True), merge_mode='concat')(input)
dense = Dense(n_classes, activation='softmax')(gru)
base_model = Model(inputs=input, outputs=dense)label_true = Input(shape=[max_label_length])
input_length = Input(shape=[1])
label_length = Input(shape=[1])
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')([dense, label_true, input_length, label_length])
model = Model(inputs=[input, label_true, input_length, label_length], outputs=loss)
首先我们需要modelpile中使用自己定义的ctc_loss损失函数
modelpile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer='adadelta')
模型的损失函数参数为模型输出y_pred和真实标签y_true, 由于我们的model输出已经是ctc_loss,所以直接将y_pred作为loss
fittedModel = model.fit([input, labels, input_length, label_length], np.ones(1), batch_size=1, epochs=100,verbose=2)
由于真实标签 labels 已经作为输入参与到 layer 层计算中,因此 model.fit 中的 y 只需要随意赋值,与 batch_size 大小保持一致
训练好 model 后, 使用 base_model 进行预测
y_pred = base_model.predict(input_test)
使用 ctc_decode 对 y_pred 进行解码
decode = K.get__decode(y_pred, input_lengths(y_pred.shape[0]) * y_pred.shape[1], greedy=True)[0][0])
这里的 decode 是对应类别的下标,根据下标转换成实际类别即可
dels import Model
from keras.layers import GRU, Dense, Bidirectional, Input, Lambda
from python_speech_features import *
import keras.backend as K
import numpy as np
import scipy.io.wavfile as wavdef ctc_lambda_func(args):y_pred, labels, input_length, label_length = _batch_cost(labels, y_pred, input_length, label_length)def get_audio_feature(audio_path):fs, audio = ad(audio_path)print(fs)print(audio.shape)# 提取mfcc特征wav_feature = mfcc(audio, fs, nfft=int(0.025*fs), winfunc=np.hamming)# deltad_mfcc_feat1 = delta(wav_feature, 1)d_mfcc_feat2 = delta(wav_feature, 2)feature = np.hstack((wav_feature, d_mfcc_feat1, d_mfcc_feat2))return featuredef get_audio_label(filepath):SPACE_TOKEN = '<space>'SPACE_INDEX = 0FIRST_INDEX = ord('a') - 1with open(filepath, 'r') as f:line = f.readlines()[0].strip()# 空格字符转换成两个空格字符targets = place(' ', ' ')# 按空格切分,两个空格之间为''targets = targets.split(' ')# 将''转换成空格tokentargets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])print(targets)# 将 token转换成数字targets = np.hstack([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEXfor x in targets])return targetsdef decode_ctc(out):batch_size, decode_len = out.shape[0], out.shape[1]for i in range(batch_size):pre = ''.join([' ' if x == 0 else chr(x + ord('a') - 1) for x in out[i]])print(pre)feature = get_audio_feature('001.wav')
feature = waxis, :]
print(feature.shape)
labels = get_audio_label(')
labels = waxis, :]
print(labels.shape)
max_label_length = labels.shape[1]
il = np.ones(1) * feature.shape[1]
print(il.shape)
ll = np.ones(1) * max_label_length
print(ll.shape)time_step, step_length = feature.shape[1], feature.shape[2]
n_classes = 26 + 1 + 1input = Input(shape=(time_step, step_length))
gru = Bidirectional(GRU(units=40, return_sequences=True), merge_mode='concat')(input)
dense = Dense(n_classes, activation='softmax')(gru)
base_model = Model(inputs=input, outputs=dense)label_true = Input(shape=[max_label_length])
input_length = Input(shape=[1])
label_length = Input(shape=[1])
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')([dense, label_true, input_length, label_length])
model = Model(inputs=[input, label_true, input_length, label_length], outputs=loss)modelpile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer='adadelta')
# model.summary()fittedModel = model.fit([feature, labels, il, ll], np.ones(1), batch_size=1, epochs=100,verbose=2)
model.save('lstm_ctc.h5')base_model.load_weights('lstm_ctc.h5')
y_pred = base_model.predict(feature)
decode = K.ctc_decode(y_pred, input_lengths(y_pred.shape[0]) * y_pred.shape[1], greedy=True)
out = K.get_value(decode[0][0])
decode_ctc(out)
部分代码参考
本文发布于:2024-02-01 07:37:50,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/170674427234937.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |