本文讲解video captioning中一个简单但效果不错的baseline,GitHub仓库为:https://github.com/ramakanth-pasunuru/video_captioning_rl 该baseline基于pytorch框架,主要包括数据预处理,数据加载,网络搭建,模型优化几个模块,虽然代码用的是LSTM网络,但换成现在主流的Transformer网络整体pipeline是相同的
数据预处理 该阶段包括视频特征的提取及词库的创建,其中视频特征一般提取两种特征,每帧的appearance feature及motion feature,当然也可以提取每帧的object feature,该baseline使用预训练的ResNet-152提取appearance feature,ResNeXt-101提取motion feature,如果有需要,你可以自己使用检测网络提取object feature,特征下载参见GitHub仓库。而对于文本的处理为统计预料库中单词出现的次数并进行筛选,比如过滤掉频次小于2的单词,添加 start, end, token 标签分别代表句子的开头,结尾及未知词。 特别需要形成的一个意识是一个单词对应一个数字, 比如 单词 love 对应 数字5,词库是100,则单词编码是0-99,最后生成的是一个100维矢量,希望经过softmax后第6维(因为从0开始计数)的数值最大。具体查看one-hot相关知识。 以下为构建单词与数字对应关系的代码结构。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 class Vocab (object ): def __init__ (self,vocab_file,max_size ): """Creates a vocab of up to max_size words, reading from the vocab_file. If max_size is 0, reads the entire vocab file. Args: vocab_file: path to the vocab file, which is assumed to contain "<word> <frequency>" on each line, sorted with most frequent word first. This code doesn't actually use the frequencies, though. max_size: integer. The maximum size of the resulting Vocabulary. """ self._word_to_id = {} self._id_to_word = {} self._count = 0 for w in [PAD_TOKEN, START_DECODING, STOP_DECODING, UNKNOWN_TOKEN]: self._word_to_id[w] = self._count self._id_to_word[self._count] = w self._count += 1 def word2id (self, word ): """Returns the id (integer) of a word (string). Returns [UNK] id if word is OOV.""" if word not in self._word_to_id: return self._word_to_id[UNKNOWN_TOKEN] return self._word_to_id[word] def id2word (self, word_id ): """Returns the word (string) corresponding to an id (integer).""" if word_id not in self._id_to_word: raise ValueError('Id not found in vocab: %d' % word_id) return self._id_to_word[word_id]
数据加载 考虑一个业务场景,怎样把硬盘中的超大文件加载到内存? 这时就需要一个函数能够读取文件的一部分到内存,同时记录读取到哪个地方了,而生成器恰好实现了这样的功能。
1 2 3 4 5 6 7 8 def file_reader (fp, block_size=1024 * 8 ): """生成器函数:分块读取文件内容 """ while True : chunk = fp.read(block_size) if not chunk: break yield chunk
而在网络结构的训练中,数据是以批 batch 的形式喂到网络结构中的,因此在数据加载部分涉及到迭代器及生成器相关知识。 常用的数据加载相关代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 class MyDataset (Dataset ): def __init__ (self, path ): pass def __getitem__ (self, item ): pass def __len__ (self ): pass dataset = MyDataset() dataloader = Dataloader(dataset, ...) for data in dataloader: iterr = iter (dataloader) while True : try : next (iterr) except StopIteration: break
Dataloader的源码相对复杂,我觉得也没有研究的必要,了解其过程就行,这里关于怎样从Dataloader加载数据的步骤引自 https://zhuanlan.zhihu.com/p/30934236
调用了dataloader 的__iter__() 方法, 产生了一个DataLoaderIter
反复调用DataLoaderIter 的__next__()来得到batch, 具体操作就是, 多次调用dataset的__getitem__()方法 (如果num_worker>0就多线程调用), 然后用collate_fn来把它们打包成batch. 中间还会涉及到shuffle , 以及sample 的方法等, 这里就不多说了.
当数据读完后, next ()抛出一个StopIteration异常, for循环结束, dataloader 失效.
dataloader也可以自己写,比如 MSRVTTBatcher 的 get_batcher 通过关键字yield构建了生成器,就可以next的方式迭代获取数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 class MSRVTTBatcher (object ): def __init__ (self,hps,mode,vocab ): def _process_data (self ): """this module extracts data from videos and caption files and creates batches""" if self._mode == 'train' : np.random.shuffle(data) else : data,_ = zip (*data) data = sorted (set (data),key=data.index) return data,data_dict def sort_based_on_caption_lengths (self, video_batch, video_len_batch, video_id, caption_batch, caption_len_batch, original_caption: def get_batcher(self ): """ This module process data and creates batches for train/val/test Also acts as generator """ yield batch
网络搭建 神经网络说到底还是矩阵运算,论文中的每个公式与相应的代码一一对应。一般的网络结构定义及调用如下,值得注意的地方是调用网络的时候似乎没有显性调用forward函数,这是因为在执行model(data)的时候,实际执行的是model.call (data), 而其继承自nn.Module,所以会调用父类的魔术函数__call__, 在该函数中会调用forward函数及很多其他钩子函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class Model (nn.Module): def __init__ (self ): super (Model, self).__init__() self.conv1 = nn.Conv2d(1 , 20 , 5 ) self.conv2 = nn.Conv2d(20 , 20 , 5 ) def forward (self, x ): x = F.relu(self.conv1(x)) return F.relu(self.conv2(x)) model = Model() model(data)
而对于captioning任务,理论部分提到过一般分为encoder和decoder,所以在代码上也是对应的,一般会使用一个类来总管encoder和decoder。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 class Seq2seqAttention (nn.Module): def __init__ (self, args ): super (Seq2seqAttention, self).__init__() self.args = args self.enable_cuda = args.cuda self.vid_dim = args.vid_dim self.embed_size = args.embed self.hidden_dim = args.hid self.vocab_size = args.max_vocab_size self.num_layers = args.num_layers self.birnn = args.birnn self.encoder = EncoderFrames(self.args) self.decoder = DecoderRNN(self.args) def forward (self, frames, flengths, captions, lengths ): video_features = self.encoder(frames, flengths) outputs = self.decoder(video_features, flengths, captions, lengths) return outputs
值得特别注意的是该代码在encoder部分使用了pack_padded_sequence,算是一个减小padding噪音的trick
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 class EncoderFrames (nn.Module): def __init__ (self, args ): super (EncoderFrames, self).__init__() self.vid_dim = args.vid_dim self.embed_size = args.embed self.hidden_dim = args.hid self.enable_cuda = args.cuda self.num_layers = args.num_layers self.args = args if args.birnn: self.birnn = 2 else : self.birnn = 1 self.linear = nn.Linear(self.vid_dim, self.embed_size, bias=False ) self.rnn = nn.LSTM(self.embed_size, self.hidden_dim, self.num_layers, batch_first=True , bidirectional=self.args.birnn, dropout=args.dropout) self.dropout = nn.Dropout(args.dropout) self.init_weights() def init_weights (self ): self.rnn.weight_hh_l0.data.uniform_(-0.08 , 0.08 ) self.rnn.weight_ih_l0.data.uniform_(-0.08 , 0.08 ) self.rnn.bias_ih_l0.data.fill_(0 ) self.rnn.bias_hh_l0.data.fill_(0 ) self.linear.weight.data.uniform_(-0.08 , 0.08 ) def init_hidden (self, batch_size ): if self.birnn: return (Variable(torch.zeros(self.birnn*self.num_layers, batch_size, self.hidden_dim)), Variable(torch.zeros(self.birnn*self.num_layers, batch_size, self.hidden_dim))) def forward (self, frames, flengths ): """Handles variable size frames frame_embed: video features flengths: frame lengths """ batch_size = flengths.shape[0 ] self.init_rnn = self.init_hidden(batch_size) if self.enable_cuda: self.init_rnn = self.init_rnn[0 ].cuda(), self.init_rnn[1 ].cuda() if batch_size > 1 : flengths, idx_sort = np.sort(flengths)[::-1 ], np.argsort(-flengths) if self.enable_cuda: frames = frames.index_select(0 , Variable(torch.cuda.LongTensor(idx_sort))) else : frames = frames.index_select(0 , Variable(torch.LongTensor(idx_sort))) frames = self.linear(frames) frame_packed = nn.utils.rnn.pack_padded_sequence(frames, flengths, batch_first=True ) outputs, (ht, ct) = self.rnn(frame_packed, self.init_rnn) outputs,_ = pad_packed_sequence(outputs,batch_first=True ) if batch_size > 1 : idx_unsort = np.argsort(idx_sort) if self.enable_cuda: outputs = outputs.index_select(0 , Variable(torch.cuda.LongTensor(idx_unsort))) else : outputs = outputs.index_select(0 , Variable(torch.LongTensor(idx_unsort))) return outputs
特别提醒一下该词库有关beam search及强化学习的代码也是非常清晰,对于原理学习及后续改进都十分有利
模型优化 这个地方代码差不多,注意训练的时候调用model.train(),测试的时候调用model.eval(), 还有注意交叉熵损失函数的输入就行。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 def train_model (self ): total_loss = 0 model = self.model model.train() batcher = self.train_data.get_batcher() for step in range (0 ,self.train_data.num_steps): batch = next (batcher) video_features = batch.get('video_batch' ) flengths = batch.get('video_len_batch' ) captions = batch.get('caption_batch' ) clengths = batch.get('caption_len_batch' ) video_features = to_var(self.args, video_features) captions = to_var(self.args, captions) outputs = self.model(video_features, flengths, captions, clengths) targets = pack_padded_sequence(captions, clengths, batch_first=True )[0 ] loss = self.ce(outputs, targets) self.optim.zero_grad() loss.backward() t.nn.utils.clip_grad_norm( model.parameters(), self.args.grad_clip) self.optim.step() total_loss += loss.data pbar.set_description(f"train_model| loss: {loss.data[0 ]:5.3 f} " ) step += 1 self.step += 1