2022-12-21

video_captioning_baseline

本文讲解video captioning中一个简单但效果不错的baseline，GitHub仓库为：https://github.com/ramakanth-pasunuru/video_captioning_rl
该baseline基于pytorch框架，主要包括数据预处理，数据加载，网络搭建，模型优化几个模块，虽然代码用的是LSTM网络，但换成现在主流的Transformer网络整体pipeline是相同的

数据预处理

该阶段包括视频特征的提取及词库的创建，其中视频特征一般提取两种特征，每帧的appearance feature及motion feature，当然也可以提取每帧的object feature，该baseline使用预训练的ResNet-152提取appearance feature，ResNeXt-101提取motion feature，如果有需要，你可以自己使用检测网络提取object feature，特征下载参见GitHub仓库。而对于文本的处理为统计预料库中单词出现的次数并进行筛选，比如过滤掉频次小于2的单词，添加 start, end, token 标签分别代表句子的开头，结尾及未知词。
特别需要形成的一个意识是一个单词对应一个数字, 比如单词 love 对应数字5，词库是100，则单词编码是0-99，最后生成的是一个100维矢量，希望经过softmax后第6维(因为从0开始计数)的数值最大。具体查看one-hot相关知识。
以下为构建单词与数字对应关系的代码结构。

class Vocab(object):
    def __init__(self,vocab_file,max_size):
        """Creates a vocab of up to max_size words, reading from the vocab_file. If max_size is 0, reads the entire vocab file.
            Args:
                vocab_file: path to the vocab file, which is assumed to contain "<word> <frequency>" on each line, sorted with most frequent word first. 
                            This code doesn't actually use the frequencies, though.
                max_size: integer. The maximum size of the resulting Vocabulary.
                
        """
        self._word_to_id = {}
        self._id_to_word = {}
        self._count = 0 # keeps track of total number of words in the Vocab

        # [PAD], [START], [STOP] and [UNK] get the ids 0,1,2,3.
        for w in [PAD_TOKEN, START_DECODING, STOP_DECODING, UNKNOWN_TOKEN]:
            self._word_to_id[w] = self._count
            self._id_to_word[self._count] = w
            self._count += 1
            # 读取词库，构建单词与数字对应关系

    def word2id(self, word):
        """Returns the id (integer) of a word (string). Returns [UNK] id if word is OOV."""
        if word not in self._word_to_id:
            return self._word_to_id[UNKNOWN_TOKEN]
        return self._word_to_id[word]

    def id2word(self, word_id):
        """Returns the word (string) corresponding to an id (integer)."""
        if word_id not in self._id_to_word:
            raise ValueError('Id not found in vocab: %d' % word_id)
        return self._id_to_word[word_id]

数据加载

考虑一个业务场景，怎样把硬盘中的超大文件加载到内存? 这时就需要一个函数能够读取文件的一部分到内存，同时记录读取到哪个地方了，而生成器恰好实现了这样的功能。

def file_reader(fp, block_size=1024 * 8):
    """生成器函数：分块读取文件内容
    """
    while True:
        chunk = fp.read(block_size)
        if not chunk:
            break
        yield chunk

而在网络结构的训练中，数据是以批 batch 的形式喂到网络结构中的，因此在数据加载部分涉及到迭代器及生成器相关知识。
常用的数据加载相关代码如下

# 先定义自己的 MyDataset类，继承自 Dataset
# 在自己的Dataset类中一般需要实现 初始化函数 __init__, 获取元素 __getitem__ 及返回整个数据长度 __len__
class MyDataset(Dataset):
  def __init__(self, path):
    pass
  def __getitem__(self, item):
    pass
  def __len__(self):
    pass

# 将实例化的Dataset对象输入到Dataloader中便可以以生成器的方式迭代获取数据
dataset = MyDataset()
dataloader = Dataloader(dataset, ...)

# 迭代获取 batch 数据 训练网络
for data in dataloader:
   # training...

# 等价与
iterr = iter(dataloader)
while True:
  try:
    next(iterr)
    # training...
  except StopIteration:
    break

Dataloader的源码相对复杂，我觉得也没有研究的必要，了解其过程就行，这里关于怎样从Dataloader加载数据的步骤引自 https://zhuanlan.zhihu.com/p/30934236

调用了dataloader 的__iter__() 方法, 产生了一个DataLoaderIter
反复调用DataLoaderIter 的__next__()来得到batch, 具体操作就是, 多次调用dataset的__getitem__()方法 (如果num_worker>0就多线程调用), 然后用collate_fn来把它们打包成batch. 中间还会涉及到shuffle , 以及sample 的方法等, 这里就不多说了.
当数据读完后, next()抛出一个StopIteration异常, for循环结束, dataloader 失效.

dataloader也可以自己写，比如 MSRVTTBatcher 的 get_batcher 通过关键字yield构建了生成器，就可以next的方式迭代获取数据

class MSRVTTBatcher(object):

    def __init__(self,hps,mode,vocab):
        # 初始化一些词库路径，视觉特征路径等信息

    def _process_data(self):
        """this module extracts data from videos and caption files and creates batches"""
        # load json data which contains all the information
        # 用于构建 video 与 caption 的关联字典
        if self._mode == 'train':
            np.random.shuffle(data)
        else:
            data,_ = zip(*data) # consider only video ids for evaluation
            data = sorted(set(data),key=data.index)

        return data,data_dict

    def sort_based_on_caption_lengths(self, video_batch, video_len_batch, video_id, caption_batch, caption_len_batch, original_caption:
        # 根据captioning的长度排序，主要是为了 后续 nn.utils.rnn.pack_padded_sequence 的需要
    
    def get_batcher(self):
        """
        This module process data and creates batches for train/val/test 
        Also acts as generator
        """
        # 构建生成器，以 批 迭代的方式获取数据
        yield batch

网络搭建

神经网络说到底还是矩阵运算，论文中的每个公式与相应的代码一一对应。一般的网络结构定义及调用如下，值得注意的地方是调用网络的时候似乎没有显性调用forward函数，这是因为在执行model(data)的时候，实际执行的是model.call(data), 而其继承自nn.Module，所以会调用父类的魔术函数__call__, 在该函数中会调用forward函数及很多其他钩子函数。

# 网络结构的定义，继承自 nn.Module
class Model(nn.Module):
    # 初始化网络模块
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)
    # 搭建网络结构
    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# 实例化网络
model = Model()
# 前向传播
# 实际执行的是model.__call__(data)
model(data)

而对于captioning任务，理论部分提到过一般分为encoder和decoder，所以在代码上也是对应的，一般会使用一个类来总管encoder和decoder。

class Seq2seqAttention(nn.Module):
    def __init__(self, args):
        super(Seq2seqAttention, self).__init__()
        self.args = args
        self.enable_cuda = args.cuda
        self.vid_dim = args.vid_dim
        self.embed_size = args.embed
        self.hidden_dim = args.hid
        self.vocab_size = args.max_vocab_size
        self.num_layers = args.num_layers
        self.birnn = args.birnn
        self.encoder = EncoderFrames(self.args)
        self.decoder = DecoderRNN(self.args)

    def forward(self, frames, flengths, captions, lengths):
        video_features = self.encoder(frames, flengths)
        outputs = self.decoder(video_features, flengths, captions, lengths)

        return outputs

值得特别注意的是该代码在encoder部分使用了pack_padded_sequence，算是一个减小padding噪音的trick

# Based on tutorials/08 - Language Model
# RNN Based Language Model
class EncoderFrames(nn.Module):
    def __init__(self, args):
        super(EncoderFrames, self).__init__()
        # self.use_abs = use_abs
        self.vid_dim = args.vid_dim
        self.embed_size = args.embed
        self.hidden_dim = args.hid
        self.enable_cuda = args.cuda
        self.num_layers = args.num_layers
        self.args = args
        if args.birnn:
            self.birnn = 2
        else:
            self.birnn = 1
        # projection layer
        self.linear = nn.Linear(self.vid_dim, self.embed_size, bias=False)
        # video embedding
        self.rnn = nn.LSTM(self.embed_size, self.hidden_dim, self.num_layers, batch_first=True, bidirectional=self.args.birnn, dropout=args.dropout)
        self.dropout = nn.Dropout(args.dropout)
        self.init_weights()

    def init_weights(self):
        self.rnn.weight_hh_l0.data.uniform_(-0.08, 0.08)
        self.rnn.weight_ih_l0.data.uniform_(-0.08, 0.08)
        self.rnn.bias_ih_l0.data.fill_(0)
        self.rnn.bias_hh_l0.data.fill_(0)
        self.linear.weight.data.uniform_(-0.08, 0.08)
        #self.linear.bias.data.fill_(0)

    def init_hidden(self, batch_size):
        if self.birnn:
            return (Variable(torch.zeros(self.birnn*self.num_layers, batch_size, self.hidden_dim)),
                    Variable(torch.zeros(self.birnn*self.num_layers, batch_size, self.hidden_dim)))



    def forward(self, frames, flengths):
        """Handles variable size frames
           frame_embed: video features
           flengths: frame lengths
        """
        batch_size = flengths.shape[0]
        #frames = self.linear(frames)
        #frames = self.dropout(frames) # adding dropout layer
        self.init_rnn = self.init_hidden(batch_size)
        if self.enable_cuda:
            self.init_rnn = self.init_rnn[0].cuda(), self.init_rnn[1].cuda()

        if batch_size > 1:
            # Sort by length (keep idx)
            flengths, idx_sort = np.sort(flengths)[::-1], np.argsort(-flengths)
            if self.enable_cuda:
                frames = frames.index_select(0, Variable(torch.cuda.LongTensor(idx_sort)))
            else:
                frames = frames.index_select(0, Variable(torch.LongTensor(idx_sort)))



        frames = self.linear(frames)
        # 值得特别注意，一个小trick
        frame_packed = nn.utils.rnn.pack_padded_sequence(frames, flengths, batch_first=True)
        outputs, (ht, ct) = self.rnn(frame_packed, self.init_rnn)
        outputs,_ = pad_packed_sequence(outputs,batch_first=True)

        if batch_size > 1:
            # Un-sort by length
            idx_unsort = np.argsort(idx_sort)
            if self.enable_cuda:
                outputs = outputs.index_select(0, Variable(torch.cuda.LongTensor(idx_unsort)))
            else:
                outputs = outputs.index_select(0, Variable(torch.LongTensor(idx_unsort)))

        # print 'Encoder Outputs:',outputs.size()

        return outputs

特别提醒一下该词库有关beam search及强化学习的代码也是非常清晰，对于原理学习及后续改进都十分有利

模型优化

这个地方代码差不多，注意训练的时候调用model.train(),测试的时候调用model.eval(), 还有注意交叉熵损失函数的输入就行。

def train_model(self):
    total_loss = 0
    model = self.model
    model.train()

    batcher = self.train_data.get_batcher()

    for step in range(0,self.train_data.num_steps): 
        batch = next(batcher)
        # 获取数据
        video_features = batch.get('video_batch')
        flengths = batch.get('video_len_batch')
        captions = batch.get('caption_batch')
        clengths = batch.get('caption_len_batch')
        video_features = to_var(self.args, video_features)
        captions = to_var(self.args, captions)
        # 求损失函数
        outputs = self.model(video_features, flengths, captions, clengths)
        targets = pack_padded_sequence(captions, clengths, batch_first=True)[0]
        loss = self.ce(outputs, targets)
           
        # update
        self.optim.zero_grad()
        loss.backward()

        t.nn.utils.clip_grad_norm(
                model.parameters(), self.args.grad_clip)
        self.optim.step()

        total_loss += loss.data
        pbar.set_description(f"train_model| loss: {loss.data[0]:5.3f}")
        
        step += 1
        self.step += 1