Mes 1 6

M2: Meshed-Memory Transformer for Image Captioning

Meshed-Memory Transformer is the state of the art framework for Image Captioning. In 2017, Google Brain published a paper called “Attention is all you need”[1], which transformed the historical perspective in processing visual — textual information, constructing multi-modal models, and building sequence to sequence structures. In this paper, the authors ended the regime of RNN/LSTM in handling sequential data. Furthermore, this paper provided AI researchers with a flexible structure that they can adapt to solve an enormous number of problems.

From the beginning of the Transformer, many variants of it have been produced rapidly. In early 2020, Marcella and her colleagues at Unimore developed a model grounded on the Transformer of Ashish Vaswani [1]. The model was described in their paper named M2 or Meshed-Memory Transformer, a state of the art model is to generate descriptions for images.

Before jumping into the paper, let’s recap the knowledge behind the model: the Transformer.


Transformer Architecture 

The architecture of Transformer [1] inherits the structure of sequence-to-sequence, which includes two main parts: Encoder and Decoder. Each block contains several layers that were constructed by Multi-Head-Attention (MHA) layers, Feed-Forward (FF) layers, and layer-normalization (LN) layers. The authors of the Transformer also introduced Positional Encoding and masked-MHA to handle the sequence property of the Inputs.

I should not explain the details of the Transformer because Jay Alammar did this job correctly in his blog post [2]. If you want to make your hand dirty by implementing some lines of code, I recommend that you should ground on the implementation of Pytorch [3]. Of course, you find many other reference source code on Github, too; however, you must be careful because each source code was employed for a specific problem. Even a thousand stars source code which was used for Language Translation problem may not fit an Image Captioning task. I got pain, and I want you to stay away from it.


11 Fig. 1: Transformer’s structure [ Vaswan. 2017]


  • Transformer from language to vision 

 As I mentioned above, the Transformer was born in 2017, and it was an innovative solution for language translation. Then, Deep Learning communities in the world were soon enthralled by this powerful model. In fact, many data scientists have adapted the original structure to build their models to solve Image Captioning tasks, and they found that their models worked like a charm.

There are many versions that the Transformer could be used in Image Captioning. Some replaced the Encoder block with an image features extractor such as Resnet 101, Faster-RCNN as Xinxin Zu did in “Captioning Transformer with Stacked Attention Modules” in 2018 [4]. Another version of the Transformer was published in “Image Caption Generation with Adaptive Transformer” of Wei Zhang, 2019 [5]. Zhang kept the original structure of the Transformer but replaced the input of Encoder by image features and converted vanilla attention to additive attention. “Entangled Transformer For Image Captioning” of Guang Li in ICCV 2019 [6] is an exciting model that we should take a look at as well.

Now, I would like to share with you the reason why M2 is the state-of-the-art framework for Image Captioning.


M2: Meshed-Memory Transformer 

12Fig. 2: Meshed Memory Transformer architecture [ Cornia. CVPR2020]


The authors of M2 presented two adjustments that leveraged the performance of the model: create Memory Augmented Encoder and Meshed Decoder.


  • Memory Augmented Encoder

As mentioned by Marcella Cornia and her colleagues, Self-Attention, the heart of Transformer, forms pairwise relationships inside the input-set. Due to the peculiarity of Self-Attention, this technique is limited in modeling a priori knowledge on relationships between image regions. For instance, given regions encoding a dog and a cat, it would be difficult to infer the concept of chasing without any a priori knowledge. To handle this issue, the authors proposed Memory Augmented Encoder, which extended the set of Keys and Values in Encoder with additional “slots” to extract priori information. The priori information is not based on the input set X; it is encoded in plain learnable vectors, which are concatenated to keys and values and can be directly updated via SGD.

The Attention module


was adjusted to




18 and 19 are the learnable matrices or priori knowledge and [. , . ] indicates concatenation. This learnable Keys and Values could retrieve learned knowledge

Which is not embedded in the X through an attention mechanism. The author left the Queries unaltered in order to avoid hallucinations. 

Other modules in the Encoder are the same as the original Transfomer’s Encoder.


  • Meshed Decoder

Unlike the original decoder block in [1] which only performs a cross-attention between the last encoding layer and the decoding layers, the M2 performs a meshed connection with all encoding layers. And then, the model summed these contributions after being modulated. The meshed attention operator is defined as


C( . , .) stands for the cross-attention estimated using keys and values from the encoder and queries of the decoder. The calculation of C is


20 is a matrix of weights that measure the relevance between the result of the cross-attention computed with each encoding layer and the input query.


The aforementioned information is the main contribution of the M2 Meshed Memory Transformer. However, there are several interesting points about the implementation that were presented in the paper [7]. I strongly recommend you read it carefully and then, please try to implement the source code provided by the authors on Github here.


Image Regions Extraction

In the source code provided on Github, the author used the Detection features that are computed with the code provided by [8]. I wrote a short module for you to extract the feature of grid regions based on the idea of the paper: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.  The Pytorch code is:

from torchvision import resnet50
From torch import nn
Import torch.nn.functional as F

class Resnet_50(nn.Module):
    Resnet50 class extractes regions feature   
    def __init__(self, encoded_image_size=14, embed_dim=512):
        super(Resnet_50, self).__init__()

        self.embed_dim = embed_dim

        #self.decoder_dim = decoder_dim
        resnet = torchvision.models.resnet50(pretrained=True)
        self.resnet_fc = resnet50(pretrained=True)

        # Remove linear and pool layers (since we're not doing classification)
        modules = list(resnet.children())[:-2])
        self.resnet = nn.Sequential(*modules)

        # Resize image to fixed size to allow input images of variable size
        self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

        # Fine Tuning

    def forward(self, images):
        Forward propagation.
        :param images: images, a tensor of dimensions (batch_size, 3, image_size, image_size)
        :return: encoded images

        # output of FC
        out = self.resnet(images)  # (batch_size, 2048, image_size/32, image_size/32)
        out = self.adaptive_pool(out)  # (batch_size, 2048, encoded_image_size, encoded_image_size)

        out = torch.flatten(out, start_dim=2, end_dim=3) # (batch_size,  2048, encoded_image_size**2)
        out = out.permute(0, 2, 1) # (batch_size, encoded_image_size ** 2, 2048)

        return out 

    def fine_tune(self, fine_tune=True):

        Allow or prevent the computation of gradients for convolutional blocks 2 to 4 of the encoder.
        :param fine_tune: Allow?

        for p in self.resnet.parameters():
            p.requires_grad = False

        # If fine-tuning, only fine-tune convolutional blocks 2 through 4
        for c in list(self.resnet.children())[5:]:

        for p in c.parameters():
            p.requires_grad = fine_tunea


[1]. “Attention is all you need

[2].  “Illustrated Transformer” by Jay Alammar

[3]. Implementation for Transformer by Pytorch

[4]. “Captioning Transformer with Stacked Attention Modules”, Xinxin Zu, Appl. Sci. 2018

[5]. “Image Caption Generation with Adaptive Transformer”,  Wei Zhang, 2019

[6]. “Entangled Transformer For Image Captioning”, Guang Li, ICCV 2019

[7]. “M2 : Meshed-Memory Transformer for Image Captioning”, Marcella Cornia∗ Matteo Stefanini∗ Lorenzo Baraldi∗ Rita Cucchiara, CVPR 2020.

[8] “Bottom-up and top-down attention for image captioning and visual question answering.” P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[9]. The transformer was also summarized here:


Tác giả: Tony (Senior AI Engineer – Pixta Việt Nam)

Like và follow fanpage của Pixta Việt Nam để cập nhật các thông tin công nghệ hữu ích nhé!