A Study of ConvNeXt Architectures for Enhanced Image Captioning

  • Leo Ramos*
  • , Edmundo Casas
  • , Cristian Romero
  • , Francklin Rivas-Echeverria
  • , Manuel Eugenio Morocho-Cayamcela
  • *Autor correspondiente de este trabajo

Producción científica: RevistaArtículorevisión exhaustiva

36 Citas (Scopus)

Resumen

This study explores the effectiveness of the ConvNeXt model, an advanced computer vision architecture, in the task of image captioning. We integrated ConvNeXt with a Long Short-Term Memory network that includes a visual attention module, focusing on assessing its performance across different scenarios. Experiments were conducted using various ConvNeXt versions for feature extraction, different learning rates during the training phase were tested, and the impact of including or excluding teacher-forcing was analyzed. The MS COCO 2014 dataset was employed, with top-5 accuracy and BLEU metrics used to evaluate performance. The implementation of ConvNeXt in image captioning systems reveals notable performance enhancements. In terms of BLEU-4 scores, ConvNeXt outperformed existing benchmarks by 43.04% for models using soft-attention and by 39.04% for those with hard-attention mechanisms. Furthermore, ConvNeXt surpassed models based on vision transformers and data-efficient image transformers by 4.57% and 0.93%, respectively, in BLEU-4 scores. When compared with systems using encoders such as ResNet-101, ResNet-152, VGG-16, ResNeXt-101, and MobileNet V3, ConvNeXt achieved higher top-5 accuracy improvements of 6.44%, 6.46%, 6.47%, 6.39%, and 6.68%, and reduced loss by 18.46%, 18.44%, 18.46%, 18.24%, and 18.72%, respectively.

Idioma originalInglés
Páginas (desde-hasta)13711-13728
Número de páginas18
PublicaciónIEEE Access
Volumen12
DOI
EstadoPublicada - 2024
Publicado de forma externa

Nota bibliográfica

Publisher Copyright:
© 2013 IEEE.

Citar esto