ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

A Mask-RCNN based object detection and captioning framework for industrial videos

Journal: International Journal of Advanced Technology and Engineering Exploration (IJATEE) (Vol.8, No. 84)

Publication Date:

Authors : ; ;

Page : 1466-1478

Keywords : Object detection; Mask-RCNN; Video captioning; Video analysis; Image captioning.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Video analysis of the surveillance videos is a tiresome and burdenous activity for a human. Automating the task of surveillance video analysis, specifically industrial videos could be very useful for productivity analysis, to assess the availability of raw materials and finished goods, fault detection, report generation, etc. To accomplish this task we have proposed a video captioning and reporting method. In video captioning, we generate summaries in understandable language that comprehend the video. These descriptions are generated by understanding the events and objects present in the video. The method presented in this paper constructs a captioned video summary, comprising of frames and their descriptions. Firstly, the frames are extracted from the video by performing uniform sampling. This reduces the task of video captioning to image captioning. Then, Mask- Region-based Convolutional Neural Network (RCNN) is utilized for detecting the objects like raw materials, products, humans, etc. from the sampled video frames. Further, a template-based sentence generation method is applied to obtain the image captions. Finally, a report is generated outlining the products present, and details relating to the production, like duration of the product being present, the number of products detected, the presence of operator at the workstation, etc. This framework can greatly help in bookkeeping, performing day-wise work-analysis, to keep track of employees working in a labor-intensive industry or factory, performing remote monitoring, etc., thereby reducing the human effort of video analysis. On the object classes for the created dataset, we have obtained an average confidence score of 0.8975, and an average accuracy of 95.62%. Moreover, as the captions are template-based the sentences generated are grammatically and meaningfully correct.

Last modified: 2022-01-10 22:05:30