Robots are one of the testbeds that can be used as objects for the application of intelligent systems in the current era of Industry 4.0. With such systems, robots can interact with humans through perception (sensors) like cameras. Through this interaction, it is expected that robots can assist humans in providing reliable and efficient service improvements. In this research, the robot collects data from the camera, which is then processed using a Convolutional Neural Network (CNN). This approach is based on the adaptive nature of CNN in recognizing visuals captured by the camera. In its application, the robot used in this research is a humanoid model named Robolater, commonly known as the Integrated Service Robot. The fundamental reason for using a humanoid robot model is to enhance human-robot interaction, aiming to achieve better efficiency, reliability, and quality. The research begins with the implementation of hardware and software so that the robot can recognize human movements through the camera sensor. The robot is trained to recognize hand gestures using the Convolutional Neural Network method, where the deep learning algorithm, as a supervised type, can recognize movements through visual inputs. At this stage, the robot is trained with various weights, backbones, and detectors. The results of this study show that the F-T Last Half technique exhibits more stable performance compared to other techniques, especially with larger input scales (640×644). The model using this technique achieved a mAP of 91.6%, with a precision of 84.6%, and a recall of 80.6%.