It has been reported that China has a high incidence of cancer, particularly colorectal cancer, which ranks among the highest in the world. Symptoms include indigestion, abdominal pain, nausea, vomiting, loss of appetite, weight loss, and bloody stools. In 2020 alone, colorectal cancer ranked third in incidence and second in mortality worldwide. That same year, colorectal cancer was the third most common cancer in China, with approximately 555,000 new cases, representing a 7.4% increase compared to the previous year1. Colorectal cancer generally progresses through four stages, from early to advanced. In the first stage, known as early colorectal cancer, patients have a survival rate of over 90% with timely treatment. The second and third stages are considered intermediate colorectal cancer, with survival rates ranging from 50 to 70%. By the fourth stage, or advanced colorectal cancer, the survival rate drops to just 10–20%. Therefore, early screening for colorectal cancer is extremely important2.Some common gastrointestinal diseases, as shown in Fig. 1, can undergo malignant transformation to varying degrees, eventually leading to colorectal cancer. Colitis is an inflammatory bowel disease3 characterized by inflammation of the colon mucosa and intestinal wall. It can be classified into two types based on symptoms: ulcerative colitis and Crohn’s disease. Ulcerative colitis is an autoimmune disease that typically begins in the rectal area of the colon, with inflammation confined to the mucosal layer of the intestine and gradually spreading to the muscular layer. Clinical manifestations include diarrhea, abdominal pain, bloody stools, and discomfort during defecation. Ulcerative colitis may also be associated with other conditions such as anal fistulas and perianal abscesses. Figure 1e shows an actual image of colitis.Fig. 1Real images of common digestive diseases. (a) Arti polyps. (b) Non-tipped polyps. (c) Tipped polyps. (d) Multiple polyps. (e) Colitis.Full size imageA polyp is an abnormal growth on the surface of body tissues. Based on its pathology, it can be classified into adenomatous polyps and non-adenomatous polyps4,5,6,7. Most colorectal polyps resemble mushrooms or cauliflower, featuring short stalks or stems that connect to the normal mucosal lining of the intestines. Other polyps have a flatter shape (flat polyps) or appear like carpets (sessile polyps). Figure 1a–d show different shapes of polyps in colonoscopy images. Polyps are a type of benign tumor, but over time, some polyps have the tendency to become malignant. If left untreated, they may develop into colorectal cancer. Early diagnosis of colorectal cancer primarily relies on colonoscopy, an effective method that allows direct observation of lesions and pathological analysis. However, its diagnostic accuracy heavily depends on the physician’s expertise and operational skills, with fatigue or inexperience potentially leading to missed or incorrect diagnoses. Computer vision-based assisted diagnostic technologies leverage deep learning algorithms to analyze colonoscopy images in real time, enhancing the sensitivity and accuracy of lesion detection, precisely localizing affected areas, reducing the workload of physicians, and significantly improving the detection of small or flat polyps. Compared to traditional methods, these technologies provide an efficient and reliable solution for early colorectal cancer screening, paving the way for the intelligent development of medical diagnostics.To address these challenges, Angermann et al.8 proposed a real-time polyp detection method that uses active learning algorithms to improve detection accuracy and efficiency. This method employs frame-based features to identify polyps in videos and utilizes an active learning framework based on classifiers to progressively enhance the classifier’s performance, allowing for more accurate identification and localization of polyps. However, the sensitivity and detection accuracy of this method are relatively low. Misawa et al. utilized CNNs to identify lesions in colonoscopy images9. The dataset comprised videos from 73 patients, including 73 colonoscopy video segments. However, the experimental results were less than ideal. Although the sensitivity reached 90.0%, the specificity and accuracy were relatively low, at 63.3% and 76.5%, respectively. In reference10, Peter Klare from Germany researched a novel computer-assisted polyp detection system. The study recruited 30 participants from a hospital in Germany and collected 280 colonoscopy images of polyp detection events to evaluate the system’s performance and accuracy. The experimental results indicated that the automated polyp detection system demonstrated high accuracy and robustness in clinical applications, with strong performance in detecting small polyps. However, the system’s detection performance still needs improvement in complex scenarios to meet more challenging and practical medical needs. Liu et al.11 utilized SSD (Single Shot MultiBox Detector) for the localization of polyp images. SSD is an algorithm known for its high precision and speed, but it has limited capability in recognizing small-sized objects. Nisha et al.12 designed a Dual-Path Convolutional Neural Network (DP-CNN) to classify polyps and normal bowel tissues in colonoscopy images. After training and testing on the CVC-ColonDB dataset (which contains 380 images), the system achieved an accuracy of 99.6% and a recall rate of 99.2%, although the dataset size was relatively small. Nogueira-Rodríguez et al.13 designed a deep learning model for real-time polyp detection, based on the YOLOv3 (You Only Look Once) architecture and supplemented it with a post-processing step using object tracking algorithms. The model demonstrated high prediction performance for sessile and pedunculated polyps, but it showed lower performance for flat polyps.The primary aim of this work is to enhance the detection capability of colonic polyps using an improved YOLOv5s algorithm, thereby advancing early colorectal cancer diagnosis and providing a benchmark for computer-assisted diagnostic systems. The novelty lies in the integration of the SE (Squeeze-and-Excitation) attention mechanism to enhance the C3(Cross Stage Partial Networks) module C3SE(Cross Stage Partial Networks with Squeeze-and-Excitation) and the use of BiFPN (Bi-directional Feature Pyramid Network) to optimize multi-scale feature fusion. Motivated by the need to address challenges like missed detection of small targets and poor performance in complex scenarios, the study validates its improvements on a newly constructed colonic polyp image dataset. Experimental results demonstrate significant gains in mAP, accuracy, and recall, highlighting the feasibility of this approach for advancing intelligent medical diagnostics.The improved YOLOv5s algorithmsYOLOv5s network structure and algorithm principleAs is well known, the YOLO family of algorithms has been widely applied to numerous target detection tasks in medical imaging14. YOLOv515 is one algorithm in the YOLO family, with various target detection network models for different image input sizes and datasets16.The network architecture of YOLOv5 is depicted in Fig. 2, which provides insight into its four fundamental components: input, backbone, neck, and head17.Fig. 2YOLOv5 network structure diagram.Full size imageAt the input stage, Mosaic data augmentation combines four input images into one to enhance dataset diversity, while adaptive image scaling adjusts input dimensions for improved small and large object detection. Anchor box dimensions are optimized using k-means clustering, enhancing detection accuracy, as shown in Fig. 2. The backbone network, based on CSPDarknet53 (Fig. 3), employs CSP modules and the Focus structure to optimize feature extraction, reduce computation, and retain semantic information. The neck network integrates Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) architectures, enabling multi-scale object detection by fusing features across different layers. Finally, the head network processes feature maps from the neck to produce multi-scale predictions for small, medium, and large objects using C3 modules at various levels (Fig. 2), supporting applications like lesion detection in medical imaging.Fig. 3CSPDarknet53 structure diagram.Full size imageSE attention mechanism moduleThe attention mechanism is a weighted summation process that calculates the final output by aggregating the weights assigned to various input components. This mechanism allows a model to enhance its performance by focusing more on critical components during the input data processing. As a result, attention mechanisms have become essential in neural network design18.For example, in target detection tasks, the attention mechanism helps the model concentrate on the most relevant regions for detection while automatically ignoring irrelevant ones.The SE (Squeeze-and-Excitation) attention mechanism19 is a popular lightweight approach widely used in convolutional neural networks. Its core procedure involves compression and excitation operations to determine the importance of each channel, thereby enabling the network to capture more refined features20, as shown in Fig. 4.Fig. 4Compression, excitation blocks.Full size imageThe two phases that comprise the SE attention mechanism’s concrete implementation are as follows.SqueezeThe compression mechanism pools the global average of the feature maps for each channel, generating a \(\:1\times\:1\times\:C\) scalar that represents the global features of all channels through convolutional compression of the feature maps. This process is mathematically expressed in Eq. 1.$$\:\begin{array}{c}{Z}_{c}={F}_{sq}\left({u}_{c}\right)=\frac{1}{H\times\:W}\sum\:_{i=1}^{H}\sum\:_{j=1}^{w}{u}_{c}\left(i,j\right)\end{array}$$(1)ExcitationThe excitation operation enables the model to learn the weights of each feature channel. It consists of two fully connected layers: the first layer has \(\:C\times\:SERatio\) neurons, while the second layer contains C neurons. Given an input with dimensions \(\:1\times\:1\times\:C\), the output remains \(\:1\times\:1\times\:C\) after passing through both fully connected layers. Here, SERatio represents the ratio of the output feature vector size in the Squeeze layer to the number of channels in the input feature map. This design effectively simplifies the mathematical expression in the second Eq. $$\:\begin{array}{c}s={F}_{ex}\left(z,W\right)=\sigma\:\left(g\left(z,W\right)\right)\:\:=\sigma\:\left({W}_{2}\delta\:\left({W}_{1}z\right)\right)\end{array}$$(2)It is possible to apply the SE attention mechanism to either the residual block or the convolutional layer. Implementing the SE attention mechanism allows the model to dynamically prioritize the importance of individual channels, thereby improving its generalization capability. Additionally, due to its limited number of parameters, the SE attention mechanism can be easily integrated into existing convolutional neural networks.Multi-scale feature fusion networkThe multiscale feature fusion network is a neural network architecture designed to improve performance in computer vision tasks by integrating feature information across different scales. A multiscale feature fusion network typically uses layers of varying depths to extract features at distinct scales. These extracted features are then combined through a fusion operation. The specific fusion method, which may include addition, multiplication, concatenation, etc., depends on the task requirements and desired performance. To further boost the network’s performance, multiscale feature fusion networks often incorporate additional modules, such as attention mechanisms and aggregation operations.Multi-scale feature fusion networks have the advantage of fully utilizing feature information at various scales, enhancing the model’s receptive field and expressive capability, which leads to improved performance across different computer vision applications. Figure 5 illustrates three standard design patterns for multiscale feature fusion networks: (a) the FPN (Feature Pyramid Network)21, (b) the PANet (Path Aggregation Network)22, and (c) the BiFPN (Bidirectional Feature Pyramid Network)23. FPN and PANet are used in the Neck network of the YOLOv5 algorithm, as discussed in the previous section. In 2020, Tan et al.23 proposed the Weighted Bidirectional Feature Pyramid Network (BiFPN) for the EfficientDet network, as shown in Fig. 5c. BiFPN and PANet differ in the following ways: (1) Network Structure: BiFPN builds upon the FPN and introduces multiple bidirectional connections to achieve multi-level feature map fusion. In contrast, PANet performs path aggregation of feature maps at various scales to achieve multi-scale feature map fusion. (2) Feature Fusion: BiFPN establishes bidirectional connections between different layers, allowing information to flow freely and adaptively optimizing feature fusion across levels by learning connection weights. PANet, however, aggregates feature information across scales using path aggregation of feature maps. (3) Training Method: BiFPN requires training of connection weights to enhance the network’s adaptability to target detection tasks. In contrast, PANet does not require additional training and instead aggregates feature maps of varying scales directly.Fig. 5Different forms of multi-scale feature fusion networks.Full size imageImprovements based on the attention mechanism (SENet-yolov5s network design)Figure 6 illustrates the structure and parameters of the backbone network as defined in the YOLOv5s.yaml file. To mitigate the impact of background noise and other interfering factors, the attention mechanism module has been introduced, with the layers of the C3 module upgraded to C3SE. A key consideration in developing a fusion network is selecting the appropriate layer for integrating the attention mechanism module to optimize detection performance.Fig. 6Backbone network structure.Full size imageThe backbone structure of YOLOv5s contains four C3 layers. To assess the impact of integrating attention mechanisms at different locations, a comparative experiment was conducted by modifying various C3 layers into C3SE. The performance of the modified network was then compared with the original to determine whether the attention mechanism fusion at different C3 layers consistently enhances detection performance and to identify the configuration that yields the best results.BiFPN-based model improvementAlthough top-down and bottom-up feature fusion is performed in the Neck of the YOLOv5 network, the fusion methods described above do not consider the relative importance of different input features. BiFPN not only introduces a better feature aggregation technique but also addresses the issue of input features having varying resolutions and contributing differently to the final feature fusion. To resolve this, BiFPN incorporates a weighting network based on an attention mechanism that adaptively learns the weights of each input feature layer and adjusts the aggregation process to enhance detection performance. This weighted network takes each input feature layer as input and outputs the corresponding importance weights of the feature layers. These weights are then fused with the features through quick normalization, with the fusion formula as follows:$$\:\begin{array}{c}O=\sum\:_{i}\frac{{\omega\:}_{i}}{\epsilon\:+{\sum\:}_{j}{\omega\:}_{j}}\cdot\:{I}_{i}\end{array}$$(3)In this formula, \(\:O\:\)represents the fused feature map, d \(\:\omega\:\:\)enotes the learnable weights, \(\:I\) is the original feature map, and \(\:\epsilon\:\) is a very small value (0.0001) added to avoid numerical instability. The concept of BiFPN is applied to enhance the neck region of YOLOv5, focusing primarily on the following two aspects:(1) For feature weight fusion, this work introduces a new module, BiFPN_Concat, to replace the previous Concat operation. The detailed structure of the BiFPN_Concat module is shown in Fig. 7. This module defines a trainable weight parameter ω for each feature layer to be fused, and the weight parameters are then normalized to obtain the corresponding normalized weights for each feature map across different layers. Each feature layer is then multiplied by its corresponding normalized weight, and the results are summed to generate the new fused feature layer output. This completes the feature fusion process. By learning the weight parameters of each feature layer, the contribution of each layer can be adaptively adjusted to enhance the algorithm’s ability to detect colorectal polyps.Fig. 7Introduction to the BiFPN_Concat module.Full size image(2) Network connection: Since YOLO has three detection layers, a BiFPN with three nodes is introduced into the algorithm and used only once. At this stage, the outputs from the three detection heads, \(\:P3\), \(\:P4\), and \(\:P5\), are:$$\:\begin{array}{c}\begin{array}{c}{P}_{3}^{out}=Conv\left(\frac{{\omega\:}_{1}\cdot\:{P}_{3}^{in}+{\omega\:}_{2}\cdot\:\text{Re}size\left({P}_{4}^{td}\right)}{{\omega\:}_{1}+{\omega\:}_{2}+\epsilon\:}\right)\end{array}\end{array}$$(4)$$\:\begin{array}{c}\begin{array}{c}{P}_{4}^{out}=Conv\left(\frac{{\omega\:}_{3}\cdot\:{P}_{4}^{in}+{\omega\:}_{4}\cdot\:{P}_{4}^{td}+{\omega\:}_{5}\cdot\:\text{Re}size\left({P}_{3}^{out}\right)}{{\omega\:}_{3}+{\omega\:}_{4}+{\omega\:}_{5}+\epsilon\:}\right)\end{array}\end{array}$$(5)$$\:\begin{array}{c}\begin{array}{c}{P}_{5}^{out}=Conv\left(\frac{{\omega\:}_{6}\cdot\:{P}_{5}^{in}+{\omega\:}_{7}\cdot\:Resize\left({P}_{4}^{out}\right)}{{\omega\:}_{6}+{\omega\:}_{7}+\epsilon\:}\right)\end{array}\end{array}$$(6)where \(\:{P}_{4}^{td}\) is calculated by the formula:$$\:\begin{array}{c}\begin{array}{c}{P}_{4}^{td}=Con\nu\:\left(\frac{{\omega\:}_{8}\cdot\:{P}_{4}^{in}+{\omega\:}_{9}\cdot\:\text{Re}size\left({P}_{5}^{in}\right)}{{\omega\:}_{8}+{\omega\:}_{9}+\epsilon\:}\right)\end{array}\end{array}$$(7)Here, \(\:\omega\:\) represents the trainable weight parameter, \(\:Resize\) denotes the up-sampling or down-sampling operation, and Conv refers to the convolution process. \(\:{P}_{3}^{td}\), \(\:{P}_{4}^{td}\), and \(\:{P}_{5}^{td}\) are the third, fourth, and fifth feature maps of the image, respectively, following the bottom-up pathway. \(\:{P}_{3}^{out}\), \(\:{P}_{4}^{out}\), and\(\:{P}_{5}^{out}\) are the output feature maps of \(\:P3\), \(\:P4\), and \(\:P5\).The neck structure of the YOLOv5 P3 detection head is now connected to the feature map of layer 4. The P4 detection head is linked to the feature maps of layers 18, 6, and 13, while the P5 detection head is connected to the feature map of layer 9. Through additional layers of feature aggregation, the model can concentrate on critical feature levels by assigning different training weights to various feature map levels, resulting in more accurate feature representations.Experimental designExperimental data setsThe images used in our study are sourced from publicly available databases and are randomly extracted from these datasets, including[Kvasir-SEG] https://datasets.simula.no/downloads/kvasir-sessile.zip, [CVC-ClinicDB] https://datasets.simula.no/downloads/kvasir-sessile.zip, [CVC-ClinicDB] https://www.dropbox.com/s/p5qe9eotetjnbmq/CVC-ClinicDB.rar?dl=0, and [LDPolypVideo] https://github.com/dashishi/LDPolypVideo-Benchmark, all of which are freely accessible for research purposes. The datasets were carefully selected and consist of real image frames extracted from thousands of digestive endoscopy videos. Figure 8(a) depicts the image screen of the actual gastric acquired image. However, owing to the varying pixel resolutions of the acquired images and the left side containing undesirable information and a black border, the image is cropped and scaled to 640 × 640 resolution, as seen in Fig. 8b.Fig. 8Images of the actual stomach and intestine.Full size imageThis image dataset utilized for modeling the intestinal illnesses, LabelImg software can be used for annotating these images, as shown in Fig. 9. Figure 10 illustrates how the tag specifies the image’s width and height.Fig. 9Image annotation.Full size imageFig. 10XML file content.Full size imageThe image dataset is made up of 609 photographs of diseased intestines and 2000 photographs of normal intestines. The total 2609 images are separated into training and validation sets in an 8:2 ratio. The remainder, 195 photographs from the collected colonoscopy videos are set as the test set, including 175 images of diseased intestines and 20 images of healthy intestines respectively. As shown in Fig. 11, according to the distribution of X and Y of the Ground Truth Box (real labelled box), it is obvious that the GT Box is mainly concentrated in the middle of the image. According to the distribution of width and height of GT Box, it can be seen that the targets of colonic polyps in the dataset are of different sizes, with extremely large detection targets as well as tiny targets. It is obvious that to build a high-precision detection model is very difficult.Fig. 11Dataset distribution and bounding box characteristics for colonic polyp detection.Full size imageIndicators for model evaluationThe built models are assessed by using the evaluation metrics listed below, which are commonly used in most medical image models.Confusion matrixThe confusion matrix is a standard method for evaluating classification performance. It is an N × N matrix, with ‘N’ denoting the number of categorization categories. For a binary classification problem, the confusion matrix is shown in Table 1, where true positive (TP) is the number of positive cases that the model correctly predicts as positive, false negative (FN) is the number of positive cases that the model incorrectly predicts as negative, false positive (FP) is the number of negative cases that the model incorrectly predicts as positive, and true negative (TN) is the number of negative cases that the model correctly predicts as negative. The indicators of accuracy, recall, and precision can be calculated through the confusion matrix.Table 1 Confusion matrix.Full size tableAccuracy is defined as the proportion of correctly predicted samples out of the total samples. In general, higher precision generally leads to better model performance. It is defined as:$$\:\begin{array}{c}A=\frac{TP+TN}{TP+TN+FP+FN}\end{array}$$(8)Recall is defined as the ratio of correctly predicted positive cases to the total number of positive cases. It is also known as sensitivity. It is defined as:$$\:\begin{array}{c}R=\frac{TP}{TP+FN}\end{array}$$(9)Precision is defined as the number of negative cases properly predicted as negative ones by the model di-vided by the total number of negative cases. Defined as:$$\:\begin{array}{c}T=\frac{TP}{TP+FP}\end{array}$$(10)mAP (mean average precision)This paper uses the mean Average Precision (mAP), giga floating-point operations per second (GFLOPs), and frames per second (FPS) to evaluate the model’s performance. mAP is used to assess the accuracy of the model, and its calculation formula is as follows:$$\:\begin{array}{c}mAp=\sum\:\raisebox{1ex}{${P}_{A}$}\!\left/\:\!\raisebox{-1ex}{$N$}\right.\end{array}$$(11)where\(\:\:{P}_{A}\) represents the area under the curve formed by precision on the x-axis and recall on the y-axis, and \(\:N\:\)denotes the total number of detection classes. mAP@0.5 indicates the average precision (AP) for each class calculated at an IoU threshold of 0.5, followed by averaging the AP values across all classes. mAP@0.5:0.95 refers to the computation of mAP for IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05, with the final mAP being the average of these values.Experimental results and analysisBecause the original YOLOv5s has four C3 layers in its backbone structure, a comparison experiment is designed to reveal the performance improvement when different C3 layers are upgraded to C3SE and the original network. Additionally, the experiment includes fusing the attention mechanisms of the C3 layers at different locations.It is shown from the result list in Table 2 that the model has the highest indexes when the first and second C3 layers are upgraded to C3SE; specifically, the mAP.5 is 1% higher, precision is 2.39% higher, and recall is 0.34% higher than that of the original YOLOv5s model. Furthermore, the detection performance of the YOLOv5s-1st-2nd-C3SE model exceeds the others. Therefore, the introduction of the SE module can increase the model’s feature extraction ability to some extent, thereby improving the model’s ability to detect polyps.Table 2 Comparative experimental data.Full size tableThe next step, in addition to the YOLOv5s-1st-2nd-C3SE model, i.e., the YOLOv5s with the first and second C3 layers upgraded to C3SE, is to build a new model by integrating BiFPN into YOLOv5s as described in the BiFPN-based model improvement subsection. Furthermore, the YOLOv5s-SEBiFPN model is created by coupling the YOLOv5s-1st-2nd-C3SE model with this improvement. Finally, the test experiment is executed in the same way as above, including the Faster R-CNN, SSD, and original YOLOv5s models. The performance indicators’ findings are displayed in Table 3.Table 3 Comparison of the results of the performance indicators of each model.Full size tableSince the training conditions of the seven models are consistent, the indicators result listed in Table 3 reveals that the improved YOLOv5s model outperforms the original YOLOv5s model in terms of mAP, accuracy, and recall values, with 0.7%, 0.5%, and 1.1% increments, respectively. Comparatively, the coupled improvement model, YOLOv5s-SEBiFPN, increases the performance indicators more, with mAP, accuracy, and recall increasing by at least 1.6%, 1.3%, and 1.5%, respectively.Another experiment is performed by using the means of k-fold cross-validation to identify differences in the models’ random performance. As is shown in Table 4, YOLOv5s + SEBifpn demonstrates exceptional stability, particularly in mAP, with a variation of only 0.4%, and in precision, with a variation of merely 0.5%, reflecting its consistency in accuracy and recall. Compared with other models, YOLOv5s + SEBifpn exhibits the highest stability in mAP, showcasing superior precision and robust recall capabilities, making it well-suited for high-accuracy and stability-demanding tasks such as intestinal detection. Although its FPS is slightly lower (ranging from 24.9 Hz to 26.0 Hz), its outstanding accuracy is reliable performance in practical applications. YOLOv5s + SEBifpn offers significant advantages in terms of both high precision and stability, making it particularly suitable for scenarios requiring rigorous accuracy and robustness.Table 4 Comparison of performance indicators after 5-fold cross-validation for each model.Full size tableFigure 12 is collected from the results of YOLOv5s and YOLOv5s-SEBiFPN models on the test set to depict the visualization of detection targets, with a red border representing the confidence of the prediction box’s category, as indicated in Eq. (12).Fig. 12YOLOv5s + SEBifpn algorithm vs. YOLOv5s algorithm visualisations.Full size image$$\:\begin{array}{c}{P}_{r}\left(clas{s}_{i}∣objet\right)*{P}_{r}\left(object\right)\cdot\:IoU={P}_{r}\left(clas{s}_{i}\right)*IoU\end{array}$$(12)where \(\:{P}_{r}\left(clas{s}_{i}∣objet\right)\) is the probability of being a specific class if there is an object, and \(\:{P}_{r}\left(object\right)\cdot\:IoU\) is \(\:{P}_{r}\left(object\right)\) multiplied by IoU (Intersection over Union).In Fig. 12a–d are typical results from the original YOLOv5s model, and images (e), (f), (g), and (h) are corresponding results from the YOLOv5s + SEBiFPN model. The sub-images (c) and (g), (d) and (h) show that the confidence level of the YOLOv5s-SEBiFPN model increases. The sub-images (a) and (e) show that the YOLOv5s-SEBiFPN model not only improves in confidence level but also enhances the detection capabilities for tiny and medium-sized polyps. However, it can also be seen that the polyp obscured by the intestinal wall in (b) is undetected due to a failure in detecting polyps in complex and diverse scenes.Figure 13 shows YOLOv5s-1s-2nd-C3SE algorithm is employed to identify the (a) and (b) images on the upper side. The improved YOLOv5s + SEBifpn algorithm is utilized to identify the (d) and (e) images on the lower side. It is observed that the YOLOv5s-1s-2nd-C3SE algorithm cannot precisely identify certain flat-shaped polyps. This phenomenon can occur since certain flat-shaped objects may exhibit varying dimensions or angles within the image, which may prove difficult for the attention mechanism to accurately detect. In contrast, the YOLOv5s-1th-2nd-C3SE algorithm exhibits superior performance over YOLOv5s + Bifpn. For instance, Fig. 13c remains undetected in YOLOv5s + Bifpn, whereas, Fig. 13(f) is detected in YOLOv5s-1th-2nd-C3SE. This discrepancy indicates that the YOLOv5s-1th-2nd-C3SE algorithm is superior to the YOLOv5s + Bifpn algorithm when dealing with flat polyps.Fig. 13YOLOv5s + SEBifpn, YOLOv5s-1st-2nd-C3SE, YOLOv5s + Bifpn visualisation comparison.Full size imageFigure 13 is exhibited to reveal the performance of SE, BiFPN, and SE + BiFPN modified YOLOv5s models, respectively, selected from the undetected results of the three models. The sub-images (a) and (b) represent the weak performance of the YOLOv5s-1st-2nd-C3SE model. The sub-images (d) and (e) correspond to the YOLOv5s + SEBiFPN model. The sub-image (c) is from the undetected set of the YOLOv5s + BiFPN model, with the same undetected result as in sub-images (a) and (b). The sub-image (f) is from the YOLOv5s + SEBiFPN model. This discrepancy indicates that the YOLOv5s + SEBiFPN algorithm may be slightly superior to the YOLOv5s + BiFPN model when dealing with flat polyps. Perhaps some flat-shaped targets exhibit varying dimensions or angles in the images, making it difficult for the attention mechanism to detect them accurately.Finally, Fig. 14 illustrates typical recognition images from the YOLOv5s + SEBiFPN model. In each sub-image of Fig. 14, prediction boxes, confidence scores, and classes are tagged, except for the normal intestinal images. Specifically, Fig. 15 displays the number of images with similar confidence scores in the test set (195 images). For instance, the number of images with a confidence score of 0.9 that tested positive for intestinal polyps is about 118, which is 60.8% of the total, while the ratio for images with confidence scores between 0.7 and 0.8 is 25.3%.Fig. 14YOLOv5s + SEBifpn algorithm detection visualisation.Full size imageFig. 15Distribution of confidence levels comparison.Full size imageIt becomes evident that the enhanced YOLOv5s + SEBiFPN algorithm exhibits greater stability in the decline of the loss function after 200 epochs during training, as shown in Figs. 16 and 17. Additionally, it displays a smoother and less pronounced ascent in the mAP and recall curves. The increased stability of the YOLOv5s + SEBiFPN algorithm is clear.Fig. 16Training results of YOLOv5s + SEBifpn algorithm.Full size imageFig. 17Training results of YOLOv5s algorithm.Full size image