A comprehensive analysis of YOLO architectures for tomato leaf disease identification

Wait 5 sec.

IntroductionTomato (Solanum lycopersicum) is one of the most widely cultivated and consumed fruits globally1. Its regular consumption has been linked to reduced risks of various health conditions due to its nutritional benefits2, including high levels of vitamin C, potassium, and antioxidants such as lycopene3. In addition, tomato plays an essential economic role by supporting millions of smallholder farmers and contributing significantly to global agricultural economies4. However, its cultivation is vulnerable to a wide range of plant diseases caused by bacteria, fungi, and viruses5, which can reduce yields, deform fruit quality, and increase production costs6,7,8. These challenges can ultimately disrupt supply chains and affect consumer prices.In tomato plants, diseases typically manifest through visible symptoms on the leaves, such as discoloration, spots, or deformities9,10. Traditionally, farmers have relied on manual inspection to monitor these diseases7, a method that is time-consuming, labor-intensive, and prone to human error7,11. Manual inspection often fails to adequately cover large agricultural areas, leading to delayed detection and reduced treatment effectiveness8. To address these limitations, recent advances in Artificial Intelligence (AI) and Computer Vision (CV) have introduced automated approaches for crop monitoring2,4. By leveraging machine learning algorithms and image processing, CV systems enable rapid, accurate identification of disease symptoms12, improving efficiency and enabling timely disease management at scale5,12.Within CV, object detection has emerged as a key technique for leaf disease detection. It enables the identification and localization of multiple objects within an image13, allowing for the rapid processing of large datasets and providing real-time insights for disease management14,15. Among object detection frameworks, You Only Look Once (YOLO) stands out for its high speed and accuracy16,17. Unlike multi-stage approaches such as Mask R-CNN, YOLO processes the entire image in a single pass18, making it ideal for real-time agricultural applications. Over successive versions, YOLO has consistently improved in precision, robustness, and efficiency, solidifying its position as one of the leading object detection frameworks.Based on the previous discussion, this work conducts a comprehensive comparison of the latest YOLO architectures, YOLOv8, YOLOv9, YOLOv10, YOLOv11, and YOLOv12, for tomato leaf disease detection. These architectures represent the most recent official releases and reflect state-of-the-art advancements in object detection. Our analysis focuses on evaluating their effectiveness, efficiency, and practicality in agricultural scenarios, aiming to identify the best-performing model and provide actionable insights for real-world deployment. To the best of our knowledge, this is the first comprehensive study to benchmark these five YOLO versions in the context of tomato leaf disease detection, offering a timely contribution for both agricultural practitioners and AI researchers. The contributions of this work are summarized as follows:Providing a technical overview of the latest YOLO architectures: YOLOv8, YOLOv9, YOLOv10, YOLOv11, and YOLOv12 highlighting their advancements and key features in the field of object detection.Conducting a comprehensive performance analysis to evaluate the effectiveness, efficiency, and accuracy of these architectures in the context of tomato leaf disease detection.Establishing a benchmark for future research to support the development of improved object detection models and architectures tailored for agricultural applications.Offering practical insights for agriculture by demonstrating the applicability of advanced computer vision techniques and their potential for integration into AI-driven crop health monitoring systems.Related workThe task of leaf disease detection has been widely studied in recent years due to its importance in precision agriculture and crop management. YOLO-based architectures have gained significant attention due to their balance between accuracy and inference speed, making them well-suited for real-time agricultural applications. Several studies have explored the use of YOLO models for detecting leaf diseases, demonstrating their effectiveness in identifying different pathological conditions in crops. Given the impact of early disease detection on yield optimization and resource management, research in this area remains highly relevant, driving continuous efforts to refine and evaluate detection models.To begin, the study in19 evaluates YOLOv5, YOLOX, Scaled YOLOv4, and SSD using a dataset of 3154 images from corn, potato, and tomato plants, covering eight disease classes. Results show that YOLOv5 outperforms the others in training speed and mAP. Similarly,20 compares YOLOv5 and YOLOv6 on the PlantDoc dataset, which contains 2569 images across 13 plant species and 17 diseases. YOLOv5 achieves a higher mAP@50, while YOLOv6 surpasses it in mAP@50:95, with YOLOv5 being 2.96 times faster in training.The study in21 compares YOLOv5, YOLOv7, and YOLOv8 for citrus leaf disease detection using a dataset of 2684 images across three disease classes: Anthracnose, Melanose, and Brown Spot. YOLOv8 achieves the highest performance, reaching 91.6% mAP@50:95 and significantly outperforming YOLOv5 and YOLOv7. Similarly,22 evaluates YOLOv7, YOLOv8, and Faster R-CNN on custom datasets for mango (1522 images, three classes) and guava (647 images, four classes). YOLOv8 again surpasses the other models in F1-score and mAP, demonstrating strong adaptability to small datasets.Expanding YOLO evaluations,23 analyzes YOLOv5, YOLOv8, and YOLOv9 for leaf disease detection using a dataset of 8,858 images, including healthy and diseased leaves from apples, cucumbers, tomatoes, and grapes. YOLOv9 achieved the best performance, with precision, recall, and F1-scores exceeding 90%. Similarly,24 uses the PlantDoc dataset to compare YOLOv5, YOLOv6, YOLOv8, and YOLOv9, where YOLOv9 again led in precision, recall, and mAP, while YOLOv5 attained the highest F1-score, demonstrating strong competitiveness against newer models.Further expanding the evaluations,25 analyzes YOLOv8, YOLOv9, and YOLOv10 for detecting Fusarium disease in banana leaves using a binary-class dataset of 450 images. YOLOv9 achieved the best mAP@50:95, highlighting its localization capability. Similarly,26 evaluates YOLOv5, YOLOv8, YOLOv9, and YOLOv10 for melon leaf disease detection using a dataset of 600 images across five disease classes, where YOLOv9 again outperformed all models, achieving the highest mAP, precision, and recall, while YOLOv5 showed the lowest effectiveness.Table 1 Summary of related work on leaf disease detection using YOLO architectures.Full size tableAs seen in the reviewed studies and summarized in Table 1, the comparative evaluation of YOLO architectures for leaf disease detection has attracted considerable attention. While multiple works have demonstrated the effectiveness of YOLO models, they differ widely in scope, datasets, and target crops. A common trend is the predominance of evaluations involving earlier YOLO versions, with limited studies addressing newer architectures such as YOLOv9, YOLOv10, and beyond. Thus, despite progress, a comprehensive assessment of the latest models remains lacking.Another notable gap in the reviewed studies is the limited focus on tomato crops, despite their global agricultural and economic significance. Many works combine multiple plant species with small sample sizes per species, raising concerns about evaluation comprehensiveness. Additionally, the small overall dataset sizes question the robustness of performance assessments, as limited samples may not reflect real-world variability. These gaps highlight the need for a systematic comparison of the latest YOLO versions using a dedicated, high-quality tomato leaf disease dataset.Materials and methodsDataset descriptionThe dataset used for this work is the Tomato-Village dataset4. This dataset comprises a collection of images of tomato leaf diseases for three tasks: multiclass classification, multilabel classification, and object detection. The images were manually captured in three districts of Rajasthan, India. For the object detection variant, the images were taken directly in the fields, as shown in Fig. 1, where a single image can contain multiple leaves with different diseases. Moreover, the images include leaves from plants of varying ages, different timings, and under diverse lighting conditions. Additionally, data augmentation techniques were applied to increase the variability of the dataset; specifically, the techniques used included: RandomRotate90, RandomBrightnessContrast, RGBShift, Rotate up to 90 degrees, RandomCrop with 30% pixel reductions, HorizontalFlip, and VerticalFlip. All images were manually annotated, resulting in a total of 14,368 images and 161,223 annotations.Fig. 1Sample images from used dataset.Full size imageThe dataset, specifically the object detection variant, includes six tomato leaf diseases: late blight, leaf miner, magnesium deficiency, nitrogen deficiency, potassium deficiency, and spotted wilt virus. The ‘healthy’ leaf class was not included in this division as it is implied that any leaf not detected with a disease is healthy. However, it is important to mention that this class is available for the other dataset variants. Table 2 provides information about the diseases contained in this dataset, and Fig. 2 shows graphical examples of them. The dataset is available for free use on GitHub, and it is prepared in both XML Pascal VOC and YOLO formats. Additionally, the dataset is divided into training and validation subsets. For this work, we performed our own split to include a validation subset. This division was done in an 80-10-10 proportion, resulting in 11,494, 1,437, and 1,437 images for training, validation, and evaluation, respectively. Furthermore, we modified the class names to standardize their capitalization. This process ensures consistency across all annotations, facilitating more accurate and efficient data processing.Table 2 Tomato leaf diseases and their descriptions contained in the used dataset.Full size tableFig. 2Sample images of the tomato leaf diseases contained in the used dataset.Full size imageOverview of architectures under studyYOLOv8YOLOv8 was released by Ultralytics in 2023, the same developers as YOLOv5. This architecture follows the same structural paradigm as its predecessors, consisting of a backbone, neck, and head, as shown in Fig. 3. YOLOv8 shares some characteristics with its predecessor YOLOv5. These include the use of CSPDarknet53 for the backbone, the incorporation of Spatial Pyramid Pooling Fast (SPPF) to handle features at different scales, and the use of Path Aggregation Network (PANet)32 to improve the flow of information through the network. However, it includes some key modifications that enhance its capabilities.Fig. 3YOLOv8 architecture33.Full size imageThe first significant modification in YOLOv8 is the incorporation of an anchor-free paradigm, which allows the model to predict the centers of objects more directly. This reduces the number of bounding box predictions, speeding up convergence34. Another relevant modification is the replacement of the C3 module from YOLOv5 with a new C2F module35. C2F is inspired by the Efficient Layer Aggregation Network (ELAN)36 module and allows the concatenation of outputs from all bottleneck modules to expand the receptive field; this, in turn, enables the model to learn features more effectively.Additionally, the training of YOLOv8 incorporates techniques to improve the model’s performance and generalization. This includes mosaic image augmentation, which randomly selects four images from the training dataset and combines them into a composite image37. This is achieved through cropping, concatenation, and scaling to fit the required input dimensions. This technique increases the dataset’s variability automatically, resulting in a trained model with enhanced adaptability. The head of YOLOv8 remains consistent, integrating decoupled heads to process detection and classification independently. YOLOv8 comes in five versions (n, s, m, l, and x) to meet different demands and computational resources, with details provided in Table 3.Table 3 Details of the YOLOv8 variants.Full size tableYOLOv9YOLOv938 is one of the latest versions released in the YOLO family. Although the features of YOLOv9 have not yet been explored in depth, it is known to include three key additions: reversible functions, Programmable Gradient Information (PGI), and a Generalized Efficient Layer Aggregation Network (GELAN). These modifications are designed to mitigate the issues of information loss in deep networks, where data is significantly compressed in bottlenecks, risking the loss of important information that consequently are not transmitted to subsequent layers.To address the mentioned problem, YOLOv9 first implements reversible functions. These functions have the unique property that their inverse does not result in information loss. By integrating these functions into the architecture, it ensures the retention of the maximum amount of input information and enables its transmission to all layers. This allows the network to update gradients more accurately, thereby improving the model’s capability.To enhance the capabilities of reversible functions, PGI is incorporated to support both deep and lightweight networks. As illustrated in Fig. 4, PGI features a main branch for efficient inference with minimal computational overhead, and an auxiliary branch based on reversible concepts for precise gradient generation and parameter updates. Additionally, a multi-level auxiliary information module enables effective gradient information sharing across layers. These enhancements collectively improve the model’s learning, inference, and localization capabilities.Fig. 4Comparison of PGI, implemented in YOLOv9, and related methods.Full size imageGELAN is incorporated to complement the capabilities of the reversible functions and PGI. This component, shown in Fig. 5, is designed based on ELAN and Cross Stage Partial Network (CSPNet)39, and it operates by combining the gradient path planning of CSPNet with the speed of ELAN during the inference process. This allows YOLOv9 to achieve fast inference without negatively impacting its accuracy. Moreover, the structure of GELAN allows for stacking multiple blocks, enabling YOLOv9 to handle a variety of scenarios and complexities effectively. Table 4 shows the five variants of YOLOv9.Fig. 5GELAN module structure, used in YOLOv9.Full size imageTable 4 Details of the YOLOv9 variants.Full size tableYOLOv10A few weeks after YOLOv9, YOLOv1040 was introduced to address a major limitation in YOLO architectures: the reliance on Non-Maximum Suppression (NMS) for post-processing. NMS, used to remove duplicate detections41, slows inference speed and hampers true end-to-end performance due to increased latency. Additionally, YOLOv10 includes design improvements and optimizations of components that had been previously overlooked.Firstly, YOLOv10 adopts an NMS-free design to reduce latency42. During training, YOLO traditionally assigns multiple positives per instance, enhancing performance but requiring NMS for post-processing. In contrast, one-to-one assignment removes the need for NMS but decreases accuracy and convergence speed. To leverage both approaches, YOLOv10 introduces consistent dual assignments by adding a one-to-one head alongside the standard YOLO head, as shown in Fig. 6. Both heads are optimized during training, benefiting from one-to-many learning, but during inference, only the one-to-one head is used for faster predictions. A consistent matching metric ensures alignment between both heads, selecting the best positive sample for each.Fig. 6Consistent dual assignment process for NMS-free training incorporated in YOLOv10.Full size imageTransitioning to component modifications, YOLOv10 introduces both efficiency- and accuracy-driven changes. A lightweight classification head was developed using two depth-wise separable convolutions and a 1\(\times\)1 convolution to reduce computation. For downsampling, Spatial-channel Decoupled Downsampling separates spatial and channel transformations: point-wise convolution handles channel modulation while depth-wise convolution manages spatial resolution, minimizing cost and preserving information. Additionally, intrinsic rank analysis identified redundancies in deeper stages, leading to a Compact Inverted Block (CIB) design, shown in Fig. 7a, combining depth-wise and point-wise convolutions for greater efficiency without sacrificing performance43.In the second set of modifications, YOLOv10 integrates large-kernel depth-wise convolutions to expand the receptive field and enhance model capability. However, to avoid performance degradation in small object detection and increased computational cost, these convolutions are applied selectively to deeper stages of smaller model scales. Additionally, Partial Self-Attention (PSA), shown in Fig. 7b, is introduced to improve global modeling with low computational overhead. PSA divides feature channels and applies multi-head self-attention only to a subset, enhancing global feature representation without significantly increasing complexity.Fig. 7CIB and PSA modules introduced in YOLOv10.Full size imageAll these modifications enable YOLOv10 to achieve performance on par with the best available architectures for object detection, with the added advantage of lower latency. This makes it a highly effective option for end-to-end deployment in applications where speed and accuracy are crucial, such as surveillance, autonomous driving, and real-time analysis. YOLOv10 is available in six versions (n, s, m, b, l, and x), with details provided in Table 5.Table 5 Details of the YOLOv10 variants.Full size tableYOLOv11YOLOv11 (YOLO11) is a recent advancement in the YOLO series, developed by Ultralytics, the creators of YOLOv5 and YOLOv8. Positioned as the direct successor to YOLOv8, it builds upon and enhances its foundational architecture. There is no official article describing in depth the characteristics of the architecture. However, some key details and innovations have already been disclosed. The design adheres to the traditional YOLO framework, which is composed of three key components: the backbone, the neck, and the head, as illustrated in Fig. 8.Fig. 8High-level diagram of the YOLOv11 architecture.Full size imageThe backbone is composed of alternating convolutional blocks and C3k2 blocks. Convolutional blocks combine 2D convolutions, batch normalization, and the Sigmoid Linear Unit (SiLU) activation function. The C3k2 blocks, shown in Fig. 9, evolve from the CSP bottleneck, enhancing feature extraction and information flow by splitting feature maps and applying efficient \(3\times 3\) convolutions. Inside each C3k2 block, the C3K module, similar to the C2F structure but without splitting, is designed to balance accuracy and speed during feature extraction.Fig. 9Structure of a C3k2 block, and comparison with C2F.Full size imageIn the neck, YOLOv11 retains the YOLOv8 base structure, including the SPPF, while incorporating C3k2 and C2PSA blocks. The C3k2 block was previously described. C2PSA, specifically designed for the neck, integrates Partial Spatial Attention (PSA) to enhance focus on critical image regions, aiding detection of small or occluded objects. PSA modules apply an attention layer, concatenate input and attention features, and process the result through feed-forward networks and convolutions, followed by a final concatenation, as shown in Fig. 10. Finally, the head remains consistent with those used in previous versions, featuring a multi-scale design to detect objects at three different levels of detail. YOLOv11 is available in five variants, whose characteristics are detailed in Table 6.Fig. 10Structure of C2PSA block used in YOLOv11.Full size imageTable 6 Details of the YOLOv11 variants.Full size tableYOLOv12At the time of this work, YOLOv1244 represents the latest YOLO release, focusing on balancing inference speed and detection accuracy. Unlike previous CNN-centric versions, YOLOv12 introduces an attention-centric design, leveraging the superior modeling capabilities of attention mechanisms traditionally avoided due to latency concerns. Through architectural innovations, it demonstrates that attention-based models can achieve real-time performance comparable to CNNs. YOLOv12 maintains the classic three-part structure, backbone, neck, and head, with significant modifications to integrate attention without compromising speed.Regarding the backbone, YOLOv12 maintains a hierarchical architecture for progressive multi-scale feature extraction, with early stages inherited from YOLOv11 and optimized for efficiency. Its main innovation is the Residual Efficient Layer Aggregation Network (R-ELAN), shown in Fig. 11, which replaces the traditional ELAN. R-ELAN introduces a residual shortcut from the input to the output, combined with a scaling factor, effectively mitigating gradient blockage and improving convergence, particularly in large-scale models. Additionally, the aggregation strategy has been redesigned. Instead of splitting the input and processing multiple paths in parallel as in ELAN, R-ELAN first adjusts the channel dimensions through a transition layer. After this adjustment, the features are processed sequentially and then concatenated, forming a stable and computationally efficient bottleneck structure.Fig. 11Comparison of R-ELAN (introduced in YOLOv12) with prior architectural blocks including GELAN, ELAN, C3K2, and CSPNet.Full size imageTo further optimize the backbone, YOLOv12 integrates 7\(\times\)7 separable convolutions to maintain a wide spatial receptive field while reducing parameters and memory usage. This design eliminates the need for explicit positional encoding, enabling spatial awareness without the high computational cost of traditional large-kernel operations. Additionally, lightweight convolutional blocks based on multiple small-kernel operations are employed. By decomposing computation and increasing parallelization, the model enhances processing speed while preserving rich feature representation. Finally, unlike YOLOv8 to YOLOv11, YOLOv12 omits the triple-block stacking in the final backbone stages and instead uses a single R-ELAN block, minimizing structural redundancy and improving training stability without sacrificing representational capacity.As for the neck, YOLOv12 continues the modular feature fusion strategy used in previous YOLO versions but adapts it to better accommodate the integration of attention mechanisms. A central innovation in this component is the use of Area Attention, a lightweight and efficient local attention mechanism tailored specifically for real-time object detection. Unlike conventional attention modules that rely on window partitioning (e.g., Swin Transformer) or complex grid patterns (e.g., axial or criss-cross attention), Area Attention segments the feature map into equal vertical or horizontal areas, as shown in Fig. 12, using a simple reshape operation, eliminating the overhead of explicit partitioning and maintaining a relatively large receptive field.Fig. 12Comparison between the Area Attention mechanism used in YOLOv12 and other attention mechanisms33.Full size imageTo further accelerate feature processing, YOLOv12 incorporates FlashAttention in the neck. FlashAttention addresses a key limitation of traditional attention mechanisms: inefficient memory access caused by irregular transfers between high-speed SRAM and high-bandwidth memory, leading to high latency. By restructuring attention computation into efficient memory I/O operations, FlashAttention reduces bandwidth usage and wall-clock time. Combined with Area Attention, it enables localized attention at high speed, allowing YOLOv12 to achieve real-time inference even at higher resolutions. Together, these components enhance feature discrimination in cluttered scenes, improve focus on critical regions, and retain fine-grained spatial information with minimal computational overhead.Moreover, YOLOv12 modifies the MLP ratio in its attention blocks to improve computational efficiency. Traditionally set at 4:1 in standard vision transformer designs, this ratio determines the relative width of the intermediate feed-forward layer compared to the input dimension. YOLOv12 reduces this to 1.2:1 in smaller model scales (n,s, m) and to 2:1 in larger ones (l, x), effectively rebalancing the computational load between the attention mechanism and the subsequent feed-forward processing. This change not only reduces the overall parameter count and memory usage but also accelerates inference while preserving model representational capacity.Finally, the prediction head in YOLOv12 maintains the foundational design of earlier YOLO models, employing convolutional layers to predict class probabilities, bounding box coordinates, and objectness scores. Some refinements have been introduced to streamline the prediction pathways, enhancing efficiency and supporting consistent multi-scale detection45. Additionally, the head integrates Area Attention, improving spatial awareness and contributing to faster, more precise predictions. As with previous versions, YOLOv12 is available in five model variants, detailed in Table 7.Table 7 Details of the YOLOv12 variants.Full size tablePerformance metricsPrecisionPrecision serves as an indicator of the reliability of positive identifications made by an object detection model. It measures the proportion of correctly identified positives against the total number of items labeled as positive46,47. The formula to calculate precision is given in Eq. 1:$$\begin{aligned} \text {Precision} = \frac{TP}{TP + FP}, \end{aligned}$$(1)where TP represents the count of correctly identified positive instances, and FP represents the instances erroneously classified as positive. A high precision score indicates that the model is effectively minimizing false positive identifications.RecallRecall is a metric that measures the model’s ability to identify all relevant instances within a dataset46. It quantifies the proportion of actual positives that are correctly detected. Mathematically, recall can be defined as shown in Eq. 2:$$\begin{aligned} \text {Recall} = \frac{TP}{TP + FN}, \end{aligned}$$(2)where FN denotes the false negatives. High recall is indicative of a model’s effectiveness, highlighting the model’s capability to capture as many positives as possible.Mean average precisionMean average precision (mAP) is a metric used to evaluate the overall accuracy of object detection models across various threshold settings46. It aggregates the precision scores across different recall levels, providing a single figure that summarizes the model’s performance. mAP is calculated by averaging the area under the precision-recall curve for each class and then computing the mean of these averages across all classes, as shown in Eq. 3:$$\begin{aligned} \text {mAP} = \frac{1}{N} \sum ^N_{i=1} \text {AP}_i, \end{aligned}$$(3)where \(\text {AP}_i\) is the average precision for each class, and N is the number of classes. Furthermore, the AP calculation, shown in Eq. 4, involves integrating precision over the range of possible recall levels, which can vary depending on the detection threshold applied48.$$\begin{aligned} \text {AP} = \int _0^1 \text {precision}(r) \text {d}r . \end{aligned}$$(4)The mAP metric can be specified with different Intersection over Union (IoU) thresholds to provide more granular insights into the model’s performance. Two common variants are mAP@50 and mAP@50:9549. Achieving a high mAP score indicates that a detection model not only accurately identifies and localizes objects across different classes but also consistently upholds this accuracy under varying conditions of detection stringency50.Training and implementation detailsThe aim of this study is to evaluate the models within a common setting to ensure that the results are comparable and relevant. The training was conducted using default hyperparameter configurations. This design choice prioritizes fairness and comparability over individual model optimization, ensuring that all architectures are evaluated under consistent and realistic training conditions. The most relevant parameters include: a learning rate of 0.01, SGD optimizer with momentum 0.937, weight decay of 0.0005, and input image resolution of 640\(\times\)640 pixels. The only adjustment applied across all models was setting the batch size to 64 to match available GPU memory. The training process spanned 100 epochs, a standard setting applied uniformly across all architectures to maintain consistency and facilitate direct comparisons.During the evaluation process, the best-performing model from each training session was selected based on validation performance. The comparison and analysis were carried out in two stages: first, by contrasting the results of architectures within the same version, and second, by conducting a broader comparison across different architectures. This approach allowed us to identify the strengths and weaknesses of each model under similar conditions. Regarding the implementation code, we used the official open-source notebooks provided by the authors of each YOLO version. These notebooks include complete implementations for training, evaluation, and inference, and are publicly available on their respective GitHub repositories. The training environment utilized a high-performance computing setup with the TensorFlow framework. Specifically, the training was performed using four Nvidia A100 SXM4 40GB GPUs. Additionally, the setup included 128 CPU cores and 256 GB of memory.Results and discussionYOLOv8To begin with, the training times of the YOLOv8 variants, shown in Table 8, are analysed. As expected, times increase with model complexity. Notably, YOLOv8n and YOLOv8s show similar durations (0.602 and 0.641 hours, respectively), suggesting their suitability for scenarios with limited computational resources and the need for rapid, cost-effective deployment. A similar trend is observed with YOLOv8m and YOLOv8l, with training times of 1.072 and 1.169 hours, respectively. Although more complex than the nano and small variants, they exhibit comparable efficiency, making them viable options when balancing training time and model complexity for tomato leaf disease detection. Finally, YOLOv8x records the highest training time at 1.664 hours, as expected given its greater architectural complexity. However, the increase remains acceptable considering it has 65 million more parameters than the nano variant. This highlights the efficiency of YOLOv8, making even its most complex models viable for tomato leaf disease detection across varying computational environments.Moving on to performance evaluation (Table 8), an incremental improvement is observed from the lightest to the most complex models. Notably, the mAP@50:95 increases from 0.538 (nano) to 0.776 (x-large), reflecting enhanced detection and localization accuracy as model complexity grows. A similar pattern is seen in precision, recall, and mAP@50 scores. These improvements suggest that while more complex models require additional training time, they deliver significantly better precision across multiple diseases, justifying the extra computational resources for practical deployment.Table 8 Results of each version of YOLOv8 on evaluation set (1438 images), plus training time (100 epochs).Full size tableAn important observation is that the most significant performance gains occur in the first three YOLOv8 variants: nano, small, and medium. Beyond the medium model, improvements with the large and x-large versions are smaller and more incremental. This suggests that, for the task and dataset in this work, heavier variants reach a point of diminishing returns. Consequently, the nano, small, and medium models offer a more efficient trade-off between performance and computational cost, making them practical options for tomato leaf disease detection without requiring the most complex variants.When analyzing each disease separately, the heavier YOLOv8 variants generally show better performance. For instance, YOLOv8x achieves the highest precision, mAP@50, and mAP@50:95 for late blight, leaf miner, and magnesium deficiency, while YOLOv8l records the best recall for late blight. Notably, YOLOv8x consistently reports the highest mAP@50:95 across all diseases, highlighting its effectiveness in capturing fine details. However, lighter models sometimes outperform heavier ones. YOLOv8m, for example, achieves the highest precision (0.947) and mAP@50 (0.905) for nitrogen deficiency, surpassing the large and x-large variants. Additionally, YOLOv8m also reports the best recall for spotted wilt virus, demonstrating that intermediate models can offer a more balanced and effective performance in certain cases.Regarding inference speed, the YOLOv8n and YOLOv8s variants prove to be the fastest, with times of 2.2ms and 2.4ms respectively. The YOLOv8m variant has an inference time of 3.9ms, while the large variant adds an extra millisecond, reaching 4.9ms. Finally, the most heaviest variant, YOLOv8x, reports 6.8ms, making it the slowest among all variants. However, this speed is still acceptable, and it does not rule out YOLOv8x as a viable option, especially considering the task of tomato leaf disease detection, where precision and detailed detection capability are crucial.Further insights are provided by the tests shown in Fig. 13. In the first test (Fig. 13a), lighter models like YOLOv8n and YOLOv8s fail to correctly detect magnesium deficiency and often confuse sunlight reflections with disease symptoms. In contrast, heavier models accurately identify magnesium deficiency with higher confidence scores, with YOLOv8x exceeding 90%. In the second test (Fig. 13b), YOLOv8n again performs poorly, detecting only half of the instances with low confidence. From YOLOv8m onwards, models detect most instances, with YOLOv8l and YOLOv8x achieving confidence scores above 80%. However, the medium and large variants occasionally exhibit duplicate detections, suggesting slight issues with instance overlap.Fig. 13Inference tests of YOLOv8.Full size imageYOLOv9The training times of YOLOv9 9 reveals that all variants require over two hours. YOLOv9-T, YOLOv9-S, and YOLOv9-M report very similar durations (2.290, 2.452, and 2.497 hours) despite differences in parameter count. YOLOv9-C shows a moderate increase (2.905 hours), while YOLOv9-E, the most complex variant, approaches four hours. Overall, YOLOv9 exhibits considerably long training times, potentially limiting its practicality under resource constraints.Moving to performance evaluation (Table 9), heavier variants demonstrate clear superiority over lighter ones. YOLOv9-E achieves the best overall results, approaching 90% in precision and mAP@50, and notable scores of 0.795 and 0.590 in recall and mAP@50:95, respectively. YOLOv9-C follows, with precision and mAP@50 above 80%, and competitive recall and mAP@50:95 values. In contrast, YOLOv9-M and YOLOv9-S report performance below 80% in precision, recall, and mAP@50, and under 50% in mAP@50:95. YOLOv9-T performs the worst across all metrics. These findings highlight that, although lighter variants train faster, their effectiveness in disease detection is significantly inferior.Table 9 Results of each version of YOLOv9 on evaluation set (1,438 images), plus training time (100 epochs).Full size tableAnalyzing per-disease performance, a pattern similar to the overall results emerges. YOLOv9-E consistently outperforms all other variants, notably achieving 0.916 mAP@50 for magnesium deficiency, being the only case exceeding 90% across diseases and metrics. YOLOv9-C and YOLOv9-M show competitive but variable performances. YOLOv9-C excels in precision for diseases like leaf miner and magnesium deficiency, while YOLOv9-M shows higher recall in late blight and nitrogen deficiency. Both maintain mAP@50:95 scores frequently above 50%. Meanwhile, YOLOv9-S declines, surpassing 80% mAP@50 only for late blight, magnesium deficiency, and potassium deficiency. Finally, YOLOv9-T shows the weakest performance, achieving acceptable precision only for spotted wilt virus (0.745) and reporting very low mAP@50:95, below 30% for diseases like leaf miner and nitrogen deficiency.Regarding inference speed, clear variations emerge as model complexity increases. The T variant is the fastest at 2.8 ms, while the S variant rises to 3.6 ms. YOLOv9-M shows a notable jump to over 7 ms, suggesting limitations for real-time applications. This trend continues with YOLOv9-C at 8.9 ms and YOLOv9-E reaching 11.5 ms. Although heavier variants offer better precision and detection, their higher inference times raise concerns about practical deployment in tomato leaf disease detection.Finally, Fig. 14 shows the graphical test results. In the first test (Fig. 14a), all models struggle to fully match the ground-truth detections. YOLOv9-S and YOLOv9-M incorrectly detect magnesium deficiency in unaffected areas, and all variants falsely detect leaf miner instances in the leaf center. Heavier models show improved confidence scores, though still below 90%. In the second test (Fig. 14b), overall performance improves, but issues persist, such as YOLOv9-C incorrectly detecting magnesium deficiency and some late blight instances. Again, heavier variants achieve higher confidence scores, occasionally exceeding 80%, but never surpassing 90%. These findings highlight that, despite better precision, the heavier models still face challenges in detection fine-tuning.Fig. 14Inference tests of YOLOv9.Full size imageYOLOv10Beginning with the training times of YOLOv10 models (Table 10), overall, differences between models are not excessive despite variations in parameter count. YOLOv10-N reports the shortest time at 0.830 hours, making it ideal for resource-limited scenarios. YOLOv10-S increases slightly to 0.926 hours, while YOLOv10-M and YOLOv10-B add less than half an hour more. The heavier YOLOv10-L and YOLOv10-X variants show the highest times, 1.643 and 2.008 hours, respectively. Notably, the overall difference between YOLOv10-N and YOLOv10-X is just over an hour, suggesting that even the heaviest models maintain reasonable training times, supporting their practical use for leaf disease detection.Turning to the performance results (Table 10), overall performance improves with increasing model complexity. YOLOv10-X stands out with the highest metrics across the board. However, lighter models also achieve strong results; YOLOv10-M, for example, records a precision of 0.938, just slightly behind YOLOv10-X’s 0.942. Even YOLOv10-N demonstrates notable performance, with precision and mAP@50 scores exceeding 80%. These results indicate that all YOLOv10 variants deliver reliable performance, offering flexibility to adapt to different computational resources and application requirements.Table 10 Results of each version of YOLOv10 on evaluation set (1,438 images), plus training time (100 epochs).Full size tableAnalyzing the performance for each disease, the overall trend persists, with all YOLOv10 variants showing strong results. Major differences arise when comparing lighter models like YOLOv10-N and YOLOv10-S with heavier ones like YOLOv10-L and YOLOv10-X. YOLOv10-X achieves notable precision for magnesium deficiency (0.951) and spotted wilt virus (0.954), while YOLOv10-L surpasses it with 0.965 precision and 0.971 mAP@50 for potassium deficiency. Mid-complexity models like YOLOv10-B and YOLOv10-M also perform well, often exceeding 90% in key metrics; for example, YOLOv10-M achieves 0.957 precision for potassium deficiency. Although YOLOv10-N and YOLOv10-S show lower overall performance, they still present solid results, with YOLOv10-S often reaching above 90% and YOLOv10-N surpassing 80%. However, YOLOv10-N records the lowest figure (0.587 recall for leaf miner) among all architectures. Overall, the YOLOv10 family proves highly effective for tomato leaf disease detection.Regarding inference speeds, as model complexity increases, a corresponding rise in latency is observed. YOLOv10-N and YOLOv10-S achieve speeds of 2.1ms and 2.3ms, respectively, while YOLOv10-M reaches 3.4ms. YOLOv10-B and YOLOv10-L follow with 4.2ms and 4.9ms, remaining within an acceptable range. YOLOv10-X records the highest latency at 6.7ms, but this value is still reasonable for practical applications. Overall, despite higher inference times in heavier models, all variants maintain a viable balance between precision and processing speed for tomato leaf disease detection.Turning to the graphical tests in Fig. 15, the results align with the numerical observations. In the first test (Fig. 15a), all models detect most instances, with YOLOv10-S and YOLOv10-B achieving perfect detection; YOLOv10-B also shows slightly higher confidence scores. YOLOv10-L and YOLOv10-X mistakenly detect an extra leaf miner instance, while YOLOv10-M misses one spotted wilt virus. Confidence scores generally improve with heavier models, nearing or exceeding 90%. In the second test (Fig. 15b), YOLOv10-M, YOLOv10-B, and YOLOv10-X match the ground-truth exactly, with YOLOv10-X achieving the best confidence scores, often surpassing 90%. YOLOv10-N misses one detection, while YOLOv10-S and YOLOv10-L introduce false positives. Overall, all variants perform well, with relatively few errors and high confidence levels.Fig. 15Inference tests of YOLOv10.Full size imageYOLOv11Beginning with training times (Table 11), a consistent increase is observed as model complexity grows. YOLOv11n, the lightest variant, reports the shortest time at 0.570 hours, followed by YOLOv11s at 0.679 hours, a modest increase. YOLOv11m requires 0.962 hours, approximately 42% more than YOLOv11s. The heavier variants, YOLOv11l and YOLOv11x, demand 1.227 and 1.635 hours, respectively. Notably, YOLOv11x’s training time is nearly three times that of YOLOv11n, reflecting the significant computational cost associated with larger models. It is worth highlighting that even the most robust variant, YOLOv11x, requires just over 1.5 hours to train, which is a remarkably efficient time considering its complexity. Their relatively low training times, combined with their scalability across different variants, make them suitable for diverse use cases, ranging from resource-constrained environments to high-performance systems requiring robust accuracy and speed. Therefore, these models present themselves as strong candidates for practical applications and deployments.When analyzing the performance results (Table 11), an incremental improvement in performance is observed with increasing model complexity. YOLOv11x emerges as the best-performing variant with a precision of 0.940, recall of 0.884, mAP@50 of 0.936, and mAP@50:95 of 0.790. In contrast, YOLOv11n reports the lowest values, with 0.835 precision, 0.773 recall, 0.840 mAP@50, and 0.565 mAP@50:95. YOLOv11l ranks second, achieving 0.931 precision and 0.932 mAP@50, very close to YOLOv11x. Meanwhile, YOLOv11s and YOLOv11m show intermediate but competitive results, with YOLOv11s notably achieving 0.913 precision and 0.906 mAP@50, making it a strong candidate for resource-limited applications. Overall, the YOLOv11 models offer excellent performance paired with efficient training times, establishing them as robust and versatile solutions for tomato leaf disease detection.Analyzing the inference times, lighter models perform faster as expected. YOLOv11n is the quickest at 2.1 ms per image, followed by YOLOv11s at 3.7 ms, YOLOv11m at 4.9 ms, YOLOv11l at 5.8 ms, and YOLOv11x at 7.9 ms. Despite being the slowest, YOLOv11x maintains an efficient inference time suitable for most applications.Table 11 Results of each version of YOLOv11 on evaluation set (1,438 images), plus training time (100 epochs).Full size tableRegarding per-class performance, the more complex models, YOLOv11l and YOLOv11x, deliver the best results. YOLOv11l achieves the highest scores for potassium deficiency, while YOLOv11x excels across most classes, surpassing 95% precision in magnesium and nitrogen deficiencies and leading in late blight metrics. Lighter models also perform notably. YOLOv11s achieves precision above 90% in five of six classes, and YOLOv11m shows strong results like 0.945 precision in magnesium deficiency and 0.944 mAP@50 in late blight. Even YOLOv11n consistently exceeds 80% across categories. Overall, the YOLOv11 family balances complexity and accuracy effectively, offering robust solutions for tomato leaf disease detection across different resource constraints.Moving forward, Fig. 15 presents the inference tests using YOLOv11 models. In the first test (Fig. 16a), all models perform satisfactorily, with well-localized bounding boxes and high confidence scores. However, none achieve perfect alignment with the ground truth: YOLOv11n, YOLOv11m, YOLOv11l, and YOLOv11x correctly predict eight out of nine instances, while YOLOv11s identifies only six. Confidence scores are generally high, often exceeding 50% and reaching above 90% in more complex models. The second test (Fig. 16b) is more challenging. YOLOv11n detects seven out of nine instances but shows overlapping and lower confidence; YOLOv11s underperforms with fewer correct detections. YOLOv11m improves with eight correct predictions and stronger confidence scores. YOLOv11x detects extra instances with misplacements, while YOLOv11l achieves near-perfect detection with high confidence, reinforcing its reliability.Fig. 16Inference tests of YOLOv11.Full size imageYOLOv12Starting with the training times (Table 12), they increase progressively with model scale, from 0.640 hours for YOLOv12n to 2.094 hours for YOLOv12x. This growth reflects the rise in parameters and complexity across variants. Notably, the increase is not linear: while YOLOv12m requires around 1.17 hours, YOLOv12x nearly doubles that time, highlighting inefficiencies in scaling. This can be critical in environments with limited computational resources or frequent retraining needs.Turning to the performance (Table 12), all YOLOv12 variants reflects the benefits of its attention-centric design. Precision rises from 0.860 (YOLOv12n) to 0.947 (YOLOv12x), indicating improved reliability and fewer false positives as capacity increases, largely due to Area Attention. Recall also improves, from 0.790 to 0.873, although the gain is more moderate, suggesting that while larger models detect more true instances, attention mechanisms alone may not fully compensate for limited capacity in smaller variants.The most significant improvement appears in mAP@50:95, rising from 0.604 (YOLOv12n) to 0.783 (YOLOv12x), a relative gain of nearly 18%. This highlights the effectiveness of additions like R-ELAN and position-aware convolutions in refining localization. Notably, the jump from YOLOv12s to YOLOv12m (+0.049) shows mid-scale models already benefit considerably from attention mechanisms. Meanwhile, mAP@50 improves more gradually, from 0.864 to 0.933, indicating that all models perform well at coarse localization, with larger models primarily enhancing fine-grained accuracy. However, this performance gain comes at the cost of increased inference time, rising from 1.5 ms (YOLOv12n) to 10.5 ms (YOLOv12x). While expected, this trade-off is critical for deployment, particularly in real-time or edge-computing scenarios. Notably, YOLOv12m offers a strong balance, achieving 0.907 mAP@50 and 0.717 mAP@50:95 at just 4.9 ms, positioning it as a competitive mid-scale option for accuracy and efficiency.Table 12 Results of each version of YOLOv12 on evaluation set (1,438 images).Full size tableWhen examining per-class performance across YOLOv12 variants, a clear improvement in classification and localization is observed as model complexity increases. YOLOv12n, though modest overall, achieves respectable precision for potassium deficiency (0.871) and spotted wilt virus (0.873), but struggles with leaf miner, where recall drops to 0.666. YOLOv12s shows substantial improvement, with all classes reaching at least 0.847 mAP@50, and particularly strong results in magnesium deficiency (0.920) and leaf miner (0.905), highlighting enhanced sensitivity to subtle disease patterns.YOLOv12m further strengthens this trend, becoming the first variant where every class achieves over 0.875 mAP@50. Complex diseases like magnesium deficiency and spotted wilt virus surpass 0.920 mAP@50, reflecting the benefits of deeper attention layers and R-ELAN aggregation in improving spatial discrimination. In YOLOv12l, class-wise performance remains highly balanced. Although the overall mAP gain over YOLOv12m is modest, precision and recall stay consistently high across all classes. Magnesium deficiency reaches 0.938 mAP@50, and potassium deficiency achieves the highest mAP@50:95 at 0.788, highlighting YOLOv12l’s ability to accurately detect both large, uniform symptoms and fine, scattered patterns, an outcome attributed to its deeper R-ELAN structures and integrated attention modules.Finally, YOLOv12x maintains consistently high class-wise performance, though not uniformly. Potassium deficiency leads with 0.955 mAP@50 and 0.835 mAP@50:95, the highest across all models, while magnesium deficiency and spotted wilt virus also achieve strong metrics. In contrast, nitrogen deficiency lags slightly in mAP@50:95 (0.764) despite high precision, indicating less precise bounding box localization. Similarly, leaf miner exhibits lower recall and localization scores, likely due to its thin, linear patterns. Overall, YOLOv12x delivers top performance with high accuracy and confidence, although fine-grained symptom detection remains challenging even at maximum model capacity.Regarding the visual tests, Fig. 17a shows the first qualitative comparison across YOLOv12 variants. Overall, models effectively control redundant detections, indicating stable confidence thresholds and well-calibrated post-processing. However, detection accuracy varies. From YOLOv12n through YOLOv12l, distinguishing spotted wilt virus from magnesium deficiency remains problematic, leading to incomplete identification of magnesium cases. Only YOLOv12x correctly detects both, likely due to deeper attention layers and enhanced spatial resolution. All models detect the single instance of leaf miner, although YOLOv12n and YOLOv12x misclassify an additional instance, suggesting that both underparameterized and overparameterized models may confuse thin or vein-like background textures.The second visual test, presented in Fig. 17b, again shows consistent bounding box behavior across YOLOv12 variants, with no excessive overlapping, suggesting stable confidence calibration. However, this test proves more challenging, with lighter variants (nano, small, and medium) failing to detect leaf miner instances near image borders or within occluded regions. Conversely, these models reliably detect late blight cases. YOLOv12l improves leaf miner detection but struggles with late blight at the edges, indicating sensitivity to spatial context. Surprisingly, YOLOv12x exhibits the most errors, missing leaf miner in bright areas and showing lower confidence in peripheral late blight detections. This highlights that increased model complexity does not guarantee better robustness in edge-case scenarios.Fig. 17Inference tests of YOLOv12.Full size imageGeneral remarksTo conclude the analysis of the architectures, Figs. 18 and 19 compare all models based on overall performance. The first key observation is that YOLOv10 models are the lightest, with even the x-large version staying below 30 million parameters. YOLOv9 increases in complexity, especially YOLOv9-E, nearing 60 million parameters. Similarly, YOLOv8’s large variant reaches almost 44 million parameters, while YOLOv8x becomes the most complex model, approaching 70 million. YOLOv12 follows a comparable scaling trend, with its nano variant matching the lightest models, and its x-large model growing to just under 60 million parameters, placing it above YOLOv9, YOLOv10, and YOLOv11 x-large variants, but still below YOLOv8x.Analyzing Fig. 18, which compares performance relative to parameter count, the YOLOv10 models stand out by delivering strong precision even in lighter variants, with the medium model already outperforming all YOLOv8 and YOLOv9 counterparts. YOLOv10-M, B, L, and X consistently surpass 90% precision, showcasing the effectiveness of their NMS-free dual-head design. YOLOv8 models follow closely, maintaining precision mostly above 85%, while YOLOv9 variants lag, with competitive precision only in C and E versions. YOLOv11 further elevates precision, with YOLOv11x nearing 95% and smaller variants matching or exceeding YOLOv8 models of similar complexity. YOLOv12 shows comparable precision to YOLOv11, especially in its nano, small, and medium versions. However, the large variant drops slightly, falling below YOLOv8-L, YOLOv10-L, and YOLOv11-L levels, nearing 90%. Despite this, YOLOv12x achieves the highest precision among all models, outperforming even YOLOv8x while using fewer parameters, confirming the strength of its attention-centric design.Fig. 18Performance comparison between all architectures regarding their number of parameters.Full size imageAnalyzing recall, the gap between YOLOv10 and YOLOv8 narrows, with only YOLOv10-X outperforming all YOLOv8 variants. YOLOv8-L and YOLOv8-X show competitive recall values, equaling or slightly surpassing most YOLOv10 and YOLOv9 models. Lighter variants from both families perform similarly, without major differences. YOLOv11 again sets a new benchmark, with its medium, large, and x-large versions all exceeding 85% recall, and even its nano model outperforming its counterparts. YOLOv12 shows intermediate recall, generally falling between YOLOv11 and YOLOv8, surpassing YOLOv8 consistently and maintaining a clear lead over YOLOv9, which remains the weakest. Among nano variants, YOLOv12n achieves the best recall, nearing 80%. However, across all models, no variant exceeds 90% recall, highlighting a common limitation.Moving on to mAP@50, YOLOv8 and YOLOv10 show similar performance, surpassing 80% in lighter variants and 90% in heavier ones. YOLOv11 again stands out, with its medium, large, and x-large models outperforming all others, and even its nano and small variants exceeding their counterparts from YOLOv8, YOLOv9, and YOLOv10. YOLOv9 remains the weakest, with only its E variant exceeding 85%. YOLOv12 positions itself between YOLOv10 and YOLOv8, generally outperforming the latter but trailing the former. Although YOLOv12 does not surpass YOLOv11, its medium and large variants offer comparable results. Notably, YOLOv12n emerges as the best-performing nano model across all versions, confirming its strength in lightweight detection tasks.The mAP@50:95 results confirm YOLOv11’s superiority, with its variants consistently leading and several models approaching the 80% threshold, which no other versions achieve. The prior advantage of YOLOv10-X over YOLOv8 disappears here, as YOLOv8-L and YOLOv8-X outperform their YOLOv10 counterparts. YOLOv9 continues to underperform, with only its heaviest variants nearing 60%. In this stricter evaluation, YOLOv12 improves its relative position, consistently surpassing YOLOv8 and YOLOv10 and establishing itself as the second-best architecture after YOLOv11. YOLOv12n again excels among nano models, surpassing 60%, while YOLOv12x ranks just behind YOLOv11x as the second-best performer overall.Notably, the comparatively weaker performance of YOLOv9 can be attributed to architectural and training-related factors. In particular, the integration of PGI with a reversible auxiliary branch and the GELAN backbone introduces higher training demands. While these innovations aim to maintain high accuracy and stable learning, they also introduce higher training demands and may require more extended convergence periods. Given the fixed training schedule of 100 epochs used across all models, combined with the diversity of disease classes and the density of instances in the dataset, YOLOv9 may not have had sufficient training time to fully optimize its parameters. This could explain why its performance lags behind later architectures such as YOLOv10 and YOLOv11, which adopt more efficient and lightweight components better suited for faster convergence.Turning to Fig. 19, which compares efficiency in training time and latency, YOLOv9 models clearly require the most resources, with significantly higher training durations, making them the least practical for deployment. YOLOv10 improves efficiency, with all variants training under 2.1 hours. YOLOv8 performs even better, completing training in under 1.7 hours across all versions, positioning it among the most efficient. YOLOv11 continues this trend with slightly longer times, especially in its larger models, but still within an excellent range. YOLOv12 ranks third, with all variants training under 2.5 hours, slightly slower than YOLOv11 and YOLOv8, yet far more efficient than YOLOv9.Fig. 19Efficiency comparison between all architectures.Full size imageRegarding latency, YOLOv8 and YOLOv10 models display similar behavior, with YOLOv8 offering the best balance between speed and detection performance. YOLOv10 maintains comparable latency but slightly lags in accuracy, making its overall efficiency less favorable. YOLOv11 achieves the highest accuracy but at a modest cost in latency compared to YOLOv8, a relevant factor for strict real-time applications. YOLOv9 performs the worst, with even its better variants, YOLOv9-C and YOLOv9-E, exceeding 11 ms, limiting their practical use. YOLOv12 follows a similar latency profile to YOLOv10 and YOLOv11, with YOLOv12n standing out as the fastest nano variant. The small to large YOLOv12 models maintain comparable efficiency, while YOLOv12x exhibits a notable latency increase, ranking as the slowest model after YOLOv9-E.Overall, for tomato leaf disease detection, YOLOv11 offers the best trade-off between model complexity, training time, and practical deployment. It consistently delivers top-tier accuracy while maintaining reasonable training and inference times. YOLOv8, YOLOv10, and YOLOv12 also provide strong alternatives, each excelling in different trade-off combinations. YOLOv10 and YOLOv12 in particular stand out for their efficient lightweight variants. Conversely, YOLOv9 underperforms despite its complexity, showing longer training times and limited gains in accuracy. These findings highlight that selecting an architecture depends not only on raw performance but also on balancing computational cost and real-world deployment needs in agricultural scenarios.When training infrastructure is limited, opting for models such as YOLOv12n or YOLOv10-N provides significant time savings and enables faster retraining cycles, even if it comes at a small cost in detection accuracy. On the other hand, in environments where longer training times are acceptable and hardware resources are sufficient, models like YOLOv11x are preferable due to their superior generalization and precision. Ultimately, this study demonstrates that model choice should account for both resource availability and the precision demands of the target agricultural application, balancing efficiency with accuracy across varying deployment scenarios.LimitationsAlthough this study makes a thorough effort to provide a comprehensive comparison, several limitations must be recognized. Firstly, the dataset used covers only six diseases, which may not fully represent the variety of diseases that could impact tomato leaves under different regional and environmental conditions. Additionally, the training and evaluation were conducted using high-performance hardware, which is not readily available in all settings, potentially affecting the replicability of the results in environments with more limited computational resources. Lastly, the focus of this work is solely on YOLO architectures, without considering other object detection models that may be of interest to certain readers.Conclusions and future worksThis work aims to compare YOLO architectures for the task of tomato leaf disease detection. Specifically, this study compares the latest versions of YOLO: YOLOv8, YOLOv9, YOLOv10, YOLOv11, and YOLOv12. For this, the Tomato-Village dataset, comprising a collection of 14,368 images of tomato leaf diseases across six types: late blight, leaf miner, magnesium deficiency, nitrogen deficiency, potassium deficiency, and spotted wilt virus, is used. Training was conducted using all available variants of the architectures, maintaining default parameters to ensure a solid comparative analysis. The results reveal significant differences in performance, highlighting the strengths and weaknesses of each architecture.YOLOv11 models emerged as the top-performing architecture, achieving the highest precision, recall, and mAP scores while maintaining competitive training times and reasonable latency, positioning them as the most attractive option for tomato leaf disease detection. YOLOv10 and YOLOv12 follow closely, offering a strong balance between accuracy, speed, and efficiency. YOLOv8, while slightly behind YOLOv10, YOLOv11, and YOLOv12, still delivers notable performances and, in some cases, rivals heavier models. In contrast, YOLOv9 shows the weakest results, with the longest training times, poorest metrics, and highest latency. Notably, YOLOv12n stands out as the best nano variant across all architectures, offering exceptional speed and robust detection capabilities, making it ideal for extremely resource-constrained scenarios.Future research could evaluate these models under varying environmental conditions and on different plant disease datasets to assess their robustness and generalization. It would also be valuable to test their deployment in real-time agricultural monitoring systems and compare them with other object detection techniques to identify the most accurate and efficient solutions. Finally, future work could focus on developing practical platforms based on these architectures to enable early disease detection and improve crop management strategies.