SafeTraffic Copilot: adapting large language models for trustworthy traffic safety assessments and decision interventions

Wait 5 sec.

IntroductionRoad traffic injury remains a stubborn public health crisis in the United States. In 2022 alone, 42,795 people were killed on U.S. roads—one of the highest per capita fatality rates in the developed world1. Despite decades of counter-measures, the fatality curve continues to rise, especially in the United States (as shown in Fig. 1a), underscoring an urgent need for new data-driven techniques that can uncover the mechanisms of crashes and inform decisive policy action. Expected crash prediction models (hereafter referred to as crash prediction) offer a principled way to learn from historical data and isolate the factors that most strongly elevate risk2.Fig. 1: Overview of the proposed SafeTraffic Copilot.a The U.S. faces one of the highest crash risks among developed countries, with a rising trend. However, analyzing and addressing this issue is challenging due to the heterogeneous factors involved in crash events, including traffic conditions, human behavior, environmental impacts, and driver characteristics. To tackle this, we propose SafeTraffic Copilot, a framework designed for two key tasks: (1) Predicting crash outcomes and (2) Attributing crash factors for conditional risk analysis. By addressing questions such as why crashes occur and how to mitigate crash risks, SafeTraffic Copilot seeks to deliver optimal policies for safety improvement. b The SafeTraffic Copilot workflow incorporates multi-modal data, including driver behavior, vehicle details, infrastructure, and environmental conditions, represented through textual reports, satellite imagery, and other formats. Leveraging an AI-expert cooperative method, the crash data is transformed into textual prompts, resulting in the SafeTraffic Event dataset comprising 66,205 cases. SafeTraffic LLM is created with accurate and trustworthy forecasting abilities for further analysis. Building on this pipeline, SafeTraffic Attribution operates across three dimensions: (1) Event-level risk analysis to identify feature contributions, (2) Conditional risk analysis to assess state-level risks under varying conditions, and (3) Data collection guidance to optimize the data acquisition process. The results of SafeTraffic Attribution provide actionable insights to enhance data analysis and collection, fostering a more comprehensive understanding of crash data and events.Full size imageThe current approaches to crash prediction are broadly categorized into macroscopic, statistical-level analyses, and microscopic, event-level investigations. Macroscopic models offer a general understanding of safety performance, identifying high-risk areas and temporal trends, but they lack the granularity to explain the specific circumstances of a crash—the who, what, and why3,4,5,6. While microscopic models, often employing machine learning, aim to predict crash consequences under specific traffic conditions, they have struggled with precision and generalization4,7,8,9,10. A fundamental challenge lies in effectively integrating the multi-modal data associated with a crash event, spanning textual narratives, numerical data, images, and driver histories, and interpreting the complex interplay of contributing factors, thereby limiting their utility in designing effective safety policies.The recent emergence of foundation models, especially Large Language Models (LLMs), presents a transformative opportunity to mitigate these enduring challenges by leveraging their advanced capabilities in processing and reasoning from complex, multi-modal information11,12,13,14,15. These models can synthesize and interpret vast, unstructured data, such as the narrative descriptions in crash reports, and align them with structured data like roadway characteristics and driver histories, offering a more holistic understanding than was previously possible. However, adapting these powerful generative models for the discriminative task of crash outcome prediction introduces its own set of technical hurdles. The primary challenge is methodological: generative LLMs with extensive output vocabularies must be re-engineered to reliably predict outcomes within a set of well-defined, finite categories (e.g., crash severity levels)16. This adaptation raises major concerns about the trustworthiness and calibration of their predictions, which is crucial for high-stakes applications like public safety17. Furthermore, the inherent “black box" nature of these models poses a major obstacle to achieving the interpretability required for targeted safety improvements18,19,20. While initial studies have explored LLMs for traffic safety, they have been limited to prompt engineering and have not addressed the crucial interpretability gap, which is essential for robust decision support and for answering the critical why and how questions of crash causation21,22,23.In this research, we introduce SafeTraffic Copilot, a LLM-driven framework that shifts the paradigm from aggregate-level statistics to granular, event-level crash prediction and understanding (see Fig. 1b). By reframing crash prediction as a text-based reasoning task, SafeTraffic Copilot is designed to address the key challenges of data integration, model generalization, and feature attribution. The framework consists of three integrated components: SafeTraffic Event dataset, for unifying multi-modal crash data; SafeTraffic LLM, for accurate outcome prediction; and SafeTraffic Attribution, for conditional risk analysis. This approach allows us to not only forecast the when, where, who, and what of a crash but also to provide a deep, interpretable understanding of why it occurred and how similar risks can be mitigated, offering a unified approach for targeted and effective data-driven safety interventions.ResultsThis study shifts traffic safety analysis from aggregate-level to event-level crash prediction by developing SafeTraffic Copilot, a customized LLM framework that integrates multi-modal crash data into the broader semantic context to forecast consequences and attribute features with interpretability. Using SafeTraffic Event dataset (66,205 textual prompts; over 14 million words) and framing outcome prediction as token generation, SafeTraffic LLM delivers a 33.3% to 45.8% average F1 improvement over competitive baselines across multiple crash-consequence tasks. By embedding traffic-safety priors and explicitly targeting the number of injuries, crash severity, and crash type via special tokens, SafeTraffic Copilot yields accurate and trustworthy predictions, where the accuracy increases with confidence, achieving over 70% accuracy when the confidence score exceeds 60%, and 95% precision for fatal-crash predictions at the same threshold. Our proposed textual feature-attribution module provides event- and state-level insight—it simultaneously uncovers what drives risk in a specific crash, enabling conditional intervention, and what information drives model quality at scale, guiding strategic data-collection policies for long-term accuracy. Specifically, the proposed sentence-level Shapley scores identify high-risk scenarios (such as “alcohol-impaired,” “work-zone,” “inappropriate behaviors,” etc.) for actionable “what-if” analysis. For example, combining alcohol impairment with a work-zone setting nearly doubles the likelihood of a severe crash compared with sober driving under identical conditions. Further, summing Shapley contributions across the entire fine-tuning dataset ranks data fields by their marginal impact on prediction accuracy and confidence score, thereby guiding first-responders toward the details that matter and streamlining continuous model updates. For example, unit information (driver and vehicle attributes) accounts for more than 40% of the contribution to crash severity prediction, highlighting these fields as priority entries for future crash documentation.Conditional expected crash predictionIn our research, we define crash prediction as expected (conditional) crash prediction, a definition consistent with the agencies and established literature4,24,25,26. Officially, expected crash prediction is defined as the task of estimating expected crash characteristics, including crash type, severity, number of injuries, and their likelihood of occurrence, under specified conditions. These estimations are based on expected traffic conditions and relevant contextual information, such as roadway attributes, environmental conditions, traffic volumes, and driver behaviors.The prediction targets consist of three variables with belonged confidence score: Number of Injury, Severity, and Crash Type (see Fig. 2)27,28. Specifically, Number of Injury task predicts the number of people injured in the given crash event. We define the number of injuries as the total number of non-fatal injuries, including possible, minor, and serious injuries, obtained by subtracting fatalities from the total number of injured people in a crash. Number of Injury task is treated as a classification task with four categories: zero, one, two, and three or more than three, where crashes involving more than two injured people are grouped into a single category due to the limited number of such cases. The Severity task assesses the level of injury severity in a crash, classified into five levels from no apparent injury to fatal. Type task predicts the type of crash, such as the rear-end collision or collision with object, with 14 crash type categories in the Washington dataset and 16 in the Illinois dataset. Detailed information on the defined targets and confidence scores is available in “SafeTraffic Event dataset construction" and “Expected crash prediction confidence score calculation" in the “Methods” section.Fig. 2: SafeTraffic Copilot crash outcomes prediction pipeline.Multi-modal crash data is collected and organized into textual prompts through an AI-expert cooperative process. The Highway Safety Information System (HSIS) crash data, satellite images, and infrastructure data are used to extract general and infrastructure information, including the crash time, location, the road level, and so on. The vehicle data and person data are converted into the event information and the unit information, including vehicle movements, driver characteristics (e.g., age, gender, alcohol use), vehicle attributes (e.g., manufacture year), and so on. SafeTraffic Event dataset is created with three prediction targets: Number of Injury, Severity, and Type. The Number of Injury task predicts the number of people injured in the crash event, the Severity task estimates the severity level of the crash, such as no apparent injury or fatal, and the Type task classifies type of crash, such as single vehicle with object or angle impacts right (The crash event outcomes classification are provided in Supplementary Table 4 and Supplementary Table 5). The SafeTraffic LLM is fine-tuned using the SafeTraffic Event dataset. To reframe the crash outcomes prediction from a classification task to a language inference task, SafeTraffic LLM is fine-tuned by adding prediction targets as special tokens in its vocabulary and adjusting parameters using Low-Rank Adaptations (LoRA)37, a lightweight fine-tuning technique that injects trainable rank-decomposed matrices into each layer without updating the full model.Full size imageOriginal multi-modal crash dataOur cleaned dataset comprises crash data from Washington State in 2022, totaling 16,188 records, and from Illinois in 2022, totaling 42,715 records, after excluding cases with missing key attributes related to vehicle or crash object status. Primary sources include the Highway Safety Information System (HSIS) crash data29, the state crash report, and satellite images30. HSIS is a multistate database that contains crash, roadway inventory, and traffic volume data for a select group of States. The HSIS crash data contains four major components: crash data, infrastructure data, vehicle data, and person data. Crash data provides detailed descriptions of crashes, such as location, time, and injury severity. Infrastructure data includes information about road layouts and traffic characteristics, such as road level and speed limits. Vehicle data contains details such as manufacturing year and reported defects of the involved vehicles, while person data captures demographic and other relevant details about drivers and passengers, such as age and gender. Satellite images complement the HSIS data by providing additional visual context, including information about lanes, intersections, and other roadway attributes. Further information on raw data formats and types is available in “Raw data” in the “Methods” section.In addition to the Washington and Illinois datasets, we also collect and process 2250 crash cases from Maine, 2250 from Ohio, and 2802 from North Carolina to evaluate the model’s zero-shot generalization. Their raw data sources align with those used for Illinois and Washington. Using the Illinois State prompt template (due to its closer data format), we converted the records into prompt format, keeping shared features and setting unmatched ones to null. Differences in data representation (e.g., driver alcohol levels as a numeric value in Illinois versus a textual descriptor in Maine) introduce unseen values and distributions, enabling evaluation of the model’s zero-shot generalization.Developing SafeTraffic LLM for predicting crashesTo leverage the multi-modal crash data described above for crash prediction, we developed the SafeTraffic Copilot crash outcomes prediction pipeline, which transforms crash outcomes prediction into a text-based reasoning task. To achieve this, the raw crash data is organized into the textual SafeTraffic Event dataset, which is then used to fine-tune the SafeTraffic LLM. Figure 2 presents an overview of the proposed pipeline.The SafeTraffic Event dataset is created through an AI-expert cooperative textualization process, organizing multi-modal raw data for effective crash prediction. The detailed information about the raw data feature engineering and the textualization process is available in “SafeTraffic Event dataset construction" in the “Methods” section. As shown in Fig. 2, the constructed prompts are divided into five parts: one system prompt and four content parts, with each content part containing approximately 100 words. These parts include:System prompt: provides an introduction and task-specific instructions.General information: includes general information about the time and location of the prediction region and the roadway category.Infrastructure information: describes road infrastructure, encompassing static features like the number of lanes and speed limits, as well as dynamic elements such as work zones, lighting, and road surface conditions.Event information: contains detailed descriptions of crash events, such as the number of vehicles involved and their directions of movement.Unit information: provides vehicle and individual details relevant for crash prediction, such as airbag status and the driver’s age.To organize and merge the mentioned information for each crash event, we perform feature engineering and textualization, structure the textualized data as input, and process labels corresponding to the three targeted crash prediction tasks from real-world reports. The complete prompt examples are presented in Fig. 3. Ultimately, after filtering out data items with missing information, the SafeTraffic Event dataset merges the complementary information from multi-modal data sources and contains 66,205 crash records with approximately 14.5 million words. These records are split into training, validation, and test sets in a 7:1.5:1.5 ratio for Washington and Illinois datasets, while the remaining datasets from three other states are used exclusively for cross-region training-free generalization evaluation.Fig. 3: Example prompt structure and content in the SafeTraffic Event dataset.An example prompt of an expected crash prediction event from the dataset collected in Washington State.Full size imageWith SafeTraffic Event dataset, we can adapt LLMs for expected crash prediction. Although vanilla LLMs like Llama 313 possess broad general knowledge and strong text-reasoning capabilities, they demonstrate limited effectiveness on crash prediction tasks without the fine-tuning process (see Supplementary Section 3.1). To address this, we developed SafeTraffic LLM, a specialized model fine-tuned on the processed SafeTraffic Event dataset. This fine-tuning process enhances the LLM’s comprehension of crash events and enables accurate outcome prediction. Specifically, additional special tokens are introduced into the LLM vocabulary as prediction targets (Number of Injury, Severity, and Crash Type), fine-tuning the model to generate these tokens during prediction. The details of the fine-tuning are provided in “SafeTraffic LLM” in the “Methods” section.Prediction performance and trustworthinessWe evaluate the performance of SafeTraffic LLM and compare its performance with other baselines (see “Adopted baselines" in the “Methods” section). The fine-tuning process is based on two vanilla LLMs with different sizes: Llama 3.1 8B and Llama 3.1 70B. Accuracy, precision, and F1-score are used as the evaluation metrics; the detailed information is available in “Evaluation metrics" in the “Methods” section.SafeTraffic LLM provides the highest accurate and reliable prediction results across all crash types, severity, and injury counts, even in zero-shot scenarios. Table 1 compares the performances of SafeTraffic LLM and adopted baselines. The results show that the SafeTraffic LLM outperforms all the baselines in each task setting with an average F1-score improvement of 33.3%–45.8% across multiple tasks. SafeTraffic LLM performs well on both the Washington and Illinois datasets, demonstrating its stability across diverse geographical regions. Moreover, as shown in the confusion matrix in Fig. 4a, b, beyond aggregated metrics, SafeTraffic LLM demonstrates a more balanced prediction distribution and achieves higher accuracy across individual categories. In contrast, as shown in Fig. 4c, d, existing machine learning models tend to predict the dominant categories (e.g., zero under Number of Injury prediction task, no apparent injury under Severity prediction task, and the complete confusion matrix is shown in Supplementary Fig. 7).Table 1 Performance comparison of the three expected crash prediction tasksFull size tableFig. 4: SafeTraffic LLM provides predictions with trustworthiness.The confusion matrices generated by the SafeTraffic LLM for both the a Washington and b Illinois datasets clearly demonstrate improved prediction results (we select the best results for each task based on the F1-score). In contrast, baseline models tend to predict the most frequent category across both the c Washington and d Illinois datasets (we show baseline models with the best F1-score. The performances for other baseline models can be found in Supplementary Fig. 7). Meanwhile, SafeTraffic LLM produces trustworthy predictions for both the e Washington and f Illinois datasets. Higher confidence levels in the model’s predictions correspond to an increased likelihood of accuracy. Furthermore, g The SafeTraffic LLM achieves higher precision for fatal-crash predictions. h Fatal-crash predictions also exhibit higher confidence in the Illinois dataset. The Washington dataset is not shown due to limited fatal cases. i For fatal crashes, the SafeTraffic LLM achieves near-perfect precision (97.61%) when the confidence score exceeds 0.6, indicating that the SafeTraffic LLM is highly accurate and trustworthy for fatal crashes. j The 3-year temporal comparison of monthly prediction accuracy across tasks (2019–2021). The used SafeTraffic LLM was fine-tuned on the 2022 Washington dataset and evaluated on the 2019–2021 Washington datasets to assess its temporal generalization capability. The central line represents the median; the box spans from the 25th to 75th percentiles; whiskers extend to 1.5 × IQR. k The prediction performance at the county level, aggregated in a box plot. SafeTraffic LLM demonstrates stable performance at the county level for fine-tuning tasks in Illinois (IL) and Washington (WA), as well as zero-shot tasks in Maine (ME), North Carolina (NC), and Ohio (OH). States evaluated in zero-shot settings are highlighted with a gray background. Source data are provided as a Source data file.Full size imageSafeTraffic LLM provides trustworthy crash predictions, where a higher confidence score links to higher accuracy. SafeTraffic LLM tailors LLMs for discriminative crash outcomes prediction tasks, generating predictions accompanied by confidence scores that represent the probabilities associated with specific special tokens (see “Expected crash prediction confidence score calculation" in the “Methods” section for the calculation details of the confidence score). Figure 4e, f illustrates the trend of accuracy in relation to the confidence scores of SafeTraffic LLM ’s predictions for the Washington and Illinois datasets. The results indicate that our model achieves greater accuracy at higher confidence levels. For instance, for the Number of Injury prediction task in the Washington dataset, when the model’s confidence score exceeds 0.40, the accuracy rises above 0.65, and with confidence scores over 0.60, the accuracy surpasses 0.80. This relationship is even more pronounced for fatal-crash predictions (see Fig. 4g–i). The strong positive correlation between confidence scores and accuracy showcases the quantifiable trustworthiness of the SafeTraffic Copilot. By providing reliable confidence scores alongside predictions, the framework empowers informed decision-making in real-world applications.SafeTraffic LLM exhibits reliable spatial and temporal generalization capabilities. For spatial generalization, we evaluated SafeTraffic LLM by training on the Illinois dataset and testing on three unseen states: Maine, North Carolina, and Ohio. Without any additional fine-tuning, SafeTraffic LLM achieved average F1-scores of 0.576 in North Carolina, 0.613 in Maine, and 0.593 in Ohio, which is comparable to its performance in Illinois (see Table 1). Beyond state-level evaluation, we also assessed generalization at the county level. As shown in Fig. 4j, most counties exhibit an accuracy variation within 10%–20%, demonstrating the model’s stable performance across regions. For temporal generalization, we evaluate the model fine-tuned on 2022 Washington data using data from 2019 to 2021 in the same state. As shown in Fig. 4k, the model maintains stable performance across months for all three tasks: Number of Injury, Severity, and Type prediction, with over 75% of the months falling within a ±10% accuracy range. Aggregated yearly performance is presented in Supplementary Section 3.3.SafeTraffic Attribution frameworkUnderstanding how SafeTraffic LLM generates accurate predictions and how various components of the input prompt influence the outcomes is fundamental to enabling evidence-based decision-making. In our analysis, we focus exclusively on severe crashes (i.e., fatal and serious injury crashes) to identify the contributing factors behind these events. As discussed above, the SafeTraffic LLM ’s confidence score strongly correlates with its predictive accuracy for severe crashes. Consequently, the confidence score associated with severe crash predictions (hereafter referred to as the confidence score) can be used as an indicator of crash risk level: a higher confidence score corresponds to greater prediction accuracy for severe crashes, which in turn reflects a higher likelihood that the crash is severe (rather than minor or no apparent injury) in the real world. Notably, the SafeTraffic LLM ’s confidence scores tend to be lower than their corresponding accuracy values, indicating that using the confidence score is a conservative estimate of risk.Within the SafeTraffic Attribution framework, a sentence-based feature contributions calculation method was proposed to identify how each sentence contributes to the LLM’s outputs based on Shapley theory, which is recognized as a systematic and equitable method for attributing the contribution of each feature to a model’s output31,32, thereby revealing crash-related factors at the event level (see “SafeTraffic Attribution” in the “Methods” section for details). In essence, each feature’s contribution represents its share of responsibility for the model’s confidence in a particular prediction. The sum of all feature contributions equals the confidence score itself. Figure 5 illustrates sentence-level feature contributions for the severity of individual crash events, using one crash from Washington and one from Illinois as examples. In the Washington crash example (Fig. 5a), Driver Behavior (e.g., reckless driving or speeding) is the primary factor contributing to serious injury crashes, with the feature contribution of 0.258. Person Info (e.g., no seatbelt use) also shows a substantial impact with the feature contribution of 0.149. By contrast, Dynamic Info (daylight and dry roads) lowers the probability of a crash with serious injuries with a negative feature contribution of −0.009. While in the Illinois example (Fig. 5b), an elevated BAC (Blood Alcohol Content, with feature contribution of 0.284) and the presence of a Work Zone (feature contribution of 0.462) notably increase the likelihood of fatal-crash outcomes. More additional sentence-level feature-attribution analysis can be found in Supplementary Sections 4.1 and 4.2. The following sections utilize SafeTraffic Attribution framework to examine feature importance from two perspectives: (1) at the inference stage, to identify key factors influencing crash predictions under various conditions and high-risk scenarios, and (2) at the fine-tuning stage, for which data are more critical for model learning.Fig. 5: Single case feature-attribution results for Severity task.The left part displays the full prompt from a Washington and b Illinois, with different colors representing various semantic text sequences. The right part illustrates the feature contribution assigned to each text sequence. Positive contributions signify a supportive role in the model’s prediction, whereas negative contributions indicate a detracting influence. The absolute value of these contributions represents the importance of each sequence to the model’s output.Full size imageFactor attribution at the inference stage for conditional risk analysisConditional analysis evaluates crash outcomes across various scenarios, such as driving with or without alcohol consumption, to quantify the risk factors associated with each scenario. Severe crashes (serious injuries and fatal crashes) were prioritized in the conditional analysis due to their critical importance for traffic safety. These crashes, particularly fatal ones, were predicted accurately and reliably by SafeTraffic LLM (see Fig. 4g–i). Five key contributing factors were identified for this conditional analysis: Driver BAC (BAC = 0 mg/dL or not offered/BAC