How to Analyze Ball-by-Ball Cricket Data in R (cricketdata)

Wait 5 sec.

[This article was first published on Blog - R Programming Books, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. :root{ --text:#111111; --muted:#444444; --accent:#0b3d91; --border:rgba(0,0,0,.14); --mono: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace; --sans: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial, "Apple Color Emoji","Segoe UI Emoji"; --max: 980px; } /* Keep it friendly to WordPress/Elementor/Gutenberg: - no body background - no position fixed - no overflow hidden - no heights */ .post-wrap{ max-width: var(--max); margin: 0 auto; padding: 0 16px; font-family: var(--sans); color: var(--text); line-height: 1.75; } /* Force readable black text without breaking theme typography */ .post-wrap, .post-wrap *{ color: var(--text); } .post-wrap .muted{ color: var(--muted); } .post-wrap a{ color: var(--accent); text-decoration: none; } .post-wrap a:hover{ text-decoration: underline; } .post-wrap h2{ font-size: 1.55rem; margin: 2.2rem 0 0.8rem; line-height: 1.25; } .post-wrap h3{ font-size: 1.22rem; margin: 1.7rem 0 0.7rem; line-height: 1.3; } .post-wrap h4{ font-size: 1.05rem; margin: 1.2rem 0 0.5rem; line-height: 1.35; } .post-wrap p{ margin: 0 0 1rem; } .post-wrap ul{ margin: 0 0 1rem 1.2rem; } .post-wrap li{ margin: 0.35rem 0; } .callout{ border-left: 4px solid var(--accent); padding: 12px 14px; margin: 1.2rem 0; background: rgba(11,61,145,.06); } .kbd{ font-family: var(--mono); border: 1px solid var(--border); padding: 2px 6px; border-radius: 6px; background: rgba(0,0,0,.03); font-size: .9em; } pre, code{ font-family: var(--mono); font-size: 0.92rem; } pre{ border: 1px solid var(--border); background: rgba(0,0,0,.03); padding: 12px 12px; border-radius: 10px; overflow: auto; margin: 0 0 1rem; white-space: pre; } table{ width: 100%; border-collapse: collapse; margin: 0.8rem 0 1rem; border: 1px solid var(--border); } th, td{ text-align: left; padding: 10px; border-bottom: 1px solid var(--border); vertical-align: top; } th{ background: rgba(0,0,0,.03); font-weight: 650; } tr:last-child td{ border-bottom: none; } .toc{ border: 1px solid var(--border); background: rgba(0,0,0,.02); padding: 14px; margin: 1.2rem 0 1.6rem; } .toc a{ display: block; padding: 4px 0; } .footer{ border-top: 1px solid var(--border); margin-top: 1.8rem; padding-top: 1rem; } /* Print-friendly */ @media print{ .post-wrap{ max-width: none; } .callout, .toc{ background: none; } } { "@context":"https://schema.org", "@type":"Article", "headline":"Cricket Analytics in R with cricketdata", "description":"A comprehensive, hands-on guide to cricket analytics in R covering ball-by-ball workflows, player metrics, match prediction, IPL insights, and reproducible reporting.", "inLanguage":"en" } Focus keyphrase: cricket analytics in R • Secondary: R cricket data analysis • Package: cricketdata Cricket analytics is no longer limited to season averages and simple leaderboards. With modern ball-by-ball datasets, we can quantify tempo, isolate phase-specific skills, evaluate matchups, and model outcomes under uncertainty. R is a strong environment for this work because it combines data wrangling, visualization, statistical modeling, and reproducible reporting in one place. What you’ll learn in this post: How cricket data is typically structured (match, innings, ball-by-ball) How to engineer metrics for batting and bowling that respect cricket context How to perform phase analysis (Powerplay / Middle / Death) and matchup analysis How to build a baseline win probability model in R How to extend the workflow for IPL insights and role-based evaluation How to keep your analysis reproducible using Quarto/R Markdown On this page 1) Workflow 2) Data structures and ingestion 3) Cleaning and cricket-specific preprocessing 4) Phase labeling and innings context 5) Batting analytics: metrics that explain style 6) Bowling analytics: economy, wickets, and pressure 7) Matchups: bowler vs batter 8) Visualizations that make sense to cricket fans 9) Win probability modeling (baseline + upgrades) 10) IPL insights: roles, venues, and player value 11) Reproducible reporting in R 12) FAQ 13) Next steps 1) A Practical Workflow for Cricket Analytics in R A professional cricket analytics workflow is easiest to maintain when you separate the work into layers: (1) data, (2) context, (3) features, (4) metrics, (5) models, and (6) communication. This structure reduces confusion and keeps analyses reproducible across tournaments and seasons. Layer What you do Typical outputs Data Load ball-by-ball + match metadata; standardize columns Cleaned tables with stable IDs Context Add format, venue, innings state, chase information Phase labels, required run rate, wickets in hand Features Create derived variables at ball and player level Dots, boundaries, pressure flags, matchup summaries Metrics Aggregate in ways that reflect roles and phases Role-aware leaderboards, split tables Models Predict outcomes or estimate player value Win probability, outcome prediction, uncertainty Communication Publish results as charts, tables, dashboards, reports Quarto/Markdown reports and consistent outputs 2) Data Structures and Ingestion Cricket data typically appears at three levels: Match-level: teams, venue, toss, winner, margin, date Innings-level: runs, wickets, overs, target, result context Ball-by-ball: batter, bowler, runs, extras, wickets, over/ball index Ball-by-ball data is the most valuable layer because it captures the decisions and the state transitions that drive outcomes. If you want phase metrics, win probability, or matchup analysis, ball-by-ball is the foundation. 2.1 Install and load packages install.packages(c( "cricketdata", "dplyr", "tidyr", "stringr", "lubridate", "purrr", "ggplot2", "slider", "broom"))library(cricketdata)library(dplyr)library(tidyr)library(stringr)library(lubridate)library(purrr)library(ggplot2)library(slider)library(broom) 2.2 Keep your data model explicit It helps to define (and document) the expected schema for your ball-by-ball table. At minimum, you want: match_id, innings, over, ball_in_over, batter, bowler, batter_runs, extras_runs, total_runs, and a wicket indicator such as is_wicket. Practical rule: treat your ball-by-ball table as the single source of truth. Build everything else (tables, charts, model datasets) from it, not from hand-edited exports. 3) Cleaning and Cricket-Specific Preprocessing Cricket data cleaning is rarely about generic missingness. Most issues are cricket-specific: inconsistent player names, extras affecting “balls faced” and “balls bowled”, run-out attribution, and multiple encodings of dismissal types. 3.1 Standardize names and IDs clean_name % str_replace_all("[’`]", "'") %>% str_squish() %>% str_trim()}# Example usage:# balls %# mutate(# batter = clean_name(batter),# bowler = clean_name(bowler),# non_striker = clean_name(non_striker),# team_batting = clean_name(team_batting),# team_bowling = clean_name(team_bowling)# ) 3.2 Legal balls vs extras A common mistake is using every row as a “ball” for strike rate or bowling strike rate. In many datasets, wides and some no-balls are not legal deliveries (rules differ by format and encoding). A robust approach is to create a legal_ball flag. # Template: adjust to your dataset columns# balls %# mutate(# total_runs = batter_runs + extras_runs,# legal_ball = if_else(extras_type %in% c("wides"), 0L, 1L)# ) 3.3 Wickets: be explicit about what counts Many analyses treat “bowler wickets” differently than total wickets. For example, run outs are not credited to the bowler. You can create separate fields: is_wicket: any wicket fell on the ball is_bowler_wicket: wicket credited to bowler (exclude run outs) # balls %# mutate(# is_wicket = as.integer(!is.na(dismissal_kind)),# is_bowler_wicket = as.integer(is_wicket == 1 & dismissal_kind != "run out")# ) 4) Phase Labeling and Innings Context “Phase-aware” analysis is one of the biggest upgrades you can make in limited-overs cricket. A batter who dominates the powerplay may not be a strong death-overs hitter; likewise, a death specialist bowler should not be judged by powerplay economy alone. 4.1 Phase labeling (T20 example) label_phase_t20 = 0 & over < 6 ~ "Powerplay", over >= 6 & over < 16 ~ "Middle", over >= 16 & over 4.2 Chase context (innings 2 example) To model win probability during a chase, you need game state features. A baseline set includes: runs needed, balls left, and wickets in hand. From these, you can compute required run rate. # NOTE: adjust "max_balls" to format (e.g., 120 for T20, 300 for ODI)# max_balls %# group_by(match_id, innings) %>%# arrange(over, ball_in_over, .by_group = TRUE) %>%# mutate(# cum_runs = cumsum(total_runs),# cum_wkts = cumsum(is_wicket),# legal_balls = cumsum(legal_ball),# balls_left = pmax(max_balls - legal_balls, 0),# wkts_in_hand = 10 - cum_wkts,# runs_needed = pmax(target - cum_runs, 0),# req_rr = if_else(balls_left > 0, 6 * runs_needed / balls_left, NA_real_)# ) %>%# ungroup() 5) Batting Analytics: Metrics That Explain Style Batting analysis becomes more informative when you separate “output” from “method.” Totals (runs) are output. Style shows up in dots, boundaries, rotation, and risk. Below are metrics that are both interpretable and useful. 5.1 Core batting metrics (phase-aware) Strike rate (SR): runs per 100 legal balls faced Dot-ball %: dots per legal balls faced Boundary %: (4s + 6s) per legal balls faced Singles/rotation rate: % balls with 1 run off the bat Dismissal rate: outs per 100 legal balls faced # batting_phase %# group_by(batter, phase) %>%# summarise(# balls_faced = sum(legal_ball),# runs = sum(batter_runs),# dots = sum(legal_ball == 1 & batter_runs == 0),# ones = sum(legal_ball == 1 & batter_runs == 1),# fours = sum(batter_runs == 4),# sixes = sum(batter_runs == 6),# outs = sum(is_wicket == 1 & player_dismissed == batter),# sr = 100 * runs / pmax(balls_faced, 1),# dot_pct = 100 * dots / pmax(balls_faced, 1),# boundary_pct = 100 * (fours + sixes) / pmax(balls_faced, 1),# rotation_pct = 100 * ones / pmax(balls_faced, 1),# out_rate = 100 * outs / pmax(balls_faced, 1),# .groups = "drop"# ) 5.2 Intent vs risk (simple but powerful) A practical comparison for T20 batters is a two-dimensional view: strike rate versus dismissal rate. You can do this by phase, and optionally add minimum sample thresholds (e.g., at least 100 legal balls in that phase). # batting_filtered % filter(balls_faced >= 100)# ggplot(batting_filtered, aes(x = out_rate, y = sr)) +# geom_point() +# facet_wrap(~phase) +# labs(# x = "Dismissals per 100 balls",# y = "Strike rate",# title = "Intent vs Risk by Phase"# ) Interpretation tip: a player with high SR and low out rate is rare and typically elite. Players cluster by role: powerplay aggressors, middle-over stabilizers, and death-over finishers. 5.3 A “pressure” proxy you can compute quickly Pressure is hard to define perfectly, but you can build useful proxies using innings state. One simple approach in a chase: treat pressure as higher when req_rr exceeds a threshold. # chase %# mutate(pressure = as.integer(req_rr >= 10)) # example threshold 6) Bowling Analytics: Economy, Wickets, and Pressure Bowling value is multi-dimensional. Economy tells you how well runs were contained, but wickets create discontinuities in the innings. Modern analysis usually studies both together, often by phase. 6.1 Core bowling metrics (phase-aware) Economy: runs conceded per over (use total runs) Bowling strike rate: legal balls per wicket (exclude run outs) Dot-ball %: dot deliveries per legal balls Boundary conceded %: % balls conceding 4 or 6 # bowling_phase %# group_by(bowler, phase) %>%# summarise(# balls = sum(legal_ball),# overs = balls / 6,# runs_conceded = sum(total_runs),# wickets = sum(is_bowler_wicket),# dots = sum(legal_ball == 1 & total_runs == 0),# boundaries = sum(batter_runs %in% c(4,6)),# econ = runs_conceded / pmax(overs, 0.1),# bowl_sr = balls / pmax(wickets, 1),# dot_pct = 100 * dots / pmax(balls, 1),# boundary_pct = 100 * boundaries / pmax(balls, 1),# .groups="drop"# ) 6.2 Death bowling: separating skill from exposure Death overs are higher variance by nature: batters swing harder and boundaries are more frequent. To evaluate death bowlers fairly, compare them to phase baselines (league/season averages for the death phase). That helps you see whether a bowler is genuinely strong in the death or simply facing harsher conditions. # phase_baseline %# group_by(phase) %>%# summarise(# baseline_econ = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),# .groups="drop"# )# bowling_adj %# left_join(phase_baseline, by = "phase") %>%# mutate(econ_above_baseline = econ - baseline_econ) 7) Matchups: Bowler vs Batter (and How Not to Overfit) Matchups are popular because they feel actionable: “Does bowler A match up well against batter B?” The risk is that many matchups are based on small samples. The solution is to: (1) enforce minimum balls, (2) report uncertainty, and (3) consider shrinkage if you operationalize results. 7.1 Simple matchup table # matchups %# group_by(bowler, batter) %>%# summarise(# balls = sum(legal_ball),# runs = sum(batter_runs),# outs = sum(is_wicket == 1 & player_dismissed == batter),# sr = 100 * runs / pmax(balls, 1),# out_rate = 100 * outs / pmax(balls, 1),# .groups="drop"# ) %>%# filter(balls >= 30) %>%# arrange(desc(out_rate)) 7.2 Add confidence intervals (quick approximation) As a lightweight option, treat out events as binomial and compute approximate intervals for out rate. This is not perfect, but it is better than treating a 2-out sample the same as a 20-out sample. # matchups_ci %# mutate(# p = outs / pmax(balls, 1),# se = sqrt(p * (1 - p) / pmax(balls, 1)),# lo = pmax(p - 1.96 * se, 0),# hi = pmin(p + 1.96 * se, 1),# out_rate_lo = 100 * lo,# out_rate_hi = 100 * hi# ) 8) Visualizations That Make Sense to Cricket Fans The best cricket charts are those that map directly to the mental model of the game. Here are a few workhorses: Run rate by over to reveal acceleration and collapse patterns Worm charts (cumulative runs) to compare innings trajectories Wicket timeline to explain how innings shape changes Phase leaderboards to compare roles (powerplay vs death) 8.1 Run rate by over # over_summary %# group_by(match_id, innings, over) %>%# summarise(# runs = sum(total_runs),# legal_balls = sum(legal_ball),# .groups="drop"# ) %>%# mutate(rr = 6 * runs / pmax(legal_balls, 1))# ggplot(over_summary, aes(x = over, y = rr)) +# geom_line() +# facet_wrap(~innings) +# labs(x="Over", y="Run rate", title="Run Rate by Over") 8.2 Worm chart (cumulative runs) # worm %# group_by(match_id, innings) %>%# arrange(over, ball_in_over, .by_group = TRUE) %>%# mutate(cum_runs = cumsum(total_runs),# legal_balls = cumsum(legal_ball)) %>%# ungroup()# ggplot(worm, aes(x = legal_balls, y = cum_runs, group = innings)) +# geom_line() +# facet_wrap(~match_id) +# labs(x="Legal balls", y="Cumulative runs", title="Worm Chart") 9) Win Probability Modeling in R (Baseline + Upgrades) Win probability models answer a common fan and analyst question: “Given the current state, how likely is the chasing team to win?” A simple and surprisingly effective baseline uses a logistic regression on chase state variables. 9.1 Baseline logistic regression # wp_data %# filter(balls_left > 0) %>%# mutate(# win = as.integer(chasing_team_won) # adapt to your encoding# ) %>%# select(win, runs_needed, balls_left, wkts_in_hand, req_rr) %>%# filter(is.finite(req_rr))# wp_model = 120) %>%# arrange(econ) 10.2 Venue adjustment (separating skill from conditions) Some venues inflate scoring; others suppress it. A simple adjustment is to compute a venue baseline run rate and then measure player performance relative to that baseline. # venue_rr %# group_by(venue) %>%# summarise(# venue_run_rate = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),# .groups="drop"# )# batter_venue %# group_by(batter, venue) %>%# summarise(# batter_run_rate = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),# balls = sum(legal_ball),# .groups="drop"# ) %>%# left_join(venue_rr, by="venue") %>%# mutate(adj_run_rate = batter_run_rate - venue_run_rate) 10.3 Player “value” as expected contribution If you want to move toward value modeling, a practical approach is to estimate expected runs per ball (batting) and expected runs conceded per ball (bowling) in context (phase, venue, matchup). You can then compare players under similar conditions. 11) Reproducible Reporting in R (Quarto / R Markdown) Reproducibility is a competitive advantage in analytics. It ensures your results can be refreshed with new matches, audited, and reused. Quarto (or R Markdown) lets you publish analysis as a single document that includes narrative, code, and output. 11.1 A clean project structure cricket-analytics/ data/ raw/ cleaned/ R/ cleaning.R phases.R metrics.R plots.R reports/ weekly-report.qmd match-preview.qmd models/ output/ README.md 11.2 A minimal Quarto report pattern # weekly-report.qmd# ---# title: "Weekly Cricket Analytics Report"# format: html# ---# ```{r}# source("R/metrics.R")# balls