Flowcharts that belong in the analysis pipeline

Wait 5 sec.

[This article was first published on R – G-Forge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Flowcharts should be beautiful. Just like this CC photo from Wasif Malik,Thanks to Alan Haynes andhis excellent suggestions, I have spent some time improving theflowchart component of the Gmisc package. The result is not meant to beanother decorative diagram tool. It is meant for the kind of figuresresearchers keep redrawing by hand: CONSORT diagrams, cohort derivationcharts, screening flows, data-cleaning audit trails, and the small butimportant maps that explain how a study population came to be.I like tools such as Excalidrawfor thinking. They are fast, expressive, and excellent forconversations. But when a figure enters a manuscript, the needs change.Counts must be updated. Exclusions must match the analysis script.Treatment arms should align. Follow-up losses should be traceable. Thefigure should survive reviewer round three without becoming a manualediting project.That is the space where flowchart() in Gmisc is useful:the diagram becomes part of the research workflow.The figure above is the kind of chart I want Gmisc to make feelnatural. It is still a grid graphic in R, but it has the visual grammarof a manuscript figure: grouped arms, side exclusions, count badges,phase labels, and arrows that do not need nudging after every textchange.Every figure in this post is generated by code, and the code isincluded below each image. They all share the same two-linepreamble:library(Gmisc)library(grid)To save any of them to a file, wrap the call in a graphics device,e.g.png("01-consort-color.png", width = 9, height = 7, units = "in", res = 180, bg = "white")# ... the flowchart code ...dev.off()The CONSORT figure above is produced by:options(boxGrobTxtPadding = unit(3, "mm"))box_fill move(subelement = "excluded", x = 1 - exclusion_margin, just = "right") |> align( axis = "y", subelement = "excluded", references = list("assessed", "randomised") ) |> connect("assessed", "excluded", type = "L", lty_gp = side_gp, arrow_size = 3, smooth = TRUE) |> connect("randomised", "arms", type = "N", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |> connect("assessed", "randomised", type = "v", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |> connect("arms", "lost", type = "L", lty_gp = side_gp, arrow_size = 3, smooth = TRUE) |> connect("arms", "analysis", type = "v", lty_gp = con_gp, arrow_size = 3) |> print()A figure that canchange with the analysisThe biggest advantage of drawing a flowchart in code is not that codeis elegant. It is that research figures are rarely finished when wethink they are.The inclusion count changes after a database refresh. A reviewer asksfor a sensitivity analysis. Someone notices that two exclusioncategories should be split. The statistician reruns the cohortdefinition. If the diagram is hand-drawn, every one of those changescreates a small risk of mismatch between the paper and the actualanalysis.If the chart is generated, it can sit beside the code that producedthe numbers.flowchart(...) |> spread(axis = "y") |> spread(subelement = "arms", axis = "x") |> connect("randomised", "arms", type = "N")That is the mental model: define boxes, arrange boxes, connect boxes.The final result can still be polished, but it remains reproducible.Cohortderivation from data people already haveMost clinical researchers do not start with a perfect trial flow.They start with a registry extract, an EHR table, a REDCap project, anExcel sheet from a collaborator, or a combination of all of them.That workflow deserves a clear figure too.This kind of diagram is useful because it does not only show who wasincluded. It shows how the study base was assembled: what sources werelinked, where exclusions entered, and which analytic populations cameout at the end.I find this especially helpful for observational studies. A table canreport baseline characteristics, but a flowchart explains theconstruction of the cohort. It gives the reader a quick answer to: “Whathappened between the raw data and the model?”source_gp connect("linked", "cohort", type = "v", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |> connect("linked", "exclusions", type = "side", lty_gp = excl_gp, arrow_size = 3, side = "right", end_side = "left", side_route = "outside", side_offset = exclusion_line_offset, label = "Excluded\nn = 6,397", label_gp = gpar(col = "#AD1457", cex = 0.8)) |> connect("cohort", "outputs", type = "N", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |> print()The audit trail is part ofthe storyAnother common workflow is less glamorous but just as important: datavalidation.Many research projects have a small data-engineering pipeline evenwhen nobody calls it that. Data arrive through forms, imports, manualentry, and collaborator spreadsheets. Then someone checks missingfields, duplicates, impossible dates, inconsistent IDs, andoutliers.That process is often hidden in prose. A compact flowchart can makeit visible without turning the methods section into a systems manual. Itis also a useful project-management figure: the same chart can be shownto clinicians, data managers, statisticians, and co-authors.Note how the box shapes carry meaning here — ellipses, databases,documents, tapes, and diamonds all come from dedicatedbox*Grob() helpers:input_gp connect("validation", "clean", type = "vertical_axis", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |> print()Follow-up is rarely justdown the pageLongitudinal studies often need to distinguish between people who arelost, censored, withdrawn, dead, or still contributing information up toa time point. A simple downward flow can imply that everyone leaving abox disappears from the analysis, which is not always true.Dotted return arrows are useful for this. They can show that aparticipant left direct follow-up but still contributes information tothe final analysis up to censoring. That is a visual detail, but itcommunicates an analytical idea.This is where small flowchart improvements matter. Not because thereader cares about the drawing API, but because the figure can expressthe study design more faithfully.options(boxGrobTxtPadding = unit(1, "mm"))main_gp connect("groups1", "groups2", type = "vertical", lty_gp = con_gp, arrow_size = 3) |> connect("groups2", "analysis", type = "vertical", lty_gp = con_gp, arrow_size = 3) |> connect(list("ex1$1", "ex2$1"), "analysis$1", type = "side", lty_gp = dotted_gp, arrow_size = 3, side = "left", end_side = "left", side_route = "outside", side_offset = fan_in_offset) |> connect(list("ex1$2", "ex2$2"), "analysis$2", type = "side", lty_gp = dotted_gp, arrow_size = 3, side = "right", end_side = "right", side_route = "outside", side_offset = fan_in_offset) |> print()Why this belongs in GmiscGmisc has always collected the small tools I found myself needingaround medical statistics: descriptive tables, transition plots, andgrid-based figures. Flowcharts fit that same pattern. They are not astatistical model, but they are part of how research iscommunicated.The new flowchart work in 3.4.0 is therefore aimed at the practicalproblems:making CONSORT-like diagrams less painful to drawkeeping grouped stages aligned and readablemaking arrows behave predictablysupporting side paths, return paths, and repeated box patternsproducing figures that can be regenerated when the studychangesThe vignette contains the full API and examples:vignette("Grid-based_flowcharts", package = "Gmisc")The blog figures in this post are intentionally close to thingsresearchers already have in their workflow: trial enrollment, registryconstruction, data validation, and follow-up accounting. My hope is thatthey make the flowchart tools feel less like a drawing utility and morelike a small extension of the analysis itself.To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Continue reading: Flowcharts that belong in the analysis pipeline