Speeding up tidySummarizedExperiment through query optimisation and the plyxp backend

Wait 5 sec.

[This article was first published on tidyomicsBlog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. tidySummarizedExperiment logoContributors: Michael Love, Justin Landis, Pierre-Paul AxisaThe generality of tidySummarizedExperiment makes it easy to interface with several tidyverse packages (e.g. dplyr, tidyr, ggplot2, purrr, plotly). This is possible thanks to its approach of converting SummarizedExperiment objects to tibbles, performing operations, and converting back to the original format. This conversion process introduces substantial overhead when working with large-scale datasets. Each operation requires multiple data transformations, with the conversion to tibble format creating memory copies of the entire dataset, followed by the reverse conversion back to SummarizedExperiment. For datasets containing hundreds of samples and tens of thousands of genes, these repeated conversions can consume memory and add significant computational overhead to even simple operations such as filtering or grouping.With the new tidySummarizedExperiment release (v1.19.7), we have introduced new optimisations that address these performance limitations. This optimisation is powered by:Check for the query domain (assay, colData, rowData), and execute specialised operation.Use of plyxp for complex domain-specific queries.plyxp is a tidyomics package developed by Justin Landis, and first released as part of Bioconductor 3.20 in October 2024. It uses data-masking functionality from the rlang package to perform efficient operations on SummarizedExperiment objects.Motivation and design principlesThis benchmark supports ongoing work to improve the performance of tidySummarizedExperiment. In this benchmark, we show up to 30x improvement in operations such as mutate().The current optimisation is grounded in three principles:Decompose operation series: break mutate(a=..., b=..., c=...) into single operations for simpler handling and clearer routing. Reference implementation in R/mutate.R (decomposition step) at L146.Analyse scope: infer whether each expression targets colData, rowData, assays, or a mix (noting that the current analyser is likely over-engineered and could be simplified). See L149.Route mixed operations via plyxp: when an expression touches multiple slots, prefer the plyxp path for correctness and performance. See L155.These design choices aim to preserve dimnames, avoid unnecessary tibble round-trips, and provide predictable performance across simple and mixed-slot scenarios.Example of code optimisationThis was the mutate() method before optimisation. The previous implementation relied on as_tibble() |> dplyr::mutate() |> update_SE_from_tibble(.data)The function update_SE_from_tibble interprets the input tibble and converts it back to a SummarizedExperiment. Although this step provides great generality and flexibility, it is particularly expensive because it must infer whether columns are sample-wise or feature-wise.Show pre-optimization sourcemutate.SummarizedExperiment % unlist() if (is_sample_feature_deprecated_used(.data, .cols)) { # Record deprecated usage into metadata for backward compatibility .data gt(0) if (tst) { columns paste(collapse=", ") stop( "tidySummarizedExperiment says:", " you are trying to rename a column that is view only", columns, "(it is not present in the colData).", " If you want to mutate a view-only column,", " make a copy and mutate that one." ) } # If Ranges column not in query, prefer faster tibble conversion # Skip expanding GRanges columns when not referenced skip_GRanges not() # Round-trip: SE -> tibble -> dplyr::mutate -> SE .data |> as_tibble(skip_GRanges=skip_GRanges) |> dplyr::mutate(...) |> update_SE_from_tibble(.data)}The new implementation captures all easy cases, such as sample-only and feature-only metadata mutate(). If mutate() is a mixed operation that can be factored out to sample- and feature-wise operation it is handled by plyxp. Otherwise, the general solution is used.Key components to compare: – The pre-optimization code always uses a tibble round-trip (as_tibble() |> dplyr::mutate() |> update_SE_from_tibble()). – The optimized code first analyzes scope (colData, rowData, assay, or mixed) and dispatches to specialized paths. – The fallback still exists (mutate_via_tibble) for complex cases, preserving generality.Show post-optimization sourcemutate.SummarizedExperiment