Major Update for revss Package for R – v3.1.0

Wait 5 sec.

[This article was first published on R Archives | Strange Attractors, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.The revss Package for R Receives Major Update to Version 3.1.0This is a big one! The revss package for R, which provides robust estimation for small samples, received a major, breaking update. The entire calculation engine was rewritten, new functionality added, and massive Monte Carlo analyses were run to calculate bias reduction factors. It should be at version 3.1.0, even though CRAN is showing 3.0.0. What happened and why? Where to even start?The Rabbit HolesBias Factors?Fast forward half a decade, and in 2025, someone raised a GitHub issue with the package, correctly pointing out that R&V were using an bias-adjusted form of MAD in their work, called with factors calculated by Croux & Rousseeuw (CR) in 1992. I figured it would be an easy fix, just add the factors, right? Wrong! Because then I thought, why limit it to MAD? Why not distribute it over the following estimates:Mean Absolute Deviation from the Mean (MADM)Mean Absolute Deviation from the Median (MADMd / ADM)Median Absolute Deviation from the Mean (MdADM)Median Absolute Deviation from the Median (MdADMd / MAD)Robust Scale Estimate of CV with unknown locationRobust Scale Estimate of CV with known locationHowever, CR only posted their factors for MAD. Moreover, they posted three-digit factors and I felt that there should be accurate four-digit factors to match with how the stats package uses 1.4826 in their implementation of MAD. So straight down the rabbit hole I went!Simulation StudiesIdeally, simulating small samples from a known distribution enough times generates a level of confidence that the bias for small samples could be accurately estimated, and then balanced. In my case, I decided to look at all , based on Table 1 of CR. I would follow another idea of CR, which was to calculate specific factors, let’s say , for and a single asymptotic factor which would be used to calculate for . I am currently writing a fuller treatment of the study, the factors, the results, and the package, which I hope to submit to the Journal of Statistical Software in the near future.Monte Carlo Standard ErrorsI first started with small-scale simulations studies to estimate the standard errors of the six metrics above. This would inform how many simulations would be needed, as Monte Carlo Standard Errors (MCSE) are defined as where is the standard error of the parameter and is the number of independent samples. For accuracy, one would want, at a minimum, to have the MCSE less than the number of digits. So for four-digit accuracy, the minimum acceptable MCSE would be strictly less than . Better would be for it to be less than half that amount, so that it can be expected to “round” correctly. Of course, for every two times reduction in MCSE you need four times the simulations, due to the square root.How Many Simulations?!The small scale studies, repeated many times, indicated that to achieve MCSEs of less than , I would need something on the order of 660 million simulations for for individual factors and 66 million simulations for sample sizes between 10 and 100. The way the package was written in pure R, that would take days—if not weeks—to complete. That was not an option. So, the rabbit hole deepens; it’s refactoring time!Complete Engine RewriteI left the R engine complete and built workhorse functions first in C and then in Fortran with a bridge to C, using the original R as a check. I didn’t keep the timings, unfortunately, but it was fastest to move all the heavy lifting to Fortran and write thin C wrappers to facilitate the movement of the SEXPs. Even faster than having the work done directly in C. I have experience with this process, and it was not difficult to get the functionality ported. Once I brought the unit test coverage back to 100% and was certain that the mechanism was correct, it was time for the micro-optimizations.The Curmudgeon Uses the LLMsAnyone following me for the past decade knows that I am an avowed skeptic of the proliferation of LLMs/AI. However, I have coded in R for over 20 years and in other languages supporting R packages for over 10 years. I think I can consider myself competent creating R packages with compiled code. I’m also decent at optimizing the code for speed. I’m no Donald Knuth or even Tomas Kalibera. However, I do know how to envision data structures and make naïve data flows more efficient. So, I decided to make use of the more capable models. I was confident that I could spot the hallucinations, especially with a 100% coverage test harness. Therefore, I used ChatGPT, Claude, and Grok. For the first two I used free accounts but I can use the full Grok given I have a premium subscription to X.Force MultipliersI can, and probably should, write an completely separate blog post about experiences, but in short, I was very impressed. The models do the job of scraping all the standard and non-standard sites, like Stack Overflow, Rosetta Code, R-devel, etc. More than that, their pattern matchings have developed to the point where they can provide meaningful suggestions. Yes, I found mistakes with each model; yes, they hallucinated; and yes, they can get all obsequious when you catch them. And no, they still aren’t great at keeping the structure of the psuedocode in memory. That being said, there is no question that I learned a lot, and I will certainly apply some of these enhancements to my other packages. In my opinion, Grok seemed to respond the best overall, probably because it was the most advanced model. Also, Claude consistently outperformed ChatGPT for the work I needed. Your mileage may vary.Numbers, Numbers, and Even More Numbers! 10″ title=”Rendered by QuickLaTeX.com” height=”16″ width=”62″ style=”vertical-align: -2px;”/> the CLT starts peeking in again. This time, an MCSE of for all estimates was possible using only 16.5M simulations. Of course, this would be for 6 metrics and 91 different sample sizes, so it’s still over 9B values. So I used the same mean and error estimation algorithms.Publish and Perish?Armed with these simulations, I was able to generate more precise bias reduction factors for all six scale metrics used in the revss package. Finally! Using these factors, I rewrote the functions admn, madn, robLoc, and robScale to use these factors. I also allow use of the CR factors for madn. See the package itself and the upcoming paper for more detail.After these weeks of research and development, I settled on the engine and statistics of the package refactor. At the time, there were no other packages which depended, imported, or even suggested revss, so I felt comfortable updating the major version and breaking the API. Once again, I spent hours bringing the test harness to 100% coverage and writing documentation. I roll my own Rds; I prefer then to roxygen or any devtools for that matter. This ties back to my philosophy on dependency hell.Repeating the change/build/test/check/covr cycle dozens of times, I thought I was ready for publication. I will put it down to my persistent insomnia and possible tiredness-induced euphoria, but I pushed to CRAN without changing the defaults to use the factors! That was a complete brain-freeze on my part. Thankfully, CRAN allowed me to push an update the next day. However, the updated version is still not live as of the publishing of this post.Current StateSo right now, the version on CRAN is the flawed one—3.0.0. It has conflicting information between its documentation and its functionality, stray unneeded functions, and some other embarrassing shortcomings. The proper release is 3.1.0, once it gets pushed to CRAN. For now, it lives on the master branch of its GitHub repository. The last commit is not tagged 3.1.0 because I only tag after CRAN acceptance. I learned that the hard way over the years! Once the CRAN version reads 3.1.0, it should be safe to download from CRAN as well. I will probably drop a blog note when that happens.As always, I very much appreciate feedback, constructive criticism, and examples where my packages have helped others. Please reach out and let me know! Happy estimating!ReferencesCroux, C., and Rousseeuw, P. J. (1992), “Time-Efficient Algorithms for Two Highly Robust Estimators of Scale,” in Computational Statistics, eds.Y. Dodge and J. Whittaker, Heidelberg: Physica-Verlag HD, pp. 411–428.Rousseeuw, P. J., and Verboven, S. (2002), “Robust estimation in very small samples,” Computational Statistics & Data Analysis, 40, 741–758. https://doi.org/10.1016/S0167-9473(02)00078-6. The post Major Update for revss Package for R – v3.1.0 appeared first on Strange Attractors.To leave a comment for the author, please follow the link and comment on their blog: R Archives | Strange Attractors.R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Continue reading: Major Update for revss Package for R – v3.1.0