Reproducible Analytical Pipelines

Wait 5 sec.

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Here’s the new data. Could you summarise itlike Alice did last year, and send me a report?The civil service and public bodies in the UK publish lots of datasets.These datasets can be really helpful when experimenting with data visualisation and presentationtools.As data consumers, what we rarely see is the amount of work that goes into preparing those datasets,or how they are used to make decisions about, or understand trends within the country.That work has to be coordinated across multiple people, each with different skills.Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.Much like teams do, software and data evolve over time.The raw data that feeds into the above datasets, and any products that are built upon them (reports,applications and so on), may only be collected and processed every few years – and a lot can changein a few years.So, teams within those departments need a way to reliably generate those datasets and data productsfrom newly-collected raw data that is robust (or at least flexible) enough to accommodate changesin:data quality,the structure/schema of the raw data,personnel within the team and departmental restructuring,software tooling,output data format or usage.It is becoming more common for this kind of data processing to be handled by aReproducible Analytical Pipeline (RAP).A RAP is a, largely, automated process written in code.An aim of using RAPs here, is to reduce the amount of manual and ad-hoc input into the dataprocessing, so that when given the same input data you would generate the same downstream productsand so that the process should work successfully and predictably when given new data.By placing the processing decisions in code, RAPs make data processing more easily auditable andmore transparent.TheUK Civil Serviceand the NHS have guidelines on their aimsfor RAPs and how to create these pipelines.Now, you might not be working for one of those institutions, and the data processing and analysisthat you perform might not be public facing or subject to a national audit.But, if you’re doing data science or data processing as part of your job, the ideas surrounding RAPsmay help you work more efficiently.Let’s start with the basics:where does your data come from?where does it go to?what is your main tool when working with it?and who else either depends upon, or is also responsible for, your work?TheRAP guidelines for the UK Civil Servicepromote the use of open-source tools, version control, and automation.Which tools should you choose, what should you automate, and who needs to know about or approve whatyou are doing?If you’ve inherited an Excel workbook with last year’s data embedded inside it and you need toprocess this year’s data, you may not know enough about the processes that occurred before lastyear’s data was copied into the spreadsheet or any manual tweaks that happened after it was imported(how were missing values handled etc).You could automate the early, data ingestion, stages.If you’re inherited some SQL scripts that make database queries and you have tocopy-paste the resulting values into a report, you could automate the report-generation step.If you have a collection of analysis steps or scripts, that have to be called in a particular order,or where you have to manually edit the scripts (fixing the filepaths, for example) for them to workwith a new raw-data release, you could think about how to orchestrate running those scripts or howto configure the project so that it requires less manual intervention to run next time.Editing code and calling commands in a programming environment are manual processes, too.You may not be able to automate everything at once.So try to make strategic wins on those areas of your data workflow that are the least clear, or thatinvolve the most manual input.The push towards automation requires programming skills, and a choice over a programming language.In data science this typically means SQL plus either R or Python.Which you choose for a project, depends on the skills across your team and the infrastructure thatis available to you.Don’t use your favourite language, or a language you want to experiment with, if no-one else on theteam can review your code or take over the project from you.One of the best resources that I found while researching this blog post was the book“Building reproducible analytical pipelines with R”by Bruno Rodrigues.That book covers many of the topics mentioned above: how to set up a project with version control,how to generate automated reports, how to orchestrate multiple analytical processes together.It is a very R-focussed book, but the ideas hold whether you work in Python or another language.Reproducibility in data science has a long-standing counterpart inscience more generally.If you write a scientific paper, the data upon which it is based, and the data-processing stepsinvolved should be made available.But they should be created in such a way that they can be reused.If someone wants to regenerate your results, and they can download your data and code, the codeshould be written in such a way that this is guaranteed.Just releasing a script on GitHub isn’t enough – the precise version of any used scripts andproject-specific data should be tagged; the programming environment should be matched as closely aspossible (for example, matching the version of R or Python used, using the same versions of anyinstalled packages); any supporting data sources should be pinned to specific versions and so on.For us though, RAPs are more about ensuring that data-processing is predictable and transparent,and that processes can be reused at a subsequent date and with updated data.Your team may need to level-up their programming skills, or their knowledge of your programmingenvironment, to take advantage of improved automation.But doing so will reduce the amount of repetitive manual tasks, simplify on-boarding new teammembers, and make maintenance easier.Also, automating stuff is really fun.For updates and revisions to this article, see the original postTo leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Continue reading: Reproducible Analytical Pipelines