Seasonal adjustment by @ellis2013nz

Wait 5 sec.

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.A reasonably straightforward post today. I wanted to look at monthly tourism numbers in Samoa. In fact I started to do this for Pacific islands in general, but the data wrangling challenges were sufficient that I only got as far as Samoa for now. There’s interest in these at the moment because they would be a relatively timely indicator of possible economic damage from the fuel crisis related to the Iran war. Visitor numbers, inflation, and merchandise trade are part of the very select number of monthly published economic statistics in this part of the world (for a select subset of countries).There’s two things happening in this post: getting hold of the Samoa’s visitor arrivals numbers; and seasonally adjusting them.The latter is more interesting to me than the former, but unfortunately the former took most of the time.Data wranglingHere’s what the visitor numbers look like, once we’ve got them all into one data frame:The Samoa Bureau of Statistics has a nice Excel workbook up to May 2023, but from that date onwards the data is only available as far as I can see in PDF reports. Luckily these are all available as links from a single page, but there seems to be either a skill issue on my part or some kind of block on the website that stops any systematic download of them all, so I had to download all the PDFs by hand one at a time.Then Claude helped me write a parser to find the visitor arrivals number in each PDF. Actually, Claude wasn’t any good at actually finding the right number, but it did give me a pattern I could adapt, much as in the old days I would have used Stack Overflow. There are a lot of numbers in each PDF and we need the right one—visitor arrivals, not total arrivals (yes, that’s an em dash, but I write them all by hand). In this case it turns out that the trick is that the PDFs all include the sentence “Overall visitor numbers for the month under review stood at [number]”, and luckily there is no other use of the word “stood” in the document.The other fiddly thing was extracting the actual date each PDF referred to. Then everything needed to be tested. In the end, it would have been quicker to just manually type the 35 numbers I needed. But here’s the code that does all the data wrangling.library(tidyverse)library(rvest)library(httr2)library(lubridate)library(pdftools) library(readxl)library(seasonal)library(tsibble)library(fable)library(feasts)library(ggtext)library(scales)#' Extract date from messy filenamesextract_date mutate( stem = tools::file_path_sans_ext(basename(path)), # Remove any leading word followed by _ or - (e.g. "Migration_") stem = str_remove(stem, "^([A-Za-z]+[_-])+(?=[A-Z])"), month_str = str_extract(stem, "[A-Za-z]{3,}"), # Normalise non-standard abbreviations month_str = str_replace(month_str, "^Sept$", "Sep"), year_str = str_extract(stem, "\\d{4}|\\d{2}"), year = if_else(nchar(year_str) == 2, as.integer(paste0("20", year_str)), as.integer(year_str)), date = as.Date(paste(year, month_str, "01"), format = "%Y %B %d") |> coalesce(as.Date(paste(year, month_str, "01"), format = "%Y %b %d")) ) |> pull(date)}test components() |> autoplot() Note that in this decomposition, the original ‘visitors’ series and the trend are on the original scale, but the ‘seasonal’ and ‘irregular’ components are expressed as multipliers. So a seasonal value of 1.2 means in a given month the value is 20% higher as a result of the seasonality than otherwise.And here is my final, presentation version of the data:Produced with this code:comp_data components() |> # the 'trend' that comes straight from the decomposition does # not adjust ofr the Covid coefficient and looks pretty weird # so it is more intuitive to present it after adjustment for # Covid, which we need to calculate by multiplying (because of # the log transform that SEATS used autoamtically): mutate(covid = as.numeric(covid_reg)[1:nrow(samoa_visitors)], trend_adj = trend * exp(covid * coef(fit_ts)[1])) comp_data|> ggplot(aes(x = date_month, y = season_adjust)) + geom_line(linewidth = 1.3) + geom_line(aes(y = trend_adj), colour = "steelblue", alpha = 0.9, linewidth = 1.2) + geom_point(aes(y = visitors), colour = "grey70") + scale_y_continuous(label = comma) + labs(x = "", y ="Visitor arrivals", title = "Visitor arrivals per month to Samoa", subtitle = "Original, seasonally adjusted and trend (adjusted for Covid period).", caption = "Source: data from Samoa Bureau of Statistics. Seasonal adjustment by freerangestats.info.") + theme(plot.subtitle = element_markdown())After all that, what insight do we have? Well, not a lot really, other than the blindingly obvious trends of the devastation to the industry of Covid and the slow and noisy growth trend since then. We are at least well placed to make commentary on the impact of the fuel crisis on tourism, and can say that there isn’t any evident yet. If and when we do see the impact, we’ll be able to talk about in terms of trends and random variation, after having removed the seasonal element. So that’s useful.Well, that’s all. Perhaps I’ll get around in a later post to adding the other Pacific island countries with monthly tourism data—Fiji, Vanuatu, Cook Islands, French Polynesia being the key ones I’m aware of.To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Continue reading: Seasonal adjustment by @ellis2013nz