T test in R

Wait 5 sec.

[This article was first published on coding-the-past, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.In this post, you will learn what a T Test is and how to perform it in R. First, you’ll see a simple function that lets you perform the test with just one line of code. Then, we will explore the intuition behind the test, building it step by step with data about the Titanic passengers. Enjoy the reading!1. What is a T-Test?A t-test is a statistical procedure used to check whether the difference between two groups is significant or just due to chance. In this post, we’ll look at data from Titanic passengers, dividing them into males and females. Suppose we want to test the hypothesis that men and women had the same average age. If our data shows that women were, on average, 2 years younger than men, we need to ask: is this a real difference, or could it have happened randomly? The t-test helps us answer this question.2. Why is a T-Test important?A t-test is important when we want to draw conclusions about a population based on a sample. For example, imagine we are studying the demographics of ship passengers at the beginning of the twentieth century and want to use the Titanic sample to generalize findings to a broader population of passengers.Of course, such inferences may be biased, since Titanic passengers might not perfectly represent all ship passengers of that era. Nevertheless, the sample can still provide valuable insights, as long as the context of both the sample and the population is carefully considered and clearly explained.3. The Titanic passengersWe are going to use the titanic R library to access data about Titanic passengers. Specifically, we will work with a subset of passengers contained in the titanic_train dataset. Below, you will find the code to load the data, calculate the mean and standard deviation of age for males and females, and show how many passengers are men and women. content_copy Copy library(titanic) data('titanic_train')df % select(Sex, Age) %>% na.omit()df %>% group_by(Sex) %>% summarize(mean(Age), sd(Age), n()) .container-titanic { display: flex; justify-content: center; margin: 0rem 0;}.card { background: #ffffff; border-radius: 12px; box-shadow: 0 2px 6px rgba(0,0,0,0.08); padding: 1.5rem; max-width: 600px; width: 100%;}.card table { width: 100%; border-collapse: collapse; font-family: "Georgia", serif; font-size: 0.95rem; color: #333;}.card th { text-align: left; border-bottom: 2px solid #ccc; padding: 0.5rem 0.75rem; font-weight: 600; color: #222;}.card td { padding: 0.5rem 0.75rem; border-bottom: 1px solid #eee;}.card tbody tr:last-child td { border-bottom: none;}.sex-pill { display: inline-block; padding: 0.2rem 0.6rem; border-radius: 9999px; font-size: 0.85rem; font-weight: 500; color: #fff;}.sex-pill.female { background-color: #C84848;}.sex-pill.male { background-color: #183f6cff;}Sexmean(Age)sd(Age)nfemale27.914.1261male30.714.7453We can see that there is a difference of 2.8 years between the average age of men and women on the Titanic. Below, you can also check the distribution of ages. content_copy Copy ggplot()+ geom_density(aes(x=df$Age, color = df$Sex), size = 0.7)+ scale_color_discrete("")+ xlab("Age")+ ylab("Density") It seems indeed that the distributions are very similar. In this case, our best option is to carry a T Test out to see if they are really so similar.4. T test in RA T test can be performed in R in a very easy way. There is a function called t.test, whose first argument is a formula, in our case, we would like to know how age varies across different genders. Thomas Leeper wrote a very clear explanation about formulas in this page. Important for us is that the formula is composed by a dependent variable on the left (Age), followed by “~” and one or more independent variables on the right (Sex). The second argument is simply the dataframe with the data we want to test. This test assumes the two samples are independent and that age is approximately normally distributed, which we confirmed by the density plot above. content_copy Copy t.test(Age ~ Sex, data = df) How to interpret these results? The p-value of 0.0118 means that if there were truly no difference in the average age between male and female passengers (i.e., if the null hypothesis were true), there would be only a 1.18% chance of observing a difference as large as the one we found or larger. Since this p-value is less than 0.05, we reject the null hypothesis at the 95% confidence level, suggesting that a real difference exists. However, if we had chosen a 99% confidence level, we would not reject the null hypothesis, because the p-value is greater than 0.01. Our confidence interval tells us that if we took many samples like the one we have, in 95% percent of the times, we would obtain a difference between averages between -0.62 and -5. This confidence interval does not include 0 and therefore we reject the null hypothesis and accept the hypothesis that there is a difference between the average age of men and women.5. T test with BootstrapA T test with bootstrap is a good way of understanding the concepts needed to interpret the results of the T test above. Everything relies on the Central Limit Theorem according to which if I draw many samples of a population and calculate the mean of each sample, then the distribution of all these means will:(i) follow a normal distribution;(ii) the mean of the sample means will approximate the population mean;(iii) the standard deviation of this distribution will be called standard error.In our example, we have one sample of passengers. Imagine we could collect many of those samples. If we could do that, then the means of all samples would approximate the population parameter. Bootstrap is a technique to virtually create as many samples as we want from our unique sample. In our example, we have 712 ages after eliminating NAs. We could resample 712 observations from these values allowing them to repeat. That is the basic idea behind bootstrapping.In order to do that procedure, we will create a function that will resample our data frame. The first line of code uses slice_sample to randomly select n rows of our dataframe allowing for the same row to be chosen more than one time. Note that n is the number of rows of the dataframe. After that, we use dplyr to calculate the mean by gender. Note that we are actually interested in the difference between the male mean and the female mean. That’s what the two last lines of code do. content_copy Copy diff_means % summarize(mean_age = mean(Age, na.rm = TRUE)) male_mean % filter(Sex == "male") %>% pull(mean_age) female_mean % filter(Sex == "female") %>% pull(mean_age) return(male_mean - female_mean)} Now we can use the replicate function to execute our function for n times. For our purpose 1000 times is enough. Note that replicate works like a for loop. Before we do that, however, let us make a small adjustment so that we can also calculate our p-value. The p-value assumes the null hypothesis is true. Therefore, before resampling our data, let us make the difference between means be 0. For that, let us subtract the difference observed, 2.81, from the ages of all males. content_copy Copy df_null % mutate(Age = ifelse(Sex=="male", Age-2.81, Age)) set.seed(1308)diffs =2.81)/1000sum(diffs