Today I want to show you a simple code to conduct multi-sample ANOVA test and subsequently t-test with Python’s powerful scipy package. ANOVA is handy when you want to compare more than 2 samples to see if their differences (if any) are statistically significant. The test is widely used in A/B testing, comparison of automobile gas mileages, HITL (human-in-the-loop) task efficiency testing, or any scenario that requires the understanding whether different levels of a factor (i.e. different groups or different experimental designs ) have an impact on the response variable. Note that with an ANOVA output that is statistically significant (i.e. p-value < 0.05), you can only conclude that not all the means of the different groups are the same, but it doesn’t tell you which groups are different. For example, if you are comparing sample means of Group A, Group B, Group C and Group D and the result from ANOVA is statistically significant. You can only say that the sample means in these 4 groups are different. You don’t know if the difference is between Group A and Group B, Group B and Group C, Group C and Group D and so on and so forth. To find out whether the sample means of 2 individual groups are different, you should run a t-test.

For the topic in this article, we will be using 4 sample groups. The hypothetical use case is as follow: a data scientist wants to develop a reliable bounding box tool for a self-driving car project. In order to do that, he wants to make sure he has sufficient training data that are reliable. To product reliable training data, the scientist decides to utilize human-in-the-loop by using online contributors to help her draw bounding boxes for appropriate objects in images. Due to the vast amount of images that need to be boxed, the scientist must find out an efficient way to design the bounding box task so that people could draw the boxes quickly without sacrificing quality. She wants to test 2 job designs (one called control design and the other called testing design where testing design is what the scientist believed to be the more efficient design), and allocate both experienced contributors and rookie contributors to each of these 2 jobs. Here experienced contributors are simply contributors who had participated in bounding box tasks before, whereas rookie contributors had never participated in them. The goal is to see whether average time spent on drawing the boxes are different across all 4 groups and if so, does the difference exist between the experienced control and experienced testing group, the rookie control and rookie testing group, or both.

Looking at the dataset called ‘anova_test.csv’, you can see that a total of 159 contributors have participated in the HITL jobs: 13 experienced and 63 rookie contributors participated in the control design, while 28 experienced and 55 rookie contributors participated in the testing design. The complete dataset, including the average time each contributors spent can be found in the link called ‘Dataset’ below. The Python script to achieve this can be found in the link called ‘Python script’ below.

Data sources:

Python script: https://github.com/Stanleyrr/Data-Science-Portfolio/blob/master/Statistical_Analysis/ANOVA_for_Python/anova_github.py

Once you run the Python script I provided, you should first see from the output console the five-number summary containing the minimum, Q1, median, Q3 and maximum of each group. Then, you should see the 4 bell curve charts below that show the distribution of average time spent from each of the 4 groups:

As you can see, all except the experienced control group have bell curves that are largely normally distributed. The experienced control group is clearly skewed towards the right.

You should also see a boxplot that contains the statistical summary of the 4 groups as seen below:

Looking at this boxplot, it seems like the experienced control group and experienced testing group are statistically different, as are the rookie control group and rookie testing group due to their median and quantiles being very different. So is our assumption correct? Let’s examine the ANOVA and t-test output:

First of all, result from ANOVA shows that the means of all 4 groups are not all the same, which is what we expected. To find out which of the groups have different means, we look at the t-test results.

For the experienced group, there is one surprise in the t-test. The t-stat for the experienced control group and experienced testing group is negative, meaning that contributors in the experienced control group on average actually spent less time completing the bounding boxes than the experienced testing group. Also, because the p-value is at 0.0128 (less than 0.05), this difference is statistically significant. This is contradictory to what the scientist believed, as she thought that contributors in the testing group would be faster than contributors in the control group. Having said that, we need to keep in mind there are only 13 experienced contributors in the control group and 28 experienced contributors in the testing group, both of which are less than the recommended sample size of 30 according to the central limit theory. In addition, the experienced control group doesn’t even have a normally distributed bell curve as shown earlier. So you should definitely take the result with a pinch of salt until the scientist is able to increase the sample size and conduct a separate study.

For the rookie group, the result is largely in line with expectation since contributors in the control group are faster than contributors in the testing group, and that the p-value is much less than 0.05. The sample sizes in these 2 groups are also sufficient.

So that’s it for the ANOVA with Python lesson. Again, all the information related to the statistical output can be found from the above link that takes you to my Python script. Hope you guys find this article helpful and enjoyable. As always, feel free to leave comments and suggestions below 🙂