Statistical Analysis on Data Frames with Python and R

In this article I will talk about Python and R codes used to strip null values and convert csv data tables to data frames for the purpose of running practical statistical analysis smoothly. The key statistical methods discussed will be Mann-Whitney-Wilcoxon Rank Sum test and one-way ANOVA. While there are plenty of articles on the web that talk about how to run Rank Sum test and ANOVA with R, very few have taken a thorough approach that walks readers from the start (loading csv data into R and Python) to finish (running the actual statistical analysis). This article tries to fill that gap in a detailed and easy-to-comprehend fashion.

*************************************************************************

Let’s start with Python: 

CSV file: data_original

The following commands reads the “data_original.csv” file from local drive and converts all blank cells to “NaN” (null values):

from pandas import *

dataF = read_csv(“data_original.csv”, na_values=[” “])

print dataF

Output:

Capture5

To get summary statistics of each column, run the following command:

dataF.describe()

Output:

Capture

To return only the column of interest (e.g. “1st_set” column):

print dataF[“1st_set”]

To return data from a specific row (say row 16), in column “2nd_set”, type the following command*:

print dataF[“2nd_set”][15]

Output*:

28.0

*Python row count starts from 0. So add 1 to the row number you want to extract.

To obtain rows that meet certain criteria (e.g. “2nd_set” column values greater than 27):

print dataF[dataF[“2nd_set”] > 27]

Output:

Capture2

To return row(s) where value in “3rd_set” are NaN, run the following command:

import numpy as np

print dataF[np.isnan(dataF[“3rd_set”])]

Output:

Capture3

It is best to drop rows where there is NaN values because calculations like SEM (standard error of the mean) will not work if there’s NaN values in the column. This can be verified by trying the following command:

from scipy import stats

print stats.sem(dataF[“3rd_set”])

Output:

nan

To correct this nan output problem, you can use the “dropna” function. This would drop the entire rows if any of its entries have NA. If you only want to remove NAs from a specific column, say “2nd_set” column, run this command:

dataF[“2nd_set”].dropna()  

Alternatively, you can replace all NA values in the “2nd_set” column with a number such as 0:

dataF.fillna(0.0)[“2nd_set”]

Depending on what you do with NA values, you would get different means. Look at the commands and outputs below:

dataF[“2nd_set”].dropna().mean()

Output:

26.526315789473685

dataF.fillna(0.0)[“2nd_set”].mean()

Output:

25.627118644067796

To get the mean and variance of “3rd_set” when “1st_set” equals 25, run the following 2 commands:

dataF[dataF[“1st_set”] == 25][“3rd_set”].mean()

Output:

26.500000000000001

dataF.fillna(0.0)[dataF[“1st_set”] == 25][“3rd_set”].var()

Output:

2.7000000000000002

To get standard error of the mean for “3rd_set” column when “1st_set” equals 25, run these commands:

from scipy import stats

stats.sem(dataF[dataF[“1st_set”] == 25][“3rd_set”])

Output:

0.73484692283495345

Mann-Whitney-Wilcoxon (MWW) Rank Sum test:

The MWW Rank Sum test is used to determine if two sets of data are significantly different or not. Unlike the t-test, the Rank Sum test runs effectively even when data are not normally distributed. Thus the Rank Sum test could provide a more accurate assessment on data.

Run the following commands to determine whether there’s significant difference among each of the 3 columns:

sample1=dataF[“1st_set”].dropna()

sample2=dataF[“2nd_set”].dropna()

sample3=dataF[“3rd_set”].dropna()

stats.ranksums(sample2, sample3)

Output:

RanksumsResult(statistic=0.20137831588646385, pvalue=0.84040276931163582)

stats.ranksums(sample1, sample2)

Output:

RanksumsResult(statistic=-2.7353887907911338, pvalue=0.0062306635867612672)

stats.ranksums(sample1, sample3)

Output:

RanksumsResult(statistic=-2.6531578587360127, pvalue=0.0079742567612623265)

Since the Rank Sum test of “2nd_set” column and “3rd_set” column returns a P-value greater than 0.05, we fail to reject the null hypothesis that these 2 columns are similar. We conclude that columns 2 and 3 are not significantly different. On the other hand, the Rank Sum test between column 1 and column 2, and between column 1 and column 3 both yield very small P-values (much smaller than 0.05), we conclude that column 1 and column 2 are significantly different from each other. The same can  be said for column 1 and column 3.

ANOVA (one-way analysis of variance):

The MWW Rank Sum method discussed above applies to 2 sample comparisons only. What if you want to compare more than 2 samples (e.g. compare “1st_set”, “2nd_set” and “3rd_set” columns together)? In that case, you want to use ANOVA. Simply run the “f_oneway” function as shown below and you’ll get the associated P-value.

stats.f_oneway(sample1, sample2, sample3)

Output:

F_onewayResult(statistic=10.925639289300042, pvalue=3.4368375004095858e-05)

Because the P-value is much less than 0.05, we can conclude that at least 2 of the 3 columns are significantly different from each other.

*************************************************************************

Now we can switch gear to talk about the same analysis using R codes:

First, read the csv file from local drive:

dataF<-read.csv(“C:/Users/Shaolun/Desktop/Wordpress blogs/Basic stats/data_original.csv”, header=TRUE) 

Read the data table as data frame:

dataF<-as.data.frame(dataF) 

Omit all rows that contain NA values: 

newdataF <- na.omit(dataF)

If you want to omit NA values in just the “2nd_set” column, run the following commands to save the result to a new variable and calculate its mean:

second_column <- na.omit(dataF[[“X2nd_set”]])

mean(second_column)

Output:

[1] 26.52632

The result is exactly the same as what we got using Python!

You can also compute conditional mean in R based on criteria (e.g. mean of “3rd_set” column where “1st_set” column equals 25, as we’ve done in Python code). Run the following codes:

newdataF <- na.omit(dataF)

mean(newdataF$X3rd_set[newdataF$X1st_set==25]) 

Output:

26.5

This result agrees with what we got using Python earlier.

Standard Error of the Mean:

To calculate standard error of the mean of the “3rd_set” column, with NA values removed, run the following codes: 

install.packages(“plotrix”)library(plotrix)std.error(newdataF$X3rd_set) 

Output: 

[1] 0.1507861

MWW Rank Sum test:

first_column <- na.omit(dataF[[“X1st_set”]])

second_column <- na.omit(dataF[[“X2nd_set”]])

third_column <- na.omit(dataF[[“X3rd_set”]]) 

wilcox.test(second_column,third_column)

Output:

Wilcoxon rank sum test with continuity correction

data:  second_column and third_columnW = 1689, p-value = 0.8372

wilcox.test(first_column,second_column) 

Output:

Wilcoxon rank sum test with continuity correction

data:  first_column and second_columnW = 1164, p-value = 0.005678

wilcox.test(first_column,third_column)

Output:

Wilcoxon rank sum test with continuity correction

data:  first_column and third_columnW = 1201.5, p-value = 0.007381

As you can see, the Rank Sum test results generated by R agree with the results generated by Python, with the negligible difference in P-value likely due to difference in rounding.

ANOVA (one-way analysis of variance):

Now let us run ANOVA using R. One caveat with R is that the program won’t be able to run ANOVA if the sample sizes are different. We can however stack the 3 samples vertically using a modified “data.frame” function as shown below:

Data <- data.frame(Y=c(first_column, second_column, third_column), Column =factor(rep(c(“first_column”, “second_column”, “third_column”), times=c(length(first_column), length(second_column), length(third_column)))))

Data

Output:

Capture6

test1 <- aov(Y~Column, data=Data)

anova(test1)

Capture4

As expected, the P-value of 3.437e-05 is almost identical to the P-value obtained from Python, which means the observations of at least 2 of the 3 samples are significantly different.

That is all for this blog. I hope it is able to help readers who have trouble converting csv data into data frames and writing statistical analysis commands in R and Python. As usual, feel free to leave comments down below if you have any thoughts or questions. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s