Tags

, , , , , , , , , ,

One-Way ANOVA for More Than Two Variables

Download CSV file for this tutorial from the following link:

Salary.csv

Suppose you want to test (a null hypothesis) that number of years of work experience before pursuing master in business administration (MBA) has no effect on the salary being offered in the job after graduation. The alternative hypothesis being that there are significant differences in the salaries. Therefore, we will set up three groups of students:

Group A: This is the group of students which has no prior (or less than 1 year of) work experience before joining MBA programs.

Group B: Group of students who have work experience of between 1 and 3 years.

Group B: Group of students who have work experience of more than 3 years.

For sample collection, we will select only top 10 Indian colleges/institutes or university which offers MBA/PGDM specialization. We, if wish for, can further concentrate on only Finance or Marketing or Human Resources major for sample collection in the study.

Although a large sample size would yield for better convincing statistical results, for this example and ease of calculation, we will select 6 students in each group which means 6 x 3 = 18 observations in the sample. (Note: The data are not linked to an individual institute and may or may not be the actual representation of overall offers at those institutes/universities.)

Table below gives the detail of the observations:

 

Table 1
Student Group Experience (in Months) Salary Offered (in Indian Rupees, ‘000)
1 A 6 24
2 A 9 26
3 A 3 18
4 A 8 24
5 A 6 20
6 A 2 18
7 B 14 20
8 B 16 22
9 B 30 28
10 B 18 25
11 B 24 30
12 B 28 32
13 C 38 40
14 C 46 55
15 C 52 60
16 C 84 90
17 C 75 88
18 C 60 75

 

Because we want to test the hypothesis for salary data, the above table 1 can be rearranged:

Salary Offered (in Rupees, ‘000)
Group A Group B Group C
24 20 40
26 22 55
18 28 60
24 25 90
20 30 88
18 32 75

 

Now, one (e.g. a probable future student) would expect that there is no difference in the salary being offered to different students in their first job after MBA and she sets a (null) hypothesis:

H0: The mean of salary in each group (e.g. MA represents mean of group A) is not significantly different (i.e. MA = MB = MC)

An alternative (Ha) to this hypothesis will be the means are significantly different or at least one of the three means is significantly different from other two.

Lets work on data in R….

#Reading CSV data into R, if you had already downloaded the file on your computer in the woking directory of R,
# otherwise download it from the link given here
# https://www.dropbox.com/s/57tqbf7373tuxmn/salary.csv

salary<- read.csv(“salary.csv”, header=TRUE)

names(salary)   #view dataset’s variables
str(salary)         #view dataset’s structure

#Modifying group factor
salary$Group = factor(salary$Group, labels = c(“Group A”, “Group B”, “Group C”))
salary

#First we will box plot of the data of the salary offered for the three groups A, B and C.

require(ggplot2) 

ggplot(salary, aes(x = Group, y = Salary)) + geom_boxplot(fill = “grey80”, colour = “blue”) + scale_x_discrete() + xlab(“Group of Students”) + ylab(“Salary offered, in Indian Rupees, ‘000”)

ggplot

 

# Does it give you some idea?…. well it suggests that there will be differences.
# But to test the hypothesis and study other assumptions, lets move onto ANOVA

# To investigate the differences among the groups, we will fit the one-way ANOVA model using the R’s lm() function:
# lm() syntex is: lm(Y-axis values ~ X-axis, data = dataset)
#for more you can look at the help ?lm()

diffgrps<- lm(Salary ~ Group, data = salary)

# Here, results of the model fitted to the data are saved in an object.
# This object can help us in understanding of the goodness of the fit to the data and other model assumptions.
# We will use this object to produce standard summary:

diffgrps

#which produces the following summary

#Call:
# lm(formula = Salary ~ Group, data = salary)

# Residuals:
#                 Min                     1Q             Median              3Q           Max
#              -28.0000         -4.0417           0.3333          4.2083      22.0000

# Coefficients:
#                                     Estimate           Std. Error        t value           Pr(>|t|)
#          (Intercept)           21.667            4.851             4.466           0.000453 ***
#          GroupGroup B      4.500               6.861             0.656           0.521822
#          GroupGroup C      46.333             6.861             6.753           6.49e-06 ***
# —
# Signif. codes:       0 ‘***’      0.001 ‘**’        0.01 ‘*’         0.05 ‘.’       0.1 ‘ ’     1

# Residual standard error: 11.88 on 15 degrees of freedom
# Multiple R-squared: 0.7872, Adjusted R-squared: 0.7588
# F-statistic: 27.74 on 2 and 15 DF, p-value: 9.126e-06

The model output indicates some evidence of a difference in the average salary.

#An analysis of variance for this model can be performed using the following anova command in R:

anvgrp<- anova(diffgrps)

anvgrp

# Analysis of Variance Table

# Response: Salary
#                      Df       Sum Sq        Mean Sq          F value          Pr(>F)
# Group             2       7834.1        3917.1            27.739          9.126e-06 ***
# Residuals     15       2118.2        141.2

# —
# Signif. codes:     0 ‘***’      0.001 ‘**’      0.01 ‘*’           0.05 ‘.’       0.1 ‘ ’    1

This result above confirms that there are differences between the groups p-value is very close to 0 which means significant upto 0%.

# Using function confint() we can to calculate confidence intervals on the group parameters:

confint(diffgrps)

#                                      2.5 %                97.5 %
# (Intercept)               11.32635          32.00698
# GroupGroup B          -10.12342         19.12342
# GroupGroup C          31.70992           60.95675

 

Advertisements