Tags

, , ,

A data frame in R is a lot different from matrix.

Where a matrix contains the same types of data, usually numeric, a data frame can have different columns that can contain different types of data (numeric, character, date and so on). This is very much similar like the datasets you would have seen in MS Excel or SPSS or any other statistical software or spreadsheet. This is the reason you would see data frame as the most used and common data structure in R.

Before we work on data frame in R, let us take up an example and understand some basics. Look at the table below:

PID ADate PGender PAge DiabetesType PStatus
1 11/15/2013 Male 15 Type 1 Poor
2 11/17/2013 Female 71 Type 2 Improved
3 12/21/2013 Male 52 Type 1 Excellent
4 12/28/2013 Female 33 Type 2 Improved

 

In the table above, the data on diabetic patients admitted on various dates at a hospital is arranged in different columns. PID is unique patient identification number, ADate is the date a particular patient admitted to hospital, PGender and PAge column represents the gender and age of a particular patient respectively, DiabetesType represents the type of diabetes in that patient and PStatus column represents the overall health condition of that patient when given treatment.

What we see in the above table is that there are variety of data (numeric, data, alpha numeric or text) stored in different columns and can’t be stored in matrix form. Therefore, the right choice in our case would be data frame. A data frame in “R” is created using function data.frame():

> mydataframe <- data.frame(column1, column2, column3, column4,…. columnN)

Here in above function data.frame(), the columns column1, column2… upto N numbers are column vectors of any types of data (dates, numeric, logical, alpha numeric or text etc).

Now we will learn how we can store such data in R data frame. For simplicity, let there be a dataset of 100 patients in different ages from 15 to 75 years and genders. For the sake of easiness we have used random generation of data for the dataset.

Let us first store PID (patient IDs 1 to 100) as:

> PID <- c(1:100)

Then we enter hundred dates of patient admission:

> SeqDate <- seq(as.Date(“2013/1/1”), as.Date(“2013/12/31”), “day”)

> ADate1 <- sample(SeqDate, 100, replace=TRUE)

> ADate <- sort(ADate1, decreasing = FALSE)

And, Gender of patients:

> PGender <- sample(c(“Male”, “Female”), 100, replace=TRUE)

Now we generate a column with values of 100 counts of ages between 15 and 75 years:

> PAge <- sample(15:75, 100, replace=TRUE)

Similarly, we generate a column with diabetes types:

> DiabetesType <- sample(c(“Type1”, “Type2”), 100, replace=TRUE)

And, we generate last column with patient status upon treatment:

> PStatus <- sample(c(“Poor”, “Improved”, “Excellent”), 100, replace=TRUE)

Finanally, all the columns are combined into one data frame:

> MyPatientData <- data.frame(PID, ADate, PGender, PAge, DiabetesType, PStatus)

Which should look like:

> MyPatientData

PID ADate PGender PAge DiabetesType PStatus
1 1/2/2013 Male 64 Type 2 Improved
2 1/3/2013 Male 54 Type 1 Poor
3 1/13/2013 Male 34 Type 2 Poor
4 1/14/2013 Female 65 Type 2 Poor
5 1/15/2013 Male 39 Type 1 Poor
6 1/25/2013 Female 61 Type 1 Excellent
7 1/28/2013 Male 71 Type 1 Poor
8 1/31/2013 Female 49 Type 2 Poor
9 2/3/2013 Male 66 Type 2 Excellent
10 2/6/2013 Female 37 Type 1 Poor

Now we will do some exercise with the above data frame.

Suppose if you want to list some specific elements of a data frame (in column or rows), we can perform so:

  1.       For displaying data in column no 1 (PID) and 3 (PGender)

> MyPatientData[, c(1,3)]

PID PGender
1 Male
2 Male
3 Male
4 Female
5 Male
6 Female
7 Male
8 Female
9 Male
10 Female
  1.       For displaying all columns values for a particular patient id (#13):

> MyPatientData[13,]

row.names PID ADate PGender PAge DiabetesType PStatus
13 13 2/24/2013 Female 69 Type 1 Poor
  1.       You can also display specific column using the following command:

> MyPatientData[c(“PGender”, “PStatus”)]

 

PGender PStatus
1 Male Improved
2 Male Poor
3 Male Poor
4 Female Poor
5 Male Poor
6 Female Excellent
7 Male Poor
8 Female Poor
9 Male Excellent
10 Female Poor

4. Tabling the data, e.g., if you want to tabulate dataset, do this way:

> table(MyPatientData$DiabetesType, MyPatientData$PStatus)

Excellent Improved Poor
Type 1 14 13 27
Type 2 16 6 24

5. Displaying dataset in ascending order of age and type of diabetes:

> sortnames <- c(“PAge”)

> MyPatientData[do.call(“order”, MyPatientData[sortnames]), ]

This will produce:

PID ADate PGender PAge DiabetesType PStatus
94 12/11/2013 Male 15 Type 1 Excellent
97 12/18/2013 Male 15 Type 2 Excellent
93 12/9/2013 Male 16 Type 1 Poor
24 4/6/2013 Male 18 Type 2 Excellent
6 1/21/2013 Male 19 Type 1 Poor
26 4/22/2013 Male 19 Type 1 Excellent
38 5/31/2013 Female 19 Type 1 Improved
46 7/15/2013 Male 19 Type 2 Improved
71 10/5/2013 Female 20 Type 2 Poor
27 4/24/2013 Female 21 Type 1 Improved

You can use this sorting function for multiple columns such as first sorting on AGE and then on GENDER:

> sortnames <- c(“PAge”, “PGender”)

> MyPatientData[do.call(“order”, MyPatientData[sortnames]), ]

PID ADate PGender PAge DiabetesType PStatus
94 12/11/2013 Male 15 Type 1 Excellent
97 12/18/2013 Male 15 Type 2 Excellent
93 12/9/2013 Male 16 Type 1 Poor
24 4/6/2013 Male 18 Type 2 Excellent
38 5/31/2013 Female 19 Type 1 Improved
6 1/21/2013 Male 19 Type 1 Poor
26 4/22/2013 Male 19 Type 1 Excellent
46 7/15/2013 Male 19 Type 2 Improved
71 10/5/2013 Female 20 Type 2 Poor

 

I hope you found this tutorial on Data Frame and various basic operations on the dataset in it very interesting and easy to understand.

If you have any queries, please feel free to contact me.

Good Luck and Happy Learning R !

 MANOJ KUMAR

Advertisements