In this post, I will provide a simple script for merging a set of files in a directory into a single, large dataset (or dataframe) using R.

Step 1: Setting the Working Directory as default that contains the files which need to be merged:

1
setwd("C:/target_dir/")

Step 2: Retrieving a List of Files in a Directory

For this, the list.files() function can be used.

This function simply fetched all the files’ names (from the set working directory) into a lists (e.g. vector my_file_list as below).

2
my_file_list <- list.files()

If you want it to list the files from a different directory, just specify the path of that directory into list.files() function.

For example, if you want the files in the folder C:/myfiles/, you could use the following code:

3
my_file_list <- list.files("C:/myfiles/")

Step 3: Merging the Files into a Single Dataframe

The final step is to iterate through the list of files in the currently set working directory and fetch the data from these files to form a single, large dataframe.

When the script is run, it encounters the first file in the my_file_list, and it creates the main dataframe (mydataset) to merge everything from other files in the folder into it.

This is done using the following  !exists conditional:

  • If mydataset already exists, then a temporary dataframe called my_temp_dataset is created and added to mydataset.
  • When all the data from all the files in the folder are added to mydataframe, the temporary dataframe is removed, using the rm(my_temp_dataset) command.
  • If mydataset doesn’t exist (!exists is true), then function call will create it.

Here’s the final code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for (myfile in my_file_list){

# if the dataframe mydataset doesn’t exist, create it
if (!exists(“mydataset”)){
mydataset <- read.spss(myfile, to.data.frame = T)
}

# if the merged mydataset exists, append data from next files into it
if (exists(“mydataset”)){

my_temp_dataset <- read.spss(myfile, to.data.frame = T)

mydataset <- rbind(mydataset, my_temp_dataset)

rm(my_temp_dataset)

}
}

That’s it!

Advertisements