Automate the opening of Excel files in R
How to automate the reading and opening of Excel file (or csv or other) in R
Goal: Open a large number of files in R
It can happen that you have a folder full of data files (Excel or other) and you need to import several or even all of them into R.
Doing it by hand can be long, time-consuming and error-prone… so let R works.
The process
The process is quite simple:
_ With the function list.files()
of the base
package, R can list all the files in a folder.
_ Choice of files to import: either all or a selection.
_ Import files by associating them with their original names.
Creation of dummy data sets
Before starting, you must create a folder containing data sets that can be re-imported.
I decided to create dummy .xlsx
Excel files but this is valid for any format .csv
…
openxlsx::write.xlsx(airquality, file = "airquality.xlsx")
openxlsx::write.xlsx(iris, file = "iris2.xlsx")
openxlsx::write.xlsx(Orange, file = "orange.xlsx")
openxlsx::write.xlsx(mtcars, file = "mtcars3.xlsx")
Importing files
Now the folder is created, I will be able to import them.
I start by re-importing all of them.
library(tidyverse)
files_excel <- list.files(path = here::here("content", "post"), full.names=TRUE) %>%
str_subset(".xlsx") %>%
map(openxlsx::read.xlsx) %>%
set_names(list.files(path = here::here("content", "post"), full.names=TRUE) %>%
str_subset(".xlsx") %>%
str_extract("\\w+.xlsx") %>%
str_remove(".xlsx"))
str(files_excel)
## List of 4
## $ airquality:'data.frame': 153 obs. of 6 variables:
## ..$ Ozone : num [1:153] 41 36 12 18 NA 28 23 19 8 NA ...
## ..$ Solar.R: num [1:153] 190 118 149 313 NA NA 299 99 19 194 ...
## ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## ..$ Temp : num [1:153] 67 72 74 62 56 66 65 59 61 69 ...
## ..$ Month : num [1:153] 5 5 5 5 5 5 5 5 5 5 ...
## ..$ Day : num [1:153] 1 2 3 4 5 6 7 8 9 10 ...
## $ iris2 :'data.frame': 150 obs. of 5 variables:
## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## ..$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
## $ mtcars3 :'data.frame': 32 obs. of 11 variables:
## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..$ disp: num [1:32] 160 160 108 258 360 ...
## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
## $ orange :'data.frame': 35 obs. of 3 variables:
## ..$ Tree : chr [1:35] "1" "1" "1" "1" ...
## ..$ age : num [1:35] 118 484 664 1004 1231 ...
## ..$ circumference: num [1:35] 30 58 87 115 120 142 145 33 69 111 ...
So I now have 4 data files in my environment that I can use.
Of course, it is possible to do the same with .csv
files by using the read.csv()
function and putting “.csv” instead of “.xlsx”.
Maybe you noticed that I use a very simple regular expression which allows me to select only the files whose format interests me.
A “regular expression” is a string of characters encoded in a language that will allow you to find strings that obey the condition. Well, it’s not very clear, so I’ll take an example.
When you take a picture with your phone or camera, it will give it a name composed of a fixed base often “img” and a variable part often linked to the date.
So, if you want to find all the photos taken, you can search all the files with the name “img” or “img2018”.
For you the names of linked files can be obvious to find, it is a place or a machine name, etc…. but for R it is not so obvious, you have to teach him the format you are looking for.
For example, to be able to find files in the form “Paris20-03-2014”, “Foljuif10-04-2012”, the regular expression would be“([A-z]+)\w+ ([0-9]{2}[:punct:][0-9]{2}[:punct:][0-9]{4})”.
I therefore explain to R that I am looking for files with at the beginning an alphabet string followed by a number in the format 00-00-00-0000.
I’m not a regular expression specialist. If you know more about this, I recommend the french article by Lise Vaudor that I read every time I have a doubt!
Well, now that this is a little clearer (I hope so), I’m going to use a very simple regular expression that will allow me to retrieve only Excel files whose name has a number at the end.
files_excel_bis <- list.files(path = here::here("content", "post"), full.names=TRUE) %>%
str_subset("[0-9].xlsx") %>%
map(openxlsx::read.xlsx) %>%
set_names(list.files(path = here::here("content", "post"), full.names=TRUE) %>%
str_subset("[0-9].xlsx") %>%
str_extract("\\w+.xlsx") %>%
str_remove("[0-9].xlsx"))
str(files_excel_bis)
## List of 2
## $ iris :'data.frame': 150 obs. of 5 variables:
## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## ..$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
## $ mtcars:'data.frame': 32 obs. of 11 variables:
## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..$ disp: num [1:32] 160 160 108 258 360 ...
## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
Here we are, hoping that this article will be a little useful to you!
If you have other methods or comments, feel free to contact me.
Good coding!
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email