Start with the data

Check the distribution of data and relationship between factors

April 12, 2018 Marie Vaugoyeau

2 minutes read

The distribution of each variable
The multi-test of correlations

After verifying that there is no one missprint in the data and before starting the true statistical analysis, I check the distribution of data and relationships between factors.

For this blog article, I use data of air quality available in R.

The distribution of each variable

To see distribution, I use a ggplot.

library(tidyverse)
D<-airquality

# First time transform data of tidy ones
D_tidy<-D %>% tidyr::gather(VAR, value, -Month, -Day) %>% mutate(Time = Month*30.5+Day)

ggplot(D_tidy) + geom_line(aes(x=Time, y=value, color=Month)) + scale_color_distiller("Month", palette = "Spectral") + facet_wrap(~VAR, scales = "free_y") + theme_minimal() + ylab("")

We can see there are missing value in Ozone but it is not an issue.

The multi-test of correlations

To see which variable varies with another, I analysed correlation with cor.test.
I used the following code lines with R modified from Zuur et al. (2010)).

D<-airquality

par(bty="n") # I prefer when box was delete from plots but it is not a necessary

# I re-write panel.smooth to change the shape of points and thickness of the red line
panel.smooth2=function (x, y, col = par("col"), bg = NA, pch = par("pch"),
    cex = 1, col.smooth = "red", span = 2/3, iter = 3, ...)
{
  points(x, y, pch = 20, col = col, bg = bg, cex = cex)
    ok <- is.finite(x) & is.finite(y)
    if (any(ok))
        lines(stats::lowess(x[ok], y[ok], f = span, iter = iter),
            col = "red", lwd=2,...)
}

# I modfied panel.cor to show estimate for only significant cor.test
panel.cor <- function(x, y, digits=1, prefix="", cex.cor)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r1=cor.test(x,y)
  r <- abs(cor(x, y,use="pairwise.complete.obs"))
  
  txt <- format(c(r1[4], 0.123456789), digits=digits)[1]
  txt <- paste(prefix, txt, sep="")
  if(missing(cex.cor)) cex <- 0.9/strwidth(txt)
  text(0.5, 0.5, if(r1[3]<0.005) txt, cex = cex * r)
}

pairs(D, lower.panel = panel.smooth2, upper.panel=panel.cor, cex.labels=1.3)

My main modification is the blank when correlation test was not significant.

In the upper triangle, number is estimate of the correlation between the line variable and the column one.

And you, how do you do multi-test of correlations?

post

Home

About

Curriculum

Contact

Services

Photographer

Recent posts

Automatic generation and dispatch of media to customers, blocking points and forwarding solutions

Searching a first name for a child

Text-mining to create a word cloud representative of a PDF file

Visualize geographical areas around points of interest with leaflet