Learning to code in R
One of the major tools that I didn't anticipate being so important in research is statistical computing. Once you have done the experiment, collected the data, you are left with a pile of numbers. The numbers are organized in a spreadsheet, and then you have to do something with it all. There are many approaches to data analysis, and many softwares available to do it. Excel is the most common and basic program for storing and manipulating spreadsheets, and you can do a lot with spreadsheets in Excel. Once you get in to more complicated statistics, there are other programs that are better, such as SAS, SPSS, STATA, Minitab, etc. These programs have a graphical user interface, or a point-and-click set up from which you can select options from dropdown menus and sequential pop-up boxes. I'm sure these work well, though I have little experience with any of them.
In my program (and I believe this is widely true in research), using computer coding is a more popular way to work with data. Common programming languages for statistical analysis and engineering include MATLAB, Python, and R. In addition to being powerful and flexible, one of the reasons using code is so valuable is having reproducible workflows. When you use a point and click software, you can lose track of the steps you did, especially if there are a lot of steps. Maybe you want to try analysis over again but change one thing–you have to start over from the beginning. Maybe someone wants to do the same kind of analysis you did, but you didn't document every step, so they can't verify your results. In code, every step is explicit and documented. You can always rerun code to retry a test with different data or change a setting. You can also share code.
R is the the most common language used in my field, Ecology. From their website: “R is a language and environment for statistical computing and graphics.” I knew nothing about R when I started my degree at Rutgers, 4.5 years ago. I had very little experience with coding before. Until now, I have been dabbling in R and getting familiar with it, but my time has come to sift through my three years of data and figure out what it all means. I am getting up to speed with using R on my own data now, and it's fun to do, like figuring out a series of puzzles.
Because I am in R mode in my own research, I also went into R mode for the class I TA. I am a TA in Freshwater Ecology, and we are remote again this semester. I though learning a bit of R would be a good topic for remote learning, so I developed a series of lab exercises to graph data related to freshwater ecology in R. I am going to share the first one with you, and you can get a small idea of how R works and what it can do. The software is open source and free to download if you want to download it. You can also run R in your browser using this website: https://jupyter.org/try and scroll to “Try Jupyter with R.” (There is example code already in the “notebook” that will open, but you can scroll to the bottom and paste code into the empty cell at the bottom.)
To run this code, copy and paste each section into the source pane or a new cell and hit the “Run” button. We will be using code that requires R to have specific libraries, known as “packages”, installed; the ones called curl and ggplot2. If you are running R on your own computer and have never installed them, you need to install them first. If you are using the web browser your can skip this step.
#install.packages("curl")
#install.packages("ggplot2")
Then the “library()” code activates those packages in the current section.
library(curl)
library(ggplot2)
Next, is a tiny little first program in R. We are making an object called “myString” then displaying it:
myString <- "Hello, World!"
print (myString)
## [1] "Hello, World!"
Look below the code window for the results of the code. What did this return for you? Was it: [1] “Hello, World!”? If so, sounds like it worked.
Next, we will load a dataset from the internet. This is data that has been made available by @Alisonhorst, who makes amazing statistics and R cartoons and other content. We will name it “data_import”
data_import <-read.csv(curl("https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/other-stats-artwork/shark_raw.csv") )
To see what is in the data that you loaded, you can look at the header data which shows the column labels and the first six rows of data.
head(data_import)
## x y
## 1 10 314
## 2 10 314
## 3 11 314
## 4 11 314
## 5 11 314
## 6 11 314
We see that this dataset is composed of two columns of data, labeled x and y. Those labels are very generic. It's hard to tell what's going on with the data just by looking at the data_import table. Let’s plot the data to visualize it. There are two numerical variables, so a scatterplot is appropriate. We’ll use ggplot, a popular package for graphing. The first line says which dataset to use and which variables to use for x and y in the plot. The second line says what kind of plot to make—bar chart, line chart, box and whisker, etc. We are making a scatter plot, so we are using point geometry:
ggplot(data_import, aes(x=x, y=y)) +
geom_point()
Hello shark! You must be a Glyphis river shark, one of the few types of shark that spend most of its time in freshwater. This is not real ecological data, it is just a fun example of how to make a graph in R. In this example, the shark outline is made up of 2705 separate points that look connected when we view the graph at this scale.
Let's change the color of the points to be more shark-like.
ggplot(data_import, aes(x=x, y=y)) +
geom_point(color = "gray")
You can continue to add things to you scatter plot to change the appearance, including axis labels, colors, shapes, legends, and more. The global appearance of the plot is controlled by the themes. Let’s change a theme option to make the shark swim in dark blue water.
ggplot(data_import, aes(x=x, y=y)) +
geom_point(color = "gray") +
theme(panel.background = element_rect(fill = 'darkblue'))
In this example, color is defined by names that are recognized by R. Many standard color names are defined and will work, and you can look at more color options here: https://www.r-graph-gallery.com/ggplot2-color.html.