--- title: "LAB 1" author: "Michael Kummer" output: pdf_document --- # Introduction to R **What is R?** "R is a language and environment for statistical computing and graphics." (http://www.r-project.org/about.html) - Download R: http://cran.r-project.org/ - Install. **What is Rstudio?** "RStudio is a free and open source integrated development environment (IDE) for R, a programming language for statistical computing and graphics." (http://en.wikipedia.org/wiki/RStudio) - Download Rstudio: http://www.rstudio.com/ - Install. - Open Rstudio (not R): If you have a Windows PC and Rstudio does not show on your desktop you have to search for it on your start menu. During this course we will always work on Rstudio as it is more user-friendly. ## Rstudio screen The Rstudio screen is divided in four panels, some of which are subdivided into tabs. The main panels are: **1. Console** - Here you can type commands and see their output. **2. Environment and history** - Environment: This tab stores the things you create during your R session. - History: This is a record of your activity, you can save it for later use. **3. Files, plots, packages and help** - Files: This tab shows you the files and folders in your workspace (or work folder). - Plots: This tab will show you all the plots/graphics you produce. - Packages: This tab will show you a list of the add-ons you have available and indicate whether they are on/off. - Help: This tab gives you a search box where you can search for additional information. **4. The R script(s) and data view** - The R script is where you keep a record of your work. ## Packages Think of the R packages as smartphone applications. There is a large number of available packages that serve different purposes. In order to install a package you should type: *install.packages("write.name.of.package.here")* You need to be connected to the internet in order to install packages. When installing a package do not worry about the messages in red showing on your console unless they explicitly say "error". You only need to install each package once. Once you have installed a package, you can turn it on by typing: *library(write.name.of.package.here)* The package tab will show you which packages you have available. If the package is checked that means it is active, if it is not checked you will need select it in order to use it. ## Working Directory At any moment, if you want to check what is your current working directory, you can type: *getwd()* During and R session, if you want to work from a different folder than your default folder you can type: *setwd("write/path/to/folder/here")* You can check what files you have in your working directory by running the following command: *list.files()* # Getting Started ## Using the Console as a calculator You can use the Console just like a calculator (a very sophisticated calculator.) Try performing some calculations: ```{r} 1+1 3*4 24/6 (2*10) - (3*4) 2^3 8^(1/3) ``` R has built in mathematical functions, for example: ```{r} sqrt(25) # square root log(1) # natural log ``` You can store the results of your calculations in your work environment by giving them names. In order to name any kind of object you use "<-". Lets try: ```{r} my.sum <- 10 + 10 # save result my.sum # display result ``` Once you store an object in your environment you can interact with it directly by using its name: ```{r} my.sum/10 ``` ## R scripts When working with R, you should keep a record of your code so that you can keep track of what you have done and re-use it later. In this course we will always work on Rscripts and **not** on the Console. ### Create new Rscript - You can open a new Rscript by clicking on the blank sheet with a green plus sign on the top left corner of your screen and selecting the option "R script". - You can then save your script, for instance - "Lab1-notes.R". ## Working on an Rscript - In order to send a command line from the Rscript to the Console just put the cursor on the selected command line and click ctrl+enter (PC users) or cmd+enter (MAC users). - You can also select several command lines and enter them all at once. - You can write comments on your Rscript (lines that the program will ignore) by using the symbol "#" on the beginning of a word or phrase, for instance: ```{r} # This is a comment, it will be ignored by the program. ``` - Every time you want to re-run a command you have already typed in your script, just put the cursor on top of it and enter it, you do not need to retype it. - At the end, don't forget to save your script before you close Rstudio. - R is **case sensitive** and **spelling sensitive**. # Loading Data into Rstudio On the course's page you will find an example dataset. Please do the following: - Download the dataset to your computer. - Either save the file directly in your TopicsDig folder or copy the file from your downloads folder to your TopicsDig folder. - Type the following in your Rscript: ```{r} setwd("C:/TopicsDig/Labs") # change the file's path to your own load("C:/TopicsDig/Labs/datasets/ceosal2.RData") ``` You can see that two new objects appeared in your work environment: - Data: This is a table with data on CEO salaries. - Desc: This is a table with the description of each variable in Data. When loading files to R, you need to use a different function according to the file's extension. The "load" function we just used opens files of the type ".RData". If you want to load a ".csv" file you should use the function "read.csv" instead of "load". # Exploring Data in R During this class we will work with a data format called "data.table". In order to convert your data into this format you will need to do the following: Install the data.table package: *install.packages("data.table")* Activate the data.table package: ```{r} library(data.table) ``` Now we will convert our dataset "data" into the data.table format. For our own convenience, we will include the prefix "dt." in the name of all objects of the data.table type. Also, it is useful to give names to our objects or datasets that reflect their contents. In this case, a good name for our dataset would be, for instance, "dt.ceo.salaries": ```{r} dt.ceo.salaries <- data.table(data) ``` You can see that a new object "dt.ceo.salaries" appeared in your work environment. This is a duplicate of the dataset "data". As we don't need the original dataset we can delete it by typing the following: ```{r} rm(data) ``` The "rm" stands for "remove" and it deletes objects from your work environment. You cannot undo this action. There are many options for exploring your data in R. You can get the names of the variables in your dataset by typing: ```{r} names(dt.ceo.salaries) # or colnames(dt.ceo.salaries) ``` You can count the number of columns in your dataset: ```{r} ncol(dt.ceo.salaries) ``` And count the number of observations (rows): ```{r} nrow(dt.ceo.salaries) ``` You can look at the first rows in your data: ```{r} head(dt.ceo.salaries) ``` Or the last rows: ```{r} tail(dt.ceo.salaries) ``` You can also view your entire dataset by typing: ```{r} View(dt.ceo.salaries) ``` As R is capitalization and spelling sensitive, if you type "view" instead of "View" in the above command it will not work. Every time your code is not working you should first check if you got your spelling and capitalization right. You can also use some of the data.table features to explore your data. In data.tables, when you write the table's name followed by square brackets with a comma in the middle, the space before the comma refers to the table's rows, and the space after the comma refers to the table's columns. When you leave either one of these spaces blank, it means that you are selecting either all rows (if space before the comma is blank) or all columns (if space after the comma is blank), or everything (if both spaces are blank). ```{r} dt.ceo.salaries[1, ] # shows first row and all columns dt.ceo.salaries[ , salary] # shows all rows of variable "salary" dt.ceo.salaries[1, salary] # shows first row of variable "salary" dt.ceo.salaries[1:10, list(salary, age)] # shows first ten rows of the variables "salary" and "age" ``` Ordering the data: ```{r} dt.ceo.salaries[order(age)] # order ascending (default) dt.ceo.salaries[order(-age)] # order descending ``` Subseting the data: ```{r} dt.ceo.salaries[age<=45,] # select only CEOs with less than 45 years dt.young.ceo.salaries <- dt.ceo.salaries[age<=45,] # creates a new data table ``` Subseting the data using multiple conditions (use the symbol "&"" for *and* and the symbol "|" for *or*): ```{r} dt.ceo.salaries[age<=45 & grad==1,] ``` Adding a new variable to the data.table: ```{r} dt.ceo.salaries[, log_salary:=log(salary)] dt.ceo.salaries[, age_squared:=age^2] ``` Deleting a variable from the data table: ```{r} dt.ceo.salaries[, log_salary:=NULL] ``` # References http://dss.princeton.edu/training/RStudio101.pdf # Additional Resources - Data table cheat sheet: https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf # Acknowledgements and Thanks: This lab is based on material by M Godinho de Matos, R Belo and F Reis. Gratefully acknowledged!