Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentIntroduction to Statistics: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.6 The Structure of Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - Distributions of Estimates
-
segmentChapter 10 - Confidence Intervals and Their Uses
-
segmentChapter 11 - Model Comparison with the F Ratio
-
segmentChapter 12 - What You Have Learned
-
segmentResources
list Introduction to Statistics: A Modeling Approach
The Structure of Data
Data can come to us in many forms. If you collect data yourself, you may start out with numbers written on scraps of paper. Or you may get a computer file filled with numbers and words of various sorts, each representing the value of some sampled object on some variable of interest.
Regardless of how the data start out, it is necessary to organize and format data so that they are easy to analyze using statistical software. There is no one way to organize data, but there is a way that is most common, and that is what we recommend you use.
Statistician Hadley Wickham came up with the concept of what he calls “Tidy Data.” Tidy data is a way of organizing data into rectangular tables, with rows and columns, according to the following principles:
Each column is a variable
Each row is an observation (or, we have been calling it a case or an object to which a measure is attached)
Each type of observation (or case) is kept in a different table (more on this below)
Rectangular tables of this sort are represented in R using a data frame. The columns are the variables; this is where the results of measures are kept. The rows are the cases sampled. Data frames provide a way to save information such as column headings (i.e., variable names) in the same table as the actual data values.
Principle 3 above simply states that the types of observations that form the rows cannot be mixed within a single table. So, for example, you wouldn’t have rows of college students intermixed with rows of cars or countries or couples. If you have a mix of observation types (e.g., students, families, countries), they each go in a different table.
L_Ch2_Structure_1
L_Ch2_Structure_2
One challenge for students is to keep track of the difference between an observation (e.g., housekeepers or countries), a variable (e.g., Weight or Sex or Happiness, represented in columns), and the values a variable can take (e.g., 120, male, 5.5).
In this course we will be providing most of the data you analyze in a tidy format. You’ve already been using this format for a bit as we explore data. But now we are making it explicit: it’s important to think of data as rows and columns, with rows as observations and columns as variables. In the future, you may have to transform a non-tidy data set into a tidy one.
Getting Data Into DataCamp
In this course, we have pre-loaded most of the data sets we use into DataCamp. But you may want to import your own data into DataCamp. In this section we will teach you one simple way to do that.
The easiest way to get your data into DataCamp is through Google Sheets. Here’s a step-by-step guide:
- Get your data into tidy format - rows and columns.
- Copy/Paste (or enter) your data into a Google Sheet.
- Once in the Google Sheet, go to the File menu and select Publish to the Web. Where it says Web Page in the drop down menu, change it to Comma-separated values (.csv) (see picture). Then click the Publish button.
- Copy the shareable link (highlighted in blue) to the clipboard.
- Open your DataCamp sandbox and run this code:
DataFrameName <- read.csv("https://url.com", header=TRUE)
Be sure to replace the url (between the quotes) with your shareable link, and replace DataFrameName with a name of your choice.
Note: the
header=TRUE
argument indicates that the first row of the data file contains the variable names. If it doesn’t, simply omit this part of the code.
Give It a Try
Okay, let’s see if you can upload some data from a study by deLoach, Miller & Rosengren (1997) into DataCamp. (Click here if you want to download the research article.)
First, get the data file. We’ve saved the data in a .csv file. Click here to download the data to your browser.
Save the data to your computer as a .csv file (using the File / Save As command), and then import it into a Google Sheet (using File / Import in Google Sheets).
Next, use the DataCamp window below, and the instructions above, to see if you can get the data into DataCamp. Call the data frame deloach1997.
Once you have imported the data and created the data frame, try running str()
to see what the data frame contains. It should have 32 observations, and 4 variables: Age, Gender, Condition, Retrievals.
require(mosaic)
require(tidyverse)
require(ggformula)
require(supernova)
require(Lock5Data)
require(Lock5withR)
Fingers <- supernova::Fingers
Servers <- supernova::Servers
Survey <- supernova::Survey
TipExperiment <- supernova::TipExperiment
MindsetMatters <- Lock5Data::MindsetMatters
HappyPlanetIndex <- Lock5Data::HappyPlanetIndex