Course Outline

list Introduction to Statistics: A Modeling Approach

The Structure of Data

Data can come to us in many forms. If you collect data yourself, you may start out with numbers written on scraps of paper. Or you may get a computer file filled with numbers and words of various sorts, each representing the value of some sampled object on some variable of interest.

Regardless of how the data start out, it is necessary to organize and format data so that they are easy to analyze using statistical software. There is no one way to organize data, but there is a way that is most common, and that is what we recommend you use.

Statistician Hadley Wickham came up with the concept of what he calls “Tidy Data.” Tidy data is a way of organizing data into rectangular tables, with rows and columns, according to the following principles:

  1. Each column is a variable

  2. Each row is an observation (or, we have been calling it a case or an object to which a measure is attached)

  3. Each type of observation (or case) is kept in a different table (more on this below)

Rectangular tables of this sort are represented in R using a data frame. The columns are the variables; this is where the results of measures are kept. The rows are the cases sampled. Data frames provide a way to save information such as column headings (i.e., variable names) in the same table as the actual data values.

Principle 3 above simply states that the types of observations that form the rows cannot be mixed within a single table. So, for example, you wouldn’t have rows of college students intermixed with rows of cars or countries or couples. If you have a mix of observation types (e.g., students, families, countries), they each go in a different table.

L_Ch2_Structure_1

L_Ch2_Structure_2

One challenge for students is to keep track of the difference between an observation (e.g., housekeepers or countries), a variable (e.g., Weight or Sex or Happiness, represented in columns), and the values a variable can take (e.g., 120, male, 5.5).

In this course we will be providing most of the data you analyze in a tidy format. You’ve already been using this format for a bit as we explore data. But now we are making it explicit: it’s important to think of data as rows and columns, with rows as observations and columns as variables. In the future, you may have to transform a non-tidy data set into a tidy one.

Getting Data Into DataCamp

In this course, we have pre-loaded most of the data sets we use into DataCamp. But you may want to import your own data into DataCamp. In this section we will teach you one simple way to do that.

The easiest way to get your data into DataCamp is through Google Sheets. Here’s a step-by-step guide:

  1. Get your data into tidy format - rows and columns.
  2. Copy/Paste (or enter) your data into a Google Sheet.
  3. Once in the Google Sheet, go to the File menu and select Publish to the Web. Where it says Web Page in the drop down menu, change it to Comma-separated values (.csv) (see picture). Then click the Publish button.

  4. Copy the shareable link (highlighted in blue) to the clipboard.
  5. Open your DataCamp sandbox and run this code:
DataFrameName <- read.csv("https://url.com", header=TRUE)

Be sure to replace the url (between the quotes) with your shareable link, and replace DataFrameName with a name of your choice.

Note: the header=TRUE argument indicates that the first row of the data file contains the variable names. If it doesn’t, simply omit this part of the code.

Give It a Try

Okay, let’s see if you can upload some data from a study by deLoach, Miller & Rosengren (1997) into DataCamp. (Click here if you want to download the research article.)

First, get the data file. We’ve saved the data in a .csv file. Click here to download the data to your browser.

Save the data to your computer as a .csv file (using the File / Save As command), and then import it into a Google Sheet (using File / Import in Google Sheets).

Next, use the DataCamp window below, and the instructions above, to see if you can get the data into DataCamp. Call the data frame deloach1997.

Once you have imported the data and created the data frame, try running str() to see what the data frame contains. It should have 32 observations, and 4 variables: Age, Gender, Condition, Retrievals.

require(mosaic) require(tidyverse) require(ggformula) require(supernova) require(Lock5Data) require(Lock5withR) Fingers <- supernova::Fingers Servers <- supernova::Servers Survey <- supernova::Survey TipExperiment <- supernova::TipExperiment MindsetMatters <- Lock5Data::MindsetMatters HappyPlanetIndex <- Lock5Data::HappyPlanetIndex
Try any code you like!
DataCamp: ch2-15

Responses