Yesterday I learned that the first step of mastering statistics is to master the art of exploring data.
Exploring Data Types
Categorical data is data that can be in groups. They are labels. In the R programming language, they are called factors. Generally categorical data you will use a bar chart or pie chart to explore data. The distribution of categorical data are counts, frequency, or percentage.
For quantitative data, you would use a histogram, line chart, or stem plot ( only if the data is small).
Exploratory Data Analysis (EDA) workflow
- Study each individual variable
- Study the relationships between two variables
- Create graphs of the distribution of variables
- Last, add numerical summaries of specific aspects of data
Four things to measure the distribution of a variable shape, center, spread, and outliers.
Measures of center — mean and median
Measures of spread- quantile ranges
spread + center gives useful information about the distribution of the data.
shape- the data can have a normal distribution like a bell curve, or can be skewed to the right or left
Tools for finding outliers
standard deviation — measures how much a data point is away from the mean