Understand how data science can be used effectively in industry

Program proficiently in R

Visualize findings effectively

Build basic models and find patterns in data

### Syllabus: Data Science 101

This five-day intensive training will turn employees into savvy data analysts with a solid foundation to tackle data cleaning, visualization, and basic modeling. Students will become comfortable with R, an open-source tool that is widely used by professional statisticians and analysts. R is designed to analyzing data powerfully and effectively, create predictive models, and build beautiful visualizations. This online component contains assessment during the training session, as well as additional resources, trainings, and support beyond the classroom.

###### By the end of this course, students will be able to:

**Understand how data science can be used effectively in industry****Program proficiently in R****Visualize findings effectively****Build basic models and find patterns in data**

###### Assessment:

**Concept reviews:**these are comprised of quizzes that cover the most important concepts and ideas in each lesson. They encourage holistic understanding and are multi-faceted ques=on types (i.e. drag and drop, fill-in-the-blanks, matching, etc).**Exercises:**these are additional videos that cover the coding functions in the instructional video in more depth. They are project-based and include coding templates for students to strengthen their skills outside of the course.

###### Materials provided:

- Accompanying workbooks to use as reference materials
- R code templates to implement as frameworks
- Data sets used in the instructional videos and exercises

### Course Outline

**1. Data science fundamentals:**

What is data science?

A data scientist’s approach

Commercial applications of data science

**2. ****Introduction to R programming:**

Installing R and RStudio

Introduction to RStudio

Performing basic calculations in R

Loading data into R

Working with multiple data types

Data wrangling and cleaning in R

**3. Basic visualizations: **

Basic plotting in R

Basic plotting in ggplot2

Customizing graphs and adjusting formats

**4. Introduction to clustering and unsupervised machine learning: **

What is unsupervised machine learning?

Commercial applications of data mining

**5. Implementation of clustering: **

k-means clustering on multi-dimensional data

Evaluating the quality of clustering

Determining the right number of clusters to use

**6. Clustering multiple data: **

Working with binary data – cosine distance

Clustering binary data – spherical k-means

Assessing quality of spherical k-means clustering

Interpreting clusters of binary data and making recommendations

Pitfalls of clustering

**7. Introduction to regression and supervised machine learning: **

Commercial applications of regression

**8. Basic statistics and regression modeling: **

Linear relationships: slope, y-intercept, variable interactions

Variance and standard deviation

Covariance and correlation

Normal distribution and bell curves

**9. Regression model evaluation: **

Distribution of errors: Q-Q plot, heteroscedasticity

Multiple regression

R squared and adjusted R squared

p-values and t-test

F-test and F-distribution

**10. Introduction to classification: **

k-Nearest Neighbors

Decision trees: gini coefficient and information gain

Introduction to random forests

Confusion matrices, misclassification rates

Base line errors and lift

**11. Pitfalls and best practices of data science: **

Understanding the limits of your data

Checking data validity

Ethical considerations

Best practices for data analysts