Cloudera Data Scientist Training

Note: This four-day workshop covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW).

1:1 Coaching

24*7 Support

Cloud Labs

High Success Rate

Globally Renowned Trainer

Real-time code analysis and feedback

Course Description

This four-day workshop covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW). Participants use Spark SQL to load, explore, cleanse, join, and analyze data and Spark MLlib to specify, train, evaluate, tune, and deploy machine learning pipelines.

They dive into the foundations of the Spark architecture and execution model necessary to effectively configure, monitor, and tune their Spark applications. Participants also learn how Spark integrates with key components of the Cloudera platform such as HDFS, YARN, Hive, Impala, and Hue as well as their favourite Python or R packages.

Learning Objectives

How to use Apache Spark to run data science and machine learning workflows at scale
How to use Spark SQL and DataFrames to work with structured data
How to use MLlib, Spark’s machine learning library

How to use PySpark, Spark’s Python API
How to use sparklyr, a dplyr-compatible R interface to Spark
How to use Cloudera Data Science Workbench (CDSW)
How to use other Cloudera platform components including HDFS, Hive, Impala, and Hue

Certification Curriculum

Module 1

Data Science Overview

What Data Scientists Do
What Process Data Scientists Use
What Tools Data Scientists Use

Module 2

Cloudera Data Science Workbench (CDSW)

Introduction to Cloudera Data

Module 3

Science Workbench

How Cloudera Data Science

Module 4

Workbench Works

How to Use Cloudera Data Science

Module 5

Workbench

Entering Code
Getting Help
Accessing the Linux Command Line
Working with Python Packages
Formatting Session Output

Module 6

Case Study

DuoCar
How DuoCar Works
DuoCar Datasets
DuoCar Business Goals
DuoCar Data Science Platform
DuoCar Cloudera EDH Cluster
HDFS
Apache Spark
Apache Hive
Apache Impala
Hue
YARN
DuoCar Cluster Architecture

Module 7

Apache Spark

Apache Spark
How Spark Works
The Spark Stack
Spark SQL
DataFrames
File Formats in Apache Spark
Text File Formats
Parquet File Format

Module 8

Summarizing and Grouping DataFrames

Summarizing Data with Aggregate
Functions
Grouping Data
Pivoting Data

Module 9

Window Functions

Introduction to Window Functions
Creating a Window Specification
Aggregating over a Window Specification

Module 10

Exploring DataFrames

Possible Workflows for Big Data
Exploring a Single Variable
Exploring a Categorical Variable
Exploring a Continuous Variable
Exploring a Pair of Variables
Categorical-Categorical Pair
Categorical-Continuous Pair
Continuous-Continuous Pair

Module 11

Apache Spark Job Execution

DataFrame Operations
Input Splits
Narrow Operations
Wide Operations
Stages and Tasks
Shuffle

Module 12

Processing Text and Training and Evaluating Topic Models

Introduction to Topic Models
Scenario
Extracting and Transforming Features
Parsing Text Data
Removing Common (Stop) Words
Counting the Frequency of Words
Specifying a Topic Model
Training a topic model using Latent Dirichlet Allocation (LDA)
Assessing the Topic Model Fit
Examining a Topic Model
Applying a Topic Model

Module 13

Training and Evaluating Recommender Models

Introduction to Recommender Models
Scenario
Preparing Data for a Recommender Model
Specifying a Recommender Model
Spark Interface Languages
PySpark
Data Science with PySpark
sparklyr
dplyr and sparklyr
Comparison of PySpark and sparklyr
How sparklyr Works with dplyr
sparklyr DataFrame and MLlib Functions
When to Use PySpark and sparklyr

Module 14

Running a Spark Application from (CDSW)

Overview
Starting a Spark Application
Reading Data into a Spark SQL Dataframe
Examining the Schema of a Data Frame
Computing the Number of Rows and Overview
Starting a Spark Application
Reading Data into a Spark SQL Dataframe
Examining the Schema of a Data Frame
Computing the Number of Rows and Columns of a DataFrame
Examining Rows of a DataFrame
Stopping a Spark Application

Module 15

Inspecting a Spark SQL DataFrame

Overview
Inspecting a DataFrame
Inspecting a DataFrame Column
Inspecting a Primary Key Variable
Inspecting a Categorical Variable
Inspecting a Numerical Variable
Inspecting a Date and Time Variable

Module 16

Transforming DataFrames

Spark SQL DataFrames
Working with Columns
Selecting Columns
Dropping Columns
Specifying Columns
Adding Columns
Changing the Column Name
Changing the Column Type

Module 17

Monitoring, Tuning, and Configuring Spark Applications

Monitoring Spark Applications
Persisting DataFrames
Partitioning DataFrames
Configuring the Spark Environment

Module 18

Machine Learning Overview

Machine Learning
Underfitting and Overfitting
Model Validation
Hyperparameters
Supervised and Unsupervised Learning
Machine Learning Algorithms
Machine Learning Libraries
Apache Spark MLlib

Module 19

Training and Evaluating Regression Models

Introduction to Regression Models
Scenario
Preparing the Regression Data
Assembling the Feature Vector
Creating a Train and Test Set
Specifying a Linear Regression Model
Training a Linear Regression Model
Examining the Model Parameters
Examining Various Model Performance Measures
Examining Various Model Diagnostics
Applying the Linear Regression Model to the Test Data
Evaluating the Linear Regression Model on the Test Data
Plotting the Linear Regression Model
Training a Recommender Model using Alternating Least Squares
Examining a Recommender Model
Applying a Recommender Model
Evaluating a Recommender Model
Generating Recommendations

Module 20

Working with Machine Learning Pipelines

Specifying Pipeline Stages
Specifying a Pipeline
Training a Pipeline Model
Querying a Pipeline Model
Applying a Pipeline Model

Module 21

Deploying Machine Learning Pipelines

Saving and Loading Pipelines and Pipeline Models in Python
Loading Pipelines and Pipeline Models in Scala
Working with Rows
Ordering Rows
Selecting a Fixed Number of Rows
Selecting Distinct Rows
Filtering Rows
Sampling Rows
Working with Missing Values

Module 23

Transforming DataFrame Columns

Spark SQL Data Types
Working with Numerical Columns
Working with String Columns
Working with Date and Timestamp Columns
Working with Boolean Columns

Module 24

Complex Types

Complex Collection Data Types
Arrays
Maps
Structs

Module 25

User-Defined Functions

User-Defined Functions
Defining a Python Function
Registering a Python Function as a User-Defined Function
Applying a User-Defined Function

Module 26

Reading and Writing Data

Reading and Writing Data
Working with Delimited Text Files
Working with Text Files
Working with Parquet Files
Working with Hive Tables
Working with Object Stores
Working with pandas DataFrames

Module 27

Combining and Splitting DataFrames

Joining DataFrames
Cross Join
Inner Join
Left Semi Join
Left Anti Join
Left Outer Join
Right Outer Join
Full Outer Join
Applying Set Operations to DataFrames
Splitting a DataFrame

Module 28

Training and Evaluating Classification Models

Introduction to Classification Models Scenario
Preprocessing the Modeling Data
Generate a Label
Extract, Transform, And Select Features
Create Train and Test Sets
Specify A Logistic Regression Model
Train the Logistic Regression Model
Examine the Logistic Regression Model
Evaluate Model Performance on the Test Set

Module 29

Tuning Algorithm Hyperparameters Using Grid Search

Requirements for Hyperparameter Tuning
Specifying the Estimator
Specifying the Hyperparameter Grid
Specifying the Evaluator
Tuning Hyperparameters using Holdout Cross-validation
Tuning Hyperparameters using K-fold Cross-validation

Module 30

Training and Evaluating Clustering Models

Introduction to Clustering Scenario
Preprocessing the Data
Extracting, Transforming, and Selecting Features
Specifying a Gaussian Mixture Model
Training a Gaussian Mixture Model
Examining the Gaussian Mixture Model
Plotting the Clusters
Exploring the Cluster Profiles
Saving and Loading the Gaussian
Mixture Model

Module 31

Overview of sparklyr

Connecting to Spark
Reading Data
Inspecting Data
Transforming Data Using dplyr Verbs
Using SQL Queries
Spark DataFrames Functions
Visualizing Data from Spark
Machine Learning with MLlib

Module 32

Introduction to Additional CDSW

Collaboration
Jobs
Experiments
Models
Applications

Cloudera Data Scientist Training

Course Description

Learning Objectives

Certification Curriculum

Data Science Overview

Cloudera Data Science Workbench (CDSW)

Science Workbench

Workbench Works

Workbench

Case Study

Apache Spark

Summarizing and Grouping DataFrames

Window Functions

Exploring DataFrames

Apache Spark Job Execution

Processing Text and Training and Evaluating Topic Models

Training and Evaluating Recommender Models

Running a Spark Application from (CDSW)

Inspecting a Spark SQL DataFrame

Transforming DataFrames

Monitoring, Tuning, and Configuring Spark Applications

Machine Learning Overview

Training and Evaluating Regression Models

Working with Machine Learning Pipelines

Deploying Machine Learning Pipelines

Transforming DataFrame Columns

Complex Types

User-Defined Functions

Reading and Writing Data

Combining and Splitting DataFrames

Training and Evaluating Classification Models

Tuning Algorithm Hyperparameters Using Grid Search

Training and Evaluating Clustering Models

Overview of sparklyr

Introduction to Additional CDSW

Prerequisites

Download Brochure

Certification Assessment

Testimonials

Our clients praise us for our great results, personable service, expert knowledge, and on-time delivery. Here are what just a few of them had to say:

Training FAQ's

Trending Course

Leading Safe®️ 5.1

Cloudera Data Analyst

Certified Agile Coaching

We'd loveto hear from you

Quick Links

Trending Courses

Discover

Contact Information

Singapore

Social

We'd love
to hear from you