Notes

This posts records all my notes for the online course - Python for Machine Learning - on Coursera.com.

New notes will be added on the top of the older ones.

Week 4

DBSCAN Clustering

Density-Based Spatial Clutering of Applications with Noise
Density based clustering locates regions of high density, and separates outliers.
Each point in the DBSCAN is either:
- core point
- border point
- outlier point
Advantages of DBSCAN: - Arbitrarily shaped clusters - Robust to outliers - Does not require specification of the number of clusters

Hierarchical Clustering

K-Means Clustering

Partitioning Clustering
K-means divides the data into non-overlapping subsets (clusters) without any cluster-internal structure
- Examples within a cluster are very similar
- Examples across different clusters are very different

Clustering

Intro to Clustering:

Clustering for segmentation
- (showing an example first.)
What is clustering?
- A group of objects that are similar to other objects in the cluster, and dissimilar to data points in other clusters.
Clustering V.S. classification
- (showing an similar but different term, which may easily confuse people.)
- classification: labeled dataset
- clustering: unlabeled dataset
Clustering applications:
- Retail/marketing
  - identifying buying patterns of customers
  - recommending new books or moviews to new customers
- Banking
  - fraud detection in credit card use
  - identifying clusters of customers (e.g. loyal)
- Insurance
  - fraud detection in claims analysis
  - insurance risk of customers
- Publication
  - Auto-categorizing news based on their content
  - Recommending similar news articles
- Medicine
  - Characterizing patient behavior
- Biology
  - Clustering genetic markers to identify family ties
Why clustering?
- Exploratory data analysis
- Summary generation
- Outlier detection
- Finding duplicates
- Pre-processing step
Clutering algorithms
- Partitioned-based Clustering
  - Relatively efficient
  - K-Means, K-Median, Fuzzy c-Means
- Hierarchical Clustering
  - Produces trees pf c;uters
  - Agglomerative, Divisive
- Density-based Clutering
  - Produces arbitrary shaped cluters
  - DBSCAN

Different types of errors:

\( R^2 = 1 - RSE\)
\( RSE = \frac{\sum_{j=1}^{n}(y_j-\hat{y_j})^2}{\sum_{j=1}^{n}(y_j-\bar{y_j})^2} \)

\( RAE = \frac{\sum_{j=1}^{n}

y_j-\hat{y_j}

}{\sum_{j=1}^{n}

y_j-\bar{y_j}

} \)

Types of Regression models:
- Simple Regression
  - Simple Linear Regression
  - Simple Non-linear Regression
- Multiple Regression:
  - Multiple Linear Regression
  - Multiple Non-linear Regression
Applications of Regression
- Sales forecasting
- Satisfaction analysis
- Price estimation
- Employment income
Regression Algorithms
- Ordinal regression
- Poisson regression
- Fast forest quantile regression
- Linear, Polynomial, Lasso, Stepwise, Ridge regression
- Bayesian linear regression
- Neural network regression
- Decision forest regression
- Boosted decision tree regression
- KNN (K-nearest neighbors)

Supervised Learning	Unsupervised Learning
Classification: Classifies labeled data	Clustering: Finds patterns and groupings from unlabeled data
Regression: Predicts trends using previous labeled data
Has more evaluation methods	Has fewer evaluation methods than supervised learning
Controlled environment	Less controlled environment

Unsupervised Learning
- The model works on its own to discover information.
- Unlabelled data.
- Techniques include:
  - Dimension reduction
  - Density estimation
  - Market basket analysis
  - Clustering (one of most popular)
- Clustering: grouping of data points or objects that are somehow similar by:
  - Discovering structure
  - Summarization
  - Anomaly detection
Supervised Learning
- We “teach the model”, then with that knowledge, it can predict unknown or future instances.
- Labelled data.
- Classification: the process of predicting discrete class lebels or categories.
- Regression: the process of predicting continuous values.
Python libraries for machine learning (upper level marked in bold)