Notes
This posts records all my notes for the online course - Python for Machine Learning - on Coursera.com.
New notes will be added on the top of the older ones.
Week 4
DBSCAN Clustering
-
Density-Based Spatial Clutering of Applications with Noise
-
Density based clustering locates regions of high density, and separates outliers.
- Each point in the DBSCAN is either:
- core point
- border point
- outlier point
- Advantages of DBSCAN: - Arbitrarily shaped clusters - Robust to outliers - Does not require specification of the number of clusters
Hierarchical Clustering
-
Agglomerative (more often)
-
Divisive
K-Means Clustering
-
Partitioning Clustering
-
K-means divides the data into non-overlapping subsets (clusters) without any cluster-internal structure
- Examples within a cluster are very similar
- Examples across different clusters are very different
Clustering
Intro to Clustering:
- Clustering for segmentation
- (showing an example first.)
- What is clustering?
- A group of objects that are similar to other objects in the cluster, and dissimilar to data points in other clusters.
- Clustering V.S. classification
- (showing an similar but different term, which may easily confuse people.)
- classification: labeled dataset
- clustering: unlabeled dataset
- Clustering applications:
- Retail/marketing
- identifying buying patterns of customers
- recommending new books or moviews to new customers
- Banking
- fraud detection in credit card use
- identifying clusters of customers (e.g. loyal)
- Insurance
- fraud detection in claims analysis
- insurance risk of customers
- Publication
- Auto-categorizing news based on their content
- Recommending similar news articles
- Medicine
- Characterizing patient behavior
- Biology
- Clustering genetic markers to identify family ties
- Retail/marketing
- Why clustering?
- Exploratory data analysis
- Summary generation
- Outlier detection
- Finding duplicates
- Pre-processing step
- Clutering algorithms
- Partitioned-based Clustering
- Relatively efficient
- K-Means, K-Median, Fuzzy c-Means
- Hierarchical Clustering
- Produces trees pf c;uters
- Agglomerative, Divisive
- Density-based Clutering
- Produces arbitrary shaped cluters
- DBSCAN
- Partitioned-based Clustering
Week 3
- Classification algorithms in ML:
- Decision Trees (ID3, C4.5, C5.0)
- Naive Bayes
- Linear Discriminant Analysis
- K-Neareest neighbor
- Logistic Regression
- Neural Networks
- Support Vector Machines (SVM)
Week 2
- Different types of errors:
-
\( R^2 = 1 - RSE\)
-
\( RSE = \frac{\sum_{j=1}^{n}(y_j-\hat{y_j})^2}{\sum_{j=1}^{n}(y_j-\bar{y_j})^2} \)
-
\( RAE = \frac{\sum_{j=1}^{n} y_j-\hat{y_j} }{\sum_{j=1}^{n} y_j-\bar{y_j} } \) -
\( RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y_i})^2} \)
-
\( MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y_i})^2 \)
- \( MAE = \frac{1}{n}\sum_{j=1}^{n}|y_j-\hat{y_j}|\)
-
- Types of Regression models:
- Simple Regression
- Simple Linear Regression
- Simple Non-linear Regression
- Multiple Regression:
- Multiple Linear Regression
- Multiple Non-linear Regression
- Simple Regression
- Applications of Regression
- Sales forecasting
- Satisfaction analysis
- Price estimation
- Employment income
- Regression Algorithms
- Ordinal regression
- Poisson regression
- Fast forest quantile regression
- Linear, Polynomial, Lasso, Stepwise, Ridge regression
- Bayesian linear regression
- Neural network regression
- Decision forest regression
- Boosted decision tree regression
- KNN (K-nearest neighbors)
Week 1
Supervised Learning | Unsupervised Learning |
---|---|
Classification: Classifies labeled data |
Clustering: Finds patterns and groupings from unlabeled data |
Regression: Predicts trends using previous labeled data |
|
Has more evaluation methods | Has fewer evaluation methods than supervised learning |
Controlled environment | Less controlled environment |
- Unsupervised Learning
- The model works on its own to discover information.
- Unlabelled data.
- Techniques include:
- Dimension reduction
- Density estimation
- Market basket analysis
- Clustering (one of most popular)
- Clustering: grouping of data points or objects that are somehow similar by:
- Discovering structure
- Summarization
- Anomaly detection
- Supervised Learning
- We “teach the model”, then with that knowledge, it can predict unknown or future instances.
- Labelled data.
- Classification: the process of predicting discrete class lebels or categories.
- Regression: the process of predicting continuous values.
- Python libraries for machine learning (upper level marked in bold)