4 min read

Machine Learning: From Primer to Pro

Machine Learning: From Primer to Pro

During this tutorial, we will learn about Machine Learning with Scikit-Learn or Sklearn. Our plan is as follows:

  1. Briefly touch on the fundamentals of ML
  2. Explore the ML world:
    • Split a dataset into training and testing sets
    • Train a simple ML model on a training set
    • Evaluate the model's performance on the testing set
    • Make predictions on new data
  3. In the process, we will revisit Pandas:
    • Import the necessary libraries and load the dataset
    • Perform exploratory data analysis to understand the structure and characteristics of the dataset
    • Preprocess the data to handle missing values, outliers, and any other issues
    • Summarise findings and present results in a clear and concise manner

By the end of this tutorial, we will have built an end-to-end mini Machine Learning project centred.

Part I: The FUNdamentals of Machine Learning

See what I did there? For extra fun, have a brief look at the History of AI.

What is ML?

Machine Learning (ML) is a branch of Artificial Intelligence (AI) in which algorithms extract useful information from data automatically.

We often use Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) interchangeably. But there are some valuable differences, and the way to remember them is by considering the following conceptual flow:

Data Representation Features Model Output

In the traditional Machine Learning setting, we extract relevant features and then learn the mapping between the features and the output.

In the context of Deep Learning, humans don't worry about constructing the features; instead, we let the DL model learn the best representation of the data that will achieve our goal.

Let's say we want to classify cats from a dataset of images. In a traditional ML setting, we take those images, inspect them, and develop features like colour, size, and ear-to-nose ratio. Then, we use these features to map them to a meaningful output (cat class, dog class, human class, etc.).

In a DL setting, we use the images to perform this task. Regarding models, when we talk about DL, we talk about neural networks.

When we talk traditional ML, we talk tree-based networks, XGBoost, and support-vector machines (check out the brilliant library Scikit-Learn). Recently, with the advent of computing resources such as GPUs, DL has become the de facto approach to solving most ML tasks, using, for example, the PyTorch library.

In this notebook, we will employ a more traditional ML setting.

I recommend the following two resources to get started in ML:

Machine Learning (ML) is essentially about computers learning from data, making tasks easier for humans.

In our daily lives, we want to solve tasks like:

  • Classification: This is typically easy for humans. For instance, I can spot a map tofu dish from a kilometre away.
  • Prediction: The difficulty can vary depending on the number of dimensions involved.
  • Imagination: This falls into the domain of Generative AI. Recent developments in natural language processing (e.g., ChatGPT) and text-to-image translation (e.g., DALL-E-2) have enabled machines to help us imagine new worlds.

ML involves three key concepts: Data, Model, and Learning. Our primary focus will be on data.

To paraphrase Mitchell (1997), we say that a model (M) learns from data (D) if its performance on the task (T) improves after considering the data. The ultimate goal is to find robust models that perform well on unseen data (known as the ability to generalise).

Learning involves optimising the model's parameters to achieve our objective.

Example: Image Classification

Let's consider a scenario where:

  • Data (D): A dataset of cat and dog images
  • Model (M): An untrained classification model
  • Task (T): Classifying an image as a cat

Before any learning occurs, if we ask our model to classify a given image as either a cat, dog, human, or robot, it won't provide a meaningful answer. The probability of a correct guess is merely 25% (1 in 4 choices).

However, once we train the model by allowing it to "see" and learn from the dataset, its performance in the classification task is likely to improve significantly.

This example illustrates how machine learning models can enhance their performance on specific tasks through exposure to relevant data, demonstrating the core principle of learning from experience in ML.

Flavours of ML

Supervised Learning Unsupervised Learning
Input: Data X & labels y Input: Just data X, no labels
Goal: Learn how to map X to y Goal: Learn underlying structure of data X
Examples: Classification, Regression Examples: Dimensionality Reduction, Clustering

We also have other flavours, like semi-supervised learning, which combines both worlds. For today, we'll keep it vanilla and chocolate.

Today, we discuss supervised learning by first gaining some "philosophical" insight into Machine Learning by referring to logistic regression.

Human: Linear regression, so much linear regression? ChatAGI: No, logistic.

The section below closely follows the tutorial by Adam Miller given at the DSFP Session 14, Day 1. I suggest you check it out here: DSFP Session 14, Day 1

In this part, we will create a scatter plot of the observed data points (y_obs) against the x values. Additionally, we will plot the true linear relationship line in magenta for comparison. The true line represents the relationship ( y = 3.1x + 13 ).

import numpy as np
import matplotlib.pyplot as plt

# Set the seed for reproducibility
np.random.seed(42)

# Generate data
n = 50
x = np.random.uniform(0, 100, n)  # Choose 50 random numbers between 0 and 100
y_true = 3.1 * x + 13
y_obs = y_true + np.random.normal(0, 15, n)  # Add some noise

# Plot the data
fig, ax = plt.subplots()
ax.plot(x, y_obs, 'o', mfc="None")  # Plot observed data points with open circles
ax.plot([0, 100], [17, 323], 'm')  # Plot the true line in magenta
ax.set_xlabel('x', fontsize=14)  # Set the x-axis label
ax.set_ylabel('y', fontsize=14)  # Set the y-axis label
ax.set_title('Linear relationship with noise', fontsize=16)  # Set the plot title
fig.tight_layout()  # Adjust the layout to fit everything nicely
plt.show()  # Display the plot