Statistics: machine learning

So what is machine learning? Depending on who you ask, it’s either the future or the end of the world, but my favorite definition is from a friend:

Machine learning is just fancy statistics on big datasets. 

Ranging from the humble linear regression to sophisticated LLMs, machine learning (ML) requires a computer to “learn” a pattern from data and use it to predict results. Since learning is a requirement, ML is also AI.* There are a lot of different algorithms that will determine how your computer learns those patterns,** but here are some useful things to keep in mind. 

Machine learning is:

1. Not magic

Yes, I know ChatGPT sometimes feels disturbingly sentient. But the truth is, at its core, ML is just pattern recognition. If your data doesn’t contain a strong pattern or your algorithm is looking for the wrong type of pattern, your results will be useless. Practically speaking, this means that machine learning is not necessarily superior to more basic statistical tests. Always check how well your model performed (aka, how well your algorithm learned its pattern) before trusting any ML results.

2. Prone to overfitting

If you’ve attempted ML before, there’s a good chance you’ve seen a figure like this:

Examples of underfitting, overfitting, and good fitting of machine learning models

This describes a common issue in machine learning: your model is either too general or too unique. Ideally, you want your algorithm to learn a pattern that both describes your current data well and can be applied to future results. If the model is underfit, it won’t describe current or future data well. If it is overfit, it will describe current data perfectly and future data horribly. You want that perfect medium between accurate and reproducible. This can be difficult to achieve, but I’ll explain some ways we’ve attempted it below. 

3. Better with more (good quality) data

Let’s imagine you’re using ML to identify photos as dogs or cats. For your algorithm to learn the differences, you’ll need to give it photos already categorized as cats or dogs (your training dataset) and then test whether it correctly identifies an uncategorized photo (your testing dataset). If the training dataset is too small, your algorithm won’t have enough information to learn a useful pattern. But what’s the issue with the right part of this figure?

Examples of classifying images as cats or dogs using too little data or skewed data

The problem is the data are skewed. Using these photos, your algorithm will learn that dogs are vertically oriented with green backgrounds and cats are horizontal with a neutral background. Thus, it will classify your unknown photo as a dog. While it’s easy to spot skewed data in cases like this, it can be difficult to identify in a large experiment. Thus, most ML algorithms will randomly sort your data into training and testing datasets to minimize this issue.

How does this work in Ometa Flow?

Our statistics workflow uses two machine learning algorithms: PLS-DA and Random Forest.

PLS-DA is a supervised version of PCA. While PCA plots samples without any knowledge of your metadata, PLS-DA uses metadata to group samples and identify important metabolites. Random Forest uses decision trees to group samples from random subsets of your data. Those are results are then combined to create a more robust final model.

Generally, these models are used in two ways:

  1. To classify unknown samples into one or more groups. For example, let’s say you want to diagnose prostate cancer from a patient’s urine sample. You would train the model on cancer/healthy samples and then ask that model to classify your unknown sample.***

  2. To identify metabolites that drive separation between groups. In this case, you would use your model to identify potential biomarkers for prostate cancer. These could then be measured in standalone assays or investigated further to understand their biological mechanisms.

To reduce the chances of overfitting, we perform repeated double-cross validation for all our machine learning models.**** This means instead of running an ML algorithm once, we run it hundreds of times and compare results. While this doesn’t completely eliminate the risk of overfitting, it dramatically reduces the risk of skewed data and provides more reliable results. 

However, all of this validation doesn’t mean your model is automatically going to be great. That’s why you should always check your model fit using our ML dashboard:

Ometa Flow's machine learning dashboard showing model performance for models trained on 6 or 50 samples

So what are we looking at here? 

First, the blue and gray lines provide results from each individual test, while the red line provides the average result. While you’re mostly going to care about that red line, the variance in your blue and gray lines can tell you something about the reliability of your models. Ideally, low variance around the average indicates that your modeling worked well most of the time.

Second, the x axis provides the number of features used to create the model. In general, the more metabolites you use to create your model, the more accurate it will be. Thus, you want to see that red line decrease towards zero as you increase the number of metabolites. Whether that line goes down quickly or slowly will tell you whether there are a handful of metabolites that drive the differences between your groups (steep downward slope), or if there are many metabolites that contribute relatively equally (shallow downward slope). 

The y axis provides the percent misclassifications for your model. This tells you how many times your model will classify a test sample into the incorrect group. You want that number to be close to 0%. A misclassification rate above 15% is suspect, and anything above 40% is useless. 

As you can see, the two models above performed very differently. The model on the left was trained with too little data, resulting in high misclassification rates. The model on the right contained enough data to train an excellent model that could achieve 0% misclassification using as few as 10 metabolites. 

How much data you need to train a good model depends greatly on your experiment. Thus, whether you’re looking at 10 or 1000 samples, it’s always best to check your model performance before using any of the results from an ML algorithm. If your model didn’t perform well, your results are no different than using a random number generator to identify important metabolites. If it did perform well, you have a robust short list of metabolites that drive differences between your groups.

Happy modeling!

 

*Useful to keep in mind when your boss’s boss suddenly needs “more AI”

**The more technical term for patterns here would be models, feel free to mentally swap these terms throughout the blog

***If you’re seriously considering something like this, please include other types of diseases in your model! I can’t tell you how many of these models fail because they get trained to identify general inflammation (a hallmark of cancer and many other diseases) instead of a specific disease.

****If you’re a fan of random forests, you’ll know this algorithm already incorporates some cross-validation. However, we still perform repeated double cross-validation because we’re just that into robust ML models.

Further Reading

Have more questions? Contact us.

Previous
Previous

Statistics: principal components analysis