Statistics: normalization

Feb 4

Cartoon of data distributions, from the normal distribution to the made-up panda distribution

If you’re like me, you probably want to start analyzing your data as soon as (or slightly before) it’s off the mass spectrometer. But before you start warming up your ANOVAs and GLMs, it’s important to make sure your data is ready for statistics. One of the most important steps in this process is minimizing non-biological variation in your data.

Biological variation refers to the real differences between your samples. These are your biomarkers, your upregulated pathways, your priority candidates.

Non-biological variation refers to differences in your samples that have nothing to do with the samples themselves. These are differences caused by sample degradation, sample processing, pipetting errors, chromatography issues, injection volume, column condition, or instrument variability. These differences may look like biological differences, but they have nothing to do with what you’re investigating. The last thing you want to do is mistake this non-biological variation for the real thing.

There are many ways to minimize non-biological variation. For instance, freezing samples immediately after collection can help lower the risk of metabolite degradation, and regularly calibrating your mass spectrometer can keep signal consistent. While these are all important, the truth is that despite your best efforts, life will always get in the way of your perfectly-designed experiment. Your collaborator will get sick on the day of collection. Your pipettes won’t be perfectly calibrated. The shipping company doesn’t deliver over the weekend.*

Some of these issues might force you to reconsider (or redo) your study. However, even the most perfect perfectionist among us can’t eliminate all non-biological variation.

This is where normalization comes in. Normalization attempts to minimize non-biological variation after the data has already been collected. There are many ways to normalize untargeted metabolomics data ranging from the simple to the complex. One of the most popular methods is Total Ion Current (TIC) normalization. In this method, every metabolite in a sample is divided by the maximum value for that sample. This can correct for fluctuations in signal intensity across samples, and ensures that all your metabolite intensities fall within the same range (between 0 and 1). There are many variations of this method, including median normalization (dividing by the median intensity instead of the maximum intensity) and reference normalization (dividing by an internal standard or reference sample). These techniques are often paired with log transformation to create a more normal distribution (see below).

Normal vs. Normalized

The more you dive into statistics, the more you’ll hear about “normal” data. Normal data has a normal distribution: data is centered at the mean with the majority of the data points within plus or minus one standard deviation of the mean. It’s the bell curve that got us all through organic chemistry. Some normalization techniques force your data into a normal distribution, but others do not. While there’s nothing inherently wrong with non-normal data, many statistical techniques assume that your data is normal. So, if you’re using a statistical test that requires normal data, double check that your data is actually normal.

Beyond these relatively simple normalization techniques, there are a slew of other algorithms available. Certain techniques require a specific type of data structure (eg, pooled samples, internal reference standards, group references, etc), so if you have your heart set on a specific technique, make sure you’ve planned your experiment accordingly. You can read more about different normalization techniques in the resources below, but for now we’ll highlight a few that were borrowed from our analytical neighbors: next-generation sequencing.

Next-generation sequencing has been dealing with issues surrounding data normalization for decades. Several popular normalization techniques - such as probabilistic quotient normalization (PQN) and variance stabilizing normalization (VSN) - were originally developed for microarray data. More recently, the microbiome field has introduced the centered log ratio (CLR) and robust centered log ratio (rCLR) transformations. While all of these techniques can be applied to metabolomics data, it’s worth noting that most of these transformations assume that your data is compositional. In compositional data, each value is interdependent and sums up to a whole. It’s like slices in a pizza - make one slice bigger, the rest of the slices get smaller. MS/MS metabolomics data isn’t necessarily compositional (the signal for one metabolite shouldn’t affect all the others), but you can treat it as compositional after performing TIC normalization. This allows you to perform the same normalization techniques as you would for sequencing data, which can facilitate comparisons between the two.

Do I handle non-biological variation before or after my samples are run?

There are two stages where you can minimize non-biological variation: the wet lab stage (sample collection, handling, and instrumental analysis) and the dry lab stage (normalization). So which is more important? Ask this enough, and you’ll find passionate people on both ends of the spectrum, arguing that you shouldn’t wear earrings while running the instrument or that all normalization problems can be solved by a computer. And while both camps make important points, I prefer the middle ground. Yes, mass spectrometers are exquisitely sensitive, but they’re never run in a perfectly-controlled environment. And while it’s theoretically possible to normalize poorly-processed samples, it’s a lot easier if the sample quality is already decent.

So the short answer is both. It’s always both.

So how does this work in Ometa Flow?

Even with all these normalization methods, there is no one method that will work every time. Ometa Flow’s statistics processing workflow performs both median normalization (similar to TIC but more robust to outliers) and rCLR normalization. In most cases, we’ll provide statistical results using both sets of normalized data.** Hopefully these results will be similar. If not, that’s a good indication that you need to do some more work to determine what method is most appropriate for your experiment.

*Trigger warning

**There are differences in how these two methods handle missing values that can make them more or less appropriate for certain statistical tests. Missing values will be the subject of their own blog post, but if you’re curious, check out the workflow documentation to learn more.

Statistics: normalization

Further Reading

Statistics: p values

Statistics: principal components analysis