Statistics: the basics
You’ve run an untargeted metabolomics experiment. You’ve done feature finding and library annotation and molecular networking. You have a table with the relative abundances of thousands of metabolites across hundreds of samples. You’re feeling pretty good about yourself.
Now what?
The problem is that no reasonable human being (or company or organization) has the time to follow up on every single metabolite in that table. In fact, that would be an enormous waste of time.
You need a way to quickly identify the best candidates for further study. You need statistics.
Done well, statistics can take a feature table and give you a prioritized list of candidates in minutes. Done poorly, you can end up in a black hole of “significant” differences which are anything but. Our next few blog posts will be focusing on different parts of statistical analysis, focusing on how to use and interpret statistical analyses as a non-statistician. But for now, here are a few general principles to consider whenever you’re performing or interpreting statistics:
Garbage in, garbage out: if your sample processing, data quality, or feature finding was done poorly, no amount of fancy statistical algorithms will save you.
One size doesn’t fit all: there are a lot of statistical tests, and none of them work every time. Instead, statistical tests are tailored to specific types of data and analysis problems. Just because a specific test worked well for you in the past doesn’t mean it’s the best solution now.
All models are wrong, but some are useful: even the most appropriate statistical test for your data will be wrong a certain percent of the time. While statistics can be an incredibly powerful tool, always treat your results with skepticism and validate when you can.
Do I need peak areas?
As we’ve discussed previously, there are two levels of abundance information you can glean from an untargeted metabolomics experiment. The first is presence/absence: was this metabolite detected in my sample? This information is relatively easy to extract and is output in the results of any molecular networking analysis. The second is relative abundance: how much of this metabolite is present in my sample? This information allows you to identify metabolites that are up or down-regulated, but requires feature finding, which is more complex and error prone. While there are a handful of statistical tests that work with presence/absence data, the vast majority are designed to handle relative abundance. Thus, if you want to perform statistical analysis on untargeted metabolomics data, you should plan to perform feature finding/peak integration beforehand.
In the meantime, if you’re a client, check out the beta version of our new statistics processing workflow, available in our newest deployment!
Want to learn more? Contact us.