File Conversion
So far, we’ve focused on what is often the biggest hurdle to getting the most out of your untargeted metabolomics data: metabolite annotation. But one of the most headache-inducing steps might actually occur before you start doing any analysis at all. It has to do with the file conversion.
According to this wikipedia hero, the field of mass spectrometry uses at least 13 open file formats and over 35 proprietary file formats. There’s .mzXML, .mzML, .mgf, .mzAPI, .mz5, .imzML, .mzDB, and .mzMLb. There’s .D, .YEP, .dat, .DAT, .wiff, .wiff2. There’s .RAW, .RAW, and .raw (don’t get these confused, they’re not intercompatible!). There’s even a file format literally titled Yet Another Format for Mass Spectrometry (YAFMS). Some file formats aren’t files at all, but are folders containing multiple files per run. Fun, isn’t it?
The good news is that there is an open, internationally-recognized standard file format for mass spectrometry: .mzML. The bad news is that the vast majority of mass spectrometers will not output files in this format. Instead, almost all mass spectrometers write data in a proprietary file format, which (as the name suggests) is specific to the company that created it. Often, these files are built to slot seamlessly into the company’s proprietary analysis software, but don’t tend to play well with anyone else. So what happens when you want to do something different?
Why all the MLs?
Most open mass spectrometry file formats are written in extensible markup language, or XML. One of the first widely-used MS file formats to use XML is .mzXML, which you will still see around today. Since then, several internal organizations collaborated to create .mzML, which has been the standard open file format since its introduction in 2008. While there are a couple of other open file formats for specific applications (such as .imzML for imaging mass spectrometry), .mzML remains the standard for storing and analyzing your mass spectrometry data.
There are some analysis tools that can use proprietary formats, but most of them (including those at Ometa) only take files in open formats like .mzML. Thus, you’ll likely need to convert your files to .mzML before analysis. The file conversion process goes from the relatively simple to the arcane, and greatly depends on which proprietary file format you begin with. Thankfully, there are free, open-source tools that will convert most proprietary formats to .mzML. The one we recommend is msConvert from ProteoWizard.*
So why doesn’t everyone just output files as .mzML?
The .mzML format is open, accessible, reliable and can accommodate a wide range of mass spectrometry data across proteomics, metabolomics, mass spectrometers, ion mobility, MSn fragmentation, etc. So why doesn’t everyone just use it to begin with? There are a small number of cases where the .mzML format doesn’t work (the most common of which is imaging mass spectrometry, which uses .imzML). Otherwise, until one of us becomes president of the universe and orders an end to this nonsense, we’ll probably just have to keep living with it.
Once you’ve converted your data to .mzML, you can now access a whole new world of analysis tools that were built to work with this open format.**
MGF?
One last open file format you’ll likely encounter is the Mascot Generic Format (MGF). This is a simple, text-based format for storing mass spectra. While it doesn’t have the flexibility or the metadata storage capabilities of .mzML, it is still a popular choice for summarizing results from an analysis.
*ProteoWizard’s msconvert function works for most proprietary data and is accessible through command line or GUI, but can only be run on Windows (I know). Other programs have been written for specific use cases. Ometa Flow provides detailed documentation on how to convert most file formats to .mzML for our users, and even provides additional workflows to fix known file conversion issues. If you’re a client and have a question about how to convert your data, please let us know.
**Including all those powerful and easy-to-use tools available through Ometa Flow!
Want to learn more?
Martens, L. et al. mzML—a Community Standard for Mass Spectrometry Data. Mol. Cell. Proteomics 10, R110.000133 (2011).
Have more questions? Contact us.