Detecting signatures underlying the composition of biological data.
Nucleic acids research
Biological compositional data is inherently multidimensional and therefore difficult to visualize and interpret. To allow for the automatic decomposition of large compositional data and to capture gradients in co-occurring features, called signatures, we developed a new software package 'cvaNMF'. Our benchmarks on synthetic data show the effectiveness of cross-validation and our novel signature-similarity method to identify a suitable decomposition using non-negative matrix factorization (NMF). This software provides a complete set of tools to identify and visualize biologically informative signatures which we demonstrate in a wide range of microbial and cellular datasets: 'Enterosignatures' detected in gut metagenomes differentiated human hosts with diverse diseases; five 'terrasignatures' from rhizosphere metagenomes differentiated root- or soil-associated microbiomes, while being refined enough to infer geographic distances between plants. Large-scale data from >13?000 metagenomes representing 25 biomes were decomposed into environmental and host-associated microbiomes based on five newly discovered signatures. Finally, analysis of the cell composition of non-small cell lung cancer samples allowed separation of cancerous and inflamed tissues based on four cell-type signatures.
Nucleic acids research
View Publication

