Abstract
Through the years, several CAD systems have been developed to help radiologists in the hard task of detecting signs of cancer in mammograms. In these CAD systems, mass segmentation plays a central role in the decision process. In the literature, mass segmentation has been typically evaluated in a intra-sensor scenario, where the methodology is designed and evaluated in similar data. However, in practice, acquisition systems and PACS from multiple vendors abound and current works fails to take into account the differences in mammogram data in the performance evaluation. In this work it is argued that a comprehensive assessment of the mass segmentation methods requires the design and evaluation in datasets with different properties. To provide a more realistic evaluation, this work proposes: a) improvements to a state of the art method based on tailored features and a graph model; b) a head-to-head comparison of the improved model with recently proposed methodologies based in deep learning and structured prediction on four reference databases, performing a cross-sensor evaluation. The results obtained support the assertion that the evaluation methods from the literature are optimistically biased when evaluated on data gathered from exactly the same sensor and/or acquisition protocol.