ABSTRACT
Spreadsheets models are frequently used by scientists to analyze research data. These models are typically described in a paper or a report, which serves as single source of information on the underlying research project. As the calculation workflow in these models is not made explicit, readers are not able to fully understand how the research results are calculated, and trace them back to the underlying spreadsheets. This paper proposes a methodology for semi-automatically deriving the calculation workflow underlying a set of spreadsheets. The starting point of our methodology is the cell dependency graph, representing all spreadsheet cells and connections. We automatically aggregate all cells in the graph that represent instances and duplicates of the same quantities, based on analysis of the formula syntax. Subsequently, we use a set of heuristics, incorporating knowledge on spreadsheet design, computational procedures and domain knowledge, to select those quantities, that are relevant for understanding the calculation workflow. We explain and illustrate our methodology by actually applying it on three sets of spreadsheets from existing research projects in the domains of environmental and life science. Results from these case studies show that our constructed calculation models approximate the ground truth calculation workflows, both in terms of content and size, but are not a perfect match.
- R. Abraham and M. Erwig. Inferring Templates from Spreadsheets. In Proceedings of the 28th international conference on Software engineering., pages 182--191. ACM, 2006. Google ScholarDigital Library
- D. I. Benn and N. R. J. Hulton. An Excel spreadsheet program for reconstructing the surface profile of former mountain glaciers and ice caps. Computers and Geosciences, 36(5):605--610, 2010. Google ScholarDigital Library
- G. Boulton, M. Rawlins, P. Vallance, and M. Walport. Science as a public enterprise: the case for open data. Lancet, 377(9778):1633--5, May 2011.Google ScholarCross Ref
- Y. Chen and H. C. Chan. Visual checking of spreadsheets. In Proceedings of the European Spreadsheet Risks Interest Group 1st Annual Conference, pages 75--85, London, 2000.Google Scholar
- M. Clermont. A Toolkit for Scalable Spreadsheet Visualization. In Proceedings of EuSpRIG 2004 Conference, pages 1--12. European Spreadsheet Risks Interest Group, 2004.Google Scholar
- J. S. Davis. Tools for spreadsheet auditing. International Journal of Human-Computer Studies, 45:429--442, 1996. Google ScholarDigital Library
- F. Hermans, M. Pinzger, and A. V. Deursen. Supporting Professional Spreadsheet Users by Generating Leveled Dataflow Diagrams. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 2011. Google ScholarDigital Library
- K. Hodnigg, R. T. Mittermeir, and I. Informatik-systeme. Metrics-Based Spreadsheet Visualization Support for Focused Maintenance. In Proceedings of the European Spreadsheet Risks Interest Group 9th Annual Conference, pages 79--94, London, 2008.Google Scholar
- T. Igarashi, J. Mackinlay, B.-W. Chang, and P. Zellweger. Fluid Visualization of Spreadsheet Structures. In Proceedings of the IEEE Symposium on Visual Languages, Halifax, NS, Canada, 1998. Google ScholarDigital Library
- D. Jannach, T. Schmitz, B. Hofer, and F. Wotawa. Avoiding, Finding and Fixing Spreadsheet Errors - A Survey of Automated Approaches for Spreadsheet QA. Journal of Systems and Software, pages 1--69, 2014.Google Scholar
- B. Kankuzi and Y. Ayalew. An End-User Oriented Graph-Based Visualization for Spreadsheets. In Proceedings of the 4th International Workshop on End-User Software Engineering, pages 86--90, Leipzig,Germany, 2008. Google ScholarDigital Library
- N. a. Mazer. A novel spreadsheet method for calculating the free serum concentrations of testosterone, dihydrotestosterone, estradiol, estrone and Cortisol: With illustrative examples from male and female populations. Steroids, 74(6):512--519, 2009.Google ScholarCross Ref
- H. Rijgersberg, M. Wigham, and J. Top. How semantics can improve engineering processes: A case of units of measure and quantities. Advanced Engineering Informatics, 25(2):276--287, Apr. 2011.Google ScholarCross Ref
- S. Roy and F. Hermans. Dependence Tracing Techniques for Spreadsheets: An Investigation. In Software Engineering Methods in Spreadsheets, pages 1--4, 2014.Google Scholar
- B. Ruggeri. Chemicals exposure: Scoring procedure and uncertainty propagation in scenario selection for risk analysis. Chemosphere, 77(3):330--338, 2009.Google ScholarCross Ref
- J. Sajaniemi. Modeling Spreadsheet Audit: A Rigorous Approach to Automatic Visualization. Journal of Visual Languages & Computing, 11:49--82, 2000. Google ScholarDigital Library
- H. Shiozawa, K. Okada, and Y. Matsushita. 3D Interactive Visualization for Inter-Cell Dependencies of Spreadsheets. In Proceedings of the IEEE Symposium on Information Visualization, an Francisco, CA, USA, 1999. Google ScholarDigital Library
- K. Wolstencroft, S. Owen, M. Horridge, O. Krebs, W. Mueller, J. L. Snoep, F. du Preez, and C. Goble. RightField: embedding ontology annotation in spreadsheets. Bioinformatics (Oxford, England), 27(14):2021--2, July 2011. Google ScholarDigital Library
Index Terms
A methodology for constructing the calculation model of scientific spreadsheets
Recommendations
A methodology for testing spreadsheets
Spreadsheet languages, which include commercial spreadsheets and various research systems, have had a substantial impact on end-user computing. Research shows, however, that spreadsheets often contain faults; thus, we would like to provide at least some ...
Model inference for spreadsheets
Many errors in spreadsheet formulas can be avoided if spreadsheets are built automatically from higher-level models that can encode and enforce consistency constraints in the generated spreadsheets. Employing this strategy for legacy spreadsheets is ...
Automatically Inferring ClassSheet Models from Spreadsheets
VLHCC '10: Proceedings of the 2010 IEEE Symposium on Visual Languages and Human-Centric ComputingMany errors in spreadsheet formulas can be avoided if spreadsheets are built automatically from higher-level models that can encode and enforce consistency constraints. However, designing such models is time consuming and requires expertise beyond the ...
Comments