Previous Up Next

7  Co-Change Graphs

Co-change visualization is a method to compute clustering layouts based on the change history of the system. Intuitively, we want to compute layouts where two artifacts have close positions if they were often changed together, and they have distant positions if they were rarely commonly changed. We model the system’s change history as the so called co-change graph, which is described in Section 3.1. Then, the usual graph-layout algorithm can be applied to compute a layout (cf. [BN05a] for details). The precondition for achieving good layouts is to use an energy model that fulfills certain clustering properties (cf. the discussion in Section 6.2).

The motivation for using the co-change graph is threefold: First, frequently co-changed artifacts are likely to be logically coupled, and grouping them together in one subsystem restricts the scope of changes to the local context. Second, the co-change graph is not limited to program source code, unlike call graphs and other syntax-based models; the co-change graph includes also artifacts for test data, shell scripts, SQL scripts, examples, documentation, and subsystems in different programming languages. Third, the co-change graph can be efficiently and inexpensively extracted from version control repositories.

The (weighted) co-change graph for a given version-control repository is an undirected graph G = (V, E, w). The set V of vertices represents the artifacts of the system (e.g., files, classes, methods, packages) and change transactions (e.g., commits in CVS). An edge {c, a} is contained in the set E of edges if artifact a was changed by change transaction c (also called ’commit’). The weight w({c, a}) of an edge is interpreted as the importance of the edge. For an unweighted graph, the weight is 1 for all edges. A detailed discussion on edge weights for co-change graphs is given in the technical report [BN05b].

Information provided by the visualization. The layouts produced by the tool CCVisu provide information on two levels:

The visualization can, for example, provide some guidance for answering concrete questions like the following:

High level: What are the subsystems of the system, according to common changes? If there is a decomposition into subsystems available, does it match the subsystems suggested by the co-change visualization? (If not, what are the reasons?) If we want to restructure the system, what do the clusters in the co-change layout suggest? Are there files that need to be assigned to other subsystems, which they are closed to in the layout?

Low level: Which SQL query files correspond to which module of the system? Which test input file is related to which code file? Which configuration file corresponds to which module files? If we change a certain file, which files should we understand because of potential impact? If we are interested to unterstand a certain code file, which documentation file shall we read? If we want to test a certain part of the program, which example files and test cases are closely related to the source file of that part?

Example Visualization. We have applied the CCVisu method to the well-known software project Mozilla, in particular to the mailnews component without the base package. The co-change graph was extracted from a CVS log file with 270 000 lines (13 MB). In this example, the artifacts of the co-change graph are files. The graph consists of 1 804 artifact vertices, 9 950 vertices for change transactions, and 30 938 edges (changes). Figure 4 shows a screen-shot of the layout, which was computed within 5 min on a 1.7 MHz Pentium machine, using only 100 iterations of the minimizer.

Figure 4: Co-change visualization of Mozilla’s mailnews component

The vertices for the change transactions and the edges are omitted for readability. The artifact vertices were drawn in different colors, in order to compare the grouping suggested by the layout with the authoritative decomposition, according to the documentation. We considered 8 major subsystems of the mailnews component and assigned colors to them: AddrBook (blue), Compose (magenta), IMAP (pink), MAPI (yellow), MIME (red), Import (cyan), DB (orange), and Extensions (gray). The rest (minor components, build utils, etc.) is labeled as Misc (green) in the figure. (The subsystem labels are also annotated in gray boxes, to improve readability for gray-scale printouts.) Now we can compare whether CCVisu has positioned the 1 804 files in groups in agreement with the authoritative decomposition: Some of the subsystems are clearly separated from the rest (Extensions, IMAP, DB, MAPI, AddrBook), some are not separate clusters but almost all files of the same subsystem are closed together (Import, MIME, Compose), and Misc is not grouped at all (as expected).

Comparison of two Energy Models. Table 2 compares layouts of co-change graphs created with the edge-repulsion LinLog energy model (described in Section 6.2.3) and with the standard force model of Fruchterman and Reingold (cf. [1] or Section 6.2.1). The figures show that the Fruchterman-Reingold model separates clusters less clearly (because it enforces uniform edge lengths) and has a strong bias towards placing vertices with high degree (i.e., vertices that participated in many change transactions and are drawn large in the figures) in the center (because it models vertex repulsion instead of edge repulsion). This is typical for state-of-the-art force and energy models, because they are not primarily designed for clustering.

SW systemCC graphEdge-Repulsion LinLogFruchterman-Reingold
CrocoPat 2.1crocopat.rsfcrocopat.pngcrocopat.svgcrocopat.wrlcrocopat-FR.pngcrocopat-FR.svgcrocopat-FR.wrl
Rabbit 2.1rabbit.rsfrabbit.pngrabbit.svgrabbit.wrlrabbit-FR.pngrabbit-FR.svgrabbit-FR.wrl
Blast 1.1blast.rsfblast.pngblast.svgblast.wrlblast-FR.pngblast-FR.svgblast-FR.wrl
Table 2: Comparison of energy models

For each software system, the table provides the co-change graph (RSF), the layouts created with the edge-repulsion LinLog model, and the layouts created with the Fruchterman-Reingold model. For a detailed explanation of the formats and how to get viewers, we refer to Section 4. The PNG files are static pictures that can be viewed with standard web browsers. The WRL (VRML) files can be viewed with a VRML viewer (cf. Section 4.3.1). The advantage of the VRML files is that the names of graph vertices can be selectively shown and that one can navigate through the layout.

The artifact vertices are drawn as circles in the figures. The vertices for change transactions and the edges are omitted. The area of the circles is proportional to the degree of the corresponding vertice. The circles are colored according to the authoritative decomposition. Different subsystems in the authoritative decomposition correspond to different colors. (In the Blast visualizations, some colors are difficult to distinguish because of the large number of different colors.)

A discussion and interpretation of the layouts are given in the technical report [BN05b].

© Dirk Beyer
Previous Up Next