This page will serve as an introduction to the ggobi data visualization sytem, via analysis of the Cars93 data.
GGobi is freely available for Windows, Mac, and Linux, at www.ggobi.org. A package that lets you run GGobi from inside the R computing environment is also available (also at ggobi.org). I have found the R interface with GGobi to sometimes crash, and so will be using the stand-alone version for this demo.
Getting started with the cars data
The data may be found in a number of places, including the MASS library in R. Once you start GGobi, the data can be opened. Go to File > Open and navigate to the file. I have created an XML file (cars1.xml) with additional information, such as observations being named according to make and model.
Once you open the data, you will see two windows:
The window labelled "newcars: Scatterplot (current)" is a graphics window. Its behaviour is controlled by the "GGobi" window. The GGobi window will show controls appropriate for the current graphics window. In this case you are looking at a scatterplot. With the GGobi window you can:
- Choose what goes on the X and Y axis (clicking on X/Y buttons in front of variable name)
- Cycle between the different variables
- Interact with the plot, via the Interaction menu, for example
- Getting the label of a point using the Identify menu option, or
- highlighting points using the Brush menu option
- go back to a plain plot by choosing XY Plot from the menu.
I've assigned different colours to the six different types of cars (small, compact, midsize, large, van, sporty). If you look at a scatterplot of Type vs. Weight, it should be evident which is which (in case it's not obvious, 1...4 are the smallest to largest, 5 is van, 6 is sporty). You can change the colours or plotting symbols for points via Interaction > Brush, either with the Persistent on or off (on means the points will stay that colour even after you move the mouse away from them). Right clicking in the graph and dragging will change the size of your rectangular "brush".
Working with multiple windows
The real power of GGobi is the ability to have multiple windows with different views of the data open at the same time. Then brushing or identifying of points done in one window will be visible in all windows. For example, try opening up another window (Display > New Scatterplot Display in GGobi control window), and put a plot of Horsepower vs. Weight in one window and MPG.City vs. Price in the other. Now brush the points with unusually large horsepower values. Notice that the plotting sybols change for both plots (you may want to choose a plotting symbol/color for brushing that is distinctive).
Automatic BrushingA handy way to brush data according to a particular variable is to use the
Tools > Automatic Brushing option.
Sometime it may make more sense to transform variables, either because of the context of the problem, or because patterns in the data are clearer after transformation. The Tools > Variable Transformation menu will let you select a variable, and then apply a sequence of transformations (stage 0 is simple things like adjusting the range) such as log, inverse and Box-Cox transforms (in stage 1). A Box-Cox transform is a family of transformations which includes square root and polynomial transforms. You can read about it on the net, or in a book on regression. Stage 2 transformations are things like converting data values to ranks or normal scores, which can be useful for outliers or skewed distributions.
In this example, we'll take the inverse of MPG.city, plotted below against weight. You can see that this transform makes the relationship between weight and MPG nearly linear.
Display > New Scatterplot Display, then View > 1D Plot It's also possible to look at the distribution of one variable at a time. Selecting 1D Plot on the View menu will show you a smoothed histogram (also called an Average Shifted Histogram, or ASH) of each variable, such as price below.
In an ASH, a certain bin width for a histogram is chosen. There are a family of possible histograms that can be obtained with the same bin width, by starting bins at different values of the variable. Averaging all such histograms together gives an average shifted histogram. In GGobi, an ASH is plotted only at the data points. Bin width controls smoothness, with smaller bins giving a more jagged appearance to the distribution. You can think of an ASH as an estimate of the density of the variable. In this case, price has most values between 10 and 20 thousand dollars, with just two cars over 40K. The other 1D plot, texturing, basically jitters points to give a sense of their density.
Display > New Barchart
A different kind of univariate plot is the barchart. For numeric variables, a histogram will be produced. Any colouring of points (e.g. by "Type" for our data) is represented in the plot by sub-bars within the histogram. This can be convenient for studying the relationship between one variable and a categorical variable. For example, here's a barplot of "Weight", with colour by "Type":
We can see that the weight of the car divides up many of the vehicles according to "Type".
Visualizing higher-dimensional data
Display > New Scatterplot Matrix
Down the diagonal are 1-D ASH plots, and the off-diagnoal are pairwise scatterplots. As with all other GGobi plots, this plot is connected to the others, so brushing one plot will cause points to be identified in this plot as well.
Parallel coordinate plot
Display > New Parallel Coordinates Display
A parallel coordinates display attempts to show you multivariate structure of the data by plotting all variables simultaneously. It does this by plotting each variable on a vertical axis, and then connecting together all points on the axes that correspond to a line. With a lot of observations, it looks like sphagetti rather than anything useful.
However, these plots can be useful in comparing a few observations (here, cars) to the rest of the data. These observations could be identified in another plotting window, and then the parallel coordinate plot used to gain more information about this observation. For example, I selected the Mazda RX7 from a plot of Weight vs. Horsepower because of its unusually large horsepower for its weight, and the Volkswagen Eurovan because of it's high weight and low horsepower. In the plot below, the Eurovan is green, the RX7 is yellow. These two vehicles are most different in terms of Horsepower, Weight, and RPM. The RX7 has the highest RPM among all the vehicles.
Tools > Sphering (PCA)
We've seen PCA as a way of reducing dimensionality of the data, by projecting observations onto vectors defined as principal component coefficients, so that the variance of the projected observations is maximized. Plots of the principal components themselves (ie the scores, which are the data projected onto the PCs) can sometimes reveal structure in the data.
The "Sphere variables" menu looks like this:
You select variables to sphere (here we use all except type, since it's more of a label than a numeric variable), and then use either correlation or covariance matrix. Clicking on Update scree plot will display the variances of the PCs. You can then decide how many PCs to include, and by clicking on Apply Sphering, and add PCs to data, the principal components will be added to the data as new predictors.
Note that the PCs added to the data in GGobi are slightly different from those we have been working with. In particular, each principal component is rescaled to have variance 1 (i.e., GGobi divides through by the corresponding eigenvalue). Since the PCs are already uncorrelated, this will make their covariance matrix equal to the identity matrix. The corresponding ellipsoid of constant statistical distance will now be a sphere, hence the name "sphering".
All the plots above can be used with the principal components, including multivariate ones. As we'll see, the sphered data are also useful for "Tours" below.
Rotation and Tours: Sequences of projections
The easiest to understand dynamic rotation plot is the Rotation plot (Select a scatterplot display, which is one of the regular graph windows, and then in the GGobi menu select View > Rotation). For any three selected variables, it produces a sequence of two-dimensional plots. The sequence is constructed so as to give the illusion of a three-dimensional object. This is carried out by parameterizing the transformation via an orthogonal projection using two angles that vary slowly.
The points can be manipulated by hand: simply click and drag the animation window.
For example, if you look at the three variables Length, Width, and Weight, you should quickly notice that the variables are highly correlated, all lying close to a single line segment. It is sometimes difficult to interpret this plot. One helpful trick is to brush some observations you know are extreme in the variables you are plotting. This will help you remember which way the axes are pointing.
The axes are plotted in the lower left corner of the plotting window. These are the projections of the axes onto the two directions, indicating how much each variable contributes to the picture on the screen. In the example below, the horizontal direction is a combination of -width, -length, and -weight, with length receiving slightly less weight. The vertical axis is a contrast between weight and the average of length and widths (since length and width have roughly equal vertical directions).
A Tour can be considered to be a multivariate generalization of the three-dimensional rotation plot. The same projection operations can be used to project higher dimensional data onto a two-dimensional surface (i.e., the computer screen). In a Tour, a starting plane (ie an initial projection of the data) is randomly chosen. Then another projection of the data is chosen at randomly. The angle between the two planes is then calculated, and a sequence of projections that rotate between these two planes is then calculated. This sequence is then animated, and the whole process is repeated several times.
The visual effect of this is somewhat like a rotating scatterplot, but the points no longer appear rigid, as they did when three-dimensions are animated.
The idea behind the tour is that we may see interesting projections of the data if we look long enough. So you need to have your finger on the "pause" button. The joint behaviour of points can also be informative: points that are close in the high-dimensional data will be close in many projections, and as the projections change, the points are likely to move in similar directions.
There are three variations on the idea of a tour:
- 1D tours, which show a sequence of density estimates of a single linear combination of the data
- 2D tours, which show a sequence of scatterplots of two (orthogonal) linear combinations of the data
- 2x1D tours, which are like 2D tours, but you can choose whether each variable is put into the vertical axis, the horizontal axis, neither, or both.
Guided tours: Projection pursuit
When looking at a guided tour, it will become evident that many projections are uninteresting, and that waiting for an intersting projection can take a while. The idea of Projection pursuit is to develop a numeric criterion for the interestingness of a plot, and then use numerical methods to optimize this criterion (typically maximize, assuming that larger values indicate more interestingness).
This leads to several questions: "What is an interesting projection?" and "How do the optimization algorithms work"?
What is an interesting projection?
There are different criteria:
- Presence of gaps or high peaks in the distribution of the projected data may indicate several distinct, identifiable sub-populations in the data.
- Long-tailed or skewed distributions may also be interesting.
- Maximal variance is another criteria
We've encountered the last criterion, maximal variance, previously, since it makes up the basis of principal component analysis. PCA has the advantage that the optimal projection is available as the solution to an eigenvalue problem, which is typically well defined and fast to calculate.
Generally, we want to scale our data before carrying out projection pursuit, since different scales (eg weight bwetween 2000 and 5000, MPG between 20 and 50) may have a negative impact on the criteria and the optimum found.
It turns out that measures of the non-normalilty of data ar often popular interestingness measures. One justification for this can be found in the fact that a projection can be thought of as a weighted average of a set of variables. As the number of variables increases, the Central Limit Theorem will begin to operate, with the weighted average looking more and more like a normal distribution as dimensionality increases. Thus, when looking for unusual projections, a measure of non-normality (multi-modality, or skewness) can be useful in finding an interesting projection.
How do the optimization algorithms work?
We seek a projection that maximizes an index of "interestingness". This projection is typically a function of several variables that define the projection. For instance, when projecting three dimensions onto two, we can define the projection in terms of two angles of rotation. Similarly, to project p dimensions onto 2, we need p-1 angles. Thus the optimization problem becomes one of optimizing a function of p-1 variables.
To get an idea of how this works, let's look at a 2D projection of 3D data:
Select a scatterplot, then choose View > 2D Tour, then click on the variables "MPG.City", "Horsepower", and "RPM" in the GGobi window.
You should see a rotating scatterplot (2D Tour with 3 variables is equivalent to a rotating scatterplot). Next we'll activate the optimizer window by clicking on Projection pursuit ... in the GGobi window.
The plot is the current and recent values of the "interestingness index". The index can be chosen from the pull down menu in the lower left (values include "Holes", "Central Mass", and others). The "Optimize" button will turn on/off the optimizer. With it turned on, GGobi will try to find a projection that will maximize the index. When you click on the optimize button, the rotating scatterplot will arrive at a configuration something like this:
The axes on the lower left corner indicate we're looking at a plot of "Horsepower" versus "MPG.City", and the projection is one that seperates out three small cars with very high efficiency at one extreme, and three sporty plus two midsize cars with high horsepower at the other extreme. If you watch the optimizer as it finds the optimum, you can see that the data lie on a curved sheet in these two variables, with RPM mostly spreading out the data along the middle of the sheet.
If you click on Scramble in the GGobi window, a new random projection of the data will be chosen, and the optimizer will start again.
What's the optimizer actually doing?
The objective functions are chosen so that they are differentiable with respect to the angles that define the projection. Then conventional derivative-based optimizers (things like Newton's method) can be used to find an optimum of the function. The current version of GGobi alternates a derivative-based optimizer with a stochastic search optimizer, simulated annealing.
Finding good projections in higher dimensions
Generally it is recommended that the Sphered data (ie data projected onto the principal components, and then scaled so that each PC has variance 1) be used. This way, the different variances and correlations of the original data will not affect the projections.
A common strategy is to run the optimizer several times, using the Scramble feature to start from a new random projection of the data. You will find that in higher dimensional problems, the same optimum is not always found (as opposed to the 3D example with Horsepower, RPM and MPG, where the optimizer repeatedly found a projection of horsepower and mpg).
Brushing points in the optimal projection, and examining plots of the original variables can give insight into what the projection means. For example, using the central mass index, I found a projection with three "stars" that stick out, one corresponding to vans, one to the three most efficient cars, and one with 5 cars that are powerful for their length.
- Using scatterplots, identify which variable(s) have some discreteness to their values (i.e, the values only take certain numbers such as 10, 20, 30, instead of 11, 19, 28).
- In a parallel coordinate plot, what does it mean if many of the lines joining one variable to the next form an "X"? That is, half of the lines connect a small value to a large value, and the other half connect a large value to a small value?
- Identify pairs of variables that best seperate the six "Type"s of car. What plot(s) did you use?