Wednesday 14 March 2012

The Joel Test: what is the quality of software teams?

In 2000, Joel Spolsky published his "own, highly irresponsible, sloppy test to rate the quality of a software team". In November 2011, the respondents of the r/programming study put Joel's list of questions to the empirical test.

I am assuming that Joel compiled his list of questions based on his extensive experience with software teams, so some expectations pop up in my head. First, there should be a common theme running through these questions in so much as they relate to the quality of a software development team or organization. One way to look at this is to run the responses through factor analysis where such a common theme, if present, should manifest itself as a single  dominant factor. Second, if a common theme is indeed present, one would expect positive correlations with a more scientific "team quality factor" and other researched factors, such as “extended experience” and “skill”. 

Now, I am guessing that some people here might be familiar with factor analysis, but that many may not know what this is at all.  I’m going to try to describe the technical analysis steps I did, while at the same time describing what they and the result mean. I only did a rough analysis using a fairly standard approach.

The analysis was conducted using principal component analysis (PCA). Although PCA, in a strict sense is not factor analysis (it is a data reduction technique), it is one very common way to proceed, especially in initial stages.

Using the statistical analysis program SPSS, I first looked at the Kaiser-Meyer-Olkin (KMO) test of sampling adequacy for the 12 questions that constitute The Joel Test. The KMO value indicates how suitable the data sample is for doing analysis at all. The KMO value of .723 is not great (it ranges from 0 to 1, where below .5 is unacceptable), but it is not terrible either. 

Second, the scree plot indicates that there is a clearly discernible theme (represented by principal component 1) present in the data. The break points in the plot indicate components, and where the steepness in the plot goes toward horizontal, the remaining components are too weak to be considered. There is a clear break (or "elbow") around component number 2 or 3.

Third, using a Varimax rotated solution with two factors extracted, one can see that only 31.5% of the total variance is explained by the two factors. The first factor only explains about 20%, which is somewhat low (about 40-45% would have been better). This means that there are two reasonably strong signals in the data. All data has a lot of noise, and the whole point of factor analysis is to find signals in that noise and to cluster those signals into entities that might have a meaning; here “team quality”, or similar. The reason one “rotates” the factors is that the mathematics simply extracts two factors (signals) from the data and places them a bit haphazardly in the data. Think of the two factors as a coordinate system that suddenly appears in a two-dimensional representation of the data. Rotating this coordinate system just means that one does so to minimize the distance between the data points and the axes, so that the axes actually represent the signal in the data optimally.     

Fourth, it is possible to show how the different questions load on the two extracted factors, bearing in mind that the factors are not orthogonal. Loading expresses how much the data for each question contributes to that factor. That the factors are not orthogonal means that they share parts of the same signals in the data, so that there is a “shared meaning” between the two factors (just like a person’s height and weight are not completely independent factors of a person’s physic. They covary, but it is still meaningful to characterize body build by those two non-orthogonal factors). The rotated component matrix shows these results. One should immediately note that the questions "Do you fix bugs before writing new code?" and "Do programmers have quiet working conditions?" loads weakly negatively on the first component. At the same time, "Do you have a bug database?" loads strongly on the first component (.72) and not at all on the second (-.02). 

It is possible to plot the loadings of the questions in a two-dimensional space. Principal component 1 should be analogue to the quality of a software development team. Both FixBeforeCode and QuietWorkCond have problematic relations to this this factor. These questions form a small cluster by themselves with a medium loading on the second principal component. There is also a small cluster compromised of HaveSchedule, HaveSpec, BestTools and UsabilityTesting that have a medium strong relation to principal component 2 and a weak relation to principal component 1. Finally, the remaining questions form a cluster that is clearly associated with the team quality factor, and is marginally or not at all associated with the second (unknown) factor. 

Why is this so? I would love to hear you views on this. What should the second factor stand for, do you think? Further, it might be useful to do factor analysis on subsamples based on location. For example, U.S. and European respondents might display quite different results when the two populations are analyzed independently. 

Fifth, to be fair to Joel, I should mention that even though some of the questions had a negative loading on the team quality factor, he has some empirical support in his claim that the twelve questions as a whole give some meaningful results. First, it is possible to specify that only one principal component should be extracted, thereby reducing the responses for all the 12 questions to a single factor. The component matrix shows then that all questions have shared variance with this common factor. (Variance is noise, but shared variance means a signal is detected.) Although only about 20% of the total variance is explained (half of what could be explained by two factors), all questions have positive loading (FixBeforeCode and QuietWorkCond still have low loadings, though).

Specifying that one only wants one factor, gives a “one-factor” solution. One can do this if one is sure (from a theory or past experience perspective) that only one factor should be present and the empirical analysis supports this.

The first PC of the single and two factor solutions are highly correlated (.87), having nearly 76% shared variance. Second, the first PC of the single factor solution is also highly correlated with the second PC component of the two-factor solution as well (.65) with over 42% shared variance.

Finally, it is possible to cross check whether the extracted factors are positively correlated with other factors one would expect them to be positively correlated with. I have used non-parametric correlations for the variables (Spearmans' rho) and parametric correlations for the factors (Pearson's r) in the table below. For both first principal components (i.e., for the one and two factor solution), the correlations with general programming skill, total months of experience, time spent per week as paid work are all positive and above .158. Further, both principal components are negatively correlated with time spent per week as education or courses. Moreover, there is a slight tendency that both principal components are positively correlated with the percentage of paid time actually doing programming. 

Overall, there seems to be a tendency that the Joel's 12 questions capture one or (most likely) two factors that are relatively consistent with expectations. But more work is needed.  

first PC (single factor solution) first PC (two-factor solution) second PC (two-factor solution)
first PC (single factor solution) r 1.000 .871 .649
Sig. (2-tailed) . .000 .000
N 1234 1234 1234
first PC (two-factor solution) r
1.000 .207
Sig. (2-tailed)
. .000
1234 1234
first PC (two-factor solution) r

Sig. (2-tailed)


general programming skill rho.230 .229 .100
Sig. (2-tailed) .000 .000 .000
N 1234 1234 1234
total months of programming experience rho.158 .168 .043
Sig. (2-tailed) .000 .000 .132
N 1227 1227 1227
time spent per week as paid work rho.256 .291 .075
Sig. (2-tailed) .000 .000 .009
N 1234 1234 1234
time spent per week as education or courses rho-.148 -.196 .002
Sig. (2-tailed) .000 .000 .942
N 1125 1125 1125
time spent per week as unpaid work (e.g., OS) rho.019 -.011 .057
Sig. (2-tailed) .529 .704 .055
N 1147 1147 1147
percentage of paid time actually programming rho.098 .073 .058
Sig. (2-tailed) .001 .016 .056
N 1081 1081 1081


  1. Exceptional post however , I was wondering if you could write a litte more
    on this topic? I'd be very thankful if you could elaborate a little bit more.

    Bless you!

  2. Very informative post! There is a lot of information here that can help any business get started with a successful social networking campaign. Arborist Tools

  3. I admit, I have not been on this web page in a long time... however it was another joy to see It is such an important topic and ignored by so many, even professionals. I thank you to help making people more aware of possible issues. ABŞ vizası onlayn

  4. I admit, I have not been on this web page in a long time... however it was another joy to see It is such an important topic and ignored by so many, even professionals. I thank you to help making people more aware of possible issues. Hindistan vizası onlayn