The many faces of perspective projection matrix

One of the first things I stumbled upon in the beginning of my adventure with graphics programming were types of matrices and view spaces. I remember it took me a while to wrap my head around different naming conventions (is clip space the same as screen space or...?) and how each and every projection worked from theoretical standpoint. With Internet around it’s so much easier to figure things out but there’s one thing that I remember baffling me: the relation between different forms of perspective projection matrix.

The most popular (and dare I say: the only one?) representation of a projection matrix that you can find in decent graphics and math books today is of the API agnostic form:

[ 2n/(r - l),          0, -(r + l)/(r - l),            0 ]
[          0, 2n/(t - b), -(t + b)/(t - b),            0 ]
[          0,          0,  (f + n)/(f - n), -2fn/(f - n) ]
[          0,          0,                1,            0 ]

where: 
r, l, t, b - respective planes of the view frustum (right, left, top, bottom)
f - far plane
n - near plane

This matrix is a result of how a truncated pyramid frustum is being transformed into canonical view volume (a unit cube) – a process slightly more complicated than a regular ortographic projection and requiring a bit of math work (which I will skip here as you can find plentiful reference material on the Internet). Lets assume for a second that it all makes sense to you, you understand how each element of the matrix came to be and how the entire thing works (no really, read the full math derivation and try to understand – it’ll help!). If you’re a beginner in graphics programming, one of the first example implementations of perspective projection matrix will probably use this form instead (assuming we’re talking OpenGL):

[ (1/r)cot(fov/2),          0,                0,            0 ]
[               0, cot(fov/2),                0,            0 ]
[               0,          0, -(f + n)/(f - n), -2fn/(f - n) ]
[               0,          0,               -1,            0 ]

where: 
fov - field of view
r   - aspect ratio
f   - far plane
n   - near plane

(if you happen to see a tan() being used instead of cot() remember, that one function is the inverse of another, so the final forms may differ slightly on the trig part)

Wait… what?

There’s very little explanation available out there concerning the relation between these two forms. The math behind it is there, you can still find it and get a grip of how the matrix works – but how do two different representations result in the same output? Also, which one is the “better” one that I should use? The key is simply to understand that you arrive at both solutions using different input parameters:

– First matrix is derived given n and f planes but also the actual dimensions of the view frustum defined by r, l, t and b planes (here, both the aspect ration and field of view can be extracted from the matrix for the given size of the frustum).
– Second matrix is a result of taking into account the n and f planes and instead of frustum dimensions we use the desired view aspect ratio and desired field of view.

It is therefore easier and less code to write with the second form, making it the most commonplace in real-life. This is especially visible in FPS games where we want to have a smooth and fast control over the player’s fov. Bottom line? Being able to express the same thing in different ways is a powerful tool but also one that can easily confuse everyone using it 🙂