WEBVTT
Kind: captions
Language: en
00:00:06.320 --> 00:00:11.780
Now we're going to look at a more mathematical
way to describe the projection process, projection
00:00:11.780 --> 00:00:15.160
of a point from the real world into the image plane.
00:00:15.160 --> 00:00:19.160
We're going to use a different projection
model to what we used last time.
00:00:19.160 --> 00:00:23.220
We're going to use a model that's referred
to as the "central projection" model.
00:00:23.220 --> 00:00:29.960
The key elements of this model are the camera’s
coordinate frame we denote by C. The image
00:00:29.960 --> 00:00:35.980
plane is parallel to the camera's x and y
axes and positioned at a distance F in the
00:00:35.980 --> 00:00:37.920
positive z direction.
00:00:37.920 --> 00:00:41.020
F is equivalent to the focal length of the lens.
00:00:41.020 --> 00:00:45.060
Now in order to project the point, what we
do is we cast a ray from the point in the
00:00:45.060 --> 00:00:50.540
world through the image plane to the origin
of the camera.
00:00:50.540 --> 00:00:54.460
With the central projection model, you'll
note that the image is non-inverted.
00:00:54.460 --> 00:01:00.920
We can write an equation for the point P in
homogeneous coordinates, we multiply the world
00:01:00.920 --> 00:01:07.540
coordinates, X, Y, Z, by a three by four matrix
in order to get the homogeneous coordinates
00:01:07.540 --> 00:01:10.500
of the projected point on the image plane.
00:01:10.500 --> 00:01:14.000
Let's look at this equation in a little bit
more detail.
00:01:14.000 --> 00:01:19.020
It's quite straightforward to write an expression
for x tilde, y tilde, z tilde in terms of
00:01:19.020 --> 00:01:23.240
the focal length and the world coordinate,
X, Y, Z.
00:01:23.240 --> 00:01:27.540
We can transform the homogeneous coordinates
to Cartesian coordinates using the rule that
00:01:27.540 --> 00:01:31.820
we talked about in the last section and with
a little bit of rearrangement, we can bring
00:01:31.820 --> 00:01:37.560
the equation into this form and this is exactly
the same form as we derived in the last lecture
00:01:37.560 --> 00:01:40.040
by looking at similar triangles.
00:01:40.040 --> 00:01:45.020
What's really convenient and useful about
this homogeneous representation of the image
00:01:45.020 --> 00:01:48.560
formation process is that it is completely linear.
00:01:48.560 --> 00:01:55.060
We don't have this explicit division by Z,
the distance between the camera and the object.
00:01:55.060 --> 00:01:59.240
It's implicit in the way we write the equations
in homogeneous form.
00:01:59.240 --> 00:02:03.580
Let's look at this equation again and we can
factor this matrix into two.
00:02:03.580 --> 00:02:08.780
The matrix on the right has elements that
are either 0 or 1 or f, the focal length of the lens.
00:02:08.780 --> 00:02:12.220
So this matrix performs the scaling and zooming.
00:02:12.220 --> 00:02:14.400
It’s a function of the focal length of our lens.
00:02:14.400 --> 00:02:19.180
The matrix on the left has got an interesting
shape, it’s only a three by four and this
00:02:19.180 --> 00:02:24.200
matrix performs the dimensionality reduction,
crunches points from three dimensions down
00:02:24.200 --> 00:02:25.220
into two.
00:02:25.220 --> 00:02:28.380
And so far, we consider the image plane to
be continuous.
00:02:28.380 --> 00:02:31.260
In reality, the image plane is quantized.
00:02:31.260 --> 00:02:36.520
It consists of a massive array of light sensing
elements which correspond to the pixels in
00:02:36.520 --> 00:02:38.100
the output image.
00:02:38.100 --> 00:02:44.300
The dimension of each pixel in this grid,
I’m going to denote by the Greek letter rho.
00:02:44.300 --> 00:02:48.640
So the pixels are Ρu wide and they’re Ρv high.
00:02:48.640 --> 00:02:53.440
Pixels are really, really small so the width
and height of a pixel is often at the order
00:02:53.440 --> 00:02:57.420
of around 10 microns, maybe a bit bigger,
maybe a bit smaller.
00:02:57.420 --> 00:03:02.800
What we need to do now is to convert the coordinate
P, which we computed previously, and that
00:03:02.800 --> 00:03:07.660
was in units of meters with respect to the
origin of the image plane.
00:03:07.660 --> 00:03:10.540
We need to convert it to units of pixels.
00:03:10.540 --> 00:03:16.180
Pixel coordinates are measured from the top-left
corner of the image so we need to do a scaling
00:03:16.180 --> 00:03:19.820
and we need to do a shifting and that’s
a simple linear operation.
00:03:19.820 --> 00:03:25.320
So if we have the Cartesian x and y coordinates
of the point P on the image plane, we can
00:03:25.320 --> 00:03:31.660
convert that to the equivalent pixel coordinate
which we denote by the coordinates u and v
00:03:31.660 --> 00:03:34.720
and we can represent that again in homogeneous form.
00:03:34.720 --> 00:03:39.760
Here we multiply by a matrix, the elements
of the matrix are the dimensions of the pixel,
00:03:39.760 --> 00:03:44.540
Pu and Pv, and the coordinates of what’s
called the principal point.
00:03:44.540 --> 00:03:50.800
The principal point is the pixel coordinate
where the z axis of the camera origin frame
00:03:50.800 --> 00:03:52.660
pierces the image plane.
00:03:52.660 --> 00:03:57.920
The homogeneous pixel coordinates can be converted
to the more familiar Cartesian pixel coordinates
00:03:57.920 --> 00:04:01.260
u and v by the transformation rule that we
covered earlier.
00:04:01.260 --> 00:04:05.460
Essentially, we take the first and second
element of the homogeneous vector and divide
00:04:05.460 --> 00:04:08.460
it by the third element of the homogeneous vector.
00:04:08.460 --> 00:04:13.320
Now, we can put all these pieces together
and we can write the complete camera model
00:04:13.320 --> 00:04:16.240
in terms of three matrices.
00:04:16.240 --> 00:04:21.920
The product of the first two matrices is typically
denoted by the symbol K and we refer to these
00:04:21.920 --> 00:04:23.900
as the intrinsic parameters.
00:04:23.900 --> 00:04:27.540
All the numbers in these two matrices are
functions of the camera itself.
00:04:27.540 --> 00:04:30.660
It doesn't matter where the camera is, or
where it’s pointing, they’re only a function
00:04:30.660 --> 00:04:31.820
of the camera.
00:04:31.820 --> 00:04:36.920
These numbers include the height and width
of the pixels on the image plane, the coordinates
00:04:36.920 --> 00:04:40.280
of the principal point, and the focal length
of the lens.
00:04:40.280 --> 00:04:45.620
The third matrix describes the extrinsic parameters
and these describe where the camera is,
00:04:45.620 --> 00:04:48.300
but they don’t say anything about the type of camera.
00:04:48.300 --> 00:04:54.260
The elements in this matrix are a function
of the relative pose of the camera with respect
00:04:54.260 --> 00:04:56.320
to the world origin frame.
00:04:56.320 --> 00:05:02.640
In fact, it is the inverse of xi C. The product
of all of these matrices together is referred
00:05:02.640 --> 00:05:08.980
to as the camera matrix and it’s often given
the symbol C.
00:05:08.980 --> 00:05:15.220
So this single matrix is single three by four
matrix is all we need to describe the mapping
00:05:15.220 --> 00:05:21.320
from a world coordinate, X, Y and Z, through
to a homogeneous representation of the pixel
00:05:21.320 --> 00:05:23.820
coordinate on the image plane.
00:05:23.820 --> 00:05:28.740
That homogeneous image plane coordinate can
be converted to the familiar Cartesian image
00:05:28.740 --> 00:05:32.280
plane coordinate using this transformation
rule here.
00:05:32.280 --> 00:05:37.580
So this is a very simple and concise way of
performing perspective projection.
00:05:37.580 --> 00:05:42.600
Let’s consider now what happens when I introduce
a non-zero scale factor lambda.
00:05:42.600 --> 00:05:49.460
The homogeneous coordinate elements u tilde,
v tilde, and w tilde will all be scaled by lambda.
00:05:49.460 --> 00:05:55.440
When I convert them to Cartesian form, the
lambda term will be factored out to the numerator
00:05:55.440 --> 00:05:58.920
and the denominator so the result will be unchanged.
00:05:58.920 --> 00:06:03.780
This is a particular advantage of writing
the relationship in homogeneous form.
00:06:03.780 --> 00:06:06.720
It gives us what’s called scaling variance.
00:06:06.720 --> 00:06:10.780
Because we can multiply the camera matrix
by an arbitrary scale factor, it means we
00:06:10.780 --> 00:06:16.540
can write the camera matrix in a slightly
simplified form, which we refer to as a normalised
00:06:16.540 --> 00:06:17.840
camera matrix.
00:06:17.840 --> 00:06:22.960
We do that by choosing one particular element
of that matrix to have a value of one and
00:06:22.960 --> 00:06:27.760
typically we choose the bottom-right element
and set it to one.
00:06:27.760 --> 00:06:33.060
This normalised camera matrix still contains
all of the information to completely describe
00:06:33.060 --> 00:06:34.880
the image formation process.
00:06:34.880 --> 00:06:39.520
It contains the focal length of the lens,
it contains the dimensions of the pixels,
00:06:39.520 --> 00:06:45.020
it contains the coordinate of the principal
point, and it contains the position and orientation
00:06:45.020 --> 00:06:47.920
of the camera in three-dimensional space.
00:06:47.920 --> 00:06:52.940
And finally, we can convert the homogeneous
pixel coordinates to the more familiar Cartesian
00:06:52.940 --> 00:06:56.320
pixel coordinates, which we denote by u and v.