Camera Recording Tips for Autonomous and Advanced Driving Assistance Systems
When acquiring video data from a camera for the development of autonomous vehicles (AV) or advanced driving assistance systems (ADAS), there are several considerations.
Consider the basic functions used in recording camera data shown in flow chart in Figure 1
Figure 1: Flow chart of camera functions.
The optical system (lens, etc.) captures light from the scene or surroundings. The light is converted into an electrical charge by a sensor, and this information is then digitized. The digitized image might be sent to a computer for further processing. As seen on the right side of the figure, some cameras may also output information in addition to the image data. This can include ‘detections’ of traffic lanes, signs, cars, pedestrians, etc. These are referred to as “smart” cameras.
The optical recording infrastructure used in ADAS systems typically has requirements that are unique to the application. These include:
- No compression, what is recorded and used for training or validation of algorithms should be the highest fidelity possible. This is important in the reuse of the data in later development tasks, like bench testing of alternative sensors.
- Highly accurate timestamping is needed to understand the correlation between different signals and for accurate offline replay of data.
- Handling of large data streams. The number of cameras, total picture size, frame rate, color depth, etc., all drive very large data volumes and high measurement bandwidth requirements.
Cameras are passive sensors; they capture light that is present in the scene and do not emit their own. This makes them less resilient to low-light environments (e.g. night driving) or light fluctuations (e.g. tunnel entrance or exit). However, it also means cameras do not interfere with each other when multiple devices are used simultaneously. Components of a camera are shown in Figure 2.
Figure 2: Exploded view of camera components.
The optics are critical as they focus the light onto the sensor. They define which part of the image is in focus and directly affect the field-of-view of the sensor. Large field-of-views show more of the environment, but also tend to increase the amount of distortion at the edges of the image. These fish-eye cameras are typically used at shorter ranges and have found widespread application in parking systems. On the other hand, narrow field-of-view cameras give a zoomed-in view and are typically used for target or lane line detection. ADAS camera lenses used to date are fixed, and do not zoom-in or -out on the fly.
This article will be split into the following sections:
1. Camera Types
1.1 Basic Cameras
1.2 Built-in ISP
2. Image Quality
2.4 Bit Depth
3. File Size
3.1 File Format
3.2 Frames per Second
3.3 Data Rate
4. Other Considerations
4.1 Time Stamping
1. Camera Types
A regular camera only captures and provides a digital image, with no image processing. Cameras used in ADAS applications normally have various levels of intelligence integrated as part of the package. This integrated signal processing (ISP) can handle various levels or types of calculations, including data compression, detection and classification, etc.
1.1 Basic Cameras
Camera systems with no processing or intelligence typically only contain the imager and a serializer. The imager converts the light focused by the lens onto the sensor into digital data, and the serializer handles the communication to external devices. This RAW image data is transmitted via high-speed serial communication (GMSL, FPD-Link, etc.), and all of the meta/configuration data is sent over Inter-Integrated Circuit (I2C) protocols.
The latter includes the exposure settings, which means that the Electronic Control Unit (ECU) of the vehicle that is receiving the camera data is responsible for changing those settings when entering a tunnel or adapting to other varying condition. This load on the ECU can add significant delays in the control loop.
1.2 Image Signal Processing (ISP)
Some cameras include built-in image signal processing (ISP) hardware that can offload some of the computation from the Electronic Control Unit (ECU) of the vehicle. These sensor packages with ISP normally include auto exposure adjustments with the ability to react quickly to changing light conditions. In most ADAS applications, the video processing also includes debayerization. This is a step that converts the RAW pixels (as directly acquired by the camera image sensor), where only one color is available per pixel, to a full image, where the three colors are available per pixel. This increases the data rates at the output of the sensor since more color information is broadcast.
1.3 Smart Cameras
A smart camera or intelligent camera extends basic ISP to include detection and classification capabilities, potentially with integrated ECU communication and control functions. It is capable of extracting application-specific information from the captured images and can generate event descriptions or make decisions that are used in an intelligent and automated system. A smart camera versus regular camera is shown in Figure 3.
Figure 3: Camera (left) and smart camera (right).
Integrated detection and classification hardware can detect lane lines, traffic signs, cars, pedestrians, etc. This information is normally output over CANBUS in the form of object lists, and the actual image can be made available as an output when developing or debugging the system.
Cameras with integrated ECU and control functions combine all the previous steps with an ADAS function like lane assist, lane warning, etc. Output can be a steering signal, or a warning signal sent over CANBUS. Early trends in the industry were to use a 100% integrated solution built into the camera package. More recent developments are moving back to a centralized ECU that collects the raw data from different sensors, where decisions are made based on the combination of inputs. This allows for a single sensor to be used for different tasks as ADAS functions get more and more complex and integrated (e.g. highway pilot).2. Image Quality
There are several camera settings or capabilities that affect the quality of the final video image.2.1 Resolution
A video is composed of a series of images. The image resolution refers to the number of pixels used to represent an individual image. A pixel is the smallest addressable part of an image as shown in Figure 4
Figure 4: Image consisting of 41 pixels in width, and 24 pixels in height. Each square is one pixel and can only have one color.
A pixel has a single color and a fixed size. To have a finer resolution image, more pixels must be used to represent the image in the same screen area.
Common video and screen resolutions are shown in Figure 5.
Figure 5: Different image resolutions. The pixel width and height for each resolution are shown.
Resolution is usually defined by two numbers – the width and height in pixels.
For example, the term “4K Ultra HD” is 3840 pixels in width and 2160 pixels in height. This would be expressed as 3840 x 2160.
In Figure 5, the following resolutions are shown:
- 4K Ultra HD – 3840 x 2160 - 8,294,400 pixels total
- Quad HD – 2560 x 1440 - 3,686,400 pixels
- Full HD – 1920 x 1080 - 2,073,600 pixels
- HD – 1280 x 760 - 972,800 pixels
Higher resolutions also cause the file size of images or video to be much larger, as the increase in total information goes up by a factor of four with a doubling of the width and height.
Digital image sensors typically only detect the intensity of light focused through the camera lens. Color is determined via filtering done before the pixel array. There are many different types of color filters used with image sensing arrays, but most are based on either the primary colors Red, Green, Blue (RGB) or secondary colors Cyan, Magenta, Yellow, and Green (CMYG).
These filters are applied per pixel as shown in Figure 6. In the example, red pixels (R) are adjacent to blue pixels (B) and green pixels (G).
Figure 6: Top, Left - A image sensor has filters for specific colors (Red, Green, Blue) that results in three different images (Bottom, Right) of a single color. Debayering is the process of consolidating the three images into a single full color image.
The color sensitive pixels in the image sensor are at different physical locations. The result is three different images, each in a separate color, that are offset from each other.
Debayering is the process of assembling the three images into one full color image. Because the different color pixels are offset, interpolation is needed. Some color sensing and debayering schemes include:
- RCCB (red, clear, clear, blue): Similar to the Bayer sensor except the green pixels are clear, providing more low-light sensitivity and less noise.
- RCCC: A monochrome sensor is desired for maximum sensitivity, with the red channel required for regions of interest such as traffic lights and rear lights.
The effects of debayering are shown in Figure 7:
Figure 7: Image with debayering (left) shows a person in shadow of building which is difficult to see in image on right. The lane lines are also clearer in the debayered image.
The image on the left has been debayered. The image details (people in shadow, lines on street) are clearer than the image on the right that has not been debayered.
Cameras are passive sensors, they capture light that is present in the scene and do not emit their own. This makes them less resilient to low-light environments (e.g. night driving) or light fluctuations (e.g. tunnel entrance or exit).
The exposure dictates what details can be seen in an image as shown in Figure 8:
Figure 8: Image showing under exposure (left), correct exposure (middle), and over exposure (right).
Ideally, an image if neither under nor over exposed:
- Underexposed: Image is too dark. Details will be lost in the shadows and the darkest areas of the image.
- Overexposed: Image is too light. Details will be lost in the highlights and the brightest parts of the image
If the amount of light exceeds the sensor capability, the whites are saturated or overexposed. The sensor only sees pure white but no details. With a underexposed image parts of the image become pure black, and no details can be seen. This usually starts to appear where objects cast shadows in the image.
Exposure in most cameras is controlled by adjusting the exposure time and aperture (the opening of the lens diaphragm) based on feedback from an internal light meter. The idea is to match the amount of light that reaches the camera to be within the minimum and maximum that the light sensor can handle. Note that cameras used in autonomous vehicles often do not have an adjustable aperture.
A camera equipped with a fisheye lens also has areas of varying exposure as seen in Figure 9.
Figure 9: The Region of Interest (ROI) in image from a camera with fisheye lens is well exposed in center, but darker in corners.
The light exposure at the corners of the fisheye image are different than in the center due to distortion caused by the lens. In the case of fisheye lens, the exposure is typically optimized for the central area or region of interest (ROI).
2.4 Bit Depth
The Bit Depth dictates the number of colors used in an image. For example, an 8-bit color is two raised to the eight, which equals 256. This equates to 256 colors.
While 256 colors sound like a lot, eight-bit color causes images to have banded colors instead of fine gradients as shown in Figure 10.
Figure 10: The same image shown in 10-bit color (left) versus 8-bit color (right). Notice that the sky around the sun has color bands on the 8-bit image that are not present in the 10 bit image.
Using a higher number of bits (for example 10) reduces banding but increases the resulting file size. A 10-bit color has 1024 colors total which is four times bigger file size than 8 bit color when all other settings are kept constant.
What is the best number of bits? It estimated that the human eye can perceive around 10 million colors. A 24 bit color has about 16.7 million colors which exceeds 10 million. Due to image reprocessing, it is often preferred to exceed the 10 million required colors because of losses when reprocessing (referred to as posterization).
3. File Size
When collecting video data, the resulting file size is a function of the file format, image resolution, frames per second, and recording duration.
3.1 File Format
When capturing the uncompressed images from a camera, there are specific file formats used to ensure no data is lost:
- RAW file format: RAW refers to a native digital camera file and can be any format.
- YUV file format: YUV is a file extension for a raster graphics file often associated with the Color Space Pixel Format. YUV files contain bitmap image data stored in the YUV format, which splits color across Y, U, and V values. It stores the brightness (luminance) as the Y value, and the color (chrominance) as U and V values.
Some cameras have an image signal processor (ISP). RAW formats may include video data before the ISP, and/or processed data streamed from the ISP.
File formats like MPEG-4 (*.mp4) are generally not used in ADAS or AV applications because the image is compressed.
3.2 Frames per Second
Major motion pictures show about 24 frames (or images) per second on the screen. This gives the illusion of motion.
Today, 30 frames per second in recorded video is common (for human eyes). But if desiring to replay the video at a slower speed or utilize computer processing that is faster than the human eye, 60 frames per second is often used.
A comparison of the same motion (red line moving from left to right) captured at different frame rates is shown in Figure 11.
Figure 11: Same motion captured at 60 frames per second (top), 30 frames per second (middle), and 15 frames per second (bottom).
A 60 frames per second video is twice as large as a 30 frame per second video, when all other settings are kept the same.
3.3 Data Rate
Data rate is the number of Megabytes (MB or Mbyte) per second (s) that are acquired by a video camera. Some example data rates for different scenarios are shown in Figure 12 below:
Figure 12: Example data rate calculations (MB/s) for video data. File format (RAW or YUV), image size (based on pixel resolution), and frames per second (FPS).
Typical data rates of the serial protocols of cameras are:
- FPD-Link is 320 Mbyte per second maximum
- GMSL is 600 Mbyte per second maximum
The 10Gb port of a Simcenter SCAPTOR recorder is 660 Mbyte per second max.
A sample calculation for the amount of data generated in a eight hour recording is shown in Figure 13.
Figure 13: An eight hour recording from a single camera produces about 10 Terrabytes of data.
To accommodate the data rates and file sizes required to record high resolution view, the Simcenter SCAPTOR
has a 64 Terabyte removal drive and 10 Gb ethernet connections.4. Other Considerations
When acquiring camera data, there are more concerns that just image quality and file size. The camera data is acquired in conjunction with other signals in the vehicle. Some of the considerations include:4.1 Time Stamping
In autonomous vehicle applications, cameras are acquired in parallel with other sensors. The Electronic Control Unit (ECU) makes decisions in fractions of a second. If there are timing delays between the information from the cameras and other sensors, this will cause problems.
Therefore, precisions time protocols need to be used for storing and archiving the camera data that is acquired and used for playback and checking sensors. Standards such as IEEE-1588 can be used for precise time stamping of data carried in packets on ethernet. Dedicated devices the Simcenter SCAPTOR MDILink
are used to convert serial data from cameras to ethernet packets to allow precise time stamping with other signals.4.2 Tapping
When recording the output of a camera, “tapping” may be necessary. If a camera is attached to Electronic Control Unit (ECU) in an autonomous vehicle, it might be desirable to record the output of the camera, but not interfere with the functioning of the ECU as shown in Figure 14
Figure 14: In this vehicle, the two front cameras (orange) are connected directly to Electronic Control Units (blue) making it difficult to connect to the recording unit (black).
The solution is “tapping” as shown in Figure 15.
Figure 15: With “tapping” the cameras can be recorded while still attached to the ECUs of the vehicle.
The Simcenter SCAPTOR MDILink has both receiving and transmission ports so it can record a video signal while not interfering. This is known as tapping and is shown in Figure 16.
Figure 16: The Simcenter SCAPTOR MIDI Link has a tapping mode (left) which allows a ECU to function with still outputting to the recording device (right).