## Self-Driving Cars

### Table of Contents

1. Autonomous Vehicles

1.1 Overview

1.2 History

2. Autonomous Emergency Braking

3. Sensors

3.1 Vision Systems

3.2 Radar and Lidar

3.3 Sonar

4. Algorithms

4.1 Segmentation

4.1.1 Edge Detection

4.1.2 Edge Linking and Boundary Detection

4.2 Representation and Description

4.3 Object Recognition

5. References

6. Figures

### 1. Autonomous Vehicles

### 1.1 Overview

The dream of autonomous vehicles is here, and these robotic cars react faster than humans, have 360-degree perception, and don’t get distracted, sleepy, or intoxicated.^{1} A staggering 37,423 people died in car accidents in the US in 2008. This could have been drastically reduced if the technology was available to the public.^{1} Test cars successfully drove 1,000 miles without human intervention, and more than 140,000 miles with occasional human control.^{1 }Along with the driver who can intervene if necessary, there is a technician in the passenger seat to monitor the navigation systems.^{1}

The vehicle captures its environment with a camera, a sensor that uses electromagnetic energy outside the visual spectrum, or sound. It analyzes the images with a microprocessor to detect and avoid obstacles in real-time. The images are segmented so that the car can react optimally. The main obstacles are other vehicles, curbs, guardrails, cones, pedestrians, bicycles, and animals. Google’s self-driving car recognizes most obstacles with its many sensor types but sometimes objects such as harmless trash cause the vehicle to veer unnecessarily. There are some barriers to detection, including temporary traffic lights, but they should be resolved by 2020. Additionally, the car’s lidar (light radar) sensor has difficulty spotting some potholes, discerning when a police officer signals the car to stop, and may spot “invisible” obstacles, such as exhaust. Centering the car on the road, which is often undulating, is a good starting point for autonomous driving. However, the technology must also be aware of the obstacles on the road to be safe for the masses. An example of an obstacle is a car that suddenly brakes. Sensors onboard the autonomous vehicle must interpret the rapidly slowing car ahead and react in real-time to avoid an accident.

### 1.2 History

Sebastian Thrun is the brainchild of the self-driving car. He is director of the Stanford Artificial Intelligence Laboratory, a Google engineer, and co-inventor of the Street View mapping surface.^{1} In 2005, he led a team of Stanford students and faculty in designing the Stanley robot car and won the second Grand Challenge event of the DARPA competition. This netted a $2M Pentagon prize for autonomous driving over 132 miles in the desert.^{1} Less than two years later, another event proved that autonomous vehicles could drive safely in urban settings.^{1} The project team hired 15 engineers and more than a dozen people to sit in the driver’s seat.^{1}

Google’s self-driving car concept began in 2008 and was presented without pedals and a steering wheel in May 2014. A fully functioning prototype was released in December 2014 and was test-driven on San Francisco Bay Area roads in early 2015. In 2012, Google’s founder Sergey Brin announced the Google Self-Driving car will be available to the public in 2017. In 2014, project director Chris Urmson this schedule, who indicated a possible release from 2017 to 2020. It incorporates $150,000 in image acquisition and processing equipment, including a $70,000 lidar system (Velodyne 64-beam laser rangefinder) mounted on the top.

The Google car is equipped with three radar sensors in the front and one in the back, a scanning lidar on the top, which creates a detailed and highly accurate 3D digital map of the environment to estimate the location of the car, and a position sensor on the left rear wheel to accurately measure movements and position on the map. The advantages of using radar, or lidar, compared to cameras are important to understand. The cameras provide high-resolution, often color images that the driver can easily interpret. Radar has the advantage of a greater range and is primarily used to detect large objects and collect data in inclement weather without image distortion. Both types of sensors will be necessary for the foreseeable future.

The lidar sensors create multiple high-dynamic-range (HDR) images combined in burst mode. With image recognition software, objects can be identified up to 600 feet away without blind spots. The car follows a route programmed into the GPS navigation system and drives the speed limit included in its database for each road.^{1} A rear-view mirror camera detects traffic lights and moving objects.^{1} This technology can be installed in existing cruise control cameras. Faster image processors have enabled smarter algorithms that differentiate between pedestrians and vehicles, and color cameras can detect when the vehicle ahead brakes. For safety, the car also makes announcements, such as “approaching a crosswalk” or “turn ahead” to alert the driver in the unlikely event he should have to take control of the car.^{1}

The Tesla Model S autopilot car uses a camera and radar in the front, and 12 long-range UV sensors. A narrow laser beam maps physical features in the environment with resolutions up to 30 cm/px, up to 75m away. The hardware cost $4,250 and was added to all new cars in October 2014. This package includes software that will engage in auto-steering or “autopilot”, but unlike the Google car, the driver is still advised to pay attention to the road to assist if necessary. Additional capabilities include autonomous lane changes when the turn signal is activated, and warning the driver when he is speeding. The car’s cameras read the speed limit signs, and if the driver is speeding, an announcement is made. Additionally, the vehicle can automatically slow down when the speed limit is exceeded.

### 2. Autonomous Emergency Braking

Autonomous Emergency Braking (AEB), or Emergency Braking System (EBS), is used to help stop the vehicle faster. Automatic cruise control (ACC) radar sensors are used as part of the collision mitigation brake system (CMBS). If the gap with the car in front becomes too small, the car automatically brakes and accelerates again to maintain a safe distance. AEB uses radar, lidar, ultrasonic, infrared, or optical sensors to identify potentially hazardous targets.^{2,3} One configuration uses a camera to identify an object and then uses radar to detect its relative location.^{2} The effectiveness of the vehicle avoiding a crash is between 10% – 72%, with fatal rear-end crashes reduced by 36% – 44%.^{2}

When Mercedes introduced the technology in the late 1990s, they claimed it helped shorten stopping distance by 45% (with only a 10% decrease in stopping distance for skilled drivers).^{2} In 1996, AEB was introduced in the Mercedes-Benz S-Class and SL-Class models, and in 1998 Mercedes made the feature standard on all its vehicles. Since then, several companies, including Acura, Audi, BMW, Infiniti, Land Rover, Rolls Royce, and Volvo, have offered their versions. The European Commission (EC) estimates that brake assist could save 1,100 lives per year. According to the US Insurance Institute for Highway Safety (IIHS), there are an astounding 400,000 25-year-olds who die in car accidents per year (>1,000 people per day). Worldwide, the number of deaths from car accidents is a staggering 1.3M deaths per year (or 3,287 deaths per day).

### 3. Sensors

In the localization process, a global positioning system (GPS), inertial measurement system (IMS), and a wheel encoder are used. Sensors pinpoint where the car is on the road. Sensor fusion integrates the data from all the different sensors. By combining multiple sensors, the quality of the data is improved, but it is not always reliable. For instance, believing every obstacle report increases the vehicle’s visibility. However, the drawback to reduced “blindness” is that “ghost/invisible/phantom” obstacles are potentially introduced. The best approach to limit confusion is to have each sensor be better at a particular problem and trust that sensor over that particular examination regime. The main types of sensors used for autonomous driving are cameras, radar, and lidar vision systems. Other sensors are ladar (laser radar), thermal, long-wave infrared (uses emitted light instead of reflected light), hyperspectral (consists of a camera that works in many color bands), ultraviolet, and sonar (sound waves).

### 3.1 Vision Systems

The advantages of video are its reliability, low cost, and low power consumption.^{4} However, computer vision is not good enough today to detect all the important features in an image with the reliability necessary for safe driving. The main reason is that they require many central processing units (CPUs) or custom chips to interpret an image. Cameras must deal with lighting variations and shadows and may require illumination at night when headlights do not provide sufficient lighting. Cameras use optics to focus the incoming light on an image chip, which converts the light into a digital signal. A microprocessor takes the digital data as input, processes it according to the software written in memory with sequential digital logic, and calculates the distance and velocities of other vehicles. To process the data in real-time, the CPU should transmit and receive image data at a fast data rate, but the faster the processing speed, the higher the costs. The image sensors in a camera are generally CMOS, but CCD types are also used. A CMOS sensor has many advantages, such as smaller size, reduced power consumption, faster speed, and advanced feature capabilities.

Monocular or stereo-based (binocular) vision systems, on-board the autonomous vehicle, estimate in real-time the position and orientation of the vehicle related to the 3D road plane parameters.^{5} Binocular systems use stereopsis to determine 3D motion and parallax. In visual odometry, the ego-motion problem refers to estimating a car’s relative motion from camera images comparing road lines, street signs, etc.^{5 }The first task to consider in solving the complex problem of autonomous driving is to ensure the car stays centered in its lane and travels at the correct speed. Traveling at a constant speed with cruise control is not a new technology, in fact, it has been available in cars since 1900. The next logical technological step is to include steering cruise control. The monotonous nature of typical highway driving is reduced, and the journey is safer, as the car can hold its speed and position if the driver is distracted or falls asleep.

One configuration is a button that the driver hits when he is in the desired lane at the desired speed on the highway. The button activates the advanced cruise control, which allows the car to hold its lane while maintaining a constant speed.^{6} The lane-keeping assist system (LKAS) uses a camera next to the rearview mirror to monitor road markings, feed the data to a computer, and apply the correct steering torque to the wheel to keep the car centered in its lane. One way to improve the lane holding provided by the LKAS is to incorporate lane changing assist.^{6} The lane changing assist uses cameras to detect cars in nearby lanes as soon as the driver activates the turn signal. If no obstacle was detected, the car will change lanes by applying slight steering adjustments that can be easily overpowered by the driver if necessary.

Lateral control vehicle systems allow a vehicle to follow a lead vehicle by sensing the lead vehicle either with infrared, ultrasound, magnetic markers, etc.^{6} The lateral movement of the vehicle is controlled by a lateral control algorithm.^{6} The image processing algorithms are either based on a pixel intensity profile or vanishing points in the image plane.^{6} The control algorithm is based on a simple classical P-control and a control scheme based on H-.^{6} Automatic lateral control is a two-step process: the first step is to obtain the position of the vehicle with respect to the road from the video image, and the second step uses this information as input to a control algorithm.^{6} Classical controllers (i.e. P, PI, PID), fuzzy logic, neural networks, etc., in conjunction with the algorithms, generate the steering angle of the vehicle necessary to maintain its desired position on the road (figure 1).^{6} For lateral vehicle control, many existing mathematical models are adapted to the specific hardware in use.^{6} Parameter estimation is used for hard-to-measure parameters in the system.^{6} Attempts can be made to design a modern robust controller based on the model obtained by parameter estimation.^{6}

### 3.2 Radar and Lidar

Radar, an acronym for “radio detection and ranging”, uses radio waves as an energy source, which are longer in wavelength than visible light waves. It is better suited to identify large obstacles, such as vehicles or pedestrians. Radar can see through fog, but unfortunately lidar, an acronym for “light detection and ranging”, can detect exhaust smoke. Lidar produces a 3D image by scanning the environment at 10 rev/sec. Behind the emblem on the front of the car, there are radar sensors that constantly monitor the distance of the car in front of it. Radar is used for AEB, which allows the car to recognize if a collision is imminent. The system first warns the driver with an audio message and visual pictogram on the dashboard. If no action is taken, the seat belt pretensioners gently tug the driver, giving a physical warning. If there is still no action, the system will apply brakes.

### 3.3 Sonar

Sonar, an acronym for “sound navigation and ranging”, existed before radar and uses sound propagation to identify obstacles. The autonomous cars use active sonar, which sends a pulse of sound (ping) and listens for reflections (echoes) of the pulse. The pulse is usually generated electronically using a sonar projector, consisting of a signal generator, power amplifier, and electro-acoustic transducer/array. A beamformer is usually used to concentrate the acoustic power into a beam, which can then be swept to cover the search angles of interest. Another option is to use multiple beams. Both sonar and radar sensors have a limited field of view, but the sonar range is only six feet compared to the 200-foot range of the radar. Sonar’s acoustic frequencies range from low (infrasonic: 0.001 – 20 Hz) to high (ultrasonic: 20 kHz – several GHz).

### 4. Algorithms

Along with sensors that have improved performance and are less expensive, the study and construction of algorithms that learn and make predictions from sensor data are important to the ongoing progress in autonomous driving. The ability of the car to adapt to new circumstances is a necessary condition. In this effort, it is important to identify and extrapolate patterns: a field of research known as machine learning. Pattern recognition in algorithms is analogous to primate vision, and how neurons in the inferior temporal cortex allow object recognition. One way to assess the effectiveness of machine learning is to give algorithms more experience and measure how much performance in tasks improves (the assumption is that performance improves with more experience). Machine learning predictions are made based on known data, generalizations based on a limited set of data, and performance evaluated based on reproducing known knowledge.

Conversely, knowledge discovery and data mining (KDD) is the discovery of unknown properties in the data. It’s the next step, after solving the task of autonomous driving with machine learning, to make the car’s algorithms smarter, and thus give it the ability to learn without having to provide extensive evidence. Computational learning theory and computational statistics are two underlying areas of machine learning. Machine vision enables some detection of other vehicles, pedestrians, road edges, and road markings. It can also identify objects independent of scale or rotation. However, when a more complex, non-localized analysis of an image takes place, the term computer vision is more appropriate. When it comes to computer vision, rather than machine vision, it is assumed that object recognition is improved.

After surveying the stereo algorithms of the ARPA Unmanned Ground Vehicle program, the SRI algorithm (developed by INRIA) was discovered. The SRI algorithm attempts to produce high-quality matches using hierarchal techniques, but with existing hardware, this method falls short of real-time speed requirements.^{7} The correspondence of selected points at the frame rate was developed by TELEOS.^{7} The selection of these points is critical for the still open problem of obstacle detection.^{7 }Trinocular stereo is a variation that uses three cameras instead of two to simplify matching. Unfortunately, speed is an issue, and algorithms are 5-10 times too slow.^{7}

Two road approximation techniques commonly used for stereo-based driver assistance applications are v-disparity space and Euclidean space.^{7} The v-disparity space approach uses prior knowledge of the approximated road scene and 3D reconstruction software to create a disparity image.^{7} The disparity image and values are used to compute the v-disparity representation, and in the process avoid the need to compute 3D data points, whereas the Euclidean space computes the 3D data.^{7} A plane for the road in the Euclidean space becomes a straight line in the v-disparity space, and subsequently, the plane fitting becomes segment fitting.^{7} Instead of segment fitting using the least-squares method, the Hough transform (HT) with a voting scheme is proposed to extract the largest subset of points that define a straight line.^{7} The results showed that working in the Euclidean space, using a filtering technique, achieves better results over all road surface contours than those obtained in v-disparity space using the HT.^{7} Nevertheless, the extraction of planar representations with the HT in v-disparity space is faster and better suited for highway driving where the road is flat.^{7} The Euclidean space approach is two-stage, consisting of structuring 3D data points to perform real-time processing, followed by performing a fitting technique to find the best surface to approximate those data points.^{7}

Due to its simplicity, the method of least squares is the most common fitting method.^{7} It minimizes the sum of the squares of the offsets, instead of taking the absolute values of the offsets.^{7} This allows the residuals to be treated as continuously differentiable quantities.^{7} A 2D (i.e. lateral view YZ plane) representation of the 3D data points is generated, and then a RANSAC-based least-squares approach is used to fit a plane to the road candidate points.^{7} The sensor used for data acquisition was a commercial stereo vision system called Bumblebee (ptgrey).^{7}

### 4.1 Segmentation

Image segmentation, also known as classification, takes an input image and groups it into its constituent regions or objects. The output image is representative of the attributes extracted from the input image. In 1925, Max Wertheimer, a renowned psychologist who was one of the three founders of Gestalt psychology, listed several key factors that led to the perceptual organizing of an image, including similarity, proximity, and good continuation.^{8} There are many ways to partition an image into subsets, and there may not be a single correct answer, but a Bayesian approach is preferred.^{8} A Bayesian approach means the level of detail to which the subdivision is carried out depends on the problem to be solved. Image segmentation should be performed downwards from the big picture, much like a painter who first marks the major areas.^{8}

The segmentation of the environment around a vehicle is a non-trivial task. The partitioning is inherently hierarchical, and thus more like a tree structure than a flat one.^{8} Low-level cues, such as brightness, color, texture, and motion attributes, are combined with mid- and high-level cues to either confirm the low-level group or select one for further attention.^{8} Improving the probability of accurate segmentation is critical, as the eventual success or failure of computerized analysis procedures hinges on this accuracy.^{8} Selecting the types of sensors to enhance the objects of interest and diminish the contribution of irrelevant objects is important. The result of segmentation is a modified representation of an image into something more meaningful and easier to analyze. Segmentation algorithms are typically based on two properties of intensity values: discontinuity and similarity, both of which are defined by a set of predefined criteria.^{9} Discontinuities locate objects or boundaries, such as points, lines, and edges, using derivatives when the signal is continuous, or difference methods when it is discrete.

### 4.1.1 Edge Detection

Edges, where two regions meet, are detected by applying first or second-order derivatives (i.e. spatial-frequency filters: directional, high-pass, medium-pass, and low-pass).^{9} The magnitude of the first derivative can be used to detect the presence of an edge, and the sign of the second derivative can be used to determine whether an edge pixel lies on the light or dark side. In addition, the zero-crossings can locate the center of thick edges.^{9} Detecting the location and orientation of traffic lines relative to the car is a method of keeping the car centered in its lane. The edges of the road lines are often blurred by focusing mechanisms and electronic components that cause noise. Fortunately, image smoothing reduces noise because the derivatives are sensitive to it.^{9} If the edge transition is more than one pixel across, modeling the line as a ramp or roof edge is more appropriate than the step edge.^{9 }The edge points become any point in the edge profile, and the set of connected points is an edge segment.^{9}

The models allow mathematical expressions for edges in the development of algorithms.^{9} The performance of the algorithm is the difference between the actual edges and the models used.^{9 }The brightness of the line depends on the line thickness, filter size, and threshold value.^{9} When line detection is discussed, the assumption is that lines are thin in terms of the size of the detector (Laplacian filter).^{9} The line width you are trying to detect should be similar to the filter used, such as using a 3X3 filter to detect a 1-pixel line width.^{9} Filter types applied to an image are typically either horizontal, +45 degrees, vertical, or -45 degrees depending on the line orientation in question.^{9} The four different filters vary in the direction of the larger coefficients (the preferred direction has values of 2) within the array of filter values (-1) that make up the filter.^{9} The filter performs a sum of products of the mask coefficients and intensity values in the region, resulting in the response of the mask at the center point of the region.^{9} To find whether a particular region is more associated with a certain line orientation, apply each of the four filters (horizontal, vertical, and the two diagonals designated as k= 1, 2, 3, and 4) individually to a given point in the image to yield |R_{k}|.^{9} If |R_{k}| > |R_{j}| for all j ≠ k then the line is in the direction of mask k.^{9}

Alternatively, if we want to find all lines in the image with a certain orientation, run the mask through the image and threshold its output to yield an easily discernible binary image.^{9} Finding the edge strength and direction at a certain location in the image means finding the gradient vector, and from that determining the magnitude and angle respectively (the edge is perpendicular to the direction of the gradient vector).^{9} To identify edges, one applies either a 1D-mask (vector) or a 2D-mask.^{9} 2D-masks include: the 2X2 Roberts cross-gradient operator, the more accurate 3X3 operators- Prewitt (mask coefficients: -1, 0, 1), and the improved noise-suppression Sobel (mask coefficients: -2, -1, 0, 1, 2).^{9} The mask coefficients of the Prewitt and Sobel operators differ in the diagonal edge detection compared to horizontal and vertical edge detection.^{9} After you find the gradient in the x and y directions, the magnitude is calculated by using the method of squares and square roots, or the less computationally burdensome method of absolute values.^{9} The downside of using absolute values for the gradient magnitude image (edge map) is that it is not invariant to rotation.^{9}

Improving the gradient images is achieved by smoothing the image first so that only the principal edges are highlighted. After smoothing, thresholding is used to sharpen the lines and improve connectivity.^{9} The Marr-Hildreth edge detector contains a more sophisticated analysis that requires operators of different sizes, as they believed intensity changes were not independent of the image scale.^{9} The Marr-Hildreth algorithm consists of convolving the Laplacian of a Gaussian (LoG) with an input image and finding the zero crossings to determine the locations of the edges.^{9} An important feature of zero crossings is that the resulting edges are 1 pixel thick; this simplifies subsequent processes, such as edge linking.^{9} The LoG is a Laplacian operator (isotropic second derivative) applied to a blurring 2D Gaussian function.^{9} If the accuracy of zero crossings is inadequate, a technique proposed by Huertas and Medioni (1986) allows subpixel accuracy.^{9} The more complex Canny edge detection algorithm is used when increased performance is required, such as thin edges without breaks.^{9}

### 4.1.2 Edge Linking and Boundary Detection

Edge linking usually follows edge detection. The algorithms are designed to assemble edge pixels into meaningful edges and or region boundaries.^{9 }The three processing approaches commonly used are local, regional, and global.^{9} Local processing establishes a similarity of edge pixels (magnitude and direction of the gradient vector) in a small neighborhood.^{9 }Two pixels in a neighborhood are linked if they satisfy both magnitude and direction criteria.^{9} As the neighborhood moves throughout the image, the linked points are recorded (often assigned different intensity values to keep track of them).^{9} Regional processing often uses a functional approximation in which a 2D curve is fit to the known points.^{9} In the case of a set of points that represent an open curve, the endpoints represent the vertices of the polygon and are connected with a straight line.^{9 }Then we calculate the perpendicular distance of all other points in the curve to this line, and select the line that yielded the maximum distance.^{9} If this distance exceeds a specified threshold, the corresponding point is declared a vertex.^{9}

In local and regional processing, it is partially known whether pixels belong to individual objects.^{9} However, this knowledge is not known in global processing, and all pixels are candidates for linking, and must therefore be accepted or eliminated based on predefined global properties.^{9} The simplest, but the computationally prohibitive approach is to find all lines determined by every pair of points, and then find all subsets of points close to particular lines.^{9} Using the alternative HT approach, a parameter space consists of many intersecting lines that connect two points.^{9} The principle lines in this plane could be found by identifying all points in the parameter space where large numbers of parameter-space lines intersect.^{9 }Methods used to segment an image are edge detection, thresholding, region growing, region splitting, morphology, and motion cues.^{9} Using more than one method is generally preferred, such as combining edge detection with thresholding.^{9} Edge detection requires post-processing, such as edge linking, while global thresholding offers speed.^{9}

Region growing is a procedure that groups pixels or subregions into larger regions using seed points, and the growth is based on predefined criteria.^{9} Region splitting and merging are different in that it subdivides an image into arbitrary, disjoint regions, and then either merges or splits the regions to try to satisfy the conditions of segmentation.^{9} Starting with the entire image or region, we subdivide it based on a selected predicate that is either true or false.^{9} If the predicate for this region of examination is false, we divide it into four disjoint quadrants (quadtree representation).^{9} The images corresponding to the nodes of a quadtree are called quadregions or quadimages.^{9} Finally, the adjacent regions are merged if the predicate of the union of these two images is true.^{9} Specify a minimum quadregion size beyond which no further split is carried out.^{9}

Morphological watersheds represent many properties that have already been discussed and often produce more stable results, including connected segmentation boundaries.^{9} The concept is based on the visualization of an image in three dimensions: two spatial coordinates versus intensity.^{9} This topographical interpretation then has three types of points: points belonging to a regional minimum, points where if a drop of water was placed, it would fall with certainty to a single minimum (referred to as a catchment basin or watershed), and points where water would equally fall to more than one of these minimums (these points form crest lines on the topographic surface (referred to as divide lines or watershed lines).^{9} Next is the construction of dams by applying watershed segmentation algorithms.^{9} Imagine holes are punched into the bottom of the catchment basins, and water can flood in.^{9} The water will eventually spill from one basin to another unless a dam is built between them.^{9}

Motion is a powerful cue used by humans and many other animals to extract objects or regions of interest from a background of irrelevant details.^{9} One of the simplest ways to detect changes between two image frames is to compare the two images (pixel by pixel).^{9} By subtracting the two images, stationary elements cancel leaving only the moving elements.^{9} Accumulative difference images are achieved by comparing a sequence of image frames and incrementing a counter every time a pixel changes intensity from each subsequent image.^{9} Frequency domain techniques are also an option for detecting motion by using Fourier transforms.^{9}

Some approaches to creating specific algorithms are decision tree, association learning rule, clustering, and Bayesian networks. The decision tree maps an observation about an item to a conclusion about its targeted value. The association learning rule is used to discover interesting relationships between variables in a large database. Clustering is the assignment of observations into subsets or clusters based on similarity. Bayesian networks are a statistical approach that updates the probability of a hypothesis as more evidence is gained. Two basic questions arise: what is the criterion you want to optimize, and is there an efficient algorithm?^{8} After finding an attractive criterion, an effective algorithm must be sought to find the criterion’s minimum.^{8} Greedy or gradient descent approaches fail to find global optima for high-dimensional, non-linear problems.^{8} In the 1980s, the high-dimensional problems became solvable with the introduction of Markov random fields (MRF) and variational formulations.^{8} MRF, like Bayesian networks, allowed future events to be predicted from present events, in which chains of states form over two or more dimensions.

### 4.2 Representation and Description

After segmenting the image into regions (super pixels), you decide whether to represent an image by external (boundary) or internal characteristics (pixels that comprise the region).^{9} Next, you decide on a description of the region.^{9} Some feature descriptors that can be used to describe a boundary are the length, the orientation of the line that connects its extreme points, and the number of concavities.^{9} Often a connected sequence of straight line segments (Freeman chain codes) is used to represent a boundary as a simplified version of itself.^{9} This approach traverses and resamples the boundary by assigning the region boundary to the closest grid node.^{9} The set of node points is then connected with straight lines that correspond to the numbers 0 to 3 (for 4-code) or 0 to 7 (for 8-code), depending on the orientation of the segment (90-degree or 45-degree increments respectively).^{9} A minimum-perimeter polygon (MPP) can also approximate a digital boundary, an approach that encloses the boundary by a set of concatenated cells known as a cellular complex.^{9} The final boundary shape is straight edged and constrained by the cell’s inner and outer walls.^{9} Our algorithm must focus only these vertices by recognizing that only convex vertices of the inner wall and concave vertices of the outer wall can be vertices of the MPP.^{9}

Segment splitting is another method for simplifying a boundary, which successively divides a segment or region into two parts until a specified criterion is met.^{9} It seeks inflection points by connecting the two extreme points of the shape with a line and finding the longest perpendicular line that extends from this centroid line and the boundary edge.^{9} This perpendicular line extends above and below the centroid line to find the vertices of the polygon.^{9} If the perpendicular line exceeds a threshold value, further splitting occurs.^{9} A variation of this method is to plot the distance from the centroid to the boundary as a function of the angle (0 – 360 degrees), which produces a signature, or 1D functional representation.^{9} Scaling the image will lead to a change in the amplitudes of the corresponding signature.^{9} One way to normalize the signature is to scale all functions so that they span the range of 0 to 1, but this simple method of normalization, which depends on only two values (minimum and maximum), can be a significant source of error if the shape is noisy.^{9} Yet distance versus angle is not the only way to generate a signature.^{9} Another method, called slope density function, creates a histogram of tangent-angles.^{9} Where deep valleys represent points of rapidly varying angles (i.e. sharp inflections), while the peaks are sections of the boundary with constant tangent angles (i.e. straight or nearly straight segments).^{9}

Decomposing a boundary into segments is often useful to simplify the description process for irregularly shaped objects that contain one or more significant concavities.^{9} Following the contour of an object (set S) and marking the points at which a transition is made into or out of the convex deficiency (D) yields points that delineate the ends of the boundary segments.^{9} Another method to represent the shape of a region is to obtain a skeleton of the region through a thinning (i.e. skeletonizing) algorithm.^{9} The medial axis transformation (MAT) is a line drawn through the region with more than one border point (i.e. border points on the opposing side of the MAT line).^{9 }This creates a representation of the region that is simplified and visually pleasing.^{9} The disadvantage of the skeleton representation is that it is expensive in terms of computation, as the implementation potentially involves calculating the distance from every region point (i.e. interior point) to every point on the boundary.^{9}

### 4.3 Object Recognition

The result of segmenting, representing, and describing an image is an image composed of regions or artifacts that are more meaningful and easier to recognize as objects or patterns.^{9} The approaches to recognizing patterns are centered on learning techniques and are classified either as decision-theoretic (i.e. quantitative descriptors such as length, area, or texture) or structural (i.e. relational descriptors for recursive relationships involving primitive elements).^{9} The key concept to consider for the decision-theoretic (i.e. vector approach) is that selecting the descriptors on which to base each component of a pattern vector has a profound influence on the performance of object recognition.^{9} For quantitative descriptions, vectors are used to arrange patterns, while structural descriptions use strings and trees.^{9}

Pattern recognition refers to automatically assigning patterns to their respective classes with as little human intervention as possible.^{9 }Class membership is often a function of not only quantitative measures about each feature but also of the spatial relationships between the features.^{9} Primitive elements that are repeated can be sampled and expressed as a pattern vector.^{9} However, if primitive elements are based on relatively simple connectivity, string descriptions are a better approach to adequately generating patterns of objects.^{9} The most powerful are tree descriptions used for most hierarchical ordering schemes.^{9}

Decision-theoretic approaches to recognition are based on the use of decision (or discrimination) functions and the matching of an unknown pattern with the closest class (i.e. prototype pattern vector).^{9} The minimum distance classifier calculates the distance between the unknown pattern vector and each prototype vector.^{9} The decision is based on the prototype vector, which is the Euclidean distance from the unknown vector in question.^{9} Matching can be done with a template for spatial correlation.^{9} The best match is the location with the maximum normalized correlation coefficient.^{9}

High classification speeds can be achieved with analog circuits composed of resistor banks.^{9} Taking a probabilistic approach to pattern recognition is often beneficial due to the randomity in pattern classes.^{9} Classification using the optimal statistical classifier approach leads to the lowest probability of committing a classification error.^{9} The probability that a particular pattern comes from a certain class is a function of the probability density function of the patterns from that class and the probability of occurrence of the class a priori (note: all classes may be equally likely to occur).^{9}

The conditional average risk (or conditional average loss in decision theory terminology) is used to determine the average loss in assigning a pattern to a particular class if it is not from that class.^{9} The Bayes classifier minimizes the total average loss.^{9} The loss for a correct decision is generally assigned a value of 0 and the loss for an incorrect decision is usually assigned the nonzero value (of say 1).^{9} Sample patterns are used to estimate the statistical parameters of each pattern class.^{9} A sample pattern is called a training program, and a set of patterns from each class is called a training set.^{9} The process by which a training set is used to obtain decision functions is called learning or training.^{9}

The mean vector specifies the minimum distance class of each class.^{9} The Bayes classifier for Gaussian populations is specified by the mean vector and covariance matrix of each class.^{9} The following models use a multitude of elemental nonlinear computing elements (called neurons), organized as networks: neural networks (neural nets), neurocomputers, parallel distributed processing (PDP) models, neuromorphic systems, layered self-adaptive networks, and connectionist models.^{9} Interest in neural networks began in the early 1940s by McCulloch and Pitts (1943).^{9} Stochastic algorithms are used on training sets of patterns, but improved statistic properties are achieved through learning from the use of neural systems that adaptively develop the coefficients of decision functions.^{9} The mathematical models are generated by learning through reinforcement or association (Hebb 1949).^{9} The neurons change binary states upon learning.^{9}

Learning machines, called perceptrons, appeared in the mid-1950s and early 1960s (Rosenblatt 1959, 1962).^{9} They used mathematical proofs to show that linearly separable (by a hyperplane) training sets would converge to a solution in a finite number of iterative steps.^{9} The solution took the form of coefficients of hyperplanes capable of separating the classes represented by patterns of the training set.^{9} Although perceptrons proved to be a well-founded theoretic model, the training algorithms were unfortunately inadequate for most pattern recognition tasks of practical significance.^{9} The development of new training algorithms for multilayer perceptrons, by Rumelhart, Hinton, and Williams (1986), has improved their outlook considerably.^{9} Their method, called the generalized delta rule for learning by back-propagation, has established multilayer perceptron-like machines as one of the principal models of neural networks currently in use.^{9} The perceptron learns a linear decision function that dichotomizes linearly separable training sets.^{9}

The linear decision function uses weights to modify the components of pattern vectors; the weights are analogous to synapses in the human neural system.^{9} The function is the sum of all weighted pattern components and is called the activation function.^{9} If the function is more than 0, the perceptron output value thresholds to 1, which indicates the pattern was recognized as belonging to that class.^{9} The orientation of the hyperplane is established by the weights of the function, except for the last weight, which is proportional to the perpendicular distance from the origin to the hyperplane.^{9} The fixed increment correction rule is a simple iterative algorithm to obtain a solution weight vector for two linearly separable training sets.^{9} This method uses a correction increment if the pattern is misclassified.^{9} The algorithm is converged in a finite number of steps (assuming the two or more classes are separable) when the entire training set for the classes is cycled through the machine without errors.^{9}

In practice, linearly separable pattern classes are the exception, which is why a significant research effort in the 1960s and 1970s was spent developing techniques to handle non-separable pattern classes.^{9} However, many of these advances in the handling of non-separable pattern classes have become obsolete due to advances in training neural networks.^{9} One of the early and still relevant methods for training perceptrons is the original delta rule, also known as the Widrow-Hoff or the least-mean square (LMS) delta rule.^{9} This method minimizes the mean square error between the patterns (i.e. actual and desired response) at each training step.^{9} If the pattern classes are separable, the solution given by the delta correction algorithm may or may not produce a separating hyperplane.^{9} Nevertheless, this method converges under both separable and non-separable cases.^{9}

The methods discussed so far can be extended to more than two classes and nonlinear decisions.^{9} Multilayer feedforward neural networks consist of layers of perceptron computing elements in which the output of each neuron in one layer feeds into the input of each neuron in the next layer.^{9} The number of neurons in the input layer is often the dimensionality of the pattern vectors, and the number of neurons in the output layer is the number of pattern classes to which the neural network was trained.^{9} The number of neurons in an intermediate layer can be chosen heuristically, an average of the first and last layers, or arbitrary and refined by tests.^{9} If a pattern vector belongs to a class, the network output for that class is “high,” while all other outputs are “low.”.^{9} There is the option of replacing the hard-limiting activation function with a soft-limiting “sigmoid” function.^{9}

Different types of functions can be used for different layers, or even for different nodes within the same layer of a neural network.^{9} Back-propagation training develops a rule like the delta rule, which allows the adjustment of the weights in each of the layers in a way that seeks a minimum of an error function.^{9} This method propagates the error (error term and weights) at the output layer back into the network.^{9} The process starts with an arbitrary set of weights across the network.^{9} Then two basic phases are involved in applying the generalized delta rule at any iterative step.^{9 }In the first phase, a training vector is presented to the network and propagates through the layers to calculate the output for each node.^{9} Then the outputs of the nodes in the output layer are compared with their desired responses to generate the error terms.^{9} In the second phase, a backward pass through the network occurs, in which the appropriate error signal is passed on to each node, and the corresponding weight changes are made (taking into account any bias weights that may have been present).^{9}

A common practice is to track the network error and errors associated with individual patterns.^{9} In a successful training session, the network error decreases with the number of iterations, and there is convergence to a stable set of weights that only shows small fluctuations with additional training.^{9} If a pattern has been classified correctly during training, the response of the node in the output layer associated with the pattern class should be high (> 0.95), while all other nodes should have low outputs (< 0.05).^{9} After training, the feedback paths are disconnected and the patterns are classified based on the parameters established during training.^{9} The input pattern is allowed to propagate through the various layers and is classified as belonging to the class of the highest output node. If more than one output node is high (or if all nodes are low), either choose the highest-value node or declare a misclassification.^{9} The next stage of training occurs when the system has learned the shapes from all tested classes, as each training pattern produces an output layer node high and all other nodes low.^{9}

The second part of training is a similar process, except that the patterns are now noisy samples.^{9} By running progressively noisier test sample images for each class being tested, the weight factors are adjusted and the performance improves (measured by the decrease in the probability of misclassification).^{9} The larger the number of samples, the better learning from each (noisier) test set.^{9} If the nature of the noise is known, small incremental additions of noise to the test set are possible, which is ideal for improving the convergence and stability properties of a neural network during learning.^{9}

A single-layer perceptron implements a hyperplane decision surface.^{9} A multilayer network can implement complex decision surfaces composed of intersecting hyperplanes.^{9 }Let’s first consider a two-input network with two layers.^{9} The outputs of the two nodes in the first layer are either high or low (denoted by a 1 or 0 respectively).^{9} The two nodes in the first layer feed a single node in the second layer, and therefore the possible combinations that feed each of the second layer nodes are either (1,1), (1,0), (0,1), and (0,0).^{9} The output nodes from the first layer can classify an input pattern as belonging to one of the two regions (or classes) by performing a logical AND operation between the two nodes in the first layer.^{9} The output node in the second layer responds with a 1 only when both outputs from the first layer that feed it are 1 as a function of the AND logic performed.^{9}

Finally, lines that delineate two regions (which could break up the classes) are drawn by putting the node on the positive side of the line when it is high or on the negative side when it is low. A single linear surface may not be sufficient, but a line with a vertex (i.e. two lines connected or two separate lines) may be necessary.^{9} If the number of nodes in the first layer was increased to three, the network would implement a decision boundary consisting of the intersection of three lines.^{9} The next logical step is to increase the number of layers to three.^{9} In this case, the nodes of the first layer implement lines, the nodes of the second layer then perform AND operations to form regions from the different lines, and the nodes of the third layer assign class membership to the different regions.^{9} So far, patterns have been dealt with quantitatively, largely ignoring any structural relationships inherent in the shape of a pattern.^{9}

Strings are the most practical approach to structural pattern recognition.^{9} A procedure analogous to the minimum distance concept for pattern vectors can be formulated to compare region boundaries described in terms of shape numbers.^{9} String matching is another option if the region boundaries compared are coded into strings. Based on the number of matches between the strings, their similarity can be quantified.^{9} The next step beyond extracting attributes from images and using object recognition is image analysis methods, whose appropriate development requires concepts from machine intelligence.^{9} Machine intelligence and areas dependent on it, such as scene analysis and computer vision, are still in their early stages of practical development.^{9} Today, heuristic approaches characterize image analysis problems (i.e. learning to arrive at an approximate solution rather than an exact closed-form solution).^{9}

### 5. References

1. Markoff, J. “Google Cars Drive Themselves, In Traffic”. *The New York Times*. Oct. 2009.

2. Doecke S.D., Anderson R.W.G., Mackenzie J.R.R. “The Potential of Autonomous Emergency Braking Systems to Mitigate Passenger Vehicle Crashes”. *Ponte G. Centre for Automotive Safety Research*. 2012.

3. Hulsof, W., Knight, I., Edwards, A., et. al. “Autonomous Emergency Braking Test Results”. 0168, p. 1-13.

4. Templeton, B. “Cameras or Lasers?”

5. Sappa, A., Herrero, R., Dornaika, F., et. al. “Road Approximation in Euclidean and v-Disparity Space: A Comparative Study”. *Computer Vision Center, Edifici O Campus UAB*.

8. Shi, J., Malik, J. “Normalized Cuts and Image Segmentation”.

9. R. Gonzalez, R. Woods. *Digital Image Processing, 3rd ed*. Pearson Education Inc. 2008.