You don’t need to look very hard to find artificial intelligence in self-driving cars. From the deep convolutional neural networks that can pick out pedestrians and read street signs to the algorithms that tell a Waymo van when it is safe to pull into an intersection, advanced machine learning is ubiquitous. That’s why it’s so surprising that today’s sensors are still so dumb.
Don’t get me wrong, today’s sensors offer amazing performance by traditional metrics. High resolution digital cameras are becoming dirt cheap and are an engineer’s dream in terms of size and reliability. Radars are increasing their range and resolution all the time. LiDAR, while still expensive, offers an incredibly rich 3D view of the world, unlocking all sorts of autonomous applications.
But for all these sensors, communication is mostly a one-way street. Once you position the camera, it sends you a picture of wherever it’s pointing every 33 milliseconds until you tell it to stop. Top-of-the-line spinning LiDARs will collect data in preset directions and transmit a stream of their results. It’s a similar story with today’s radars as well.
By contrast, think about the way a human driver takes in information about their environment. Some of the time you’re likely scanning the road, looking for things that might enter your path. As you approach an intersection, you might look off to each side to see if it’s safe to proceed. If you see a child running alongside the road, you’re likely to focus your attention on them, in case you suddenly need to stop. A truly intelligent self-driving car needs to not only be able to take in information from preset scan patterns, but to be able to focus its information gathering on the most relevant areas of its environment.
While building intelligence into the sensors themselves is a challenging problem, the potential upside is enormous. By collecting data in the most efficient way possible, we can improve performance while reducing computational and material costs, something which is sorely needed to bring level 4 and 5 autonomous driving to the masses.
Teach the Machine
Machine learning research, and work on artificial neural networks, has always invited comparisons with human cognition. It’s therefore not surprising that a concept as central to the human experience as attention has been receiving increasing interest in recent years. A recent paper (Wang et al. 2017, Residual Attention Network) used stacked residual attention blocks to achieve state-of-the-art performance on a standard object recognition benchmark.
What makes this feat truly impressive is that their network had fewer than half as many layers as the next best method. Traditional convolutional neural networks treat every pixel equally, regardless of its content. By contrast, in this network each attention block performed two tasks: Deciding where to look and determining what is there. This architecture allowed the network to focus on only the most important elements of each image, giving it an edge over its rivals.
Show & Tell
Another very impressive approach was demonstrated by “Show, Attend, and Tell” (Xu et al. 2016). In this work, the researchers set out to create a network which could generate a grammatically correct caption describing the contents of an image. To accomplish this, they combined the standard convolutional approach for image recognition with a long short-term memory approach typically used for language processing tasks.
The result is impressive not only for the quality of the captions it generates, but for the insights it grants into how the network generates them. For each word, the algorithm outputs the region of the image it was paying attention to when it generated that word. One persistent criticism of deep learning is that it typically functions like a black box: Effective but enigmatic. Work like “Show, Attend, and Tell” goes a long way towards pulling back the curtain.
Networking the Neurons
There has also been research using artificial neural networks to choose when to use each of the sensors at its disposal. In a recent paper (Braun et al. 2017), researchers developed a framework for what they call “sensor transformation attention networks”: A system of neural networks which bring different types of sensors into a common framework. Most pertinent to this discussion is their algorithm’s ability to assess the level of noise from each sensor and ignore ones it determines to be unreliable.
Consider the task of transcribing spoken digits from a movie. If the audio is perfectly clear but the video is grainy, the best performance might be achieved by feeding the audio to an LSTM neural network, ignoring the video entirely. However, if the audio is badly distorted and the video is clear, performance may be improved by trying to read the lips of the speaker using a convolutional neural network. Knowing which senses to trust is an important step towards building algorithms which pay attention like humans do.
More Work Is Necessary
As impressive as these results are, there is something incomplete about the way they model attention. When a human is paying attention to an object, they track it with their eyes. This is because our visual acuity is greatest at the center of our field of view and diminishes near the edges. It’s easy to imagine a sensor which functions the same way, spending more of its time scanning critical regions of the outside world, while performing quick and coarse measurements of the uninteresting surroundings.
One very important work on this problem was published in 2010 (an eternity ago in machine learning time) by Larochelle and Hinton. Motivated by the function of the human eye, they created a model in which the network would select which regions of an input image it wanted to examine. These regions would be passed on with high resolution, while information about the surrounding regions would be blurred out. By combining these ‘foveal glimpses’, the network could be observed to scan an image in much the same way a human eye would.
While most of the above work has focused on camera images, the greatest opportunity for bringing attention to autonomous driving sensors is in solid-state analog radar. Conventional digital beamforming radar blasts a wide signal into the environment, and then tries to identify targets in the environment based on careful analysis of the echoes that come back.
By contrast, solid-state analog radar focuses all its energy into a very narrow beam which scans its environment much like LiDAR. Unlike LiDAR, however, radar typically uses a series of modulated pulses to measure the position and velocity of objects in its field of view. This approach gives range, angular resolution, and signal-to-noise figures which are unsurpassed among radars. It also presents several challenges and opportunities.
Timing Is Everything
One major challenge is determining the proper pulse sequence to use. The parameters of this sequence affect the maximum measurement range, maximum measurable velocity, and the resolution of both. These limits are set by the laws of physics, and increasing one invariably affects the others.
In a crowded city center, it might be much better to focus on the maximum resolution possible, since very distant or very fast objects aren’t likely to be relevant at slow speeds. By contrast, during freeway driving the radar will almost certainly want to be operating at maximum range to give the driver advanced warning of obstacles ahead, with a high enough maximum measurable velocity to pick up the oncoming cars. Maximizing the utility of such a radar requires algorithms which can be aware of their context and make decisions on how best to investigate their environment.
Another challenge with radar of this type is that conducting a scan takes time, usually on the order of a few milliseconds. While that might sound fast, it means that looking in every direction with high resolution takes too long to be practical for autonomous driving. Much like the attention networks described above, it is essential that such a system be able to prioritize different areas of its environment based on previous scans.
Working On It
The automotive radar startup Metawave is working to develop both the hardware and software to solve these problems. Their metamaterial-based analog beamforming radar can gather information at ranges no other sensor can reach, but only if can direct its attention on the most pressing targets. While their work for now is focused on the radar domain, these techniques could also unlock new levels of performance in solid-state LiDAR and even cameras.
Like any emerging technology, it is difficult to predict where the field will be in five years without sounding foolish a lot of the time. Nevertheless, I feel confident that the concepts of attention which are emerging in pure research will become indispensable in enabling level 4 and 5 autonomous driving. This is doubly true for mass produced self-driving cars, where cost sensitive manufacturers will look for any excuse to reduce hardware costs by using more efficient algorithms.
About the author
Matt Harrison, PhD is Metawave’s first AI Engineer, tasked with the charter to develop the architecture of AI engine powering radar for autonomous driving. Matt holds a doctoral degree in Theoretical Physics with extensive hands-on experience with constructing deep neural networks.
Recently, he led projects implementing deep CNNs, c-GANs, dense neural networks, and clustering algorithms. Beyond building and training Metawave’s deep neural networks, Matt is also building 3D radar models and simulation tools.