In many technical fields of computer vision, object detection (Object Detection) is a very basic task. Image segmentation, object tracking, key point detection, etc. usually require the help of object detection. As a basic task, target detection is usually associated with image classification and image segmentation. Let’s briefly look at the differences and connections between them.

Image classification:Image classification focuses on only a single object in the input image, which is used to determine what category the image belongs to, such as large categories such as people and animals, or small categories of different types of animals. These image-level tasks are relatively simple and easy. understand, so it is the first to be developed and used.

Target Detection:Target detection is biased towards many objects containing multiple categories in an input image. The images we often shoot or see are often objects with multiple categories, which are more complicated. The purpose is to find out the positions of different objects in the image and judge. its category.

Image segmentation:Image segmentation is similar to the input of target detection. The difference is that it uses pixels in the image as the basic unit to determine the category of each pixel, which belongs to pixel-level classification. Generally, image segmentation and target detection are related to each other. Many models and methods can learn from each other. .

First, the basic concept of target detection

Object detection is to classify all objects of interest in an image and detect their respective position coordinates.

As shown below,Target detected images includedog、bicycle、truckthree targets, and their respective location information was identified.

Of course, target detection can detect various categories of information. As long as we want to detect whether there is a target we need in an image, we can perform feature information training on images of pre-labeled categories, so that the network model can learn the known target. feature, and then identify the target category and location of other images.

Second, the development history of target detection

Target detection is a traditional algorithm based on manual features at the beginning. Traditional algorithms are usually divided into 3 stages for target detection:Region selection, feature extraction and feature classification

With the development of computers in recent years, deep learning has been widely used, and target detection based on deep learning has become a popular detection method.

After years of research on the target detection algorithm and continuous improvement and optimization of the network model, many excellent algorithm models have emerged.

These models are mainly divided intotwo types

For example, the two-stage R-CNN series (2-stage) detection model and single-stage (1-stage) detection model.

Since Faster-RCNN proposed the anchor mechanism, many subsequent improved algorithms have followed this method. Therefore, the model has another way of dividing, according to whether the anchor mechanism is applied or not.anchor-baseandanchor-free

1.Two stage与One stage

1)Two stage

Common two-stage target detection algorithms are: R-CNN, SPP-Net, Fast R-CNN, Faster R-CNN and R-FCN, etc.

In the field of two-stage target detection, the faster R-CNN model is a typical representative.

firstThe algorithm will first go throughMain network extraction features, and then the extracted feature map will first pass through the Region proposal network (RPN network) to generate a candidate region (Region Proposal, RP for short, containing the region proposal frame of the detection target), and generate a region of interest (ROI) according to the feature map and RP. To complete the classification of the regression of the subsequent position coordinate information.

2)One Stage

Common one stage target detection algorithms are: OverFeat, YOLOv1~YOLOv7, SSD and RetinaNet, etc.

The latest work, YOLOv7, outperforms all known object detectors in speed and accuracy in the range of 5-160 FPS.

The one-stage detection model does not have an RPN part, but extracts features in a convolutional network in one step to predict the category and location of the target.

Therefore, after having the pre-trained weights of the feature extraction network, the entire one-stage can be directly trained end-to-end.

All in all, the one-stage detector greatly simplifies the framework of the model structure, improves the inference speed and simplifies the training steps.


At present, target detection based on deep learning has gradually developed into aanchor-based、anchor-freeThe difference between the two fusion types is whether the anchor is used to extract the candidate target frame.

First let us understand what is an anchor?

anchor is also called anchor, is actually a preset set of bounding boxes of different scales and sizes. During network training, the actual bounding box position is offset from the preset bounding box.

In layman’s terms, it is to set a preset frame in advance where the target may exist, and then make fine adjustments based on these preset frames. And its essence is to solve the problem of label allocation.

The anchor serves as a series of prior box information, which generates the following parts:

(1) Use the network to extract the points of the feature map to locate the position of the frame;

(2) Use the size of the anchor to set the size of the frame;

(3) Use the aspect ratio of the anchor to set the shape of the border.


In recent years, anchors have been widely used in the field of target detection. There are many models using the anchor mechanism, including Faster-RCNN, SSD, YOLOV2~ YOLOV7, etc.

The process of this kind of algorithm can be divided into three steps:

(1) Preset a large number of anchors (2D/3D) in the image or point cloud space;

(2) The four offsets of the regression target relative to the anchor;

(3) Correct the precise target position with the corresponding anchor and the offset of the regression.

in the target detection algorithmone stageandanchor baseThe target detection model of .

The process of model acquisition mainly includestrainandtesttwo parts.

The main purpose of training is to use the training data set to learn the parameters of the detection network. The training data set contains a large number of visual images and annotation information (object positions and categories).

training phaseThe main process includes data preprocessing, detection network, and label matching and loss calculation.

test phaseIt mainly uses the obtained training model to predict the input image, and obtains the detection result after post-processing.

(I) Training process

(II) Test process

(II) Test process

The target detection obtains the name of each category and the position information of the rectangular box. In the network, the category is usually replaced by a number, such as 0 for Dog and 1 for Cat. The position information of the object is usually represented by a rectangular bounding box (Bounding Box). The location information of the target is determined by the four points of the bounding box.

non-maximum suppression


In the model prediction stage, we generate multiple anchor boxes for the image, and predict the category and position offset respectively, but will generate many redundant prediction boxes that do not fully contain the target, or one target may output multiple similar prediction boxes, Therefore, we need the NMS operation to get the target box that best matches the real target.

First, compare the IOU (intersection and union ratio) between the prediction boxes, and remove some overlapping prediction boxes by setting a threshold, and finally obtain a single prediction box with the highest score for each category.

As shown in the figure, the intersection ratio definition and the target detection output target frame before and after NMS processingSchematic


The anchor-free class algorithms are represented by CornerNet, ExtremeNet, CenterNet, FCOS, etc.

Anchor-Free’s target detection algorithm has two methods:

(1) Method based on joint expression of multiple key points

(2) Method based on single center point prediction

Based on the multi-keypoint joint method, the search space is limited by locating several keypoints of the target object. For example, the Grid R-CNN algorithm finds candidate regions based on RPN, and extracts feature maps for each ROI region.

The feature map is passed to the heat map of the output probability in the fully convolutional network layer, which is used to locate the grid points of the bounding box aligned with the target, and the grid points are used for feature map fusion to finally determine the bounding box of the target.

The method based on single center point prediction is to locate the center point of the target object, and then predict the distance from the center to the boundary. For example, CenterNet detects the target as a point, that is, the center point of the target box is used to represent the target, and the center point offset (offset) and width and height (size) of the target are predicted to obtain the actual box of the object, and the heatmap represents the Classified information.

Each category has a heatmap. On each heatmap, if there is the center point of the object target at a certain coordinate, a keypoint (represented by a Gaussian circle) is generated at that coordinate.As shown below

It can be seen from the above that the main difference between anchor-base and anchor-free is the way of defining positive and negative samples and regression. In anchor-free, which grid the object falls on, which grid is the positive sample, and the rest are negative samples. The anchor-base calculates the IOU of each anchor preselected box and the actual box, and how many thresholds are exceeded is considered a positive sample.

In the regression part, anchor-free is based on point for regression, while anchor-base is based on the offset between anchor box and ground truth.

This has also led to the development of fusionanchor-basedandanchor-freeBranching methods, such as FSAF, SFace, GA-RPN, etc.

3. Application scenarios of target detection in vehicles

Object detection is used in all aspects of our lives. With the rapid development of the field of autonomous driving, object detection algorithms have also been greatly applied in this field.

The application scenarios includePedestrian and vehicle detection on the road, face detection in driver fatigue monitoring, detection of leftovers in smart cockpit, occupant position detectionWait.

1. Extravehicular pedestrian and vehicle detection

Detect pedestrians and vehicles on the road, and observe the running status of the road in real time.

2. In-cabin driver face detection

The position of the driver’s face frame is detected as the basis for real-time monitoring of the driver’s state.

3. Detection of leftovers in the rear of the cabin

Detect the items left in the cockpit after getting off the car, so as to remind the driver to pay attention to the safety of the cockpit after parking.

Leave a Reply

Your email address will not be published.