License: arXiv.org perpetual non-exclusive license
arXiv:2401.06542v1 [cs.CV] 12 Jan 2024

Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Guoxin Zhang, Lei Yang, Li Wang, Caiyan Jia This work was supported in part by the National Key R&D Program of China (2018AAA0100302), supported by the STI 2030-Major Projects under Grant 2021ZD0201404.(Corresponding author: Caiyan Jia.)Ziying Song, Lin Liu, Feiyang Jia, Caiyan Jia are with School of Computer and Information Technology, Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China (e-mail: 22110110@bjtu.edu.cn, liulin010811@gmail.com, jfy539@yeah.net,cyjia@bjtu.edu.cn) Yadan Luo is with the School of Information Technology and Electrical Engineering, The University of Queensland, St Lucia, QLD 4072, Australia (e-mail: uqyluo@uq.edu.au) Guoxin Zhang is with School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China (e-mail: zhangguoxincs@gmail.com) Lei Yang is with the State Key Laboratory of Automotive Safety and Energy, and the School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China (e-mail: yanglei20@ mails.tsinghua.edu.cn).Li Wang is with School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China (e-mail: wangli_bit@bit.edu.cn.)
Abstract

In the realm of modern autonomous driving, the perception system is indispensable for accurately assessing the state of the surrounding environment, thereby enabling informed prediction and planning. Key to this system is 3D object detection methods, that utilize vehicle-mounted sensors such as LiDAR and cameras to identify the size, category, and location of nearby objects. Despite the surge in 3D object detection methods aimed at enhancing detection precision and efficiency, there is a gap in the literature that systematically examines their resilience against environmental variations, noise, and weather changes. This study emphasizes the importance of robustness, alongside accuracy and latency, in evaluating perception systems under practical scenarios. Our work presents an extensive survey of camera-based, LiDAR-based, and multimodal 3D object detection algorithms, thoroughly evaluating their trade-off between accuracy, latency, and robustness, particularly on datasets like KITTI-C and nuScenes-C to ensure fair comparisons. Among these, multimodal 3D detection approaches exhibit superior robustness and a novel taxonomy is introduced to reorganize its literature for enhanced clarity. This survey aims to offer a more practical perspective on the current capabilities and constraints of 3D object detection algorithms in real-world applications, thus steering future research towards robustness-centric advancements.

Index Terms:
Autonomous Driving, 3D Object Detection, Point clouds

I Introduction

AUTONOMOUS driving systems, fundamental to the future of transportation, heavily rely on advanced perception, decision-making, and control technologies. These systems employ a range of sensors [1] such as camera, LiDAR and radar as depicted in Fig. 1, to effectively perceive the surrounding environment. This capability is crucial for recognizing road signs, detecting and tracking vehicles, and predicting pedestrian behavior, enabling safe operation amidst complex traffic conditions.

The primary task of perception is to accurately understand the surrounding environment and minimize collision risks. This is where 3D object detection methods become essential. These approaches enable the autonomous systems to accurately identify objects in the vicinity, including their position, shape, and category [2]. Such detailed environmental perception enhances the system’s ability to comprehend the driving context and make more informed decisions.

Refer to caption
Figure 1: An illustration of 3D object detection in autonomous driving scenarios with different sensors.

The advancement of autonomous driving technologies has spared a wave of research in 3D object detection, leading to the development of diverse and innovative methods. These approaches are typically categorized based on their input types, including Camera-based  [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 53, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 17, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], Point Cloud-based [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220], and multimodal methods [221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 136, 232, 233, 234, 107, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251]. The current landscape of 3D object detection methods is prolific, necessitating a comprehensive summarization to offer intriguing insights for the research community. While comprehensive, prior surveys [2, 252] often overlook the safety aspects of autonomous driving perception, particularly in terms of the system robustness against varying testing data after deployment.

In real-world testing scenarios, the conditions encountered can greatly differ from those during training, The environmental variability, sensor discrepancies or noise, and spatial misalignment can cause a shift in the input sensory data distribution, leading to a significant drop in detector performance [233, 230, 253, 254]. We identify and discuss three major factors critical for assessing the detection robustness: 1) Environmental Variability: The detection algorithm needs to perform well under different environmental conditions, including variations in lighting, weather, and seasonal changes. The algorithm should exhibit adaptability, ensuring that it does not fail due to changes in the environment. 2) Sensor Noise: This includes handling noise introduced by sensor malfunctions such as motion blur to camera. The algorithm must possess the capability to effectively manage hardware noise, ensuring accurate processing of input data. 3) Misalignment: In real-world scenarios, sensor calibration errors can complicate the synchronization of multimodal input data, causing misalignment due to external factors (e.g., uneven road surfaces) or internal factors (e.g., system clock misalignment). The algorithm should be fault-tolerant and may incorporate an elastic alignment to mitigate misalignment’s impact on detection performance.

To ensure safe operation in varying test environments, assessing the robustness of 3D object detection algorithms is essential. They must maintain efficient, accurate, and reliable performance across diverse scenarios. In this survey, we conduct extensive experimental comparisons among existing algorithms. Centered around ’Accuracy, Latency, Robustness’, we delve into existing solutions, offering insightful guidance for practical deployment in autonomous driving.

  • Accuracy Current research often prioritizes accuracy as a key performance metric. However, a deeper understanding of these methods’ performance in complex environments and extreme weather conditions is needed to ensure real-world reliability. A more detailed analysis of false positives and false negatives is necessary for improvement.

  • Latency Real-time capability is vital for autonomous driving. The latency of a 3D object detection method impacts the system’s ability to make timely decisions, particularly in emergencies.

  • Robustness Robustness refers to the system’s stability under various conditions, including weather, lighting, sensory and alignment changes. Many existing evaluations may not fully consider the diversity of real-world scenarios, necessitating a more comprehensive adaptability assessment.

Through an in-depth analysis of extensive experimental results, with a focus on ‘Accuracy, Latency, Robustness,’ we have identified significant advantages in safety perception with multimodal 3D detection in safety perception. By integrating information from diverse sensors or data sources, the multimodal methods provide a richer and more diverse perception capability for autonomous driving systems, thereby enhancing the understanding and response to the surrounding environment. Our research provides practical guidance for the future deployment of autonomous driving technology. By discussing these key areas, we aim to align the technology more closely with real-world needs and enhance its societal benefits effectively.

The structure of this paper is organized as follows. First, we introduce the datasets and evaluation metrics for 3D object detection, with a particular focus on robustness in Section II. The subsequent sections systematically examine existing 3D object detection methods, including Camera-only (Section III), LiDAR-only (Section IV), and multimodal approaches (Section V). The paper concludes with a comprehensive summary of our findings VI.

II Datasets

TABLE I: Advantages and limitations of different modalities.
Type Sensor Hardware Cost($) Advantages Limitations

Image

Camera

102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT~103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT

+ The dense data format incorporates additional color and texture information.

- Missing depth information the camera will be affected by light, weather, etc.

Point cloud

LiDAR

104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT~105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT

+ With accurate depth information less affected by light +larger field of view

-High computational cost for sparse and disordered point cloud data and no color information.

Multimodal

Camera, LiDAR

104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT~105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT

+ Simultaneous color and depth information

- Fusion methods can produce noise interference

TABLE II: Public datasets for 3D object detection in autonomy driving.
Dataset Year Sensors Data Size Diversity Scenes
Frame Annotation Scenes Category
KITTI[255] 2012 Camera, LiDAR 15K 200K 50 3 Daytime and sunny days only
nuScenes[256] 2019 Camera, LiDAR, Radar 40K 1.4M 1000 10 Night and rainy days
Lyft L5[257] 2019 Camera, LiDAR 46K 1.3M 366 9 Daytime and sunny days only
H3D[258] 2019 LiDAR 27K 1.1M 160 8 Daytime and sunny days only
Appllo[259] 2019 Camera, LiDAR 140K - 103 27 Night and rainy days
Argoverse[260] 2019 Camera, LiDAR 46K 993K 366 9 Night and rainy days
A*3D [261] 2019 Camera, LiDAR 39K 230K - 7 Night and severely obscured data
Waymo[262] 2020 Camera, LiDAR 230K 12M 1150 3 Night and rainy days
A2D2 [263] 2020 Camera, LiDAR 12.5K 43K - 38 Daytime and sunny days only
PandaSet [264] 2020 Camera, LiDAR 14K - 179 28 Daytime and sunny days only
KITTI-360 [265] 2020 Camera, LiDAR 80K 68K 11 19 Daytime and sunny days only
Cirrus [266] 2020 Camera, LiDAR 6285 - 12 8 Daytime and sunny days only
ONCE [267] 2021 Camera, LiDAR 15K 417K - 5 Night and rainy days
OpenLane [268] 2022 Camera, LiDAR 200K - 1000 14 Daytime and sunny days only

Currently, autonomous driving systems primarily rely on sensors such as cameras, LiDAR, and radar, generating data in two modalities: point clouds and images. Based on these data types, existing public benchmarks predominantly manifest in three forms: Camera-only, LiDAR-only, and multimodal. Table I delineates the advantages and disadvantages of each of these three forms. Among them, there are many reviews [269, 252, 270, 271, 272, 273, 274, 275] providing a comprehensive overview of clean autonomous driving datasets as shown in II, The most notable ones include KITTI[255], nuScenes[256], and Waymo[262].

In recent times, the pioneering work on clean autonomous driving datasets has provided rich resources for 3D object detection. As autonomous driving technology transitions from breakthrough stages to practical implementation, we have undertaken some guided research to review the currently available robustness datasets systematically. We focus more on noisy scenarios and have systematically reviewed datasets related to the robustness of 3D detection. Many studies collect new datasets to evaluate model robustness under different conditions. Early research has explored camera-only approaches under adverse conditions [276, 277], with datasets notably small in scale and exclusively applicable to camera-only visual tasks rather than multimodal sensor stacks that include LiDAR. Subsequently, a series of multimodal datasets [278, 279, 280, 281] focus on noise concerns. For instance, the GROUNDED dataset [278] focuses on ground-penetrating radar localization under varying weather conditions. Additionally, the ApolloScape open dataset [280] incorporates LiDAR, camera, and GPS data, encompassing cloudy and rainy conditions and brightly lit scenarios.

Due to the prohibitive cost of collecting extensive noisy datasets from the real world, rendering the formation of large-scale datasets impractical, many studies have shifted their focus to synthetic datasets. ImageNet-C [282] is a seminal work in corruption robustness research, benchmarking classical image classification models against prevalent corruptions and perturbations. This line of research has subsequently extended to include robustness datasets tailored for 3D object detection in autonomous driving. Additionally, there are adversarial attacks [283, 284, 285] designed for studying the robustness of 3D object detection. However, these attacks may not exclusively concentrate on natural corruption, which is less prevalent in autonomous driving scenarios. To better emulate the distribution of noise data in the real world, several studies [254, 286, 287, 253, 288, 289] have developed toolkits for robustness benchmarks. These benchmark toolkits [254, 286, 287, 253, 288, 289] enable the simulation of various scenarios using clean autonomous driving datasets, such as KITTI [255], nuScenes [256], and Waymo [262]. Among them, Dong et al.[254] systematically designed 27 common corruptions in 3D object detection to benchmark the corruption robustness of existing detectors. By applying these corruptions comprehensively on public datasets, they established three corruption-robust benchmarks: KITTI-C, nuScenes-C, and Waymo-C. [254] denote model performance on the original validation set as APclean𝐴subscript𝑃cleanAP_{\text{clean}}italic_A italic_P start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT. For each corruption type c𝑐citalic_c at each severity s𝑠sitalic_s, [254] adopt the same metric to measure model performance as APc,s𝐴subscript𝑃c,sAP_{\text{c,s}}italic_A italic_P start_POSTSUBSCRIPT c,s end_POSTSUBSCRIPT. The corruption robustness of a model is calculated by averaging over all corruption types and severities as

ΛPcor =1|𝒞|c𝒞15s=15ΛPc,s.ΛsubscriptPcor 1𝒞subscript𝑐𝒞15superscriptsubscript𝑠15ΛsubscriptP𝑐𝑠\displaystyle\Lambda\mathrm{P}_{\text{cor }}=\frac{1}{|\mathcal{C}|}\sum_{c\in% \mathcal{C}}\frac{1}{5}\sum_{s=1}^{5}\Lambda\mathrm{P}_{c,s}.roman_Λ roman_P start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_C | end_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT roman_Λ roman_P start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT . (1)

Where 𝒞𝒞\mathcal{C}caligraphic_C is the set of corruptions in evaluation, note that for different kinds of 3D object detectors, the set of corruptions can be different (e.g., [254] do not evaluate camera noises for LiDAR-only models). Thus, the results of APcor are not directly comparable between different kinds of models, and [254] performs a fine-grained analysis under each corruption. [254] also calculates relative corruption error (RCE) by measuring the percentage of performance drop as

RCEc,s=APclean APc,sAPclean ;RCE=APclean APcor APclean .formulae-sequencesubscriptRCE𝑐𝑠subscriptAPclean subscriptAP𝑐𝑠subscriptAPclean RCEsubscriptAPclean subscriptAPcor subscriptAPclean \displaystyle\mathrm{RCE}_{c,s}=\frac{\mathrm{AP}_{\text{clean }}-\mathrm{AP}_% {c,s}}{\mathrm{AP}_{\text{clean }}};\mathrm{RCE}=\frac{\mathrm{AP}_{\text{% clean }}-\mathrm{AP}_{\text{cor }}}{\mathrm{AP}_{\text{clean }}}.roman_RCE start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT = divide start_ARG roman_AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT - roman_AP start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT end_ARG start_ARG roman_AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT end_ARG ; roman_RCE = divide start_ARG roman_AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT - roman_AP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT end_ARG start_ARG roman_AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT end_ARG . (2)

Unlike KITTI-C and Waymo-C, nuScenes-C primarily assesses performance using the mean Average Precision (mAP) and nuScenes Detection Score (NDS) computed across ten object categories. The mAP is determined using the 2D center distance on the ground plane instead of the 3D Intersection over Union (IoU). The NDS metric consolidates mAP with other aspects, such as scale and orientation, into a unified score. Analogous to KITTI-C, Ref. [254] denote the model’s performance on the validation set as mAPcleansubscriptmAPclean\text{mAP}_{\text{clean}}mAP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT and NDScleansubscriptNDSclean\text{NDS}_{\text{clean}}NDS start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT, respectively. The corruption robustness metrics, mAPcorsubscriptmAPcor\text{mAP}_{\text{cor}}mAP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT and NDScorsubscriptNDScor\text{NDS}_{\text{cor}}NDS start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT, are evaluated by averaging over all corruption types and severities. Additionally, Ref. [254] calculates the Relative Corruption Error (RCE) under both mAP and NDS metrics, similar to the formulation in Eq.2.

Additionally, some studies [283, 286, 290] examine robustness in single-modal contexts. For instance, Ref. [286] proposes a LiDAR-only benchmark that utilizes physical-aware simulation methods to simulate degraded point clouds under various real-world common corruptions. This benchmark, tailored for point cloud detectors, includes 1,122,150 examples across 7,481 scenes, covering 25 common corruption types with six severity levels. Moreover, Ref. [286] devise a novel evaluation metrics, including CEAP(%)\text{CE}_{\text{AP}}(\%)CE start_POSTSUBSCRIPT AP end_POSTSUBSCRIPT ( % ) and mCE. Ref. [286] calculates corruption error (CE) to assess performance degradation based on Overall Accuracy (OA) by:

CEc,sm=OAcleanmOAc,sm,superscriptsubscriptCEcsmsuperscriptsubscriptOAcleanmsuperscriptsubscriptOAcsm\displaystyle\mathrm{CE}_{\mathrm{c},\mathrm{s}}^{\mathrm{m}}=\mathrm{OA}_{% \mathrm{clean}}^{\mathrm{m}}-\mathrm{OA}_{\mathrm{c},\mathrm{s}}^{\mathrm{m}},roman_CE start_POSTSUBSCRIPT roman_c , roman_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT = roman_OA start_POSTSUBSCRIPT roman_clean end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT - roman_OA start_POSTSUBSCRIPT roman_c , roman_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT , (3)

where OAc,smsuperscriptsubscriptOAcsm\mathrm{OA}_{\mathrm{c},\mathrm{s}}^{\mathrm{m}}roman_OA start_POSTSUBSCRIPT roman_c , roman_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT is the overall accuracy of detector m under corruption cc\mathrm{c}roman_c of severity level ss\mathrm{s}roman_s (exclude ”clean,” i.e., severity level 0) and clean represent the clean data. For detection m, we can calculate the mean CE (mCEmCE\mathrm{mCE}roman_mCE) for each detector by:

mCEm=s=15c=125CEc,sm5C.superscriptmCEmsuperscriptsubscripts15superscriptsubscriptc125superscriptsubscriptCEcsm5C\displaystyle\mathrm{mCE}^{\mathrm{m}}=\frac{\sum_{\mathrm{s}=1}^{5}\sum_{% \mathrm{c}=1}^{25}\mathrm{CE}_{\mathrm{c},\mathrm{s}}^{\mathrm{m}}}{5\mathrm{C% }}.roman_mCE start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT roman_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT roman_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 25 end_POSTSUPERSCRIPT roman_CE start_POSTSUBSCRIPT roman_c , roman_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT end_ARG start_ARG 5 roman_C end_ARG . (4)

III Camera-based 3D Object Detection

Refer to caption
Figure 2: Camera-only methods pipeline.

In this section, we introduce the camera-based 3D object detection methods. Compared to LiDAR-based methods, the camera solution is more cost-effective and the generated image requires no complex preprocessing. Therefore, it is favored by many automotive manufacturers, particularly in the context of multi-view applications such as BEV (bird’s-eye view) systems. Generally, as shown in Fig. 2, camera-based methods can be categorized into three types: (1) monocular, (2) stereo-based, and (3) multi-view (bird’s-eye view). Due to the excellent cost-effectiveness of camera-based methods, there have been numerous reviews and investigations conducted to summarize and explore them. However, the majority of existing reviews on 3D object detection are limited to specific methodologies, with a predominant focus on accuracy. This survey aims to revisit the fundamental considerations of safety-perception deployment, redefining the discourse around existing categorizations, and exploring ‘Accuracy, Latency, and Robustness, ’ as the core dimensions for an in-depth analysis of current methodologies. The objective is to provide additional insights to guide existing technologies.

TABLE III: Camera-based 3D object detection methods.
Input Type Keypoint Methods
Monocular

Prior-guided: Direct regression using geometric prior knowledge

[291, 292, 293, 294, 295, 296, 297, 17, 49, 298, 299, 15, 300, 5, 43, 16, 68, 6, 301, 18, 39, 47, 20, 296]

Camera-only: uses the RGB image information captured by the monocular.

[21, 15, 302, 66, 57, 39, 299, 65, 67, 48, 5, 6, 53, 52, 303, 7, 304, 305, 301, 68]

Depth-assisted: extracting depth information via camera parallax.

[306, 10, 69, 37, 3, 4, 307, 45, 36, 62]

Stereo

2D-detection-based: Integrate 2D information about the target into the image.

[13, 308, 309, 310, 311, 312, 24, 313]

Pseudo-LiDAR-based: incorporate additional information from pseudo-LiDAR to simulate LiDAR depth.

[4, 314, 157, 14]

Multi-view

Depth-based: Convert 2D spatial features into 3D spatial features through depth estimation.

[25, 315, 316, 92, 93, 95, 96, 98, 100, 317, 318, 319, 320, 321, 322, 323, 324]

Query-based: Influenced by the transformer technology stack, there is a trend to explicitly or implicitly query Bird’s Eye View (BEV) features.

[99, 87, 97, 26, 89, 90, 94, 166, 319, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 91, 335, 336, 337]

III-A Monocular 3D object detection

Monocular 3D object detection refers to performing 3D object detection using only one camera which aims to infer the 3D position, size, and orientation of target objects from a single image [274]. In recent years, monocular 3D object detection has gained increasing attention due to its advantages of low cost, low power consumption, and ease of deployment in real-world applications. However, monocular methods face many challenges, owing to the insufficient 3D information in monocular pictures, such as accurately localizing the 3D position, handling occluded scenes, etc. Overcoming these challenges relies on leveraging depth information to supplement the missing 3D information in monocular images. Typically, most approaches employ depth estimation tasks to acquire depth information from images. However, monocular depth estimation is an ill-posed and highly challenging task, prompting researchers to dedicate significant efforts to optimize the accuracy and stability of depth estimation.

III-A1 Prior-guided monocular 3D object detection

In recent years, prior-guided monocular methods [291, 292, 293, 294, 295, 296, 297, 17, 49, 298, 299, 15, 300, 5, 43, 16, 68, 6, 301, 18, 39, 47, 20, 296] have continuously explored how to utilize the hidden prior knowledge of object shapes and scene geometry in images to address the challenges of monocular 3D object detection. This effective integration of prior knowledge is crucial for mitigating the uncertainty and ill-posed nature of monocular 3D object detection problems. By introducing pre-trained subnetworks or auxiliary tasks, prior knowledge can provide additional information or constraints to assist in the accurate localization of 3D objects and enhance detection precision and robustness.

Widely adopted prior knowledge in 3D objects includes object shapes [17, 338, 339, 296, 297, 293], geometric consistency [340, 300, 43, 5, 43, 16, 6], temporal constraints [47, 302], and segmentation information [296, 20]. Object shape provides insights into the appearance and structure of the object, aiding in more accurate inference of the spatial position and pose of the object. Geometric consistency knowledge assists the model in better understanding the relative positional relationships between objects in the scene, thereby improving detection consistency and robustness. Temporal constraints consider the continuity and stability of the object across different frames, providing vital clues for object detection. Additionally, leveraging segmentation information enables the model to better comprehend semantic information in the images, facilitating precise localization and identification of objects.

As a result, current works are dedicated to further exploring and utilizing prior knowledge to enhance the performance and robustness of monocular 3D object detection by integrating prior knowledge with deep learning approaches, thus driving continuous development and innovation in this field. The early algorithm Mono3D [50] first assumes the 3D object is on a fixed ground plane, and then uses the prior 3D shape of the vehicle to reconstruct the bounding box in 3D space. In subsequent work, Deep MANTA [292] uses keypoints and 3D CAD models to predict 3D objects. Pose-RCNN [341] learns viewpoint-specific subclass information from 3D CAD models to capture shape, viewpoint information, and potential occlusion patterns of objects. MonoPSR [49] generates 3D candidate frames for each object in the scene by using the fundamental relationships of the pinhole camera model and a well-established 2D object detector.

With a deeper understanding and application of prior knowledge, it is believed that significant progress will be achieved in monocular 3D object detection in the future, bringing breakthroughs and opportunities to the fields of computer vision and intelligent systems.

III-A2 Camera-only monocular 3D object detection

Camera-only monocular 3D object detection [21, 15, 302, 66, 57, 39, 299, 65, 67, 48, 5, 6, 53, 52, 303, 7, 304, 305, 301, 68] is a method that utilizes images captured by a single camera to detect and localize 3D objects. Camera-only monocular methods employs convolutional neural networks (CNNs) to directly regress 3D bounding box parameters from the images, enabling the estimation of the spatial dimensions and poses of objects in three dimensions. Drawing inspiration from the architectural design of 2D detection networks, this direct regression method can be trained in an end-to-end manner, facilitating holistic learning and inference for 3D objects. The unique challenge of monocular 3D object detection lies in inferring the 3D position, dimensions, and orientation of objects solely from a single image, without relying on additional depth maps or point cloud data. Consequently, the direct regression approach demonstrates practicality and broad applicability. By learning features from the images, convolutional neural networks can predict the 3D information of the objects. Through end-to-end training, the network gradually optimizes its parameters to enhance the accurate extraction of 3D information. This direct regression method streamlines the entire detection process and reduces reliance on supplementary information, thereby improving the algorithm’s robustness and generalization capability. Nevertheless, monocular 3D object detection still presents challenges, such as occlusion, viewpoint variations, and changes in lighting conditions, which may impact the accuracy of 3D detection. The representative work Smoke[21] abandons the regression of 2D bounding boxes and predicts the 3D box for each detected target by combining the estimation of individual keypoints with the regression of 3D variables.

III-A3 Depth-assisted monocular 3D object detection

Depth estimation plays a crucial role in depth-assisted monocular 3D object detection. To achieve more accurate monocular detection results, numerous studies [306, 10, 69, 37, 3, 4, 307, 45, 36, 62] leverage pre-trained auxiliary depth estimation networks. Specifically, the process begins by transforming monocular images into depth images using pre-trained depth estimators, such as MonoDepth[342]. Subsequently, two primary methodologies are employed to handle depth images and monocular images.

Remarkable progress has been made in Pseudo-LiDAR detectors that use a pre-trained depth estimation network to generate Pseudo-LiDAR representations[60, 306, 61, 62]. However, there is a huge performance gap between Pseudo-LiDAR and LiDAR-based detectors because of the errors in image-to-LiDAR generation. Thus, Hong et al.[59] attempted to transfer deeper structural information from point clouds to assist monocular image detection. By leveraging the mean-teacher framework, they aligned the outputs of the LiDAR-based teacher model and the camera-based student model at both the feature-level and response-level, aiming to achieve cross-modal knowledge transfer. Such a depth-assisted monocular 3D object detection, by effectively integrating depth information, not only enhances detection accuracy but also extends the applicability of monocular vision to tasks involving 3D scene understanding.

III-B Stereo-based 3D object detection

Stereo-based 3D object detection is designed to identify and localize 3D objects using a pair of stereo images. Leveraging the inherent capability of stereo cameras to capture dual perspectives, stereo-based methods excel in acquiring highly accurate depth information through stereo matching and calibration, a feature that sets them apart from monocular camera setups. Despite these advantages, stereo-based methods still face a considerable performance gap when compared to LiDAR-based counterparts. Furthermore, the realm of 3D object detection from stereo images remains relatively underexplored, with only limited research endeavors dedicated to this domain. Specifically, these approaches involve the utilization of image pairs captured from distinct viewpoints to estimate the 3D spatial depth of each object.

III-B1 2D-detection based methods

Traditional 2D object detection frameworks can be modified to address stereo detection problems. Stereo R-CNN [23] employs an image-based 2D detector to predict 2D proposals, generating left and right regions of interest (RoIs) for the corresponding left and right images. Subsequently, in the second stage, it directly estimates the parameters of 3D objects based on the previously generated RoIs. This paradigm has been widely adopted by follow works [13, 308, 309, 310, 311, 312, 24, 313].

III-B2 Pseudo-LiDAR based methods

The disparity map predicted from stereo images can be transformed into a depth map and further converted into a pseudo-LiDAR points. Consequently, similar to monocular detection methods, pseudo-LiDAR representations can also be employed in stereo-based 3D object detection approaches. These methods aim to enhance the disparity estimation in stereo matching to achieve more accurate depth predictions. Regarding the contribution of depth in 3D detection, Wang et al. [3] is a pioneer in introducing the Pseudo-LiDAR representation. This representation is generated by an image with a depth map, requiring the model to perform a depth estimation task to assist in detection. Subsequent work has followed this paradigm and made optimizations by introducing additional color information to augment pseudo point cloud [45], auxiliary tasks(instance segmentation [36], foreground and background segmentation [343] and domain adaptation [40]) and coordinate transform scheme [8, 306]. It is worth noting the insightful work proposed by  Ma et al. called PatchNet [306]. Specifically, the authors challenges the conventional idea of leveraging the pseudo-LiDAR representation for monocular 3D object detection. By encoding 3D coordinates for each pixel, PatchNet can attain a comparable monocular detection result without pseudo-LiDAR representation. This observation indicates that the power of the pseudo LiDAR representation stems from the coordinate transformation rather than the point cloud representation itself.

III-C Multi-view 3D object detection

Recently, multi-view 3D object detection has demonstrated superior accuracy and robustness compared to the aforementioned monocular and stereo 3D object detection approaches. In contrast to LiDAR-based 3D object detection, the latest panoramic Bird’s Eye View (BEV) approaches eliminate the need for high-precision maps, elevating the detection from 2D to 3D. This advancement has led to significant developments in multi-view 3D object detection. In comparison to previous reviews [1, 344, 345, 346, 347, 242, 2, 348, 349, 350, 269, 252, 270, 271, 272, 273, 274, 275], there has been extensive research on effectively leveraging multi-view images for 3D object detection. In multi-camera 3D object detection, a key challenge lies in recognizing the same object across different images and aggregating object features from multiple view inputs. The current approach involves uniformly mapping multi-view to the Bird’s Eye View (BEV) space, which is a common practice. Therefore, multi-view 3D object detection, also referred to as BEV-camera-only 3D object detection, revolves around the core challenge of unifying 2D views into the BEV space. Based on different spatial transformations, this can be categorized into two main methods: one approach is depth-based methods [25, 315, 316, 92, 93, 95, 96, 98, 100, 317, 318, 319, 320, 321, 322, 323, 324], represented by the LSS [316], also known as the 2D to 3D transformation. The other approach is query-based methods [99, 87, 97, 26, 89, 90, 94, 166, 319, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 91, 335, 336, 337] represented by the DETR3D [88], achieving a query from 3D to 2D.

III-C1 Depth-based Multi-view methods

The direct transformation from 2D to BEV space poses a significant challenge. As shown in Fig. 3, LSS [316] was the first to propose a depth-based method, utilizing 3D space as an intermediary. This approach involves initially predicting the grid depth distribution of 2D features and then elevating these features to voxel space. This method holds promise for more effectively achieving the transformation from 2D to BEV space. Following LSS [316], CaDDN [52] adopted a similar depth representation approach. It employed a network structure akin to LSS, primarily for predicting categorical depth distribution. By compressing voxel-space features to BEV space, it performed the final 3D detection. It is worth noting that CaDDN is not part of multi-view 3D object detection, but rather single-view 3D object detection, which has influenced subsequent research on depth. The main distinction between LSS [316] and CaDDN [52] lies in CaDDN’s use of actual ground truth depth values to supervise its prediction of categorical depth distribution, resulting in an outstanding depth network capable of more accurately extracting 3D information from 2D space. This line of research has sparked a series of subsequent studies, such as BEVDet [315], its temporal version BEVDet4D [95], and BEVDepth [25]. These studies are of great significance in advancing the transformation from 2D to 3D space and enabling more accurate object detection in the BEV space, providing valuable insights and directions for the relevant field’s development. Furthermore, some studies have addressed the issue of insufficient depth solely by encoding height information. These studies have found that with increasing distance, the depth disparity between the car and the ground rapidly diminishes [317, 100].

Refer to caption
Figure 3: LSS [316] “lifts” 2D space to 3D space through depth distribution.

III-C2 Query-based Multi-view methods

Under the influence of Transformer [351, 352, 353, 354] technology, query-based Multi-view methods retrieve 2D spatial features from 3D space. Inspired by Tesla’s perception system, DETR3D[88] introduces 3D object queries to address the aggregation of multi-view features. It accomplishes this by clipping image features from different perspectives and projecting them into 2D space using learned 3D reference points, thus obtaining image features in the Bird’s Eye View (BEV) space. Query-based Multi-view methods, contrary to Depth-based Multi-view methods, acquire sparse BEV features by employing a reverse querying technique, fundamentally impacting subsequent query-based developments [99, 87, 97, 26, 89, 90, 94, 166, 319, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 91, 335, 336, 337]. However, due to the potential inaccuracies associated with explicit 3D reference points, PETR [89], as shown in Fig. 4, influenced by DETR [355] and DETR3D [88], adopts an implicit positional encoding method for constructing the BEV space, influencing subsequent works[90, 336]. Meanwhile, some methods [225, 356] do not explicitly construct Bird’s Eye View (BEV) features.

Refer to caption
Figure 4: Comparison of DETR[355], DETR3D[88], and PETR[89].

III-D Analysis: Accuracy, Latency, Robustness

Currently, the 3D object detection solutions based on Bird’s Eye View (BEV) perception are rapidly advancing. Despite the existence of numerous reviews[1, 344, 345, 346, 347, 242, 2, 348, 349, 350, 269, 252, 270, 271, 272, 273, 274, 275], a comprehensive review of this field remains inadequate. It is noteworthy that Shanghai AI Lab and SenseTime Research have provided a thorough review [357] of the technical roadmap for BEV solutions. However, unlike existing reviews [1, 344, 345, 346, 347, 242, 2, 348, 349, 350, 269, 252, 270, 271, 272, 273, 274, 275], which primarily focus on the technical roadmap and the current state of the art, we considers crucial aspects such as autonomous driving safety perception. Following an analysis of the technical roadmap and the current state of development for camera-based solutions, we intend to base our discussion on the foundational principles of ‘Accuracy, Latency, Robustness.’ We will integrate the perspectives of safety perception to guide the practical implementation of safety perception in autonomous driving.

Refer to caption
Figure 5: (a) AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT comparison of monocular-based[21, 54, 305, 18, 55, 56, 76, 11, 75, 301] methods and Stereo-based [358, 310, 359, 24, 360, 13, 361, 362, 28, 363] methods on KITTI moderate dataset. (b) The mAP (left) and NDS (right) comparison of monocular-based methods [364, 7, 67, 69, 65] methods and Multi-view methods [88, 89, 90, 26, 365, 330, 25, 329, 336, 337] on the nuScenes test dataset.
Refer to caption
Figure 6: (a) The FPS and AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT comparison of monocular-based[21, 305, 56, 76, 301, 18, 7, 45, 15, 67] methods and Stereo-based [358, 24, 13, 366, 362, 28, 363, 23, 309, 367] methods on the KITTI test dataset. (b) The FPS and NDS comparison of monocular-based[67, 65] methods and Multi-view [330, 88, 89, 90, 315, 26, 95, 25, 331, 321, 368] methods on the nuScenes test dataset. (c) The FPS and AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT comparison of view-based methods [175, 173, 176], Point-based methods [369, 110, 105, 370, 186], PV-based methods [251, 371, 168, 165] and Voxel-based methods [109, 118, 372, 115, 373, 151, 139, 374, 134, 121, 147, 101] on the KITTI test dataset. (d) The FPS and NDS comparison of Point-based methods [105], PV-based methods [167] and Voxel-based methods [118, 364, 125, 232, 201, 136, 183, 154, 220, 112, 230, 375] on the nuScenes test dataset. (e) The FPS and AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT comparison of Point-Projection-based methods [222, 226, 376], Feature-Projection-based methods [232, 377], Auto-Projection-based methods [369, 233, 247, 243, 378], Decision-Projection-based methods [379, 380, 381, 196, 382] and Query-Learning-based methods [245] on the KITTI test dataset. (f) The FPS and NDS comparison of Point-Projection-based methods [383], Feature-Projection-based methods [129], Auto-Projection-based methods [233, 384], Query-Learning-based methods [230, 237] and Unified-Feature-based methods [385, 231, 386, 375, 228, 356, 387, 225, 223, 136] on the nuScenes test dataset.

III-D1 Accuracy

Accuracy is a focal point of interest in the majority of research articles and reviews and is indeed of paramount importance. While accuracy can be reflected through AP (average precision), considering AP alone for comparison may not provide a comprehensive view, as different methodologies may exhibit substantial differences due to differing paradigms. As shown in Fig. 5 (a), we selected 10 representative methods (including classic and latest research) for comparison, and it is evident that there are significant metric disparities between monocular 3D object detection [21, 54, 305, 18, 55, 56, 76, 11, 75, 301] and stereo-based 3D object detection [358, 310, 359, 24, 360, 13, 361, 362, 28, 363]. The current scenario indicates that the accuracy of monocular 3D object detection is far lower than that of stereo-based 3D object detection. Stereo-based 3D object detection leverages the capture of images from two different perspectives of the same scene to obtain depth information. The greater the baseline between cameras, the wider the range of depth information captured. As shown in Fig. 5 (b), there existed monocular 3D object detection methods [364, 7, 67, 69, 65] on the nuScenes dataset 2021 ago, but no related research on Stereo-based 3D object detection. Starting from 2021, monocular methods have gradually been supplanted by multi-view (bird’s-eye-view perception) 3D object detection Multi-view [88, 89, 90, 26, 365, 330, 25, 329, 336, 337], leading to a significant improvement in mAP. The emergence of the novel bird’s-eye-view paradigm and the increase in sensor quantity have had a substantial impact on mAP. It can be observed that initially the disparity between DD3D [69] and DETR3D [88] was not prominent, but with the continuous enhancement of multi-view 3D object detection, particularly with the advent of novel works such as Far3D [337], the gap has widened. In other words, at present, Camera-xonly 3D object detection methods on multi-camera datasets like nuScenes [256] are predominantly based on bird’s-eye-view perception. If we consider accuracy solely from this single dimension, the increase in sensor quantity has led to a significant improvement in accuracy metrics (including mAP, NDS, AP, etc.).

III-D2 Latency

Latency holds paramount importance in the realm of autonomous driving. It refers to the time required for a system to react to input signals, encompassing the entire process from sensor data acquisition to system decision-making and execution of actions. In autonomous driving, stringent requirements are imposed on latency, as any form of delay can lead to severe consequences. The following aspects underscore the importance of latency in autonomous driving.

  • Real-time responsiveness Autonomous driving systems need to demonstrate exceptional real-time responsiveness. Timely decision-making and actions are crucial for collision avoidance, adapting to traffic changes, and ensuring vehicle safety.

  • Safety High latency may result in the system’s inability to timely detect and respond to potential hazardous situations. Timely responses are a key factor in ensuring driving safety.

  • User Experience For passengers and other road users, a smooth and coherent driving experience is crucial. High latency may lead to an uncomfortable driving experience or even induce anxiety.

  • Interactivity Autonomous vehicles need to interact with other vehicles, pedestrians, and infrastructure. Low latency ensures timely communication and coordination, thereby enhancing the overall efficiency of the transportation system.

  • Emergency Response In emergency situations, such as the sudden appearance of obstacles or rapid changes in traffic conditions, the system needs to react quickly to mitigate potential dangers.

In the field of 3D object detection, Latency (Frames Per Second, FPS) and Accuracy are critical metrics for assessing the performance of algorithms. As shown in Fig. 6 (a), the chart for monocular and stereo 3D object detection illustrates the relationship between Average Precision (AP) at the moderate difficulty level of the KITTI dataset and FPS. Fig. 6 (b) shows the relationship between the nuScenes Detection Score (NDS) and FPS for monocular and multi-view 3D object detection. These FPS were obtained using a NVIDIA A100 graphics card, while the performance metrics AP and NDS are derived from the original papers.

Specifically, monocular-based 3D object detection, relying on data from a single camera, typically has lower computational requirements, thus achieving a higher FPS. However, due to the absence of depth information, its accuracy is often inferior to that of stereo or multi-view systems. Stereo-based 3D object detection, utilizing disparity information from images captured by dual cameras, enhances the accuracy of depth estimation but also introduces greater computational complexity, which may reduce FPS. Multi-view detection merges data from several cameras to provide richer scene information, which further improves accuracy. This method requires more extensive data processing, hence demanding greater computational power and algorithmic optimization to sustain a reasonable FPS level. Notably, there are no stereo-based 3D object detection methods represented on the nuScenes, with the monocular method FCOS3D[65] being particularly emblematic as it was introduced in 2021. With time and optimization, multi-view 3D object detection has rapidly developed in terms of both accuracy and latency.

In conclusion, for the realization of safe autonomous driving, 3D object detection algorithms must balance between Latency and Accuracy. While monocular detection is fast, it lacks precision, conversely, stereo and multi-view methods are accurate but slower. Future research should not only maintain high precision but also place greater emphasis on increasing FPS and reducing Latency to meet the dual requirements of real-time responsiveness and safety in autonomous driving.

III-D3 Robustness

Robustness constitutes a pivotal factor in the safety perception of autonomous driving, representing a topic of significant attention that has been previously overlooked in comprehensive reviews. In the current meticulously designed clean datasets and benchmarks such as KITTI [255], nuScenes [256], and Waymo [262], this aspect is not commonly addressed. Presently, research works[253, 254, 388, 287, 288, 389, 390] like RoboBEV[253], Robo3D[288] on 3D object detection incorporate considerations of robustness, exemplified by factors such as sensor misses, as illustrated in Fig. 7. They have adopted a methodology involving the introduction of disturbances in datasets relevant to 3D object detection to assess robustness. This includes the introduction of various types of noise, such as variations in weather conditions, sensor malfunctions, motion disturbances, and object-related perturbations, aimed at unraveling the distinct impacts of different noise sources on the model. Typically, most papers investigating robustness conduct evaluations by introducing noise to the validation sets of clean datasets, such as KITTI [255], nuScenes [256], and Waymo [262]. Additionally, we highlight findings from Ref. [254], where KITTI-C[254], and nuScenes-C[254], are emphasized as examples to illustrate the results of Camera-Only 3D object detection methods. Tables IV and V provide an overall comparison, revealing that, in general, Camera-Only methods are less robust compared to LiDAR-Only and multi-model fusion methods. They are highly susceptible to various types of noise. In KITTI-C, three representative works—SMOKE[21], PGD[67], and ImVoxelNet[391]—show consistently lower overall performance and reduced robustness to noise. In nuScenes-C, noteworthy methods such as DETR3D[88] and BEVFormer[26] exhibit greater robustness compared to FCOS3D [65] and PGD[67], suggesting that as the number of sensors increases, overall robustness improves. In conclusion, future Camera-Only methods need to consider not only cost factors and Accuracy metrics (mAP, NDS, etc.) but also factors related to safety perception and robustness. Our analysis aims to provide valuable insights for the safety of future autonomous driving systems.

Refer to caption
Figure 7: Corruption Examples in the RoboBEV[253] Benchmark: Simulating Camera Malfunction.
TABLE IV: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40subscript𝑅40R_{\text{40}}italic_R start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT at moderate difficulty. The best one is highlighted in bold. ‘RCE’ denotes Relative Corruption Error from Ref.[254].
Corruptions LiDAR-Only Camera-Only multimodal
SECOND {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT PointPillars {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT PointRCNN {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT PV-RCNN {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Part-A22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 3DSSD {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT SMOKE {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT PGD {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT ImVoxelNet {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT EPNet {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Focals Conv {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT LoGoNet * VirConv-S *
None(APcleansubscriptAPclean\text{AP}_{\text{clean}}AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT) 81.59 78.41 80.57 84.39 82.45 80.03 7.09 8.10 11.49 82.72 85.88 86.07 91.95
Snow 52.34 36.47 50.36 52.35 42.70 27.12 2.47 0.63 0.22 34.58 34.77 51.45 51.17
Rain 52.55 36.18 51.27 51.58 41.63 26.28 3.94 3.06 1.24 36.27 41.30 55.80 50.57
Fog 74.10 64.28 72.14 79.47 71.61 45.89 5.63 0.87 1.34 44.35 44.55 67.53 75.63
Weather Sunlight 78.32 62.28 62.78 79.91 76.45 26.09 6.00 7.07 10.08 69.65 80.97 75.54 63.62
Density 80.18 76.49 80.35 82.79 80.53 77.65 - - - 82.09 84.95 83.68 80.70
Cutout 73.59 70.28 73.94 76.09 76.08 73.05 - - - 76.10 78.06 77.17 75.18
Crosstalk 80.24 70.85 71.53 82.34 79.95 46.49 - - - 82.10 85.82 82.00 75.67
Gaussian (L) 64.90 74.68 61.20 65.11 60.73 59.14 - - - 60.88 82.14 61.85 63.16
Uniform (L) 79.18 77.31 76.39 81.16 77.77 74.91 - - - 79.24 85.81 82.94 70.74
Impulse (L) 81.43 78.17 79.78 82.81 80.80 78.28 - - - 81.63 85.01 84.66 80.50
Gaussian (C) - - - - - - 1.56 1.71 2.43 80.64 80.97 84.29 82.55
Uniform (C) - - - - - 2.67 3.29 4.85 81.61 83.38 84.45 82.56
Sensor Impulse (C) - - - - - - 1.83 1.14 2.13 81.18 80.83 84.20 82.54
Moving Obj. 52.69 50.15 50.54 54.60 79.57 77.96 1.67 2.64 5.93 55.78 49.14 14.44 32.28
Motion Motion Blur - - - - - - 3.51 3.36 4.19 74.71 81.08 84.52 82.58
Local Density 75.10 69.56 74.24 77.63 79.57 77.96 - - - 76.73 80.84 78.63 78.73
Local Cutout 68.29 61.80 67.94 72.29 75.06 73.22 - - - 69.92 76.64 64.88 71.01
Local Gaussian 72.31 76.58 69.82 70.44 77.44 75.11 - - - 75.76 82.02 55.66 72.85
Local Uniform 80.17 78.04 77.67 82.09 80.77 78.64 - - - 81.71 84.69 79.94 79.61
Local Impulse 81.56 78.43 80.26 84.03 82.25 79.53 - - - 82.21 85.78 84.29 82.07
Shear 41.64 39.63 39.80 47.72 37.08 26.56 1.68 2.99 1.33 41.43 45.77 - -
Scale 73.11 70.29 71.50 76.81 75.90 75.02 0.13 0.15 0.33 69.05 69.48 - -
Object Rotation 76.84 72.70 75.57 79.93 75.50 76.98 1.11 2.14 2.57 74.62 77.76 - -
Alignment Spatial - - - - - - - - - 35.14 43.01 - -
𝐀𝐯𝐞𝐫𝐚𝐠𝐞(APcor)𝐀𝐯𝐞𝐫𝐚𝐠𝐞subscriptAPcor\textbf{Average}(\text{AP}_{\text{cor}})Average ( AP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT ) 70.45 65.48 67.74 72.59 69.92 60.55 2.68 2.42 3.05 67.81 71.87 80.93 85.66
RCE (%) \downarrow 13.65 16.49 15.92 13.98 15.20 24.34 62.20 70.12 73.46 22.03 18.02 5.97 6.84
  • 1

    {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT: Results from Ref. [254].

  • 2

    * denotes the result of our re-implementation.

TABLE V: Comparison with SOTA methods on nuScenes-C validation set with mAP. ‘D.I.’ refers to DeepInteraction [237]. The best one is highlighted in bold. ‘RCE’ denotes Relative Corruption Error from Ref.[254].
Corruptions LiDAR-Only Camera-Only multimodal
PointPillars{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT SSN{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT CenterPoint{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT FCOS3D{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT PGD{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT DETR3D{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT BEVFormer{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT FUTR3D{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT TransFusion{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT BEVFusion{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT D.I.*
None(APcleansubscriptAPclean\text{AP}_{\text{clean}}AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT) 27.69 46.65 59.28 23.86 23.19 34.71 41.65 64.17 66.38 68.45 69.90
Snow 27.57 46.38 55.90 2.01 2.30 5.08 5.73 52.73 63.30 62.84 62.36
Rain 27.71 46.50 56.08 13.00 13.51 20.39 24.97 58.40 65.35 66.13 66.48
Fog 24.49 41.64 43.78 13.53 12.83 27.89 32.76 53.19 53.67 54.10 54.79
Weather Sunlight 23.71 40.28 54.20 17.20 22.77 34.66 41.68 57.70 55.14 64.42 64.93
Density 27.27 46.14 58.60 - - - - 63.72 65.77 67.79 68.15
Cutout 24.14 40.95 56.28 - - - - 62.25 63.66 66.18 66.23
Crosstalk 25.92 44.08 56.64 - - - - 62.66 64.67 67.32 68.12
FOV lost 8.87 15.40 20.84 - - - - 26.32 24.63 27.17 42.66
Gaussian (L) 19.41 39.16 45.79 - - - - 58.94 55.10 60.64 57.46
Uniform (L) 25.60 45.00 56.12 - - - - 63.21 64.72 66.81 67.42
Impulse (L) 26.44 45.58 57.67 - - - - 63.43 65.51 67.54 67.41
Gaussian (C) - - - 3.96 4.33 14.86 15.04 54.96 64.52 64.44 66.52
Uniform (C) - - - 8.12 8.48 21.49 23.00 57.61 65.26 65.81 65.90
Sensor Impulse (C) - - - 3.55 3.78 14.32 13.99 55.16 64.37 64.30 65.65
Compensation 3.85 10.39 11.02 - - - - 31.87 9.01 27.57 39.95
Moving Obj. 19.38 35.11 44.30 10.36 10.47 16.63 20.22 45.43 51.01 51.63 -
Motion Motion Blur - - - 10.19 9.64 11.06 19.79 55.99 64.39 64.74 65.45
Local Density 26.70 45.42 57.55 - - - - 63.60 65.65 67.42 67.71
Local Cutout 17.97 32.16 48.36 - - - - 61.85 63.33 63.41 65.19
Local Gaussian 25.93 43.71 51.13 - - - - 62.94 63.76 64.34 64.75
Local Uniform 27.69 46.87 57.87 - - - - 64.09 66.20 67.58 66.44
Local Impulse 27.67 46.88 58.49 - - - - 64.02 66.29 67.91 67.86
Shear 26.34 43.28 49.57 17.20 16.66 17.46 24.71 55.42 62.32 60.72 -
Scale 27.29 45.98 51.13 6.75 6.57 12.02 17.64 55.42 62.32 60.72 -
Obeject Rotation 27.80 46.93 54.68 17.21 16.84 27.28 33.97 59.64 63.36 65.13 -
Spatial - - - - - - - 63.77 66.22 68.39 -
Alignment Temporal - - - - - - - 51.43 43.65 49.02 -
𝐀𝐯𝐞𝐫𝐚𝐠𝐞(APcor)𝐀𝐯𝐞𝐫𝐚𝐠𝐞subscriptAPcor\textbf{Average}(\text{AP}_{\text{\text{cor}}})Average ( AP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT ) 23.42 40.37 49.81 10.26 10.68 18.60 22.79 56.99 58.73 61.03 62.92
RCE(%)\text{RCE}(\%)\downarrowRCE ( % ) ↓ 15.42 13.46 15.98 57.00 53.95 46.89 46.41 11.45 11.52 10.84 11.09
  • 1

    {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT: Results from Ref. [254].

  • 2

    * denotes the result of our re-implementation.

Refer to caption
Figure 8: The genaral piplines for LiDAR-based 3D object detection.

IV LiDAR-based 3D Object Detection

In this section, we introduce LiDAR-based 3D object detection which utilizes point clouds as input data and extracts point cloud features to predict 3D objects. The point cloud (PC) is a set of points in Euclidean space, which can be expressed as LpointRN×(3+C)subscript𝐿𝑝𝑜𝑖𝑛𝑡superscript𝑅𝑁3𝐶L_{point}\in R^{N\times(3+C)}italic_L start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_N × ( 3 + italic_C ) end_POSTSUPERSCRIPT, where 3 and C are the coordinate (x, y, z-axis) and extra feature (reflection intensity in general) respectively, and N is the number of points. The characteristics of the points are the following: (1) Sparsity. The point clouds are formally distributed discretely, and the location and distance of each point are irregularly distributed in 3D space. (2) Invariance. For example, when different sensors find an object, the order in which the points are collected is disparate. It poses that the network is hard to fit well because the points are all different. Therefore, putting CNN networks on point clouds causes information loss and makes it hard to train, and it usually requires preprocessing operations on point clouds to solve such problems.

Compared to camera-based methods, LiDAR captures precise 3D information, enabling LiDAR-based approaches to achieve higher detection accuracy and robustness, especially in extreme weather conditions [254]. Because in comparison to optical radiation, the laser beams emitted by LiDAR systems can penetrate certain weather disturbances, such as raindrops and haze, with slight interference. However, the high cost of LiDAR remains one of the main barriers to large-scale adoption of LiDAR-based methods. In addition, the lack of semantic information leads to poor classification performance of LiDAR-based methods. Generally, as shown in Fig. 8, LiDAR-based methods can be categorized into four types: (1) view-based 3D object detection, (2) voxel-based 3D object detection, (3) point-based 3D object detection, (4) point-voxel-based 3D object detection. In contrast to previous reviews [1, 344, 345, 346, 347, 242, 2, 348, 349, 350, 269, 252, 270, 271, 272, 273, 274, 275], our survey extends beyond the conventional classifications of LiDAR-based methods. We adopt a more foundational idea to class LiDAR-based methods based on their core data representations (such as 2D Bird’s Eye View (BEV), Voxel, Pillar, etc.) and underlying model structure (base models including Convolutional Neural Networks (CNN), Transformers, PointNet, and others). This methodological restructuring aims to provide a comprehensive understanding of the technological paradigms at the heart of LiDAR-based methods, analyzing and classifying these systems from a more essential, technical lineage perspective.

IV-A View-based 3D object detection

The core idea of the view-based methods is to transform point clouds into pseudo-image representations from a bird’s-eye view (BEV) or range view by plane [392], cylindrical [393], or spherical [394] projections. In these representations, each pixel contains 3D spatial information rather than RGB values. Due to the dense representation of pseudo-images, traditional or specialized 2D convolutions can be seamlessly applied to range images, making the feature extraction process highly efficient. However, compared to other LiDAR-based methods, detection using range views is more susceptible to occlusion and scale variations. Hence, some methods [175, 173] transform the data representation from the range view to the bird’s eye view (BEV). Furthermore, due to the projection of distant 3D spatial points becoming adjacent in the 2D image, traditional 2D CNN feature extraction operators may become less effective. In response to this issue, some methods [174, 176, 395] have specifically redesigned feature extraction operators for range images. Based on the different data representation views, the view-based methods can be divided into two categories: 1) Range View, 2) BEV View.

IV-A1 Range View

Due to the sparsity of point cloud data, projecting it directly onto an image plane results in a sparse 2D point map. Therefore, most methods [175, 173, 176, 396, 397, 174] project the point cloud into the cylinder coordinate to generate a dense front-view representation by using the following projection fuction:

θ=atan2(y,x),𝜃𝑎𝑡𝑎𝑛2𝑦𝑥\displaystyle\theta=atan2(y,x),italic_θ = italic_a italic_t italic_a italic_n 2 ( italic_y , italic_x ) , (5)
ϕ=arcsin(z/x2+y2+z2),italic-ϕ𝑎𝑟𝑐𝑠𝑖𝑛𝑧superscript𝑥2superscript𝑦2superscript𝑧2\displaystyle\phi=arcsin(z/\sqrt{x^{2}+y^{2}+z^{2}}),italic_ϕ = italic_a italic_r italic_c italic_s italic_i italic_n ( italic_z / square-root start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,
r=θ/Δθ,𝑟𝜃Δ𝜃\displaystyle r=\left\lfloor\theta/\Delta\theta\right\rfloor,italic_r = ⌊ italic_θ / roman_Δ italic_θ ⌋ ,
c=ϕ/Δϕ,𝑐italic-ϕΔitalic-ϕ\displaystyle c=\left\lfloor\phi/\Delta\phi\right\rfloor,italic_c = ⌊ italic_ϕ / roman_Δ italic_ϕ ⌋ ,

where p=(x,y,z)T𝑝superscript𝑥𝑦𝑧Tp=(x,y,z)^{\mathrm{T}}italic_p = ( italic_x , italic_y , italic_z ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT denotes a 3D point and (r,c)𝑟𝑐(r,c)( italic_r , italic_c ) denotes the 2D map position of its projection. θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ denote the azimuth and elevation angle when observing the point. ΔθΔ𝜃\Delta\thetaroman_Δ italic_θ and ΔϕΔitalic-ϕ\Delta\phiroman_Δ italic_ϕ are the average horizontal and vertical angle resolution between consecutive beam emitters, respectively.

VeloFCN [393] is an influential work which first introduced the projection method in cylindrical coordinates. Then, it is followed by [174, 176, 175, 173]. LaserNet [396] utilized DLA-Net [398] to obtain multi-scale features and detect 3D objects from this representation. Inspired by LaserNet, some works borrow the models in 2D object detection to handle range images, e.g. U-Net [399] is applied in [397, 400, 173], RPN [401] is employed in [173, 174] and FPN [402] is leveraged in [176]. Considering the limitations of traditional 2D CNNs in extracting features from range images, some works resort to novel operators, including range dilated convolutions [174], graph operators [395], and meta-kernel convolutions [176]. Furthermore, some works have focused on addressing occlusion and scale variation issues in range view. Specifically, these methods [175, 173] construct feature transformation structures from range view to point view and from point view to BEV view to convert range features into the BEV perspective.

IV-A2 BEV View

Comparison to range view detection, BEV-based detection is more robust to occlusion and scale variation challenges. Hence, feature extraction from the range view and object detection from the bird’s eye view becomes the most practical solution to range-based 3D object detection.

The bird’s eye view representation is encoded by height, intensity and density. The point cloud is discretized into a regular 2D grid. To encode more detailed height information, the point cloud is evenly divided into M𝑀Mitalic_M slices, resulting in M𝑀Mitalic_M height maps where each grid cell stores the maximum height value of the point cloud. The intensity feature represents the reflectance value of the point within each grid cell. And the point cloud density indicates the number of points in each cell. PIXOR [403], which outputs oriented 3D object estimates decoded from pixel-wise neural network predictions, is a pioneering work in this field and followed by [404, 405, 175, 176]. These methods usually entailing three stages. First, point cloud is projected into a novel cell encoding for bird’s eye view projection. Later, both object location on the plane and its heading are estimated through a convolutional neural network originally designed for image processing. Considering scale variation and occlusion, RangeRCNN [173] and RangeIOUDet [175] introduced point view serves as a bridge from RV to BEV which provides pointwise features for models.

IV-B Voxel-based 3D object detection

Voxel-based methods propose to divide the sparse point cloud and assign the distributed point cloud into regular voxels, forming the dense data representation, termed voxelization. Generally, the overall process of voxel-based methods is illustrated in Fig. 8. Compared to view-based methods, voxel-based methods leverage spatial convolution to effectively perceive 3D spatial information and achieve higher detection accuracy. Voxel-based methods face the following challenges:

  • High computational complexity: Compared with camera-based methods, voxel-based methods require significant memory and computation resources due to the large number of voxels used to represent the 3D space.

  • Spatial information loss: Due to the discrete nature of voxels, details and shape information can be lost or blurred during the voxelization process. Additionally, the limited resolution of voxels makes it challenging to accurately detect small objects.

  • Inconsistency in scale and density: Voxel-based methods are typically performed on voxel grids with specific scales and densities. However, due to the significant variations in object scales and point cloud densities across different scenes, making methods to adapt different scenes becomes challenging.

To overcome the aforementioned challenges, it is necessary to address the limitations of data representation, improve the network’s feature capacity and target localization accuracy, and enhance the algorithm’s understanding of complex scenes. Indeed, it is crucial for ensuring safety perception in autonomous driving. Although the optimization strategies of these methods may vary, they share common perspectives of model optimization: 1) data representation. 2) model structure.

IV-B1 Data representation

Voxel-based methods first rasterize point clouds into discrete grid representations. Grid representations are closely related to accuracy, computational complexity, and memory requirements. Using too large voxel size results in significant information impairment while using too small voxel size increases the burdens of computation and memory. As shown in Fig. 9, according to the height of the z-axis, the type of grid representations can be categorized into voxel and pillar.

Refer to caption
Figure 9: Comparison of voxel voxelization with pillar voxelization. The blue cubes in the image represent a non-empty voxels.
Voxel

Voxel grid process divides the 3D space into regular voxel grids with size (dL×dW×dHsubscript𝑑𝐿subscript𝑑𝑊subscript𝑑𝐻{d_{L}}\times{d_{W}}\times{d_{H}}italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT) in the x, y, and z directions, respectively. Only non-empty voxel units that contain points are stored and used for feature extraction. However, due to the sparse distribution of point clouds, the majority of voxel units are empty. As a pioneering work in voxel-based methods [112, 115, 113, 109, 129, 130, 105, 151, 143, 101, 102, 103, 125, 251, 182, 104, 121, 122, 136, 230, 114, 108, 131],VoxelNet [108] proposes a novel voxel feature encoding (VFE) layer to extract features from the points inside a voxel cell. Theb, following works [113, 149, 114, 104, 103, 125, 161] have extended the VoxelNet network by adopting similar voxel encoding approaches. Existing methods often perform local partitioning and feature extraction uniformly across all positions in the point cloud. This approach leads to limitations in the receptive field for distant regions and information truncation. Therefore, some works have proposed different approaches to voxel partitioning: 1) Different coordinate systems. Some approaches have reexamined voxel partitioning from different coordinate system perspectives, e.g. [200, 153] from cylindrical and [132] from spherical coordinate systems. Sphereformer [132] facilitates the aggregation of information from sparsely distant points by dividing the 3D space into multiple non-overlapping radial windows using spherical coordinates (r𝑟ritalic_r, θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ), thereby enhancing information integration from dense point regions. 2) Multi-scale voxels. Some works generate voxels of different scales [115, 150] or use reconfigurable voxels [155], e.g. HVNet [150] proposes a hybrid voxel network which integrats different scales in the point-level voxel feature encoder (VFE). In addition, there are two series of approaches trying to incorporate additional information for voxel fusion. 1) Additional temporal information. Some methods [122, 126, 406, 407, 408, 409, 410] integrate point cloud data from multiple time steps to obtain global environmental information, effectively mitigating the effects of objects blocking each other, and providing a denser representation that captures more detailed spatial information. 2) Additional spatial prior information. Some works encode density information [120, 411, 412, 413] or voxel centroid information [120] in point cloud processing.

Pillar

Pillars can be considered as a special form of voxels. Specifically, point clouds are discretized into a grid uniformly distributed on the x-y plane, creating a set of pillars without binning along the z-axis. Pillar features can be aggregated from points through a PointNet [414] and then scattered back to construct a 2D BEV image for feature extraction. As the pioneering work in this series [183, 118, 135, 119, 181, 123, 156], PointPillar [118] firstly introduces pillar representation. Followed works have extended the ideas from 2D detection to PointPillars. PillarNet [183] adopts the ‘encoder-neck-head’ detection architecture to enhance the performance of pillar-based methods. SWFormer [135] and ESS [123] draw inspiration from the swin transformer [352] and apply a hierarchical window mechanism to pseudo-images, enabling the network to maintain a global receptive field while achieving faster inference speed. PillarNext [119] integrates a series of mature 2D detection techniques and achieves performance comparable to voxel-based methods.

IV-B2 Model Structure

Most voxel-based detectors consist of two fundamental components: voxel-based data representation and voxel-based neural networks. Specifically, there are primarily three major types of neural networks of voxel-based methods: 1) 2D CNNs used for processing BEV feature maps and pillars. 2) 3D CNNs for processing voxels. 3) Transformers for handling both voxels and pillars.

2D CNN

: 2D CNN is primarily used to detect 3D objects from a bird’s-eye view perspective, including processing BEV (Bird’s Eye View) feature maps and pillars [118, 183, 119, 181, 135]. Specifically, the 2D CNN used for processing BEV feature maps often come from well-developed 2D object detection networks, such as Darknet [415], ResNet [416], FPN [402], and RPN [401]. Some early voxel-based works drew inspiration from mature ideas in 2D detection, e.g. Voxel-FPN [115]. One significant advantage of 2D CNN compared to 3D CNN is its faster speed. However, due to its difficulty in capturing spatial relationships and shape information, 2D CNN typically exhibits lower accuracy.

3D Sparse CNN

: 3D Sparse CNN consists of two core operators: sparse convolution and submanifold convolution [417], which ensure that the convolutional operation is performed only on non-empty voxels. SECOND [109] implements efficient computation of sparse convolution [417] and submanifold convolution [418] operators to gain fast inference speed by constructing a hash table. Due to its outstanding performance, it’s followed by [112, 125, 113, 250, 104]. However, the limited receptive field of 3D Sparse CNNs, leading to information truncation, restricts the model’s feature extraction capabilities. Meanwhile, the sparse representation of features makes it challenging for the model to capture fine-grained object boundaries and detailed information. To optimize these issues, main optimization strategies have emerged: 1) Expanding the model’s receptive field. Some methods  [130, 129] extend the concept of large kernel convolution from 2D to 3D space or introduce additional downsampling layers in the model [112]. 2) Combining sparse and dense representations. Methods in this category typically utilize dense prediction heads to prevent information loss [125, 113, 108, 109, 372] or retrieve lost 3D information from the detection process [113, 168, 372, 180, 102, 125], or they add additional auxiliary tasks to the model [372, 201, 419, 103, 104, 151]. Methods employing dense prediction heads typically require high-resolution Bird’s Eye View (BEV) feature maps for conducting dense predictions on them. Considering computational complexity, some recent methods aim to establish global sparse and local dense prediction relationships [124, 131]. Meanwhile, certain detection methods focus on recovering 3D information from the detection process, for instance, the pioneering two-stage detection work Voxel-RCNN [113], which aggregates early features around voxels near instances using the Voxel ROI Pooling module to recover lost 3D information. Subsequent works have designed approaches based on the Voxel-RCNN paradigm, such as using corner point [180, 102, 104] or keypoints from [125, 105].

Numerous methods resort to auxiliary tasks to enhance the spatial features and provide implicit guidance for accurate 3D object detection, including IOU prediction to rectify the object confidence scores [151, 104, 103], object shape completion to complete object shapes from sparse point clouds [420, 162]. and object part estimation to gain 3D structure information by identifying the part information inside objects [372, 201].

Transformer

: In recent years, transformer [352, 353] has developed rapidly in Computer Vision and has shown amazing performance on numerous tasks. Therefore, many endeavors have been made to adapt Transformers to 3D object detection. Particularly, recent studies [253, 254] have confirmed the excellent robustness of transformer-based models, which will further advance research in the domain of safety perception for autonomous driving.

Compared with CNN, the query-key-value design and the self-attention mechanism make transformer modeling global relationships, resulting in a larger receptive field. However, the primary limitation for efficient application of Transformer-based models is the quadratic time and space complexity of the global attention mechanism. Hence, it’s critical to design specialized attention mechanisms for Transformer-based 3D object detectors. Transformer [351], DETR [355], and ViT [353] are the works that have most significantly influenced 3D transformer-based methods [127, 123, 126, 135, 181, 230, 223, 128]. They have each inspired subsequent 3D detection works in various aspects: the design of attention mechanisms, the architecture of encoders and decoders, and the development of patch-based inputs and architectures similar to visual transformers.

Inspired by transformer [351], VoTr [127] is the first work to incorporate transformer into a voxel-based backbone network, which is composed of sparse attention and sparse submanifold attention modules. Subsequent work [128] have continued to build on the foundation of voxel-transformer, further optimizing the temporal complexity of the attention mechanism. DETR [355] has inspired a range of networks to adopt an encoder-decoder structure akin to DETR’s. TransFusion [230] as a notable work, generates object queries from initial detections, applying cross-attention to LiDAR and image features within the Transformer decoder for 3D object detection. Meanwhile, many papers [181, 135, 123] try to explore and refine patch-based inputs mechanism from ViT [353] and the window attention mechanism from Swin Transformer [352], e.g. SST [123] and SWFormer [135] group local regions of voxels into patches, apply sparse regional attention, and then apply region shift to change the grouping. It is noteworthy that SEFormer [181] is the first to introduce object structure encoding into the transformer module.

IV-C Point-based 3D object detection

Benefiting from the prosperity of point cloud in deep learning [421, 414, 422, 423], the Point-based 3D object detection inherits many framework and proposes to directly detect object from the raw points without preprocessing. Compared to voxel-based methods, the raw points maximally retains the original information, which is beneficial for fine-grained feature acquisition. Meanwhile, a series of point-based backbone works [414, 421] naturally provide a strong baseline for point-based methods. However, as of now, the performance of point-based methods is still influenced by two factors: the number of contextual points and the context radius used in feature learning, e.g. increasing the number of contextual points can provide fine-grain 3D information but significantly increases the model’s inference time. Similarly, reducing the context radius can achieve the same effect. Therefore, selecting appropriate values for these two factors enables the model achieve a balance between accuracy and speed. Specifically, to address the aforementioned issues, existing methods mostly focus on optimizing the two basic components of point-based 3D object detectors: 1) Point Cloud Sampling. 2) Feature Learning.

IV-C1 Point Cloud Sampling

FPS (Farthest Point Sampling), originating from the work on PointNet++ [421], is a point cloud sampling method extensively utilized in point-based methods. It aims to select a set of representative points from the raw points, such that their mutual distances are maximized, thereby optimally covering the entire spatial distribution of the point cloud.

PointRCNN [111], a pioneering two-stage detector in point-based methods, utilize the PointNet++ [421] with multi-scale grouping as backbone network. In stage-1, it generates 3D proposals from point clouds in a bottom-up manner. The stage-2 network refines the proposals by combining semantic features and local spatial features. A similar framework proposed by [144]. Their model eliminates part of the background information with 2D image semantic segmentation.

However, existing methods relying on Farthest Point Sampling (FPS) still face several issues: 1) points irrelevant to detection also participate in the sampling process, leading to additional computational burden. 2) The distribution of points across different parts of an object is uneven, resulting in suboptimal sampling strategies. In subsequent works, design paradigms similar to FPS have been employed, with addressing the aforementioned issues, such as segmentation-guided background point filtering [186, 144], random sampling [424], feature space sampling [105], voxel-based sampling [107, 110], coordinate refinement [138] and ray-based grouping sampling [192].

IV-C2 Model Structure

The feature learning stage in point-based methods aims to extract discriminative feature representations from raw points. The neural network used in the feature learning phase should possess the following characteristics: 1) Invariance, where the point cloud backbone network should be insensitive to the ordering of input point clouds, 2) Local awareness, enabling the backbone network to perceive and model local regions and extract local features, 3) The ability to integrate contextual information, allowing the backbone network to extract features from both global and local contextual information.

Based on the aforementioned characteristics, a multitude of detectors have been designed for processing raw points. However, most methods can be categorized according to the core operators they utilize: 1) PointNet-based methods [111, 251, 195, 186, 425]. 2) Graph Neural Network-based methods [110, 426, 424, 427, 134]. 3) Transformer-based methods [138, 428].

PointNet-based

PointNet-based methods [111, 251, 195, 186, 425] primarily rely on the Set Abstraction [414] to perform downsampling on raw points, aggregation of local information, and integration of contextual information, while preserving the symmetry invariance of raw points. Point-RCNN [111], as the first two-stage work in point-based methods, achieved amazing performance at its time, yet it still faces the issue of high computational cost. Subsequent work [144, 186] has addressed this issue by introducing an additional semantic segmentation task during the detection process to filter out background points that contribute minimally to detection. Furthermore, some methods efforts have focused on resolving the issue of the uncontrolled receptive field in PointNet&PointNet++, such as through the use of GNN [161] or Transformer [138].

Graph-based

GNNs (Graph Neural Networks) possess key elements such as an adaptive structure, dynamic neighborhood, the capability to construct both local and global contextual relationships, and robustness against irregular sampling. These characteristics naturally endow GNNs with an advantage in handling irregular point clouds. Point-GNN [110], a pioneering work, designs a one-stage graph neural network to predict object with an auto-registration mechanism, merging and scoring operation, which demonstrate the potential of using the graph neural network as a new approach for 3D object detection. Most graph-based point-based methods [110, 426, 424, 427, 429] aim to fully utilize contextual information. This motivation leaves room for further improvements in subsequent works [427, 429].

Transformer-based

Up to this point, a series of methods [126, 430, 428, 217, 138, 431] have explored the use of transformers for feature learning in point clouds and have achieved excellent results. Pointformer [138] introduced local and global attention modules for processing 3D point clouds. The local transformer module models interactions among points within local areas, aiming to learn contextually relevant regional features at the object level. The global transformer, on the other hand, focuses on learning context-aware representations at the scene level. Subsequently, the local-global Transformer combines local features with high-resolution global features to further capture dependencies between multi-scale representations. Group-free [428] directly utilize all points in the point cloud to compute features for each object candidate, where the contribution of each point is determined by an automatically learned attention module. Specifically, the authors adapted the Transformer to suit 3D object detection, enabling it to model both object-to-object and object-to-pixel relationships and extract object features without manual grouping. Moreover, by iteratively refining the spatial encoding of objects at different stages, the detection performance is further enhanced.

Point-based transformer directly process unstructured and unordered raw point clouds. This results in significantly higher computational complexity compared to structured voxel data. Consequently, the application of transformer-based methods in point-based methods is far less prevalent than in voxel-based methods.

IV-D Point-Voxel based 3D object detection

Point-based methods offer high resolution and preserve the spatial structure of the original data, but they suffer from high computational complexity and inefficiency in handling sparse data. In contrast, voxel-based methods provide a structured data representation, enhancing computational efficiency and facilitating the application of conventional convolutional neural network techniques. However, due to the discretization process, they often lose fine spatial details. Driven by these issues, PV-based methods were developed. Point-voxel methods aim to leverage the fine-grained information capture capabilities of point-based methods and the computational efficiency of voxel-based methods. By integrating these methods, point-voxel based methods enable a more detailed processing of point cloud data, capturing both the global structure and micro-geometric details. This is critically important for safety perception in autonomous driving, as the accuracy of decisions made by autonomous driving systems depends on high-precision detection results.

The key goal of point-voxel methods is to enable feature interplay between voxels and points via point-to-voxel or voxel-to-point transformations. The idea that leverages point-voxel feature fusion in backbones has been explored by many works [116, 199, 146, 164, 166, 117, 165, 371, 432, 120, 433, 168, 139]. These methods fall into two categories: 1) Early Fusion. The early fusion methods fuses [116, 199, 146, 164, 166, 165] voxel features and point features within the backbone network. 2) Late Fusion. while the late fusion methods [117, 371, 432, 120, 433, 168, 139], typically a two-stage detection approach, uses voxel-based methods for initial proposal box generation, followed by sampling and refining key point features from the point cloud to enhance 3D proposals.

Early Fusion

Some methods [116, 199, 146, 163, 164, 165, 166] have explored using new convolutional operators to fuse voxel and point features, with PVCNN [116] potentially being the first work in this direction. In this approach, the voxel-based branch initially converts points into a low-resolution voxel grid and aggregates neighboring voxel features through convolution. Then, through a process called devoxelization, voxel-level features are transformed back to point-level features and fused with features obtained from the point-based branch. The point-based branch extracts features for each individual point. Since it does not aggregate neighboring information, the method can operate at a higher speed. Following closely, SPVCNN [199] which builds upon PVCNN, extends PVCNN to the domain of object detection. Other methods attempt to make improvements in different perspectives, such as auxiliary tasks [146, 163] or multiscale feature fusion [164, 165, 166].

Late Fusion

The methods in this series predominantly adopt a two-stage detection framework. Initially, voxel-based methods are employed to generate preliminary object proposals. This is followed by a refinement phase, where point-level features are leveraged for the precise delineation of detection boxes. Shi et al. proposed PV-RCNN [371], a milestone in PV-based methods. It utilizes SECOND [109] as the first-stage detector and proposes a second-stage refinement stage with a RoI grid pool for the fusion of keypoint features. Later works in this domain predominantly follow the aforementioned paradigm, focusing on advancements in second-stage detection. Notable developments include, attention mechanisms [167, 168, 139], scale-aware pooling [433] and point density-aware refinement modules [120].

PV-based methods simultaneously possess the computational efficiency of voxel-based approaches and the capability of point-based methods to capture fine-grained information. However, the construction of point-to-voxel or voxel-to-point relationships, along with the feature fusion of voxels and points, incurs additional computational overhead. Consequently, compared to voxel-based methods, PV-based methods can achieve better detection accuracy, but at the cost of increased inference time.

IV-E Analysis: Accuracy, Latency, Robustness

Refer to caption
Figure 10: AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT comparison of View based [174, 176, 175, 173], Voxel based [109, 118, 373, 434, 115, 145, 102, 374, 372, 151, 134, 113, 147, 139, 120, 101, 127, 435, 182, 121], Point based [369, 111, 436, 110, 105, 370, 186], and Point-Voxel based [251, 433, 165, 371, 168] on KITTI moderate dataset.

In the autonomous driving sector, the development of LiDAR-based 3D object detection solutions is advancing rapidly. A series of works [1, 344, 345, 346, 347, 242, 2, 348, 349, 350, 269, 252, 270, 271, 272, 273, 274, 275] have comprehensively summarized the current technological roadmaps, such as the extensive review of LiDAR-based solutions by the Shanghai AI Lab and SenseTime Research [269]. However, there remains a lack of summarization and guidance from the perspective of safety perception and cost impact in autonomous driving. Therefore, in this section, following an analysis of the technological roadmaps and current state of LiDAR-based solutions, we intend to base our discussion on the fundamental principles of ‘Accuracy, Latency, and Robustness.’ This is aimed at providing guidance for the practical implementation of economically efficient and safe sensing in autonomous driving.

IV-E1 Accuracy

Referencing Section III discussion on camera-based methods, to investigate the core factors influencing the performance of LiDAR-based methods similarly, we selected representative and cutting-edge methods from each category for a comparative performance analysis, as shown in Fig 10. The current scenario indicates that the performance of view-based methods is significantly lower than the other three categories. view-based approaches aim to transform point clouds into pseudo-images, then process data using 2D detectors. Although this is favorable for inference speed, it comes at the cost of sacrificing some 3D spatial information. As discussed in the section III, camera-based detection methods that can effectively recover 3D space often exhibit superior detection performance. For example, the detection performance of binocular methods surpasses that of monocular methods, and methods based on the BEV paradigm outperform others. Therefore, the effective representation of 3D spatial information is a primary factor impacting the performance of LiDAR-based methods.

In the development of LiDAR-based detection methods, initially, point-based and PV-based approaches outperformed voxel-based methods. However, over time, methods represented by Voxel RCNN [113], utilizing ROI pool modules capable of effectively aggregating fine-grained information, have brought comparable or superior performance to voxel-based approaches. The distinction of the Voxel-RCNN approach from earlier voxel-based methods lies in its ROI pooling module, which effectively aggregates local fine-grained information, addressing the issue of loss of detailed 3D spatial information in the voxelization process. As shown in Fig. 6(c) and (d), the comparison between single-stage [109, 118] and two-stage [113, 127] LiDAR methods indicates that the latter’s use of fine-grained data significantly enhances accuracy. This factor also contributed to the early advantage of point-based and PV-based methods over voxel-based approaches. However, recent voxel-based methods, focusing on extracting and utilizing detailed point cloud information, have narrowed this gap in precision. This phenomenon underscores the importance of fine-grained point cloud information in improving detection accuracy.

Additionally, it is important to note the accuracy differences between transformer-based methods and those based on CNNs or PointNet, as shown in Fig. 10. Due to both global and local self-attention mechanisms that capture contextual information from a large spatial range, transformers have an advantage in modeling long-range dependencies. The receptive field of 3D sparse CNNs is constrained by the size of the convolutional kernel and the spatial discontinuities of sparse features, leading to information truncation issues and thus inferior performance compared to transformers. Recent related works [130, 129] provide a good demonstration that when the receptive field of 3D convolutions is expanded, methods based on 3D sparse convolutions can achieve the performance of transformer-based methods. Therefore, effective global relationship modeling and avoiding information truncation are other key factors in enhancing the accuracy of LiDAR-based detection.

IV-E2 Latency

The section III highlights latency’s importance in autonomous driving safety and user experience. While camera-based methods tend to outperform LiDAR-based methods in terms of inference speed, the latter still maintain a competitive edge due to their accurate 3D perception. We conducted tests using an A100 graphics card to measure the FPS of significant LiDAR-based approaches, and evaluated their performance using the original research’s AP and NDS metrics. As shown in Fig. 6, it indicates that view-based methods excel in model latency due to the reduction in point cloud dimensions and the efficiency of 2D CNNs. Voxel-based methods achieve exceptional inference speed due to the use of structured voxel data and well-optimized 3D sparse convolutions. However, point-based methods face challenges in applying efficient operators during data preprocessing and feature extraction stages due to the irregular representation of point clouds. Point-GNN [110] is an extreme example of this, with model latency nearly ten times that of contemporary voxel algorithms.

To conclude, optimizations improving detection performance typically compromise inference speed in autonomous driving models. Achieving a balance between accuracy and speed is an evolving challenge in this field. Future studies should prioritize the simultaneous improvement of accuracy, FPS and latency reduction in order to meet the urgent requirements of real-time response and safety in autonomous driving.

IV-E3 Robustness

Previous comprehensive reviews have not focused significantly on the topic of robustness. Presently, research works[253, 254, 388, 287, 288, 389, 390] like RoboBEV[253], Robo3D[288] on 3D object detection incorporate considerations of robustness, exemplified by factors such as sensor misses. Robo-LiDAR [286] represents the first comprehensive exploration solely dedicated to the robustness of LiDAR-based methods. In a manner akin to BR3D [254], this method evaluates robustness by integrating disturbances into datasets pertinent to 3D object detection, such as KITTI [255]. The method involves proposing a variety of noise types and 25 typical degradations associated with object and scene-level natural weather conditions, noise interferences, density variations, and object transformations. In this section, we will combine the work of Ref. [254] and Robo-LiDAR [286] with the aim of systematically analyzing the robustness of LiDAR-based methods. As shown in the Table IV, generally, LiDAR-based methods exhibit higher robustness to noise compared to camera-based methods. It is important to note the performance differences between LiDAR-based methods and multimodal methods in noisy environments. In multimodal methods [227, 232, 230, 221, 228, 247, 237], the complementary interplay of data types becomes evident when disturbances are limited to LiDAR sensor data. In such scenarios, image data can partially mitigate the impact on point cloud integrity, consequently elevating the performance of fusion methods above that of methods relying solely on LiDAR. when disturbances affect both image and point cloud data concurrently, the efficacy of most multimodal methods significantly diminishes.

As shown in the Table VI, under various noise conditions, LiDAR-based methods experience varying degrees of accuracy decline, with the most significant reduction observed in extreme weather noise scenarios. These results indicate an urgent need in the field of autonomous driving to address the robustness issue of point cloud detectors. For most types of corruptions, voxel-based methods generally exhibit greater robustness than point-based methods. A plausible explanation is that voxelization, through the spatial quantization of a group of adjacent points, mitigates the local randomness and spatial information disruption caused by noise and density degradation. Specifically, for severe corruptions (e.g., shear, FFD in the Transformation), the point-voxel-based method [371] exhibits greater robustness. PointRCNN [111] does not show the highest robustness against any form of corruptions, highlighting potential limitations inherent in point-based methods.

Compared to single-stage detectors [109, 143, 127], two-stage detectors [125, 111, 371, 142, 127] exhibit poorer robustness to common degradations, as indicated by lower mCEAPsubscriptmCEAP\text{mCE}_{\text{AP}}mCE start_POSTSUBSCRIPT AP end_POSTSUBSCRIPT scores. One possible cause is that corrupted data could affect the proposal generation of stage 1, and the low-quality proposals significantly affect the BBox regression of stage 2. In summary, single-stage detectors demonstrate greater robustness against scene-level corruptions and object-level corruptions and density disruptions, while two-stage detectors are mainly more robust against weather and scene-level density degradations. As for Transformation corruptions, one-stage detectors present better robustness on FFD, rotation, translation and two-stage detectors work better under corruptions of shear, scale.

TABLE VI: Comparsion with LiDAR-based detectors on corrupted validation sets of KITTI from Ref. [286] on Car detection with CEAP(%)\text{CE}_{\text{AP}}(\%)CE start_POSTSUBSCRIPT AP end_POSTSUBSCRIPT ( % ). CEAP(%)\text{CE}_{\text{AP}}(\%)CE start_POSTSUBSCRIPT AP end_POSTSUBSCRIPT ( % ) denotes Corruption Error from Ref. [286]. The best one is highlighted in bold.
Corruption Point-voxel Point Voxel Average
PVRCNN-voxel PointRCNN SECOND BtcDet VoTr-SSD VoTr-TSD SE-SSD CenterPoint
Scene-level Weather rain 25.11 23.31 21.81 31.07 28.17 26.77 29.51 25.83 26.45
snow 44.23 37.74 34.84 54.07 54.10 52.18 49.19 38.74 45.64
fog 1.59 3.52 1.60 1.81 1.77 2.02 1.59 1.11 1.88
Noise uniform_rad 10.19 8.32 9.51 9.13 3.79 4.11 9.34 8.15 7.82
gaussian_rad 13.02 9.98 12.13 10.83 4.84 5.18 11.02 10.17 9.65
impulse_rad 2.20 3.86 2.23 2.50 2.25 3.57 1.18 1.8 2.46
background 2.93 6.49 2.41 1.82 4.59 3.6 2.14 1.86 2.46
upsample 0.81 1.84 0.31 0.95 0.37 0.71 0.55 0.46 0.75
Density cutout 3.75 3.97 4.27 3.99 4.51 3.59 4.26 4.11 4.0
local_dec 14.04 - 13.88 14.55 14.44 12.50 17.01 14.64 14.44
local_inc 1.40 3.34 1.33 2.20 1.66 1.69 0.90 0.95 1.68
beam_del 0.58 0.79 0.73 0.88 0.80 0.53 1.07 0.47 0.73
layer_del 2.94 3.46 3.10 3.39 3.29 3.16 3.37 2.67 3.17
Object-level Noise uniform 15.44 12.95 9.48 12.6 2.76 4.81 6.99 6.51 8.94
gaussian 20.48 17.62 12.98 17.05 4.72 7.46 9.56 9.49 12.42
impulse 3.3 4.7 2.53 4.07 2.88 4.29 2.2 2.11 3.26
upsample 1.12 1.95 0.67 1.33 0.08 0.4 0.22 0.16 0.74
Density cutout 15.81 15.62 14.99 15.62 15.07 16.09 16.51 14.06 15.47
local_dec 14.38 14.16 13.23 14.26 12.66 14.41 15.08 12.52 13.84
local_inc 13.93 14.19 13.74 13.56 11.34 13.05 11.03 11.64 12.81
Transformation shear 37.27 40.96 40.35 41.37 39.52 37.85 40.35 40.0 39.71
FFD 32.42 38.88 33.15 36.77 33.14 34.26 37.96 32.86 34.93
rotation 0.60 0.47 0.31 0.97 0.39 0.75 0.27 0.38 0.52
scale 5.78 8.13 6.96 5.81 8.53 6.50 6.53 7.50 6.97
translation 3.82 3.03 3.24 4.58 4.88 5.34 1.37 3.91 3.77
mCE 11.49 11.64 10.39 12.21 10.42 10.60 11.17 10.09 11.01
TABLE VII: CEAP(%)\text{CE}_{\text{AP}}(\%)CE start_POSTSUBSCRIPT AP end_POSTSUBSCRIPT ( % ) of LiDAR-based detectors with different proposal architectures on corrupted validation sets of KITTI from Ref. [286] on Car detection. CEAP(%)\text{CE}_{\text{AP}}(\%)CE start_POSTSUBSCRIPT AP end_POSTSUBSCRIPT ( % ) denotes Corruption Error from Ref. [286]. The best one is highlighted in red (One-stage [109, 143]) or blue (Two-stage [110, 371, 127, 125]).
Corruption One-stage Two-Stage
Sence-level Weather rain 26.50 26.42
snow 46.04 45.39
fog 2.01 -
Noise uniform_rad 7.55 7.98
gaussian 9.33 9.84
impulse_rad 1.30 1.02
background 3.05 3.38
upsample 0.41 0.95
Density cutout 4.35 3.88
local_dec 15.12 13.93
local_inc 1.30 1.92
beam_del 0.87 0.65
layer_del 3.25 3.12
Object-level Noise uniform 6.41 10.46
gaussian 9.09 14.42
impulse 2.54 3.69
upsample 0.32 0.99
Density cutout 15.52 15.44
local_dec 13.66 13.95
local_inc 12.04 13.27
Transformation shear 40.07 39.49
FFD 34.75 35.04
rotation 0.32 0.63
scale 7.34 6.74
translation 3.16 4.14
mCE 10.66 11.21

V multimodal 3D Object Detection

multimodal 3D object detection refers to the technique of using data features from different sensors and integrating these features to achieve complementarity, thus enabling the detection of 3D objects. As shown in Fig. 11, they particularly emphasizes the combination of image data and point cloud data. Image data is rich in semantic features, such as color and texture, but often lacks depth information. In contrast, point cloud data provides depth information and geometric structure, which is crucial for accurately perceiving and interpreting the 3D characteristics of a scene. Since a single type of sensor cannot fully and accurately perceive the 3D environment, multimodal 3D object detection acquires features with rich semantic information by fusing various types of data.

Refer to caption
Figure 11: Projection for feature fusion vs. Non-Projection for feature fusion.

In the field of autonomous driving, there are a variety of fusion methods for multimodal 3D object detection. Previous reviews [1, 344, 345, 346, 347, 242, 2, 348, 349, 350, 269, 252, 270, 271, 272, 273, 274, 275] have mostly classified these methods based on different stages of fusion (early, middle, late), but this classification is overly simplistic and does not fully consider the special requirements of autonomous driving. Given the fundamental differences between the two heterogeneous modalities of point clouds and images, the alignment step in multimodal fusion is particularly critical. It ensures the consistency and accuracy of information from different sensors and data sources during the fusion process. In autonomous driving, the key to achieving feature alignment lies in whether to use a calibration matrix (also known as a projection matrix). However, the inherent error of the calibration matrix, being a type of prior knowledge, poses a challenge. Some works like [437, 236] avoid using the projection matrix and reduce projection errors by adopting learning methods.

Therefore, based on different methods of feature alignment, we can categorize multimodal 3D object detection methods into: (1) projection-based for feature alignment, (2) model-based for feature alignment. This taxonomy is more detailed and scientific, better reflecting the characteristics and progress of multimodal 3D object detection methods in the field of autonomous driving.

V-A Projection-based 3D object detection

Refer to caption
Figure 12: Projection-based 3D object detection: (a), (b), (c), (d). Non-Projection-based 3D object detection: (e), (f). (a) Point-Projection-based methods [222, 383, 438, 439, 440, 441, 442, 226, 227, 376, 443] integrates image features with the raw point cloud, entering the LiDAR-based pipeline. (b) Feature-Projection-based methods [444, 250, 232, 445, 446, 447, 448, 449, 377] merges image features with point cloud features, entering the LiDAR-based pipeline. (c) Auto-Projection-based methods [369, 106, 233, 234, 384, 378, 243, 247, 450] employs solutions such as Cross Deformable Attention or Graph Match to enhance feature fusion through neighbor augmentation or offset learning on direct projection. (d) Decision-Projection-based methods [451, 452, 453, 380, 381, 196, 454, 382, 107] performs projection and fuses ROI or detection results. (e) Query-Learning-based methods [230, 236, 237, 437, 245, 455] achieve feature fusion by querying image features with point cloud features through cross-attention, without the utilization of projection matrices. (f) Unified-Feature-based methods [385, 231, 386, 375, 228, 356, 137, 221, 387, 235, 225, 136, 223, 240] unify heterogeneous modalities into a common modality. Projection matrices are commonly employed during the modality unification process, while they are not required during fusion. They yield highly robust multimodal features and stands as the state-of-the-art solution for 3D object detection.

Projection-based 3D object detection refers to the use of projection matrix during the feature fusion stage to achieve the integration of point cloud and image features. It is important to clarify that the focus here is on projection during the feature fusion period, rather than those in the fusion stage, which includes projections needed for processes such as data augmentation. As shown in Fig. We have developed a more detailed classification of Projection-based 3D object detection based on the different types of projection used in the fusion stage, including Point-Projection-based [222, 383, 438, 439, 440, 441, 442, 226, 227, 376, 443], Feature-Projection-based [444, 250, 232, 445, 446, 447, 448, 449, 377], Auto-Projection-based [369, 106, 233, 234, 384, 378, 243, 247, 450],[369] and Decision-Projection-based methods [451, 452, 453, 380, 381, 196, 454, 382, 107].

V-A1 Point-Projection-based 3D object detection

Point-Projection-based 3D object detection methods [222, 383, 438, 439, 440, 441, 442, 443] involve a process of projecting image features onto raw point clouds to enhance the representational capability of the original point cloud data. The initial step in these methods is to establish a strong correlation between LiDAR points and image pixels, which is achieved using calibration matrices. Following this, the point cloud features are enhanced by augmenting them with additional data. This augmentation takes two forms: either through the incorporation of segmentation scores [222, 438, 456] or by using CNN features [376, 443, 226, 227, 383] from the correlated pixels. PointPainting [222] and PointAugmenting [383] represent advancements in multi-model 3D object detection methods by enhancing the traditional cut-and-paste augmentation. These techniques aim to seamlessly integrate data from different domains, such as point clouds and 2D imagery, while carefully managing potential overlaps or collisions between objects in both domains. PointPainting enhances LiDAR points by appending segmentation scores. However, it has limitations in effectively capturing the color and texture details present in images. To address these shortcomings, more sophisticated approaches like FusionPainting [456] have been developed, following a similar paradigm. MVP [443] builds upon the concept of PointPainting[222]. It initially utilizes image instance segmentation and establishes an alignment between the segmentation masks and the point cloud using a projection matrix. The key distinction of MVP lies in its approach to sampling: it randomly selects pixels within each range, ensuring consistency with the points in the point cloud. These selected pixels are then linked to their nearest neighbors in the point cloud. The depth value of the LiDAR point in this linkage is assigned as the depth of the corresponding pixel. Subsequently, these points are projected back to the LiDAR coordinate system, resulting in the generation of virtual LiDAR points.

V-A2 Feature-Projection-based 3D object detection

In contrast to Point-Projection-based methods, Feature-Projection-based 3D object detection methods [246, 250, 232, 445, 446, 129, 241, 449, 377] primarily focus on fusing point cloud features with image features during the feature extraction phase of point clouds. During this fusion process, point cloud features are projected onto corresponding image features, and subsequently, these image and point cloud features are integrated together. This process is achieved by applying a calibration matrix to transform the voxel’s three-dimensional coordinate system into the pixel coordinate system of the image, thereby facilitating the effective fusion of point cloud and image modalities. Specifically, the projection of a three-dimensional point cloud onto the image plane can be articulated as follows:

zc[uv1]=h𝒦[RT][PxPyPz1],subscript𝑧𝑐delimited-[]𝑢𝑣1𝒦delimited-[]𝑅𝑇delimited-[]subscript𝑃𝑥subscript𝑃𝑦subscript𝑃𝑧1z_{c}\left[\begin{array}[]{c}u\\ v\\ 1\end{array}\right]=h\mathcal{K}\left[\begin{array}[]{ll}R&T\end{array}\right]% \left[\begin{array}[]{c}P_{x}\\ P_{y}\\ P_{z}\\ 1\end{array}\right],italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [ start_ARRAY start_ROW start_CELL italic_u end_CELL end_ROW start_ROW start_CELL italic_v end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ] = italic_h caligraphic_K [ start_ARRAY start_ROW start_CELL italic_R end_CELL start_CELL italic_T end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ] , (6)

where, Pxsubscript𝑃𝑥P_{x}italic_P start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, Pysubscript𝑃𝑦P_{y}italic_P start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, and Pzsubscript𝑃𝑧P_{z}italic_P start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represent the three-dimensional spatial coordinates of the LiDAR points, while u𝑢uitalic_u and v𝑣vitalic_v denote the corresponding two-dimensional coordinates. The term zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT indicates the depth of the point’s projection on the image plane. Additionally, 𝒦𝒦\mathcal{K}caligraphic_K represents the intrinsic parameters of the camera, and R𝑅Ritalic_R and T𝑇Titalic_T signify the rotation and translation of the LiDAR relative to the camera’s reference frame, respectively. The factor hhitalic_h accounts for the scale change due to downsampling.

A quintessential example of the Feature-Projection-based method, ContFuse [246], employs continuous convolution to amalgamate multi-scale convolutional feature maps from each sensor. Within this technique, the projection of the point cloud facilitates the correspondence between the image and the Bird’s Eye View (BEV). In essence, Feature-Projection-based 3D object detection method is accomplished during the point cloud feature extraction phase. Compared to Point-Projection-based methods, they do not perform fusion on the original point cloud but achieve a profound depth feature fusion, resulting in more robust performance. space.

V-A3 Auto-Projection-based 3D object detection

Refer to caption
Figure 13: Examples of misalignment between point clouds and images.

As shown in Fig. 13, a partial image from the KITTI [255] dataset exemplifies that projection inaccuracies persist even in this classic clean dataset. Consequently, the issue of projection errors cannot be completely eliminated by manual calibration, but rather, they can only be mitigated. This is a frequent challenge in practical dataset deployments. Many studies like Point & Feature-Projection-based methods have performed fusion through direct projection without addressing the projection error issue. A few works [106, 233, 234, 384, 243, 378, 247, 450] have sought to mitigate these errors through approaches such as projection offsets and neighboring projections. For instance, Deformable Cross Attention[354] has been employed to learn offsets in the context of already projected data. We have systematically reviewed and synthesized methods that tackle projection errors, designating them as Auto-projection-based 3D object detection methods. As representative works addressing feature alignment, HMFI[243], GraphAlign[233], and GraphAlign++[234] utilize the a priori knowledge of projection calibration matrices to project onto corresponding images for local graph modeling. This approach simulates the intermodal relationships, enabling multimodal 3D object detectors to effectively identify more appropriate alignment relationships, thereby achieving faster and more accurate feature alignment between modalities. AutoAlignV2 [384] focuses on sparse learnable sampling points for cross-modal relational modeling, enhancing calibration error tolerance and significantly accelerating feature aggregation across different modalities. In summary, Auto-Projection-based 3D object detection methods mitigate errors arising from feature alignment by leveraging neighbor relationships or neighbor offset, thereby enhancing robustness in multimodal 3D object detection.

V-A4 Decision-Projection-based 3D object detection

Decision-Projection-based 3D object detection methods [451, 452, 453, 380, 381, 196, 454, 382, 107], as early implementations of multimodal 3D object detection schemes, use projection matrices to align features in Regions of Interest (RoI) or specific results. These methods are primarily focused on the alignment of features in localized areas of interest or specific detection outcomes. Graph-RCNN [107] project the graph node to the location in the camera image and collect the feature vector at that pixel in the camera image through bilinear interpolation. F-PointNet [381] performs detection of the 2D image to determine the class and localization of the object, and for each detected object, the corresponding Point clouds in 3D space is obtained through the conversion matrix of calibrated sensor parameters and 3D space. MV3D [453] employs a transformation of the LiDAR point cloud into Bird’s Eye View (BEV) and Front View (FV) projections for generating proposals. During this process, a specialized 3D proposal network is used to create precise 3D candidate boxes. These 3D proposals are then projected onto feature maps from multiple perspectives to facilitate feature alignment between the two modalities. Differing from MV3D [453] , AVOD [452] streamlines this approach by omitting the FV component and introducing a more refined region proposal mechanism. In summary, Decision-Projection-based 3D object detection methods primarily achieve feature fusion at a high-level through projection, with limited interactivity between heterogeneous modalities. This often leads to the alignment and fusion of erroneous features, resulting in issues of reduced accuracy and robustness.

V-B Non-Projection-based 3D object detection

Non-Projection-based 3D object detection methods achieve fusion without relying on feature alignment, thereby yielding robust feature representations. They circumvent the limitations of camera-to-LiDAR projection, which often reduces the semantic density of camera features and impacts the effectiveness of techniques like Focals Conv [232] and PointPainting [222]. Non-Projection-based methods typically employ cross-attention mechanisms or the construction of a unified space to address the inherent misalignment issues in direct feature projection. These methods are primarily divided into two categories: (1) Query-Learning-based [230, 236, 237, 437, 245, 455] and (2) Unified-feature-based [385, 231, 386, 375, 228, 356, 137, 221, 387, 235, 225, 136, 223]. Query-Learning-based methods entirely negate the need for alignment during the fusion process. Conversely, Unified-feature-based methods, though constructing a unified feature space, do not completely avoid projection; it usually occurs within a single modality context. For example, BEVFusion [231] utilizes LSS [316] for camera-to-BEV projection. This process, taking place before fusion, demonstrates considerable robustness in scenarios with feature misalignment.

V-B1 Query-Learning-based 3D object detection

Query-Learning-based 3D object detection methods, as exemplified by works such as [230, 236, 237, 437, 245, 455], eschew the necessity for projection within the feature fusion process. Instead, they attain feature alignment through cross-attention mechanisms before engaging in the fusion of features. Point cloud features are typically employed as queries, while image features serve as keys and values, facilitating a global feature query to acquire highly robust multimodal features. Furthermore, DeepInteraction [237] incorporates multimodality interaction, wherein point cloud and image features are utilized as distinct queries to enable further feature interaction. In comparison to the exclusive use of point cloud features as queries, the comprehensive incorporation of image features leads to the acquisition of more resilient multimodal features. Overall, Query-Learning-based 3D object detection methods employ a transformer-based structure for feature querying to achieve feature alignment. Ultimately, the multimodal features are integrated into LiDAR-based pipelines, such as CenterPoint[125].

V-B2 Unified-Feature-based 3D object detection

Unified-feature-based 3D object detection methods, represented by works such as [385, 231, 386, 375, 228, 356, 137, 221, 387, 235, 225, 136, 223], generally employ projection before feature fusion, achieving the pre-fusion unification of heterogeneous modalities. In the BEV fusion series, which utilizes LSS for depth estimation [385, 231, 386, 223], the front-view features are transformed into BEV features, followed by the fusion of BEV image and BEV point cloud features. Alternatively, CMT [225] and UniTR [356] employ transformers for tokenization of point clouds and images, constructing an implicit unified space through transformer encoding. CMT [225] utilizes projection in the position encoding process, but entirely avoids dependency on projection relations at the feature learning level. FocalFormer3D [375], FUTR3D [228], and UVTR [136] leverage transformers’ queries to implement schemes similar to DETR3D[88], constructing a unified sparse BEV feature space through queries, thus mitigating the instability introduced by direct projection. VirConv[221], MSMDFusion[387], and SFD[235] construct a unified space through pseudo-point clouds, with the projection occurring before feature learning. The issues introduced by direct projection are addressed through subsequent feature learning. In summary, Unified-feature-based 3D object detection methods [385, 231, 386, 375, 228, 356, 137, 221, 387, 235, 225, 136, 223] currently represent high-precision and robust solutions. Although they incorporate projection matrices, such projection does not occur between multimodal fusion, distinguishing them as Non-Projection-based 3D object detection methods. Unlike Auto-Projection-based 3D object detection approaches, they do not directly address projection error issues but instead opt for unified space construction, considering multiple dimensions for multimodal 3D object detection, thereby obtaining highly robust multimodal features.

V-C Analysis: Accuracy, Latency, Robustness

Refer to caption
Figure 14: Projection-based 3D object detection: P.P. (Point-Projection-based), F.P. (Feature-Projection-based), A.P. (Auto-Projection-based), D.P. (Decision-Projection-based), Non-Projection-based 3D object detection: Q.L.(Query-Learning-based), U.F. (Unified-Feature-based). (a) AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT comparison of Point-Projection-based [222, 376, 226, 227], Feature-Projection-based [377, 232, 241], Auto-Projection-based [369, 106, 243, 450, 378, 233, 247], Decision-Projection-based [453, 452, 454, 196, 380, 382, 379], and Query-Learning-based [245] on KITTI test moderate dataset. (b) The mAP (left) and NDS (right) comparison of Point-Projection-based [383], Feature-Projection-based [129], Auto-Projection-based [384, 233], Query-Learning-based [437, 230, 237], Unified-Feature-based [136, 228, 223, 231, 387, 225, 356, 386, 375, 385] and on the nuScenes test dataset.

In the preceding Sections III-D, IV-E, we have conducted a comprehensive analysis of ‘Accuracy, Latency, Robustness’ for camera-only and LiDAR-only approaches. Subsequently, we extend our examination to multimodal 3D object detection methods, employing a similar analytical framework.

V-C1 Accuracy

As shown in Fig.14, we conducted comparative evaluations on both the KITTI test and nuScenes test datasets. The majority of Projection-based 3D object detection methods have predominantly undergone experimentation on the KITTI dataset, with only a minority extending their evaluation to nuScenes. As shown in Fig.14 (a), it is evident that Feature-Projection-based and Auto-Projection-based methods exhibit superior overall performance, while Decision-Projection-based methods, primarily dated prior to 2020, tend to manifest relatively lower Average Precision (AP) metrics. A scant few Non-Projection-based 3D object detection methods, such as CAT-Det, have been experimented with on the KITTI dataset. As shown in Fig.14 (b), the latest methods predominantly belong to the Unified-Feature-based methods, underscoring the suitability of the panoramic camera offered by nuScenes for achieving modality-unifying strategies like BEVFusion[231]. Overall, it is discernible that Non-Projection-based methods present more effective solutions in terms of Accuracy metrics (e.g., AP, mAP, NDS, etc.).

V-C2 Latency

As shown in Fig. 6, we conducted a comparative analysis of mono-modal 3D object detection methods (LiDAR-only and Camera-only) and multimodal 3D object detection on the KITTI and nuScenes datasets, presenting scatter plots for Latency (FPS) and Accuracy metrics (AP, mAP, NDS, etc.). It is noteworthy that, in comparison to mono-modal 3D object detection methods (LiDAR-only and Camera-only), multimodal 3D object detection approaches generally exhibit lower FPS.

As shown in Fig. 6 (e), the results on the KITTI dataset indicate that GraphAlign excels in both AP and FPS metrics. Additionally, LoGoNet [247], Focals Conv [232], and EP-Net [226] demonstrate outstanding performance. Yes, the provided text contains a few errors. As shown in Fig. 6 (f), GraphAlign [233] maintains its position as having the highest FPS, but its NDS performance is suboptimal on the nuScenes dataset. In contrast, UniTR performs exceptionally well in both NDS and FPS metrics. Overall, it can be observed that within Projection-based methods, Auto-Projection-based and Feature-Projection-based methods exhibit superior overall performance, while within Unified-Feature-based methods, the overall performance is more outstanding. In the meticulous evaluation of the KITTI and nuScenes datasets, emphasis is placed on the trade-off between FPS and NDS metrics.

V-C3 Robustness

In the previous sections III-D3 and IV-E3, we analyzed the robustness of mono-modal 3D object detection (Camera-only and LiDAR-only). In this section, based on Tables IV and V, we analyze the robustness of multimodal 3D object detection. From KITTI-C [254] and nuScenes-C [254], it can be seen that multimodal 3D object detection is more robust compared to mono-modal 3D object detection (Camera-only and LiDAR-only), with smaller RCE. In KITTI-C, representative articles LoGoNet [247] for Auto-Projection-based and VirConv [221] for Unified-Feature-based exhibit greater robustness, while EPNet [226] for Point-Projection-based and Focals Conv [232] for Feature-Projection-based show slightly weaker performance. Additionally, in nuScenes-C, among Non-Projection-based methods, FUTR3D [228], TransFusion [230], BEVFusion [231], and DeepInteraction [237] all demonstrate strong robustness.

VI Conclusion

3D object detection plays a crucial role in autonomous driving perception. In recent years, this field has witnessed rapid development, yielding a plethora of research papers. Based on the diverse data forms generated by sensors, these methods are primarily categorized into three types: image-based, point cloud-based, and multimodal. The primary metrics for evaluation in these methods are high accuracy and low latency. Numerous reviews have summarized these approaches, focusing predominantly on the core principles of ‘high accuracy and low latency’ in delineating their technical trajectories. However, in the transition of autonomous driving technology from breakthroughs to practical applications, existing reviews have not prioritized safety perception as central concerns, failing to encompass the current technological pathways related to safety perception. For instance, recent multimodal fusion methods typically undergo robustness testing during the experimental phase, a facet not adequately considered in current reviews. Therefore, in this study, we reexamine 3D object detection algorithms with a central focus on the key aspects of ‘Accuracy, Latency, and Robustness.’ We reclassify previous reviews, placing particular emphasis on resegmenting from the perspective of safety perception. We aim for this work to offer new insights for future research in 3D object detection, transcending the confines of high-accuracy exploration.

References

  • [1] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and A. Mouzakitis, “A survey on 3d object detection methods for autonomous driving applications,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3782–3795, 2019.
  • [2] L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang, J. Li, C. Jia et al., “Multi-modal 3d object detection in autonomous driving: A survey and taxonomy,” IEEE Transactions on Intelligent Vehicles, 2023.
  • [3] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453.
  • [4] Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudolidar++: Accurate depth for 3d object detection in autonomous driving,” arXiv preprint arXiv:1906.06310, 2019.
  • [5] P. Li, H. Zhao, P. Liu, and F. Cao, “Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving,” in European Conference on Computer Vision.   Springer, 2020, pp. 644–660.
  • [6] Y. Zhang, J. Lu, and J. Zhou, “Objects are different: Flexible monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3289–3298.
  • [7] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and P. Kontschieder, “Disentangling monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1991–1999.
  • [8] L. Wang, L. Zhang, Y. Zhu, Z. Zhang, T. He, M. Li, and X. Xue, “Progressive coordinate transforms for monocular 3d object detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 364–13 377, 2021.
  • [9] Z. Qin and X. Li, “Monoground: Detecting monocular 3d objects from the ground,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3793–3802.
  • [10] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided convolutions for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops, 2020, pp. 1000–1001.
  • [11] Q. Lian, P. Li, and X. Chen, “Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1070–1079.
  • [12] Y.-N. Chen, H. Dai, and Y. Ding, “Pseudo-stereo for monocular 3d object detection in autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 887–897.
  • [13] J. Sun, L. Chen, Y. Xie, S. Zhang, Q. Jiang, X. Zhou, and H. Bao, “Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 548–10 557.
  • [14] C. Li, J. Ku, and S. L. Waslander, “Confidence guided stereo 3d object detection with split depth estimation,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 5776–5783.
  • [15] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9287–9296.
  • [16] Y. Cai, B. Li, Z. Jiao, H. Li, X. Zeng, and X. Wang, “Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 478–10 485.
  • [17] H. Chen, Y. Huang, W. Tian, Z. Gao, and L. Xiong, “Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 379–10 388.
  • [18] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3d object detection using pairwise spatial relationships,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 093–12 102.
  • [19] W. Bao, B. Xu, and Z. Chen, “Monofenet: Monocular 3d object detection with feature enhancement networks,” IEEE Transactions on Image Processing, vol. 29, pp. 2753–2765, 2019.
  • [20] J. Heylen, M. De Wolf, B. Dawagne, M. Proesmans, L. Van Gool, W. Abbeloos, H. Abdelkawy, and D. O. Reino, “Monocinis: Camera independent monocular 3d object detection using instance segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 923–934.
  • [21] Z. Liu, Z. Wu, and R. Tóth, “Smoke: Single-stage monocular 3d object detection via keypoint estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 996–997.
  • [22] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
  • [23] P. Li, X. Chen, and S. Shen, “Stereo r-cnn based 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7644–7652.
  • [24] Y. Liu, L. Wang, and M. Liu, “Yolostereo3d: A step back to 2d for efficient stereo 3d detection,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 13 018–13 024.
  • [25] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1477–1485.
  • [26] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European conference on computer vision.   Springer, 2022, pp. 1–18.
  • [27] H. Königshof, N. O. Salscheider, and C. Stiller, “Realtime 3d object detection for automated driving using stereo vision and semantic information,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC).   IEEE, 2019, pp. 1405–1410.
  • [28] Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry network for 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 536–12 545.
  • [29] C. Ji, H. Wu, and G. Liu, “Probabilistic instance shape reconstruction with sparse lidar for monocular 3d object detection,” Neurocomputing, vol. 529, pp. 92–100, 2023.
  • [30] H. Sheng, S. Cai, N. Zhao, B. Deng, M.-J. Zhao, and G. H. Lee, “Pdr: Progressive depth regularization for monocular 3d object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [31] R. Tao, W. Han, Z. Qiu, C.-z. Xu, and J. Shen, “Weakly supervised monocular 3d object detection using multi-view projection and direction consistency,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 482–17 492.
  • [32] Y. Hu, S. Fang, W. Xie, and S. Chen, “Aerial monocular 3d object detection,” IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 1959–1966, 2023.
  • [33] X. Wu, D. Ma, X. Qu, X. Jiang, and D. Zeng, “Depth dynamic center difference convolutions for monocular 3d object detection,” Neurocomputing, vol. 520, pp. 73–81, 2023.
  • [34] C. Tao, J. Cao, C. Wang, Z. Zhang, and Z. Gao, “Pseudo-mono for monocular 3d object detection in autonomous driving,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [35] J. Xu, L. Peng, H. Cheng, H. Li, W. Qian, K. Li, W. Wang, and D. Cai, “Mononerd: Nerf-like representations for monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6814–6824.
  • [36] X. Weng and K. Kitani, “Monocular 3d object detection with pseudo-lidar point cloud,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  • [37] L. Wang, L. Du, X. Ye, Y. Fu, G. Guo, X. Xue, J. Feng, and L. Zhang, “Depth-conditioned dynamic message propagation for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 454–463.
  • [38] P. Li and H. Zhao, “Monocular 3d detection with geometric constraint embedding and semi-supervised training,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5565–5572, 2021.
  • [39] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou, “Deep fitting degree scoring network for monocular 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1057–1066.
  • [40] X. Ye, L. Du, Y. Shi, Y. Li, X. Tan, J. Feng, E. Ding, and S. Wen, “Monocular 3d object detection via feature domain adaptation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16.   Springer, 2020, pp. 17–34.
  • [41] E. Jörgensen, C. Zach, and F. Kahl, “Monocular 3d object detection and box fitting trained end-to-end using intersection-over-union loss,” arXiv preprint arXiv:1906.08070, 2019.
  • [42] Y. Tang, S. Dorn, and C. Savani, “Center3d: Center-based monocular 3d object detection with joint depth understanding,” in DAGM German Conference on Pattern Recognition.   Springer, 2020, pp. 289–302.
  • [43] X. Shi, Z. Chen, and T.-K. Kim, “Distance-normalized unified representation for monocular 3d object detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16.   Springer, 2020, pp. 91–107.
  • [44] Z. Zou, X. Ye, L. Du, X. Cheng, X. Tan, L. Zhang, J. Feng, X. Xue, and E. Ding, “The devil is in the task: Exploiting reciprocal appearance-localization features for monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2713–2722.
  • [45] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan, “Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6851–6860.
  • [46] Z. Wu, Y. Wu, J. Pu, X. Li, and X. Wang, “Attention-based depth distillation with 3d-aware positional encoding for monocular 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 2892–2900.
  • [47] H.-N. Hu, Q.-Z. Cai, D. Wang, J. Lin, M. Sun, P. Krahenbuhl, T. Darrell, and F. Yu, “Joint monocular 3d vehicle detection and tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5390–5399.
  • [48] X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang, “Delving into localization errors for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4721–4730.
  • [49] J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3d object detection leveraging accurate proposals and shape reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 867–11 876.
  • [50] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3d object detection for autonomous driving,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2147–2156.
  • [51] L. Yang, X. Zhang, J. Li, L. Wang, M. Zhu, and L. Zhu, “Lite-fpn for keypoint-based monocular 3d object detection,” Knowledge-Based Systems, vol. 271, p. 110517, 2023.
  • [52] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8555–8564.
  • [53] Y. Zhou, Y. He, H. Zhu, C. Wang, H. Li, and Q. Jiang, “Monocular 3d object detection: An extrinsic parameter free approach,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7556–7566.
  • [54] X. Liu, N. Xue, and T. Wu, “Learning auxiliary monocular contexts helps monocular 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1810–1818.
  • [55] Z. Chong, X. Ma, H. Zhang, Y. Yue, H. Li, Z. Wang, and W. Ouyang, “Monodistill: Learning spatial features for monocular 3d object detection,” arXiv preprint arXiv:2201.10830, 2022.
  • [56] K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “Monodtr: Monocular 3d object detection with depth-aware transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4012–4021.
  • [57] S. Luo, H. Dai, L. Shao, and Y. Ding, “M3dssd: Monocular 3d single stage object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6145–6154.
  • [58] Z. Li, Z. Qu, Y. Zhou, J. Liu, H. Wang, and L. Jiang, “Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2791–2800.
  • [59] Y. Hong, H. Dai, and Y. Ding, “Cross-modality knowledge distillation network for monocular 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 87–104.
  • [60] X. Chu, J. Deng, Y. Li, Z. Yuan, Y. Zhang, J. Ji, and Y. Zhang, “Neighbor-vote: Improving monocular 3d object detection through neighbor distance voting,” Cornell University - arXiv,Cornell University - arXiv, Jul 2021.
  • [61] L. Peng, F. Liu, S. Yan, X. He, and D. Cai, “Ocm3d: Object-centric monocular 3d object detection.” Cornell University - arXiv,Cornell University - arXiv, Apr 2021.
  • [62] A. Simonelli, S. R. Bulo, L. Porzi, P. Kontschieder, and E. Ricci, “Are we missing confidence in pseudo-lidar methods for monocular 3d object detection?” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021.
  • [63] J. Gu, B. Wu, L. Fan, J. Huang, S. Cao, Z. Xiang, X.-S. Hua, A. Cloud, and C. Ltd, “Homography loss for monocular 3d object detection.”
  • [64] L. Peng, F. Liu, Z. Yu, S. Yan, D. Deng, Z. Yang, H. Liu, and D. Cai, Lidar Point Cloud Guided Monocular 3D Object Detection, Jan 2022, p. 123–139.
  • [65] T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one-stage monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.
  • [66] A. Kumar, G. Brazil, and X. Liu, “Groomed-nms: Grouped mathematically differentiable nms for monocular 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8973–8983.
  • [67] T. Wang, Z. Xinge, J. Pang, and D. Lin, “Probabilistic and geometric depth: Detecting objects in perspective,” in Conference on Robot Learning.   PMLR, 2022, pp. 1475–1485.
  • [68] Y. Lu, X. Ma, L. Yang, T. Zhang, Y. Liu, Q. Chu, J. Yan, and W. Ouyang, “Geometry uncertainty projection network for monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3111–3121.
  • [69] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3142–3152.
  • [70] T. Wang, J. Pang, and D. Lin, “Monocular 3d object detection with depth from motion,” in European Conference on Computer Vision.   Springer, 2022, pp. 386–403.
  • [71] J. Gu, B. Wu, L. Fan, J. Huang, S. Cao, Z. Xiang, and X.-S. Hua, “Homography loss for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1080–1089.
  • [72] P. Li and J. Jin, “Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3885–3894.
  • [73] Z. Zhou, L. Du, X. Ye, Z. Zou, X. Tan, L. Zhang, X. Xue, and J. Feng, “Sgm3d: Stereo guided monocular 3d object detection,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 478–10 485, 2022.
  • [74] X. He, F. Yang, K. Yang, J. Lin, H. Fu, M. Wang, J. Yuan, and Z. Li, “Ssd-monodetr: Supervised scale-aware deformable transformer for monocular 3d object detection,” IEEE Transactions on Intelligent Vehicles, 2023.
  • [75] L. Yang, X. Zhang, J. Li, L. Wang, M. Zhu, C. Zhang, and H. Liu, “Mix-teaching: A simple, unified and effective semi-supervised learning framework for monocular 3d object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [76] R. Zhang, H. Qiu, T. Wang, Z. Guo, X. Xu, Y. Qiao, P. Gao, and H. Li, “Monodetr: depth-guided transformer for monocular 3d object detection,” arXiv preprint arXiv:2203.13310, 2022.
  • [77] Y. Zhou, H. Zhu, Q. Liu, S. Chang, and M. Guo, “Monoatt: Online monocular 3d object detection with adaptive token transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 493–17 503.
  • [78] Y. Su, Y. Di, G. Zhai, F. Manhardt, J. Rambach, B. Busam, D. Stricker, and F. Tombari, “Opa-3d: Occlusion-aware pixel-wise aggregation for monocular 3d object detection,” IEEE Robotics and Automation Letters, vol. 8, no. 3, pp. 1327–1334, 2023.
  • [79] Z. Liu, D. Zhou, F. Lu, J. Fang, and L. Zhang, “Autoshape: Real-time shape-aware monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 641–15 650.
  • [80] Y. Cao, H. Zhang, Y. Li, C. Ren, and C. Lang, “Cman: Leaning global structure correlation for monocular 3d object detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 24 727–24 737, 2022.
  • [81] H. Liu, H. Liu, Y. Wang, F. Sun, and W. Huang, “Fine-grained multilevel fusion for anti-occlusion monocular 3d object detection,” IEEE Transactions on Image Processing, vol. 31, pp. 4050–4061, 2022.
  • [82] Y. Zhang, W. Zheng, Z. Zhu, G. Huang, D. Du, J. Zhou, and J. Lu, “Dimension embeddings for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1589–1598.
  • [83] F. Yang, X. Xu, H. Chen, Y. Guo, Y. He, K. Ni, and G. Ding, “Gpro3d: Deriving 3d bbox from ground plane in monocular 3d object detection,” Neurocomputing, vol. 562, p. 126894, 2023.
  • [84] M. Zhu, L. Ge, P. Wang, and H. Peng, “Monoedge: Monocular 3d object detection using local perspectives,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 643–652.
  • [85] Q. Lian, B. Ye, R. Xu, W. Yao, and T. Zhang, “Exploring geometric consistency for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1685–1694.
  • [86] J.-Q. Yu and S.-C. Pei, “Perspective-aware convolution for monocular 3d object detection,” arXiv preprint arXiv:2308.12938, 2023.
  • [87] E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo, and J. M. Alvarez, “M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation,” arXiv preprint arXiv:2204.05088, 2022.
  • [88] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in Conference on Robot Learning.   PMLR, 2022, pp. 180–191.
  • [89] Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 531–548.
  • [90] Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: A unified framework for 3d perception from multi-camera images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3262–3272.
  • [91] C. Ge, J. Chen, E. Xie, Z. Wang, L. Hong, H. Lu, Z. Li, and P. Luo, “Metabev: Solving sensor failures for bev detection and map segmentation,” arXiv preprint arXiv:2304.09801, 2023.
  • [92] Y. Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1486–1494.
  • [93] Y. Li, J. Yang, J. Sun, H. Bao, Z. Ge, and L. Xiao, “Bevstereo++: Accurate depth estimation in multi-view 3d object detection via dynamic temporal stereo,” arXiv preprint arXiv:2304.04185, 2023.
  • [94] Y. Zhang, W. Zheng, Z. Zhu, G. Huang, J. Lu, and J. Zhou, “A simple baseline for multi-camera 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 3507–3515.
  • [95] J. Huang and G. Huang, “Bevdet4d: Exploit temporal cues in multi-camera 3d object detection,” arXiv preprint arXiv:2203.17054, 2022.
  • [96] M. Klingner, S. Borse, V. R. Kumar, B. Rezaei, V. Narayanan, S. Yogamani, and F. Porikli, “X3kd: Knowledge distillation across modalities, tasks and stages for multi-camera 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 343–13 353.
  • [97] H. Liu, Y. Teng, T. Lu, H. Wang, and L. Wang, “Sparsebev: High-performance sparse 3d object detection from multi-camera videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 18 580–18 590.
  • [98] H. Zhou, Z. Ge, Z. Li, and X. Zhang, “Matrixvt: Efficient multi-camera to bev transformation for 3d perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8548–8557.
  • [99] Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y.-G. Jiang, “Polarformer: Multi-camera 3d object detection with polar transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 1042–1050.
  • [100] L. Yang, T. Tang, J. Li, P. Chen, K. Yuan, L. Wang, Y. Huang, X. Zhang, and K. Yu, “Bevheight++: Toward robust visual centric 3d object detection,” arXiv preprint arXiv:2309.16179, 2023.
  • [101] Z. Song, H. Wei, C. Jia, Y. Xia, X. Li, and C. Zhang, “Vp-net: Voxels as points for 3d object detection,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [102] G. Wang, B. Tian, Y. Ai, T. Xu, L. Chen, and D. Cao, “Centernet3d: An anchor free object detector for autonomous driving.” Cornell University - arXiv,Cornell University - arXiv, Jul 2020.
  • [103] Y. Hu, Z. Ding, R. Ge, W. Shao, L. Huang, K. Li, and Q. Liu, “Afdetv2: Rethinking the necessity of the second stage for object detection from point clouds,” Proceedings of the AAAI Conference on Artificial Intelligence, p. 969–979, Jul 2022.
  • [104] R. Ge, Z. Ding, Y. Hu, Y. Wang, S. Chen, L. Huang, and Y. Li, “Afdet: Anchor free one stage 3d object detection,” arXiv preprint arXiv:2006.12671, 2020.
  • [105] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3dssd: Point-based 3d single stage object detector,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 040–11 048.
  • [106] J. H. Yoo, Y. Kim, J. Kim, and J. W. Choi, “3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16.   Springer, 2020, pp. 720–736.
  • [107] H. Yang, Z. Liu, X. Wu, W. Wang, W. Qian, X. He, and D. Cai, “Graph r-cnn: Towards accurate 3d object detection with semantic-decorated local graph.”
  • [108] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499.
  • [109] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
  • [110] W. Shi and R. Rajkumar, “Point-gnn: Graph neural network for 3d object detection in a point cloud,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1711–1719.
  • [111] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 770–779.
  • [112] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse voxelnet for 3d object detection and tracking.”
  • [113] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-cnn: Towards high performance voxel-based 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1201–1209.
  • [114] H. Wang, Z. Chen, Y. Cai, L. Chen, Y. Li, M. A. Sotelo, and Z. Li, “Voxel-rcnn-complex: An effective 3-d point cloud object detector for complex traffic conditions,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–12, 2022.
  • [115] H. Kuang, B. Wang, J. An, M. Zhang, and Z. Zhang, “Voxel-fpn: Multi-scale voxel feature aggregation for 3d object detection from lidar point clouds,” Sensors, vol. 20, no. 3, p. 704, 2020.
  • [116] Z. Liu, H. Tang, Y. Lin, and S. Han, “Point-voxel cnn for efficient 3d deep learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [117] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and H. Li, “Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection,” International Journal of Computer Vision, vol. 131, no. 2, pp. 531–551, 2023.
  • [118] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 697–12 705.
  • [119] J. Li, C. Luo, X. Yang, and Q. Qcraft, “Pillarnext: Rethinking network designs for 3d object detection in lidar point clouds.”
  • [120] J. S. Hu, T. Kuai, and S. L. Waslander, “Point density-aware voxels for lidar 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8469–8478.
  • [121] H. Wu, C. Wen, W. Li, X. Li, R. Yang, and C. Wang, “Transformation-equivariant 3d object detection for autonomous driving,” Nov 2022.
  • [122] X. Chen, S. Shi, B. Zhu, K. C. Cheung, H. Xu, and H. Li, “Mppnet: Multi-frame feature intertwining with proxy points for 3d temporal object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 680–697.
  • [123] L. Fan, Z. Pang, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang, and Z. Zhang, “Embracing single stride 3d object detector with sparse transformer,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022.
  • [124] L. Fan, F. Wang, N. Wang, and Z.-X. ZHANG, “Fully sparse 3d object detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 351–363, 2022.
  • [125] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking.” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
  • [126] Z. Zhou, X. Zhao, Y. Wang, P. Wang, and H. Foroosh, “Centerformer: Center-based transformer for 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 496–513.
  • [127] J. Mao, Y. Xue, M. Niu, H. Bai, J. Feng, X. Liang, H. Xu, and C. Xu, “Voxel transformer for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3164–3173.
  • [128] C. He, R. Li, S. Li, and L. Zhang, “Voxel set transformer: A set-to-set approach to 3d object detection from point clouds.”
  • [129] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Largekernel3d: Scaling up kernels in 3d sparse cnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 488–13 498.
  • [130] T. Lu, X. Ding, H. Liu, G. Wu, and L. Wang, “Link: Linear kernel for lidar-based 3d perception,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1105–1115.
  • [131] L. Fan, F. Wang, N. Wang, and Z. Zhang, “Fsd v2: Improving fully sparse 3d object detection with virtual voxels,” arXiv preprint arXiv:2308.03755, 2023.
  • [132] X. Lai, Y. Chen, F. Lu, J. Liu, and J. Jia, “Spherical transformer for lidar-based 3d recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 545–17 555.
  • [133] Y. Chen, Z. Yu, Y. Chen, S. Lan, A. Anandkumar, J. Jia, and J. M. Alvarez, “Focalformer3d: Focusing on hard instance for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8394–8405.
  • [134] Q. He, Z. Wang, H. Zeng, Y. Zeng, and Y. Liu, “Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds.” Proceedings of the AAAI Conference on Artificial Intelligence, p. 870–878, Jul 2022.
  • [135] P. Sun, M. Tan, W. Wang, C. Liu, F. Xia, Z. Leng, and D. Anguelov, “Swformer: Sparse window transformer for 3d object detection in point clouds,” Oct 2022.
  • [136] Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-based representation with transformer for 3d object detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 18 442–18 455, 2022.
  • [137] B. Zhang, J. Yuan, B. Shi, T. Chen, Y. Li, and Y. Qiao, “Uni3d: A unified baseline for multi-dataset 3d object detection,” Mar 2023.
  • [138] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3d object detection with pointformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7463–7472.
  • [139] H. Sheng, S. Cai, Y. Liu, B. Deng, J. Huang, X.-S. Hua, and M.-J. Zhao, “Improving 3d object detection with channel-wise transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2743–2752.
  • [140] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware single-stage 3d object detection from point cloud,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 873–11 882.
  • [141] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi, “St3d: Self-training for unsupervised domain adaptation on 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 368–10 378.
  • [142] Q. Xu, Y. Zhong, and U. Neumann, “Behind the curtain: Learning occluded shapes for 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2893–2901.
  • [143] W. Zheng, W. Tang, L. Jiang, and C.-W. Fu, “Se-ssd: Self-ensembling single-stage object detector from point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 494–14 503.
  • [144] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Ipod: Intensive point-based object detector for point cloud,” arXiv preprint arXiv:1812.05276, 2018.
  • [145] L. Du, X. Ye, X. Tan, J. Feng, Z. Xu, E. Ding, and S. Wen, “Associate-3ddet: Perceptual-to-conceptual association for 3d point cloud object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 329–13 338.
  • [146] J. Li, H. Dai, L. Shao, and Y. Ding, “From voxel to point: Iou-guided 3d object detection for point cloud with voxel-to-point decoder,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4622–4631.
  • [147] Z. Li, Y. Yao, Z. Quan, W. Yang, and J. Xie, “Sienet: spatial information enhancement network for 3d object detection from point cloud,” arXiv preprint arXiv:2103.15396, 2021.
  • [148] D. Zhang, D. Liang, Z. Zou, J. Li, X. Ye, Z. Liu, X. Tan, and X. Bai, “A simple vision transformer for weakly semi-supervised 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8373–8383.
  • [149] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Class-balanced grouping and sampling for point cloud 3d object detection,” arXiv preprint arXiv:1908.09492, 2019.
  • [150] M. Ye, S. Xu, and T. Cao, “Hvnet: Hybrid voxel network for lidar based 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1631–1640.
  • [151] W. Zheng, W. Tang, S. Chen, L. Jiang, and C.-W. Fu, “Cia-ssd: Confident iou-aware single-stage object detector from point cloud,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 3555–3562.
  • [152] J. Li, H. Dai, L. Shao, and Y. Ding, “Anchor-free 3d single stage detector with mask-guided attention for point cloud,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 553–562.
  • [153] Q. Chen, L. Sun, E. Cheung, and A. Yuille, “Every view counts: Cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization,” Neural Information Processing Systems,Neural Information Processing Systems, Jan 2020.
  • [154] S. Deng, Z. Liang, L. Sun, and K. Jia, “Vista: Boosting 3d object detection via dual cross-view spatial attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8448–8457.
  • [155] T. Wang, X. Zhu, and D. Lin, “Reconfigurable voxels: A new representation for lidar-based point clouds,” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Apr 2020.
  • [156] Y. Wang, A. Fathi, A. Kundu, D. A. Ross, C. Pantofaru, T. Funkhouser, and J. Solomon, Pillar-based Object Detection for Autonomous Driving, Jan 2020, p. 18–34.
  • [157] Y. Zeng, Y. Hu, S. Liu, J. Ye, Y. Han, X. Li, and N. Sun, “Rt3d: Real-time 3-d vehicle detection in lidar point cloud for autonomous driving,” IEEE Robotics and Automation Letters, p. 3434–3440, Oct 2018.
  • [158] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using fully convolutional network,” in Robotics: Science and Systems XII, Jun 2016.
  • [159] Y. Ye, H. Chen, C. Zhang, X. Hao, and Z. Zhang, “Sarpnet: Shape attention regional proposal network for lidar-based 3d object detection,” Neurocomputing, p. 53–63, Feb 2020.
  • [160] D. Zeng Wang and I. Posner, “Voting for voting in online point cloud object detection,” in Robotics: Science and Systems XI, Jan 2016.
  • [161] Y. Wang and J. Solomon, “Object dgcnn: 3d object detection using dynamic graphs.”
  • [162] X. Zhu, Y. Ma, T. Wang, Y. Xu, J. Shi, and D. Lin, SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds, Jan 2020, p. 581–597.
  • [163] J. Deng, W. Zhou, Y. Zhang, and H. Li, “From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection,” IEEE Transactions on Circuits and Systems for Video Technology, p. 4722–4734, Dec 2021.
  • [164] Z. Miao, J. Chen, H. Pan, R. Zhang, K. Liu, P. Hao, J. Zhu, Y. Wang, and X. Zhan, “Pvgnet: A bottom-up one-stage 3d object detector with integrated multi-level features,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
  • [165] J. Noh, S. Lee, and B. Ham, “Hvpr: Hybrid voxel-point representation for single-stage 3d object detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
  • [166] T. Guan, J. Wang, S. Lan, R. Chandra, Z. Wu, L. Davis, and D. Manocha, “M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 772–782.
  • [167] J. Wang, S. Lan, M. Gao, and L. S. Davis, “Infofocus: 3d object detection for autonomous driving with dynamic information modeling,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16.   Springer, 2020, pp. 405–420.
  • [168] J. Mao, M. Niu, H. Bai, X. Liang, H. Xu, and C. Xu, “Pyramid r-cnn: Towards better performance and adaptability for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2723–2732.
  • [169] Y. Zhang, K. Liu, H. Bao, Y. Zheng, and Y. Yang, “Pmpf: Point-cloud multiple-pixel fusion-based 3d object detection for autonomous driving,” Remote Sensing, vol. 15, no. 6, p. 1580, 2023.
  • [170] Y. Wang, J. Yin, W. Li, P. Frossard, R. Yang, and J. Shen, “Ssda3d: Semi-supervised domain adaptation for 3d object detection from point cloud,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 2707–2715.
  • [171] L. Wang, Z. Song, X. Zhang, C. Wang, G. Zhang, L. Zhu, J. Li, and H. Liu, “Sat-gcn: Self-attention graph convolutional network-based 3d object detection for autonomous driving,” Knowledge-Based Systems, vol. 259, p. 110080, 2023.
  • [172] Y. Zhang, Q. Zhang, Z. Zhu, J. Hou, and Y. Yuan, “Glenet: Boosting 3d object detectors with generative label uncertainty estimation,” International Journal of Computer Vision, vol. 131, no. 12, pp. 3332–3352, 2023.
  • [173] Z. Liang, M. Zhang, Z. Zhang, X. Zhao, and S. Pu, “Rangercnn: Towards fast and accurate 3d object detection with range image representation.”
  • [174] A. Bewley, P. Sun, T. Mensink, D. Anguelov, and C. Sminchisescu, “Range conditioned dilated convolutions for scale invariant 3d object detection,” Conference on Robot Learning,Conference on Robot Learning, May 2020.
  • [175] Z. Liang, Z. Zhang, M. Zhang, X. Zhao, and S. Pu, “Rangeioudet: Range image based real-time 3d object detector optimized by intersection over union,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
  • [176] L. Fan, X. Xiong, F. Wang, N. Wang, and Z. Zhang, “Rangedet:in defense of range view for lidar-based 3d object detection,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021.
  • [177] H. Cho, J. Choi, G. Baek, and W. Hwang, “itkd: Interchange transfer-based knowledge distillation for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 540–13 549.
  • [178] D. Ye, Z. Zhou, W. Chen, Y. Xie, Y. Wang, P. Wang, and H. Foroosh, “Lidarmultinet: Towards a unified multi-task network for lidar perception,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 3231–3240.
  • [179] Y. Wei, Z. Wei, Y. Rao, J. Li, J. Zhou, and J. Lu, “Lidar distillation: Bridging the beam-induced domain gap for 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 179–195.
  • [180] R. Ma, C. Chen, B. Yang, D. Li, H. Wang, Y. Cong, and Z. Hu, “Cg-ssd: Corner guided single stage 3d object detection from lidar point cloud.”
  • [181] X. Feng, H. Du, H. Fan, Y. Duan, and Y. Liu, “Seformer: Structure embedding transformer for 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 632–640.
  • [182] H. Wu, J. Deng, C. Wen, X. Li, C. Wang, and J. Li, “Casa: A cascade attention network for 3-d object detection from lidar point clouds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022.
  • [183] G. Shi, R. Li, and C. Ma, “Pillarnet: Real-time and high-performance pillar-based 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 35–52.
  • [184] S. Dong, L. Ding, H. Wang, T. Xu, X. Xu, J. Wang, Z. Bian, Y. Wang, and J. Li, “Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds,” Advances in Neural Information Processing Systems, vol. 35, pp. 11 615–11 628, 2022.
  • [185] J. Yin, J. Fang, D. Zhou, L. Zhang, C.-Z. Xu, J. Shen, and W. Wang, “Semi-supervised 3d object detection with proficient teachers,” in European Conference on Computer Vision.   Springer, 2022, pp. 727–743.
  • [186] C. Chen, Z. Chen, J. Zhang, and D. Tao, “Sasa: Semantics-augmented set abstraction for point-based 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 221–229.
  • [187] H. Hu, Z. Liu, S. Chitlangia, A. Agnihotri, and D. Zhao, “Investigating the impact of multi-lidar placement on object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2550–2559.
  • [188] A. Lehner, S. Gasperini, A. Marcos-Ramiro, M. Schmidt, M.-A. N. Mahani, N. Navab, B. Busam, and F. Tombari, “3d-vfield: Adversarial augmentation of point clouds for domain generalization in 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 295–17 304.
  • [189] J. Yang, L. Song, S. Liu, W. Mao, Z. Li, X. Li, H. Sun, J. Sun, and N. Zheng, “Dbq-ssd: Dynamic ball query for efficient 3d object detection,” arXiv preprint arXiv:2207.10909, 2022.
  • [190] N. M. A. A. Dazlee, S. A. Khalil, S. Abdul-Rahman, and S. Mutalib, “Object detection for autonomous vehicles with sensor-based technology using yolo,” International Journal of Intelligent Systems and Applications in Engineering, vol. 10, no. 1, pp. 129–134, 2022.
  • [191] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi, “St3d++: Denoised self-training for unsupervised domain adaptation on 3d object detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 5, pp. 6354–6371, 2022.
  • [192] H. Wang, S. Shi, Z. Yang, R. Fang, Q. Qian, H. Li, B. Schiele, and L. Wang, “Rbgnet: Ray-based grouping for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1110–1119.
  • [193] A. Xiao, J. Huang, D. Guan, K. Cui, S. Lu, and L. Shao, “Polarmix: A general data augmentation technique for lidar point clouds,” Advances in Neural Information Processing Systems, vol. 35, pp. 11 035–11 048, 2022.
  • [194] C. Min, D. Zhao, L. Xiao, Y. Nie, and B. Dai, “Voxel-mae: Masked autoencoders for pre-training large-scale point clouds,” arXiv preprint arXiv:2206.09900, 2022.
  • [195] D. Zhou, J. Fang, X. Song, L. Liu, J. Yin, Y. Dai, H. Li, and R. Yang, “Joint 3d instance segmentation and object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1839–1849.
  • [196] Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 1742–1749.
  • [197] J. Chen, B. Lei, Q. Song, H. Ying, D. Z. Chen, and J. Wu, “A hierarchical graph network for 3d object detection on point clouds,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 392–401.
  • [198] Q. Chen, X. Ma, S. Tang, J. Guo, Q. Yang, and S. Fu, “F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds,” in Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, 2019, pp. 88–100.
  • [199] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han, “Searching efficient 3d architectures with sparse point-voxel convolution,” in European conference on computer vision.   Springer, 2020, pp. 685–702.
  • [200] H. Zhou, X. Zhu, X. Song, Y. Ma, Z. Wang, H. Li, and D. Lin, “Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation,” arXiv preprint arXiv:2008.01550, 2020.
  • [201] Q. Chen, L. Sun, Z. Wang, K. Jia, and A. Yuille, “Object as hotspots: An anchor-free 3d object detection approach via firing of hotspots,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16.   Springer, 2020, pp. 68–84.
  • [202] L. Yan, K. Liu, E. Belyaev, and M. Duan, “Rtl3d: real-time lidar-based 3d object detection with sparse cnn,” IET Computer Vision, vol. 14, no. 5, pp. 224–232, 2020.
  • [203] D. Schinagl, G. Krispel, C. Fruhwirth-Reisinger, H. Possegger, and H. Bischof, “Gace: Geometry aware confidence enhancement for black-box 3d object detectors on lidar-data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6566–6576.
  • [204] M. Imad, O. Doukhi, and D.-J. Lee, “Transfer learning based semantic segmentation for 3d object detection from point cloud,” Sensors, vol. 21, no. 12, p. 3964, 2021.
  • [205] Y. Han, H. Zhang, H. Zhang, and Y. Li, “Ssc3od: Sparsely supervised collaborative 3d object detection from lidar point clouds,” arXiv preprint arXiv:2307.00717, 2023.
  • [206] C. Huang, V. Abdelzad, C. G. Mannes, L. Rowe, B. Therien, R. Salay, K. Czarnecki et al., “Out-of-distribution detection for lidar-based 3d object detection,” in 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2022, pp. 4265–4271.
  • [207] S. Liu, W. Huang, Y. Cao, D. Li, and S. Chen, “Sms-net: Sparse multi-scale voxel feature aggregation network for lidar-based 3d object detection,” Neurocomputing, vol. 501, pp. 555–565, 2022.
  • [208] J. You and Y.-K. Kim, “Up-sampling method for low-resolution lidar point cloud to enhance 3d object detection in an autonomous driving environment,” Sensors, vol. 23, no. 1, p. 322, 2022.
  • [209] E. R. Corral-Soto, A. Nabatchian, M. Gerdzhev, and L. Bingbing, “Lidar few-shot domain adaptation via integrated cyclegan and 3d object detector with joint learning delay,” in 2021 IEEE international conference on robotics and automation (ICRA).   IEEE, 2021, pp. 13 099–13 105.
  • [210] X. Wang and K. M. Kitani, “Cost-aware comparison of lidar-based 3d object detectors,” arXiv preprint arXiv:2205.01142, 2022.
  • [211] M. Pitropov, C. Huang, V. Abdelzad, K. Czarnecki, and S. Waslander, “Lidar-mimo: Efficient uncertainty estimation for lidar-based 3d object detection,” in 2022 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2022, pp. 813–820.
  • [212] P. An, J. Liang, J. Ma, Y. Chen, L. Wang, Y. Yang, and Q. Liu, “Rs-aug: Improve 3d object detection on lidar with realistic simulator based data augmentation,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [213] Y. Wu, S. Zhang, H. Ogai, H. Inujima, and S. Tateno, “Realtime single-shot refinement neural network with adaptive receptive field for 3d object detection from lidar point cloud,” IEEE Sensors Journal, vol. 21, no. 21, pp. 24 505–24 519, 2021.
  • [214] Z. Liang, Y. Huang, and Z. Liu, “Efficient graph attentional network for 3d object detection from frustum-based lidar point clouds,” Journal of Visual Communication and Image Representation, vol. 89, p. 103667, 2022.
  • [215] T.-Y. Pan, C. Ma, T. Chen, C. P. Phoo, K. Z. Luo, Y. You, M. Campbell, K. Q. Weinberger, B. Hariharan, and W.-L. Chao, “Pre-training lidar-based 3d object detectors through colorization,” arXiv preprint arXiv:2310.14592, 2023.
  • [216] M. Sualeh and G.-W. Kim, “Dynamic multi-lidar based multiple object detection and tracking,” Sensors, vol. 19, no. 6, p. 1474, 2019.
  • [217] Z. Yang, Y. Zhou, Z. Chen, and J. Ngiam, “3d-man: 3d multi-frame attention network for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1863–1872.
  • [218] Z. Tian, X. Chu, X. Wang, X. Wei, and C. Shen, “Fully convolutional one-stage 3d object detection on lidar range images,” Advances in Neural Information Processing Systems, vol. 35, pp. 34 899–34 911, 2022.
  • [219] W. Zheng, L. Jiang, F. Lu, Y. Ye, and C.-W. Fu, “Boosting single-frame 3d object detection by simulating multi-frame point clouds,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4848–4856.
  • [220] Q. Chen, L. Sun, E. Cheung, and A. L. Yuille, “Every view counts: Cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 224–21 235, 2020.
  • [221] H. Wu, C. Wen, S. Shi, X. Li, and C. Wang, “Virtual sparse convolution for multimodal 3d object detection,” Mar 2023.
  • [222] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Sequential fusion for 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4604–4612.
  • [223] Y. Xie, C. Xu, M.-J. Rakotosaona, P. Rim, F. Tombari, K. Keutzer, M. Tomizuka, and W. Zhan, “Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection,” arXiv preprint arXiv:2304.14340, 2023.
  • [224] Z. Yu, W. Wan, M. Ren, X. Zheng, and Z. Fang, “Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception,” IEEE Transactions on Intelligent Vehicles, pp. 1–14, 2023.
  • [225] J. Yan, Y. Liu, J. Sun, F. Jia, S. Li, T. Wang, and X. Zhang, “Cross modal transformer via coordinates encoding for 3d object dectection,” arXiv preprint arXiv:2301.01283, 2023.
  • [226] T. Huang, Z. Liu, X. Chen, and X. Bai, “Epnet: Enhancing point features with image semantics for 3d object detection,” in European Conference on Computer Vision.   Springer, 2020, pp. 35–52.
  • [227] Z. Liu, T. Huang, B. Li, X. Chen, X. Wang, and X. Bai, “Epnet++: Cascade bi-directional fusion for multi-modal 3d object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8324–8341, 2023.
  • [228] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A unified sensor fusion framework for 3d detection.”
  • [229] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo, J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d object detection in lidar point clouds,” Conference on Robot Learning,Conference on Robot Learning, Jan 2019.
  • [230] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, “Transfusion: Robust lidar-camera fusion for 3d object detection with transformers.”
  • [231] T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion framework,” Advances in Neural Information Processing Systems, vol. 35, pp. 10 421–10 434, 2022.
  • [232] Y. Chen, Y. Li, X. Zhang, J. Sun, and J. Jia, “Focal sparse convolutional networks for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5428–5437.
  • [233] Z. Song, H. Wei, L. Bai, L. Yang, and C. Jia, “Graphalign: Enhancing accurate feature alignment by graph matching for multi-modal 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3358–3369.
  • [234] Z. Song, C. Jia, L. Yang, H. Wei, and L. Liu, “Graphalign++: An accurate feature alignment by graph matching for multi-modal 3d object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [235] X. Wu, L. Peng, H. Yang, L. Xie, C. Huang, C. Deng, H. Liu, and D. Cai, “Sparse fuse dense: Towards high quality 3d detection with depth completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5418–5427.
  • [236] Y. Li, A. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen, B. Wu, Y. Lu, D. Zhou, Q. Le, A. Yuille, and M. Tan, “Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection.”
  • [237] Z. Yang, J. Chen, Z. Miao, W. Li, X. Zhu, and L. Zhang, “Deepinteraction: 3d object detection via modality interaction,” Aug 2022.
  • [238] A. Mahmoud, J. S. Hu, and S. L. Waslander, “Dense voxel fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 663–672.
  • [239] C.-J. Ho, C.-H. Tai, Y.-Y. Lin, M.-H. Yang, and Y.-H. Tsai, “Diffusion-ss3d: Diffusion model for semi-supervised 3d object detection,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [240] Q. Cai, Y. Pan, T. Yao, C.-W. Ngo, and T. Mei, “Objectfusion: Multi-modal 3d object detection with object-centric fusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 18 067–18 076.
  • [241] Y. Qin, C. Wang, Z. Kang, N. Ma, Z. Li, and R. Zhang, “Supfusion: Supervised lidar-camera fusion for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 22 014–22 024.
  • [242] R. Qian, X. Lai, and X. Li, “3d object detection for autonomous driving: A survey,” Pattern Recognition, vol. 130, p. 108796, 2022.
  • [243] X. Li, B. Shi, Y. Hou, X. Wu, T. Ma, Y. Li, and L. He, “Homogeneous multi-modal feature fusion and interaction for 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 691–707.
  • [244] H. Li, Z. Zhang, X. Zhao, Y. Wang, Y. Shen, S. Pu, and H. Mao, “Enhancing multi-modal features using local self-attention for 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 532–549.
  • [245] Y. Zhang, J. Chen, and D. Huang, “Cat-det: Contrastively augmented transformer for multi-modal 3d object detection.”
  • [246] Z. Wang, W. Zhan, and M. Tomizuka, “Fusing bird view lidar point cloud and front view camera image for deep object detection,” Cornell University - arXiv,Cornell University - arXiv, Nov 2017.
  • [247] X. Li, T. Ma, Y. Hou, B. Shi, Y. Yang, Y. Liu, X. Wu, Q. Chen, Y. Li, Y. Qiao, and L. He, “Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion,” Mar 2023.
  • [248] C. Lin, D. Tian, X. Duan, J. Zhou, D. Zhao, and D. Cao, “Cl3d: Camera-lidar 3d object detection with point feature enhancement and point-guided fusion,” IEEE Transactions on Intelligent Transportation Systems, 2022.
  • [249] H. Zhu, J. Deng, Y. Zhang, J. Ji, Q. Mao, H. Li, and Y. Zhang, “Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion,” arXiv preprint arXiv:2111.14382, 2021.
  • [250] Z. Song, G. Zhang, J. Xie, L. Liu, C. Jia, S. Xu, and Z. Wang, “Voxelnextfusion: A simple, unified, and effective voxel fusion framework for multimodal 3-d object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–12, 2023.
  • [251] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d object detector for point cloud,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1951–1960.
  • [252] Y. Wang, Q. Mao, H. Zhu, J. Deng, Y. Zhang, J. Ji, H. Li, and Y. Zhang, “Multi-modal 3d object detection in autonomous driving: a survey,” International Journal of Computer Vision, pp. 1–31, 2023.
  • [253] S. Xie, L. Kong, W. Zhang, J. Ren, L. Pan, K. Chen, and Z. Liu, “Robobev: Towards robust bird’s eye view perception under corruptions,” Apr 2023.
  • [254] Y. Dong, C. Kang, J. Zhang, Z. Zhu, Y. Wang, X. Yang, H. Su, X. Wei, and J. Zhu, “Benchmarking robustness of 3d object detection to common corruptions in autonomous driving,” Mar 2023.
  • [255] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 3354–3361.
  • [256] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
  • [257] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska et al., “Lyft level 5 av dataset 2019,” urlhttps://level5. lyft. com/dataset, vol. 1, p. 3, 2019.
  • [258] A. Patil, S. Malla, H. Gang, and Y.-T. Chen, “The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 9552–9557.
  • [259] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The apolloscape open dataset for autonomous driving and its application,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
  • [260] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan et al., “Argoverse: 3d tracking and forecasting with rich maps,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8748–8757.
  • [261] Q.-H. Pham, P. Sevestre, R. S. Pahwa, H. Zhan, C. H. Pang, Y. Chen, A. Mustafa, V. Chandrasekhar, and J. Lin, “A 3d dataset: Towards autonomous driving in challenging environments,” in 2020 IEEE International conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 2267–2273.
  • [262] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454.
  • [263] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V. H. Pham, M. Mühlegg, S. Dorn et al., “A2d2: Audi autonomous driving dataset,” arXiv preprint arXiv:2004.06320, 2020.
  • [264] P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang et al., “Pandaset: Advanced sensor suite dataset for autonomous driving,” in 2021 IEEE International Intelligent Transportation Systems Conference (ITSC).   IEEE, 2021, pp. 3095–3101.
  • [265] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022.
  • [266] Z. Wang, S. Ding, Y. Li, J. Fenn, S. Roychowdhury, A. Wallin, L. Martin, S. Ryvola, G. Sapiro, and Q. Qiu, “Cirrus: A long-range bi-pattern lidar dataset,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 5744–5750.
  • [267] J. Mao, M. Niu, C. Jiang, H. Liang, J. Chen, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li et al., “One million scenes for autonomous driving: Once dataset,” arXiv preprint arXiv:2106.11037, 2021.
  • [268] L. Chen, C. Sima, Y. Li, Z. Zheng, J. Xu, X. Geng, H. Li, C. He, J. Shi, Y. Qiao et al., “Persformer: 3d lane detection via perspective transformer and the openlane benchmark,” in European Conference on Computer Vision.   Springer, 2022, pp. 550–567.
  • [269] J. Mao, S. Shi, X. Wang, and H. Li, “3d object detection for autonomous driving: A comprehensive survey,” International Journal of Computer Vision, pp. 1–55, 2023.
  • [270] S. Y. Alaba and J. E. Ball, “Deep learning-based image 3-d object detection for autonomous driving,” IEEE Sensors Journal, vol. 23, no. 4, pp. 3378–3394, 2023.
  • [271] A. Singh and V. Bankiti, “Surround-view vision-based 3d detection for autonomous driving: A survey,” arXiv preprint arXiv:2302.06650, 2023.
  • [272] A. Singh, “Transformer-based sensor fusion for autonomous driving: A survey,” arXiv preprint arXiv:2302.11481, 2023.
  • [273] X. Wang, K. Li, and A. Chehri, “Multi-sensor fusion technology for 3d object detection in autonomous driving: A review,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [274] Y. Peng, Y. Qin, X. Tang, Z. Zhang, and L. Deng, “Survey on image and point-cloud fusion-based object detection in autonomous vehicles,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 22 772–22 789, 2022.
  • [275] Y. Wu, Y. Wang, S. Zhang, and H. Ogai, “Deep 3d object detection networks using lidar data: A review,” IEEE Sensors Journal, vol. 21, no. 2, pp. 1152–1171, 2020.
  • [276] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene understanding with synthetic data,” International Journal of Computer Vision, vol. 126, pp. 973–992, 2018.
  • [277] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “Dehazenet: An end-to-end system for single image haze removal,” IEEE transactions on image processing, vol. 25, no. 11, pp. 5187–5198, 2016.
  • [278] T. Ort, I. Gilitschenski, and D. Rus, “Grounded: The localizing ground penetrating radar evaluation dataset.” in Robotics: Science and Systems, vol. 2, 2021.
  • [279] M. Pitropov, D. E. Garcia, J. Rebello, M. Smart, C. Wang, K. Czarnecki, and S. Waslander, “Canadian adverse driving conditions dataset,” The International Journal of Robotics Research, vol. 40, no. 4-5, pp. 681–690, 2021.
  • [280] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The apolloscape open dataset for autonomous driving and its application,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
  • [281] C. A. Diaz-Ruiz, Y. Xia, Y. You, J. Nino, J. Chen, J. Monica, X. Chen, K. Luo, Y. Wang, M. Emond et al., “Ithaca365: Dataset and driving perception under repeated and challenging weather conditions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 383–21 392.
  • [282] D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” arXiv preprint arXiv:1903.12261, 2019.
  • [283] S. Xie, Z. Li, Z. Wang, and C. Xie, “On the adversarial robustness of camera-based 3d object detection,” arXiv preprint arXiv:2301.10766, 2023.
  • [284] J. Sun, Y. Cao, Q. A. Chen, and Z. M. Mao, “Towards robust {{\{{LiDAR-based}}\}} perception in autonomous driving: General black-box adversarial sensor attack and countermeasures,” in 29th USENIX Security Symposium (USENIX Security 20), 2020, pp. 877–894.
  • [285] D. Liu, R. Yu, and H. Su, “Extending adversarial attacks and defenses to deep 3d point cloud classifiers,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 2279–2283.
  • [286] S. Li, Z. Wang, F. Juefei-Xu, Q. Guo, X. Li, and L. Ma, “Common corruption robustness of point cloud detectors: Benchmark and enhancement,” IEEE Transactions on Multimedia, 2023.
  • [287] K. Yu, T. Tao, H. Xie, Z. Lin, Z. Wu, Z. Xia, T. Liang, H. Sun, J. Deng, D. Hao, Y. Wang, X. Liang, and B. Wang, “Benchmarking the robustness of lidar-camera fusion for 3d object detection.”
  • [288] L. Kong, Y. Liu, X. Li, R. Chen, W. Zhang, J. Ren, L. Pan, K. Chen, and Z. Liu, “Robo3d: Towards robust and reliable 3d perception against corruptions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 994–20 006.
  • [289] L. Kong, S. Xie, H. Hu, L. X. Ng, B. R. Cottereau, and W. T. Ooi, “Robodepth: Robust out-of-distribution depth estimation under corruptions,” arXiv preprint arXiv:2310.15171, 2023.
  • [290] J. Ren, L. Pan, and Z. Liu, “Benchmarking and analyzing point cloud classification under corruptions,” in International Conference on Machine Learning.   PMLR, 2022, pp. 18 559–18 575.
  • [291] M. Zeeshan Zia, M. Stark, and K. Schindler, “Are cars just 3d boxes?-jointly estimating the 3d shape of multiple objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3678–3685.
  • [292] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2040–2049.
  • [293] T. He and S. Soatto, “Mono3d++: Monocular 3d vehicle detection with two-scale 3d hypotheses and task priors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8409–8416.
  • [294] A. Kundu, Y. Li, and J. M. Rehg, “3d-rcnn: Instance-level 3d object reconstruction via render-and-compare,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3559–3568.
  • [295] F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2069–2078.
  • [296] D. Beker, H. Kato, M. A. Morariu, T. Ando, T. Matsuoka, W. Kehl, and A. Gaidon, “Monocular differentiable rendering for self-supervised 3d object detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16.   Springer, 2020, pp. 514–529.
  • [297] S. Zakharov, W. Kehl, A. Bhargava, and A. Gaidon, “Autolabeling 3d objects with differentiable rendering of sdf shape priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 224–12 233.
  • [298] Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Data-driven 3d voxel patterns for object category recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1903–1911.
  • [299] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 7074–7082.
  • [300] A. Naiden, V. Paunescu, G. Kim, B. Jeon, and M. Leordeanu, “Shift r-cnn: Deep monocular 3d object detection with closed-form geometric constraints,” in 2019 IEEE international conference on image processing (ICIP).   IEEE, 2019, pp. 61–65.
  • [301] X. Shi, Q. Ye, X. Chen, C. Chen, Z. Chen, and T.-K. Kim, “Geometry-based distance decomposition for monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 172–15 181.
  • [302] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection in monocular video,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16.   Springer, 2020, pp. 135–152.
  • [303] A. Simonelli, S. R. Bulo, L. Porzi, E. Ricci, and P. Kontschieder, “Towards generalization across depth for monocular 3d object detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16.   Springer, 2020, pp. 767–782.
  • [304] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang, “Gs3d: An efficient 3d object detection framework for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1019–1028.
  • [305] Z. Qin, J. Wang, and Y. Lu, “Monogrnet: A geometric reasoning network for monocular 3d object localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8851–8858.
  • [306] X. Ma, S. Liu, Z. Xia, H. Zhang, X. Zeng, and W. Ouyang, Rethinking Pseudo-LiDAR Representation, Jan 2020, p. 311–327.
  • [307] J. Chang and G. Wetzstein, “Deep optics for monocular depth estimation and 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10 193–10 202.
  • [308] Z. Qin, J. Wang, and Y. Lu, “Triangulation learning network: from monocular to stereo 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7615–7623.
  • [309] J. Sun, L. Chen, Y. Xie, S. Zhang, Q. Jiang, X. Zhou, and H. Bao, “Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 548–10 557.
  • [310] Z. Xu, W. Zhang, X. Ye, X. Tan, W. Yang, S. Wen, E. Ding, A. Meng, and L. Huang, “Zoomnet: Part-aware adaptive zooming neural network for 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 557–12 564.
  • [311] L. Chen, J. Sun, Y. Xie, S. Zhang, Q. Shuai, Q. Jiang, G. Zhang, H. Bao, and X. Zhou, “Shape prior guided instance disparity estimation for 3d object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5529–5540, 2021.
  • [312] W. Peng, H. Pan, H. Liu, and Y. Sun, “Ida-3d: Instance-depth-aware 3d object detection from stereo vision for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 015–13 024.
  • [313] X. Peng, X. Zhu, T. Wang, and Y. Ma, “Side: center-based stereo 3d detector with structure-aware instance depth estimation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 119–128.
  • [314] R. Qian, D. Garg, Y. Wang, Y. You, S. Belongie, B. Hariharan, M. Campbell, K. Q. Weinberger, and W.-L. Chao, “End-to-end pseudo-lidar for image-based 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5881–5890.
  • [315] J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “Bevdet: High-performance multi-camera 3d object detection in bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
  • [316] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16.   Springer, 2020, pp. 194–210.
  • [317] L. Yang, K. Yu, T. Tang, J. Li, K. Yuan, L. Wang, X. Zhang, and P. Chen, “Bevheight: A robust framework for vision-based roadside 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 611–21 620.
  • [318] X. Chi, J. Liu, M. Lu, R. Zhang, Z. Wang, Y. Guo, and S. Zhang, “Bev-san: Accurate bev 3d object detection via slice attention networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 461–17 470.
  • [319] Y. Wang, Y. Chen, and Z. Zhang, “Frustumformer: Adaptive instance-aware resampling for multi-view 3d detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5096–5105.
  • [320] J. Liu, R. Zhang, X. Chi, X. Li, M. Lu, Y. Guo, and S. Zhang, “Multi-latent space alignments for unsupervised domain adaptation in multi-view 3d object detection,” arXiv preprint arXiv:2211.17126, 2022.
  • [321] P. Huang, L. Liu, R. Zhang, S. Zhang, X. Xu, B. Wang, and G. Liu, “Tig-bev: Multi-view bev 3d object detection via target inner-geometry learning,” arXiv preprint arXiv:2212.13979, 2022.
  • [322] P. Cao, H. Chen, Y. Zhang, and G. Wang, “Multi-view frustum pointnet for object detection in autonomous driving,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 3896–3899.
  • [323] C. Xu, B. Wu, J. Hou, S. Tsai, R. Li, J. Wang, W. Zhan, Z. He, P. Vajda, K. Keutzer et al., “Nerf-det: Learning geometry-aware volumetric representation for multi-view 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 320–23 330.
  • [324] S. Wang, X. Zhao, H.-M. Xu, Z. Chen, D. Yu, J. Chang, Z. Yang, and F. Zhao, “Towards domain generalization for multi-view 3d object detection in bird-eye-view,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 333–13 342.
  • [325] Z. Luo, C. Zhou, G. Zhang, and S. Lu, “Detr4d: Direct multi-view 3d object detection with sparse attention,” arXiv preprint arXiv:2212.07849, 2022.
  • [326] D. Wang, X. Cui, X. Chen, Z. Zou, T. Shi, S. Salcudean, Z. J. Wang, and R. Ward, “Multi-view 3d reconstruction with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5722–5731.
  • [327] X. Lin, T. Lin, Z. Pei, L. Huang, and Z. Su, “Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion,” arXiv preprint arXiv:2211.10581, 2022.
  • [328] ——, “Sparse4d v2: Recurrent temporal fusion with sparse model,” arXiv preprint arXiv:2305.14018, 2023.
  • [329] X. Lin, Z. Pei, T. Lin, L. Huang, and Z. Su, “Sparse4d v3: Advancing end-to-end 3d detection and tracking,” arXiv preprint arXiv:2311.11722, 2023.
  • [330] J. Park, C. Xu, S. Yang, K. Keutzer, K. Kitani, M. Tomizuka, and W. Zhan, “Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection,” arXiv preprint arXiv:2210.02443, 2022.
  • [331] K. Xiong, S. Gong, X. Ye, X. Tan, J. Wan, E. Ding, J. Wang, and X. Bai, “Cape: Camera view position embedding for multi-view 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 570–21 579.
  • [332] D. Chen, J. Li, V. Guizilini, R. A. Ambrus, and A. Gaidon, “Viewpoint equivariance for multi-view 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9213–9222.
  • [333] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Graph-detr3d: rethinking overlapping regions for multi-view 3d object detection,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5999–6008.
  • [334] C. Shu, J. Deng, F. Yu, and Y. Liu, “3dppe: 3d point positional encoding for transformer-based multi-camera 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3580–3589.
  • [335] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Bevdistill: Cross-modal bev distillation for multi-view 3d object detection,” arXiv preprint arXiv:2211.09386, 2022.
  • [336] S. Wang, Y. Liu, T. Wang, Y. Li, and X. Zhang, “Exploring object-centric temporal modeling for efficient multi-view 3d object detection,” arXiv preprint arXiv:2303.11926, 2023.
  • [337] X. Jiang, S. Li, Y. Liu, S. Wang, F. Jia, T. Wang, L. Han, and X. Zhang, “Far3d: Expanding the horizon for surround-view 3d object detection,” arXiv preprint arXiv:2308.09616, 2023.
  • [338] J. Ku, A. Pon, and S. Waslander, “Monocular 3d object detection leveraging accurate proposals and shape reconstruction,” Cornell University - arXiv,Cornell University - arXiv, Apr 2019.
  • [339] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
  • [340] E. Jörgensen, C. Zach, and F. Kahl, “Monocular 3d object detection and box fitting trained end-to-end using intersection-over-union loss.” Cornell University - arXiv,Cornell University - arXiv, Jun 2019.
  • [341] M. Braun, Q. Rao, Y. Wang, and F. Flohr, “Pose-rcnn: Joint object detection and pose estimation using 3d object proposals,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2016, pp. 1546–1551.
  • [342] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 270–279.
  • [343] X. Wang, W. Yin, T. Kong, Y. Jiang, L. Li, and C. Shen, “Task-aware monocular depth estimation for 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 257–12 264.
  • [344] R. Roriz, J. Cabral, and T. Gomes, “Automotive lidar technology: A survey,” IEEE Transactions on Intelligent Transportation Systems, 2021.
  • [345] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M.-L. Shyu, S.-C. Chen, and S. S. Iyengar, “A survey on deep learning: Algorithms, techniques, and applications,” ACM Computing Surveys (CSUR), vol. 51, no. 5, pp. 1–36, 2018.
  • [346] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” arXiv preprint arXiv:1905.05055, 2019.
  • [347] C. Keqi, Z. Zhiliang, D. Xiaoming, M. Cuixia, and W. Hongan, “Deep learning for multi-scale object detection: A survey,” Journal of Software, vol. 32, no. 4, pp. 1201–1227, 2020.
  • [348] W. Liang, P. Xu, L. Guo, H. Bai, Y. Zhou, and F. Chen, “A survey of 3d object detection,” Multimedia Tools and Applications, vol. 80, no. 19, pp. 29 617–29 641, 2021.
  • [349] D. Fernandes, A. Silva, R. Névoa, C. Simões, D. Gonzalez, M. Guevara, P. Novais, J. Monteiro, and P. Melo-Pinto, “Point-cloud based 3d object detection and classification methods for self-driving applications: A survey and taxonomy,” Information Fusion, vol. 68, pp. 161–191, 2021.
  • [350] G. Zamanakos, L. Tsochatzidis, A. Amanatiadis, and I. Pratikakis, “A comprehensive survey of lidar-based 3d object detection methods with deep learning for autonomous driving,” Computers & Graphics, vol. 99, pp. 153–181, 2021.
  • [351] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [352] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows.” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021.
  • [353] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [354] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
  • [355] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision.   Springer, 2020, pp. 213–229.
  • [356] H. Wang, H. Tang, S. Shi, A. Li, Z. Li, B. Schiele, and L. Wang, “Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6792–6802.
  • [357] H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng et al., “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [358] A. D. Pon, J. Ku, C. Li, and S. L. Waslander, “Object-centric stereo matching for 3d object detection,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 8383–8389.
  • [359] P. Li, S. Su, and H. Zhao, “Rts3d: Real-time stereo 3d detection from 4d feature-consistency embedding space for autonomous driving,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 1930–1939.
  • [360] A. Gao, Y. Pang, J. Nie, Z. Shao, J. Cao, Y. Guo, and X. Li, “Esgn: Efficient stereo geometry network for fast 3d object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [361] J. Chen, Q. Wang, W. Peng, H. Xu, X. Li, and W. Xu, “Disparity-based multiscale fusion network for transportation detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 18 855–18 863, 2022.
  • [362] Y. Chen, S. Huang, S. Liu, B. Yu, and J. Jia, “Dsgn++: Exploiting visual-spatial relation for stereo-based 3d detectors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4416–4429, 2022.
  • [363] D. Garg, Y. Wang, B. Hariharan, M. Campbell, K. Q. Weinberger, and W.-L. Chao, “Wasserstein distances for stereo disparity estimation,” Advances in Neural Information Processing Systems, vol. 33, pp. 22 517–22 529, 2020.
  • [364] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Apr 2019.
  • [365] C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu et al., “Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 830–17 839.
  • [366] Y. Wang, B. Yang, R. Hu, M. Liang, and R. Urtasun, “Plumenet: Efficient 3d object detection from stereo images,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 3383–3390.
  • [367] X. Guo, S. Shi, X. Wang, and H. Li, “Liga-stereo: Learning lidar geometry aware representations for stereo-based 3d detector,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3153–3163.
  • [368] Y. Li, Q. Han, M. Yu, Y. Jiang, C. Yeo, Y. Li, Z. Huang, N. Liu, H. Chen, and X. Wu, “Towards efficient 3d object detection in bird’s-eye-view space for autonomous driving: A convolutional-only approach,” arXiv preprint arXiv:2312.00633, 2023.
  • [369] L. Xie, C. Xiang, Z. Yu, G. Xu, Z. Yang, D. Cai, and X. He, “Pi-rcnn: An efficient multi-sensor 3d object detector with point-based attentive cont-conv fusion module,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 460–12 467.
  • [370] Y. Zhang, Q. Hu, G. Xu, Y. Ma, J. Wan, and Y. Guo, “Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 953–18 962.
  • [371] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538.
  • [372] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 8, pp. 2647–2664, 2020.
  • [373] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai, “Tanet: Robust 3d object detection from point clouds with triple attention,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 677–11 684.
  • [374] C. Yu, J. Lei, B. Peng, H. Shen, and Q. Huang, “Siev-net: A structure-information enhanced voxel network for 3d object detection from lidar point clouds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022.
  • [375] “Focusing on hard instance for 3d object detection,” Aug 2023.
  • [376] V. A. Sindagi, Y. Zhou, and O. Tuzel, “Mvx-net: Multimodal voxelnet for 3d object detection,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 7276–7282.
  • [377] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-sensor fusion for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7345–7353.
  • [378] C. Zhang, H. Wang, Y. Cai, L. Chen, Y. Li, M. A. Sotelo, and Z. Li, “Robust-fusionnet: Deep multimodal sensor fusion for 3-d object detection under severe weather conditions,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–13, 2022.
  • [379] S. Pang, D. Morris, and H. Radha, “Clocs: Camera-lidar object candidates fusion for 3d object detection,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 10 386–10 393.
  • [380] C. Chen, L. Z. Fragonara, and A. Tsourdos, “Roifusion: 3d object detection from lidar and vision,” IEEE Access, vol. 9, pp. 51 710–51 721, 2021.
  • [381] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 918–927.
  • [382] S. Pang, D. Morris, and H. Radha, “Fast-clocs: Fast camera-lidar object candidates fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 187–196.
  • [383] C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-modal augmentation for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 794–11 803.
  • [384] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection,” Jul 2022.
  • [385] H. Hu, F. Wang, J. Su, L. Hu, T. Feng, Z. Zhang, and W. Zhang, “Ea-bev: Edge-aware bird’s-eye-view projector for 3d object detection.”
  • [386] H. Cai, Z. Zhang, Z. Zhou, Z. Li, W. Ding, and J. Zhao, “Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation,” arXiv preprint arXiv:2303.17099, 2023.
  • [387] Y. Jiao, Z. Jie, S. Chen, J. Chen, X. Wei, L. Ma, and Y.-G. Jiang, “Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection,” Sep 2022.
  • [388] Z. Zhu, Y. Zhang, H. Chen, Y. Dong, S. Zhao, W. Ding, J. Zhong, and S. Zheng, “Understanding the robustness of 3d object detection with bird’s-eye-view representations in autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 600–21 610.
  • [389] Y. Zhang, J. Hou, and Y. Yuan, “A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,” International Journal of Computer Vision, pp. 1–33, 2023.
  • [390] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai, “Tanet: Robust 3d object detection from point clouds with triple attention,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 677–11 684.
  • [391] D. Rukhovich, A. Vorontsova, and A. Konushin, “Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2397–2406.
  • [392] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015.
  • [393] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using fully convolutional network,” arXiv preprint arXiv:1608.07916, 2016.
  • [394] B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018.
  • [395] Y. Chai, P. Sun, J. Ngiam, W. Wang, B. Caine, V. Vasudevan, X. Zhang, and D. Anguelov, “To the point: Efficient 3d object detection in the range image with graph convolution kernels,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 000–16 009.
  • [396] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 677–12 686.
  • [397] G. P. Meyer, J. Charland, S. Pandey, A. Laddha, S. Gautam, C. Vallespi-Gonzalez, and C. K. Wellington, “Laserflow: Efficient and probabilistic object detection and motion forecasting,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 526–533, 2020.
  • [398] Y. Su, W. Liu, Z. Yuan, M. Cheng, Z. Zhang, X. Shen, and C. Wang, “Dla-net: Learning dual local attention features for semantic segmentation of large-scale building facade point clouds,” Pattern Recognition, vol. 123, p. 108372, 2022.
  • [399] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [400] P. Sun, W. Wang, Y. Chai, G. Elsayed, A. Bewley, X. Zhang, C. Sminchisescu, and D. Anguelov, “Rsn: Range sparse net for efficient, accurate lidar 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5725–5734.
  • [401] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1137–1149, Jun 2017.
  • [402] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
  • [403] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7652–7660.
  • [404] J. Beltrán, C. Guindel, F. M. Moreno, D. Cruzado, F. Garcia, and A. De La Escalera, “Birdnet: a 3d object detection framework from lidar information,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2018, pp. 3517–3523.
  • [405] A. Barrera, C. Guindel, J. Beltrán, and F. García, “Birdnet+: End-to-end 3d object detection in lidar bird’s eye view,” in 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2020, pp. 1–6.
  • [406] J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang, “Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020.
  • [407] R. Huang, W. Zhang, A. Kundu, C. Pantofaru, D. A. Ross, T. Funkhouser, and A. Fathi, An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds, Jan 2020, p. 266–282.
  • [408] B. Yang, M. Bai, M. Liang, W. Zeng, and R. Urtasun, “Auto4d: Learning to label 4d objects from sequential point clouds,” Cornell University - arXiv,Cornell University - arXiv, Jan 2021.
  • [409] Z. Yuan, X. Song, L. Bai, W. Zhou, Z. Wang, and W. Ouyang, “Temporal-channel transformer for 3d lidar-based video object detection in autonomous driving,” Cornell University - arXiv,Cornell University - arXiv, Nov 2020.
  • [410] Z. Zhang, J. Gao, J. Mao, Y. Liu, D. Anguelov, and C. Li, “Stinet: Spatio-temporal-interactive network for pedestrian detection and trajectory prediction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020.
  • [411] S. Wang, K. Lu, J. Xue, and Y. Zhao, “Da-net: Density-aware 3d object detection network for point clouds,” IEEE Transactions on Multimedia, 2023.
  • [412] S. Xiong, B. Li, and S. Zhu, “Dcgnn: A single-stage 3d object detection network based on density clustering and graph neural network,” Complex & Intelligent Systems, vol. 9, no. 3, pp. 3399–3408, 2023.
  • [413] Z. Qin, J. Wang, and Y. Lu, “Weakly supervised 3d object detection from point clouds,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4144–4152.
  • [414] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [415] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [416] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
  • [417] B. Graham, “Spatially-sparse convolutional neural networks,” arXiv preprint arXiv:1409.6070, 2014.
  • [418] B. Graham, M. Engelcke, and L. v. d. Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
  • [419] H. Yi, S. Shi, M. Ding, J. Sun, K. Xu, H. Zhou, Z. Wang, S. Li, and G. Wang, “Segvoxelnet: Exploring semantic context and depth-aware features for 3d vehicle detection from point cloud,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020.
  • [420] Q. Xu, Y. Zhou, W. Wang, C. R. Qi, and D. Anguelov, “Spg: Unsupervised domain adaptation for 3d object detection via semantic point generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 446–15 456.
  • [421] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
  • [422] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” Acm Transactions On Graphics (tog), vol. 38, no. 5, pp. 1–12, 2019.
  • [423] J. Mao, X. Wang, and H. Li, “Interpolated convolutional networks for 3d point cloud understanding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1578–1587.
  • [424] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou, X. Yi, O. Alsharif, P. Nguyen, Z. Chen, J. Shlens, and V. Vasudevan, “Starnet: Targeted computation for object detection in point clouds.” Cornell University - arXiv,Cornell University - arXiv, Aug 2019.
  • [425] Q. Wang, J. Chen, J. Deng, and X. Zhang, “3d-centernet: 3d object detection network for point clouds with center estimation priority,” Pattern Recognition, p. 107884, Jul 2021.
  • [426] J. Zarzar, S. Giancola, and B. Ghanem, “Pointrgcn: Graph convolution networks for 3d vehicles detection refinement,” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Nov 2019.
  • [427] M. Feng, S. Z. Gilani, Y. Wang, L. Zhang, and A. Mian, “Relation graph network for 3d object detection in point clouds,” IEEE Transactions on Image Processing, p. 92–107, Jan 2021.
  • [428] Z. Liu, Z. Zhang, Y. Cao, H. Hu, and X. Tong, “Group-free 3d object detection via transformers,” Apr 2021.
  • [429] Y. Zhang, D. Huang, and Y. Wang, “Pc-rgnn: Point cloud completion and graph neural network for 3d object detection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 3430–3437.
  • [430] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point transformer.” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021.
  • [431] J. Liu, T. He, H. Yang, R. Su, J. Tian, J. Wu, H. Guo, K. Xu, and W. Ouyang, “3d-queryis: A query-based framework for 3d instance segmentation,” Nov 2022.
  • [432] Y. Chen, S. Liu, X. Shen, and J. Jia, “Fast point r-cnn,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9775–9784.
  • [433] Z. Li, F. Wang, and N. Wang, “Lidar r-cnn: An efficient and universal 3d object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7546–7555.
  • [434] D. Zhou, J. Fang, X. Song, C. Guan, J. Yin, Y. Dai, and R. Yang, “Iou loss for 2d/3d object detection,” in 2019 international conference on 3D vision (3DV).   IEEE, 2019, pp. 85–94.
  • [435] I. Koo, I. Lee, S.-H. Kim, H.-S. Kim, W.-j. Jeon, and C. Kim, “Pg-rcnn: Semantic surface point generation for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 18 142–18 151.
  • [436] J. Li, S. Luo, Z. Zhu, H. Dai, A. S. Krylov, Y. Ding, and L. Shao, “3d iou-net: Iou guided 3d object detector for point clouds,” arXiv preprint arXiv:2004.04962, 2020.
  • [437] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, B. Zhou, and H. Zhao, “Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Jul 2022.
  • [438] S. Xu, D. Zhou, J. Fang, J. Yin, Z. Bin, and L. Zhang, “Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection,” in 2021 IEEE International Intelligent Transportation Systems Conference (ITSC).   IEEE, 2021, pp. 3047–3054.
  • [439] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1527–1536.
  • [440] K. Shin, Y. P. Kwon, and M. Tomizuka, “Roarnet: A robust 3d object detection based on region approximation refinement,” in 2019 IEEE intelligent vehicles symposium (IV).   IEEE, 2019, pp. 2510–2515.
  • [441] M. Simon, K. Amende, A. Kraus, J. Honer, T. Samann, H. Kaulbersch, S. Milz, and H. Michael Gross, “Complexer-yolo: Real-time 3d object detection and tracking on semantic point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  • [442] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 677–12 686.
  • [443] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-Gonzalez, “Sensor fusion for joint 3d object detection and semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0.
  • [444] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion for multi-sensor 3d object detection,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 641–656.
  • [445] D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor fusion for 3d bounding box estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 244–253.
  • [446] Y. Li, X. Qi, Y. Chen, L. Wang, Z. Li, J. Sun, and J. Jia, “Voxel field fusion for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1120–1129.
  • [447] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Largekernel3d: Scaling up kernels in 3d sparse cnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 488–13 498.
  • [448] Y. Qin, C. Wang, Z. Kang, N. Ma, Z. Li, and R. Zhang, “Supfusion: Supervised lidar-camera fusion for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 014–22 024.
  • [449] B. Ding, J. Xie, and J. Nie, “C 2 bn: Cross-modality and cross-scale balance network for multi-modal 3d object detection,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [450] Y. Kim, K. Park, M. Kim, D. Kum, and J. W. Choi, “3d dual-fusion: Dual-domain dual-query camera-lidar fusion for 3d object detection,” arXiv preprint arXiv:2211.13529, 2022.
  • [451] S. Pang, D. Morris, and H. Radha, “Clocs: Camera-lidar object candidates fusion for 3d object detection,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 10 386–10 393.
  • [452] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object detection from view aggregation,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 1–8.
  • [453] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 1907–1915.
  • [454] A. Paigwar, D. Sierra-Gonzalez, Ö. Erkent, and C. Laugier, “Frustum-pointpillars: A multi-stage approach for 3d object detection using rgb camera and lidar,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2926–2933.
  • [455] C. Zhang, H. Wang, L. Chen, Y. Li, and Y. Cai, “Mixedfusion: An efficient multimodal data fusion framework for 3-d object detection and tracking,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2023.
  • [456] S. Xu, F. Li, Z. Song, J. Fang, S. Wang, and Z.-X. Yang, “Multi-sem fusion: Multimodal semantic fusion for 3d object detection,” 2023.
[Uncaptioned image] Ziying Song , was born in Xingtai, Hebei Province, China in 1997. He received the B.S. degree from Hebei Normal University of Science and Technology (China) in 2019. He received a master’s degree major in Hebei University of Science and Technology (China) in 2022. He is now a PhD student majoring in Computer Science and Technology at Beijing Jiaotong University (China), with a research focus on Computer Vision.
[Uncaptioned image] Lin Liu was born in Jinzhou, Liaoning Province, China, in 2001. He is now a college student majoring in Computer Science and Technology at China University of Geosciences(Beijing). Since Dec. 2022, he has been recommended for a master’s degree in Computer Science and Technology at Beijing Jiaotong University. His research interests are in computer vision.
[Uncaptioned image] Feiyang Jia was born in Yinchuan, Ningxia Province, China, in 1998. He received his B.S. degree from Beijing Jiaotong University (China) in 2020. He received a master’s degree from Beijing Technology and Business University (China) in 2023. He is now a Ph.D. student majoring in Computer Science and Technology at Beijing Jiaotong University (China), with research focus on Computer Vision.
[Uncaptioned image] Yadan Luo (Member, IEEE) received the BS degree in computer science from the University of Electronic Engineering and Technology of China, and the PhD degree from the University of Queensland. Her research interests include machine learning, computer vision, and multimedia data analysis. She is now a lecturer with the University of Queensland.
[Uncaptioned image] Guoxin Zhang , was born in 1998 in Xingtai, Hebei Province, China. 2021 he received his Bachelor’s degree. He is now studying for his master’s degree at the Hebei University of Science and Technology (China). His research interests are in computer vision.
[Uncaptioned image] Lei Yang was born in DaTong, ShanXi Province, China in 1993. He received his master degree in robotics at Beihang University, in 2018. Then he joined the Autonomous Driving R&D Department of JD.COM as an algorithm researcher from 2018 to 2020. He is now a Ph.D. student in the School of Vehicle and Mobility at Tsinghua University since 2020. His research interests are computer vision, autonomous driving, and environmental perception.
[Uncaptioned image] Li Wang was born in Shangqiu, Henan Province, China in 1990. He received his Ph.D. degree in mechatronic engineering at the State Key Laboratory of Robotics and System, Harbin Institute of Technology, in 2020. He was a visiting scholar at Nanyang Technology of University for two years. He was a postdoctoral fellow in the State Key Laboratory of Automotive Safety and Energy, and the School of Vehicle and Mobility, Tsinghua University. Currently, he is an assistant professor at School of Mechanical Engineering, Beijing Institute of Technology. Dr. Wang is the author of more than 30 SCI/EI articles. His research interests include autonomous driving perception, 3D robot vision, and Multi-modal fusion.
[Uncaptioned image] Caiyan Jia , born on March 2, 1976, is a lecturer and a postdoctoral fellow of the Chinese Computer Society. she graduated from Ningxia University in 1998 with a bachelor’s degree in mathematics, Xiangtan University in 2001 with a master’s degree in computational mathematics, specializing in intelligent information processing, and the Institute of Computing Technology of the Chinese Academy of Sciences in 2004 with a doctorate degree in engineering, specializing in data mining. she has received her D. degree in 2004. Since 2002, she has been a reviewer for several papers in Journal of Software, Journal of Computer Science and Technology, Journal of Computer Science, IEEE Trans. on Knowledge and Data Engineering, etc.