Sketch to Scene
Project:
Date:
Keywords:
Street View Scene Reconstruction via Multi-View Sketches Using Deep Learning
2024.5
3D Reconstruction, Deep Learning, Computer Vision
This project explores the reconstruction of a 3D street view using multi-view sketches, specifically focusing on recreating Field Lane from 19th century London, a location described in Charles Dickens’s Oliver Twist but no longer exists. The study is important for those interested in experiencing the spatial dimensions of historical settings in a three-dimensional context. It employs a method that extracts depth information from images to reconstruct a 3D point cloud.
Introduction
Figure 1: A 1847 wood engraving of Field Lane
Field Lane, as depicted in Charles Dickens’s Oliver Twist, is an iconic scene from historical London, portrayed as “a dirty or more wretched place” [1]. While this place no longer exists, there are sketches that provide a glimpse into this past space (Figure 1). This research aims to answer the following questions: Can we reconstruct Field Lane in 3D using these sketches? Can this method be generalized to convert other multi-view 2D sketches into 3D reconstructions? This research aims to build a pipeline that converts 2D sketches into 3D scenes using deep learning and computer vision techniques.
Overview
Current methods for converting sketches to 3D models mainly generate isolated 3D objects from sketches on plain white backgrounds, lacking additional contextual information. There remains a gap in translating hand-drawn sketches into 3D with a contextual background, such as architectural street views. This research aims to fill that gap by utilizing deep learning models to convert 2D sketches into 3D scenes, thereby reconstructing architectural spaces such as street views from multi-view sketches.
This project first tests using one single image to reconstruct a 3D scene. Then, it progresses to using multi-view sketches for more complex reconstructions. In general, the experimental pipeline (Figure 2) involves estimating depth maps from drawings, which are then used to generate a 3D point cloud. This process incorporates camera intrinsic parameters and feature matching to achieve spatial accuracy in the reconstructed scenes.
Figure 2: Pipeline
Single-Sketch 3D Generation
Constructing 3D models from single images utilizes several key libraries and models: PIL is used for image manipulation and model processing. Open3d [2] is a library for processing and visualizing point clouds for 3D reconstruction.
For depth estimation, two different models are employed, and their results are evaluated for accuracy. The first is GLPN For Depth Estimation, a class from Hugging Face's transformers library, specifically for estimating depth using the GLPN model trained on the NYU dataset [3]. The second model is Depth Anything [4], a recent robust depth estimation model. The result reveals that the Depth Anything model, significantly improves depth map accuracy.
Figure 3: Depth maps generated by GLPN (top) and Depth Anything (bottom)
The depth maps predicted from the earlier steps are converted into RGB-D images ("D" stands for depth) using the Open3D library. From these RGB-D images, 3D point clouds are generated. This process involves mapping each pixel in the image to a 3D point based on its depth value, normalized and adjusted for perspective distortion relative to the camera's intrinsic parameters. The point cloud is initially visualized by overlaying the original sketch as its texture. Subsequent processes include outlier removal and surface reconstruction to refine the point cloud further (Figure 3). During this phase, normals are estimated and aligned, followed by employing a Poisson surface reconstruction algorithm [5] to derive a mesh from the point cloud, enhancing the visual representation.
Figure 3: Mesh generated from the point cloud after refinement
Multi-Sketch 3D Generation
This section explores several approaches to integrate multiple sketches to reconstruct a 3D scene through feature matching, aiming to achieve a more cohesive and detailed reconstruction.
Two sketches (Figure 4) depicting Chick Lane, close to and visually similar to Field Lane, are chosen for the 3D reconstruction. These sketches portray the same street façade but from slightly varied angles, which is rare in the available materials.
Figure 4: Two sketches of the Chick Lane
Feature Matching
Feature matching is first tested using the ORB (Oriented FAST and Rotated BRIEF) detector [6] combined with a brute force matcher. The result doesn’t look convincing, with few matched points (Figure 5). To enhance this, the project incorporates SuperGlue [7], an advanced neural network model that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. It correctly matches architectural features across the sketches, resulting in 188 matches (Figure 6), significantly improving alignment accuracy.
Figure 5: Feature matching result of ORB
Figure 6: Feature matching result of SuperGlue
Approach 1: point cloud generation based on depth map refinement
Depth maps for each sketch are generated using the Depth Anything model. These maps are then refined at key feature points by interpolating the depth values using the average of nearby valid points, enhancing depth accuracy. Point clouds constructed from these refined depth maps for each view are merged. However, this method resulted in poorly aligned point clouds, exhibiting visible discrepancies in scale and dual-layered geometries (Figure 7).
Figure 7: Approach 1 result
Approach 2: direct point cloud generation from matched features
This method constructs a point cloud directly from features matched across two sketches. For each matched point, a 3D point is derived based on its image coordinates and corresponding depth value. Despite its higher precision in alignment, this approach produced a sparse point cloud due to an insufficient number of matched points, limiting the reconstruction's completeness (Figure 8).
Figure 8: Approach 2 result
Approach 3: point cloud generation based on homography application
Homography [8] is applied to rectify perspective discrepancies between sketches before depth estimation and point cloud generation. This geometric transformation aligns the sketches based on matched features, facilitating more accurate depth map generation for each aligned sketch (Figure 9). The resulting point clouds from each sketch are merged to form a unified 3D model, ensuring that all data share a common spatial reference. The merged point cloud is refined to remove outliers and fused using techniques such as Iterative Closest Point (ICP) [9] for fine alignment and merging. The third approach achieved a more coherent result (Figure 10). It effectively demonstrates the spatial relationship between different views, maintaining a consistent scale throughout the reconstructed scene. However, the two geometries are not perfectly aligned.
Figure 9: Depth maps after homography application
Figure 10: Approach 3 result
Approach 4: point cloud generation based on refined key point adjustment
Unlike previous methods in which both images contribute equally to the reconstruction, this fourth strategy relies on the first image to establish the primary structure of the 3D model. The second image is supplementary for refining and adjusting the 3D point cloud by manipulating matched keypoints. This method leads to a cleaner and more precise 3D model, making this approach effective for reconstructions with minimal overlap (Figure 11).
Figure 11: Approach 4 result
Conclusion
In conclusion, this research proposes a pipeline capable of converting historical street-view sketches into 3D scenes based on depth estimation and feature matching techniques. This approach is helpful for digital heritage and virtual reality applications, enabling the reconstruction of historical sites from limited visual data.
However, the project encounters challenges with multi-view reconstructions due to the rare availability of sketches from multiple viewpoints. This scarcity complicates the accurate calculation of camera intrinsic parameters and proper image alignment. Additionally, reliance only on photogrammetry techniques has proven inadequate for high-quality scene reconstruction from depth maps and feature matching.
References
[1] C. Dickens, Oliver Twist. Ware, England: Wordsworth Editions, 1992
[2] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3D: A Modern Library for 3D Data Processing.” arXiv, Jan. 29, 2018. doi:
[3] “Datasets « Nathan Silberman.” Accessed: Apr. 30, 2024. [Online]. Available:
https://cs.nyu.edu/~fergus/datasets/
[4] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth Anything: Unleashing the Power of Large-Scale
Unlabeled Data.” arXiv, Apr. 07, 2024. doi: 10.48550/arXiv.2401.10891.
[5] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction”.
[6] “OpenCV: ORB (Oriented FAST and Rotated BRIEF).” Accessed: May 14, 2024. [Online]. Available:
https://docs.opencv.org/4.x/d1/d89/tutorial_py_orb.html
[7] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue: Learning Feature Matching with Graph
Neural Networks.” arXiv, Mar. 28, 2020. doi: 10.48550/arXiv.1911.11763.
[8] R. Hartley and A. Zisserman, “Multiple View Geometry in Computer Vision, Second Edition”.
[9] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-Squares Fitting of Two 3-D Point Sets,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. PAMI-9, no. 5, pp. 698–700, Sep. 1987, doi:
Contents