Sketch to Scene| BiruCao

Sketch to Scene

Project:

Date:

Keywords:

Street View Scene Reconstruction via Multi-View Sketches Using Deep Learning

Project Presented at ACADIA (Association for Computer-Aided Design in Architecture) 2024 Conference

2024.11

3D Reconstruction, Deep Learning, Computer Vision

This project explores the reconstruction of a 3D street view using multi-view sketches, specifically focusing on recreating Field Lane from 19th century London, a location described in Charles Dickens’s Oliver Twist but no longer exists. The study is important for those interested in experiencing the spatial dimensions of historical settings in a three-dimensional context. It employs a method that extracts depth information from images to reconstruct a 3D point cloud.

Introduction

Figure 1: A 1847 wood engraving of Field Lane

Field Lane, as depicted in Charles Dickens’s Oliver Twist, is an iconic scene from historical London, portrayed as “a dirty or more wretched place” [1]. While this place no longer exists, there are sketches that provide a glimpse into this past space (Figure 1). This research aims to answer the following questions: Can we reconstruct Field Lane in 3D using these sketches? Can this method be generalized to convert other multi-view 2D sketches into 3D reconstructions? This research aims to build a pipeline that converts 2D sketches into 3D scenes using deep learning and computer vision techniques.

Overview

Current methods for converting sketches to 3D models mainly generate isolated 3D objects from sketches on plain white backgrounds, lacking additional contextual information. There remains a gap in translating hand-drawn sketches into 3D with a contextual background, such as architectural street views. This research aims to fill that gap by utilizing deep learning models to convert 2D sketches into 3D scenes, thereby reconstructing architectural spaces such as street views from multi-view sketches.

This project first tests using one single image to reconstruct a 3D scene. Then, it progresses to using multi-view sketches for more complex reconstructions. In general, the experimental pipeline (Figure 2) involves estimating depth maps from drawings, which are then used to generate a 3D point cloud. This process incorporates camera intrinsic parameters and feature matching to achieve spatial accuracy in the reconstructed scenes.

Figure 2: Pipeline

Single-Sketch 3D Generation

Constructing 3D models from single images utilizes several key libraries and models: PIL is used for image manipulation and model processing. Open3d [2] is a library for processing and visualizing point clouds for 3D reconstruction.

For depth estimation, two different models are employed, and their results are evaluated for accuracy. The first is GLPN For Depth Estimation, a class from Hugging Face's transformers library, specifically for estimating depth using the GLPN model trained on the NYU dataset [3]. The second model is Depth Anything [4], a recent robust depth estimation model. The result reveals that the Depth Anything model, significantly improves depth map accuracy.

Figure 3: Depth maps generated by GLPN (top) and Depth Anything (bottom)

The depth maps predicted from the earlier steps are converted into RGB-D images ("D" stands for depth) using the Open3D library. From these RGB-D images, 3D point clouds are generated. This process involves mapping each pixel in the image to a 3D point based on its depth value, normalized and adjusted for perspective distortion relative to the camera's intrinsic parameters. The point cloud is initially visualized by overlaying the original sketch as its texture. Subsequent processes include outlier removal and surface reconstruction to refine the point cloud further (Figure 3). During this phase, normals are estimated and aligned, followed by employing a Poisson surface reconstruction algorithm [5] to derive a mesh from the point cloud, enhancing the visual representation.

Figure 3: Mesh generated from the point cloud after refinement

Multi-Sketch 3D Generation

This section explores several approaches to integrate multiple sketches to reconstruct a 3D scene through feature matching, aiming to achieve a more cohesive and detailed reconstruction.
Two sketches (Figure 4) depicting Chick Lane, close to and visually similar to Field Lane, are chosen for the 3D reconstruction. These sketches portray the same street façade but from slightly varied angles, which is rare in the available materials.

Figure 4: Two sketches of the Chick Lane

Feature Matching

Feature matching is first tested using the ORB (Oriented FAST and Rotated BRIEF) detector [6] combined with a brute force matcher. The result doesn’t look convincing, with few matched points (Figure 5). To enhance this, the project incorporates SuperGlue [7], an advanced neural network model that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. It correctly matches architectural features across the sketches, resulting in 188 matches (Figure 6), significantly improving alignment accuracy.

Figure 5: Feature matching result of ORB

Figure 6: Feature matching result of SuperGlue

Approach 1: point cloud generation based on depth map refinement

Depth maps for each sketch are generated using the Depth Anything model. These maps are then refined at key feature points by interpolating the depth values using the average of nearby valid points, enhancing depth accuracy. Point clouds constructed from these refined depth maps for each view are merged. However, this method resulted in poorly aligned point clouds, exhibiting visible discrepancies in scale and dual-layered geometries (Figure 7).

Figure 7: Approach 1 result

Approach 2: direct point cloud generation from matched features

This method constructs a point cloud directly from features matched across two sketches. For each matched point, a 3D point is derived based on its image coordinates and corresponding depth value. Despite its higher precision in alignment, this approach produced a sparse point cloud due to an insufficient number of matched points, limiting the reconstruction's completeness (Figure 8).

Figure 8: Approach 2 result

Approach 3: point cloud generation based on homography application

Homography [8] is applied to rectify perspective discrepancies between sketches before depth estimation and point cloud generation. This geometric transformation aligns the sketches based on matched features, facilitating more accurate depth map generation for each aligned sketch (Figure 9). The resulting point clouds from each sketch are merged to form a unified 3D model, ensuring that all data share a common spatial reference. The merged point cloud is refined to remove outliers and fused using techniques such as Iterative Closest Point (ICP) [9] for fine alignment and merging. The third approach achieved a more coherent result (Figure 10). It effectively demonstrates the spatial relationship between different views, maintaining a consistent scale throughout the reconstructed scene. However, the two geometries are not perfectly aligned.

Figure 9: Depth maps after homography application

Figure 10: Approach 3 result

Approach 4: point cloud generation based on refined key point adjustment

Unlike previous methods in which both images contribute equally to the reconstruction, this fourth strategy relies on the first image to establish the primary structure of the 3D model. The second image is supplementary for refining and adjusting the 3D point cloud by manipulating matched keypoints. This method leads to a cleaner and more precise 3D model, making this approach effective for reconstructions with minimal overlap (Figure 11).

Figure 11: Approach 4 result

Conclusion

A major challenge in multi-view sketch reconstruction is the limited availability of sketches from multiple perspectives. Future work will focus on expanding the dataset with additional viewpoints to improve scene coverage. Testing the method across different historical periods and architectural styles will evaluate its adaptability and robustness.
In conclusion, this research offers a promising method for digital heritage preservation, enabling the reconstruction of architectural spaces from minimal visual data and providing enriched virtual experiences of historical settings.

References

[1] C. Dickens, Oliver Twist. Ware, England: Wordsworth Editions, 1992

[2] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3D: A Modern Library for 3D Data Processing.” arXiv, Jan. 29, 2018. doi:

10.48550/arXiv.1801.09847.

[3] “Datasets « Nathan Silberman.” Accessed: Apr. 30, 2024. [Online]. Available:

https://cs.nyu.edu/~fergus/datasets/

[4] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth Anything: Unleashing the Power of Large-Scale

Unlabeled Data.” arXiv, Apr. 07, 2024. doi: 10.48550/arXiv.2401.10891.

[5] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction”.

[6] “OpenCV: ORB (Oriented FAST and Rotated BRIEF).” Accessed: May 14, 2024. [Online]. Available:

https://docs.opencv.org/4.x/d1/d89/tutorial_py_orb.html

[7] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue: Learning Feature Matching with Graph

Neural Networks.” arXiv, Mar. 28, 2020. doi: 10.48550/arXiv.1911.11763.

[8] R. Hartley and A. Zisserman, “Multiple View Geometry in Computer Vision, Second Edition”.

[9] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-Squares Fitting of Two 3-D Point Sets,” IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. PAMI-9, no. 5, pp. 698–700, Sep. 1987, doi:

10.1109/TPAMI.1987.4767965.

Using DUSt3R Model

DUSt3R, a 3D reconstruction model, can create 3D models from a series of images without needing pre-existing knowledge of camera calibration or positions. Not only does it generate a 3D model and provide depth information, but it can also estimate camera parameters on its own. However, it’s been designed primarily for realistic images, so our research investigates whether DUSt3R can be adapted for sketches as well.

Figure 12: The DUSt3R model's RGB maps of three Chick Lane drawings

RGB image

Confidence map

The resulting 3D point cloud projects each sketch with an accurate spatial relationship, suggesting that this multi-view sketch method is feasible for historical 3D reconstruction. This transition from single-image reconstructions to multi-view scenarios marks a significant step forward in reconstructing complex architectural scenes. The study provides a way to experience the spatial dimensions of historical settings in a new, immersive 3D context.

ezgif.com-video-to-gif-converter (1).gif

Figure 13: The reconstruction result

Contents