A technical review on: Methods of building rooftop edge detection.

Akshay VR

June 20, 2025

•

5 min read

This is Part 3 in the series of article on large scale solar potential estimation. In the previous article we have discussed about Part1:A beginner’s guide to solar potential and its estimation and Part2:Large Scale Solar Potential Analysis: Indian Overview

Intoduction

India is said to have huge solar potential and is estimated that its entire energy demand can be met by using only 0.1% of country’s land. At present, about 63% of India’s total energy demand is still met through thermal power plants. Relying so much on non-renewable sources (coal, lignite and diesel), not only poses a question of sustainability, but is also a major source of emissions degrading our environment quality. We all are very well aware of the Air Quality Index in our country’s capital, let alone other Indian cities.

To shift India’s dependency from coal, the best alternative which we have is the Solar energy. It is firmly believed by the researchers, that solar energy is one of the cleanest sources of electricity and looking at the above numbers clearly reflecting India’s huge solar access, we can certainly exploit this boundless energy source.

But the development in this domain has been slow, especially the residential rooftop sector, but the market is certainly showing an upward trend owing to continuously declining solar panel rates in the last decade. So, there is an utter need to map the solar potential at rural, cities, states and country level at whole. The precise estimation of rooftop solar potential estimations helps in sustainable development, energy policy making, and renewable energy consumption. It is also crucial to reveal the potential amount of electricity that could be produced by deploying PV systems on building rooftops and whether it will suffice the energy demand or not.

The accurate cities potential will also serve as a roadmap for the solar company giants to identify potential markets and global banks providing funds for renewable development to locate potential sites.

How Large-Scale potential estimation is done?

For mapping city scale solar potential, we generally use geospatial tools along with Machine Learning algorithms. The need for satellite imagery is a must since it serves as a base map for building roof segmentation. High resolution satellite images are preferred for accurate edge detection as these tools mostly function on pixel level classification. The major steps in the estimation process includes-

Accurate rooftop edge detection
Exclusion of roof area based on shadow analysis
Estimating potential on roof area using open source tools

In this part of the article we will be discussing in detail about the various tools and ongoing research works on accurate rooftop edge detection.

Basic Architecture: Convolution Neural Network (CNN)

Precise, consistent and stable automated extraction of buildings still remains a huge challenge in the processing of aerial or remote sensing image. The reason being difficulty in dividing image pixels into various object classes because high-spatial resolution imagery usually has complex data features, and they frequently occur in heterogeneous forms, although remarkable progress has been made in recent years.

Mostly, whenever we work with image data, we chose Convolution Neural Network (CNN) Architectures (a part of deep learning). Its preferred due to its inherent convolving feature, wherein it works on individual patches of pixels instead of the entire image. This leads to faster feature detections and reduced memory requirements. Let us look in detail, how a CNN works?

How Convolution Neural Network (ConvNet/CNN) works?

The architecture of ConvNet is analogous to the Neurons in Human Brains and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the entire visual area.

A basic ConvNet consists of-

Input image
The Input image data is either a colored image consisting of Red, Green and Blue (RGB) layers or a gray scale image consisting of a single layer. When we deal with satellite images where the sizes can easily reach upto 8K (7680 × 4320), the things become very computational intensive. The role of the ConvNet here is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction
Convolution layers
Convolution layer is nothing but a layer consisting of different set of filters (or Kernels) for mining different features in an image such as color, edge, contrast, etc. These filters are of different sizes. The one shown in the above figure is a 3 × 3 filter which passes through the entire image with certain stride value. In the case of images with multiple channels (e.g. RGB), the Kernel has the same depth as that of the input image.
Pooling layers
Similar to convolution layers pooling layers decrease the computational power required to process the data through dimensionality reduction. Furthermore, it is useful for extracting dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model.
Fully connected layer
After multiple convolution and pooling layers, we need to flatten our image into a single column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique.

CNN based models for building edge detection

Here in this article we have discussed about three popular architectures for edge building edge detection namely Fully Convolutional Neural Network (FCNN), Multi Modal Deep CNN and Deeproof. Let us see in detail one by one about these architectures and their respective accuracy.

Fully Convolutional Neural Network (FCNN), Wei Liu et.al (2019)

The input data consists of high spatial resolution UAV image and respective DSM. Firstly, the ground truth image is prepared using the UAV images and DSM. Subsequently, three-channel UAV images and one-channel DSM images are fused into a four-channel image using the Python version of the geospatial data abstraction library (GDAL2.4.2). The fused four-channel image and the ground truth image are sliced to generate a series of 256 _ 256 patches and these are then fed into CFCN while using a GPU version of Tensor Flow 2.0 with a Keras application programming interface (API) for training and testing.

Building extraction consists of two chained Convolutional neural networks for building segmentation and boundary constraints.

The area threshold and height threshold given by DSM is applied to the results of building extraction to resolve the issue of misclassification of buildings, roads, and man-made landscapes.

Measurement of model’s success

The model’s success was measured using four most widely used evaluation metrics, namely precision (correctness), recall (completeness), F1 ranking, and mean intersect over union (IoU).

Precision is used to find the percentage of true target-pixels in detected target-pixels and it is obtained as: Precision=True PositivesTrue Positives+False Positives Recall=True PositivesTrue Positives+False Negatives

F1 score is twice the harmonic value of recall and precision and it is given by: F1=2 ×Precision ×RecallPrecision+Recall

IoU is used to measure the overlap rate of detected buildings and labeled buildings and it is defined as: IoU=target ∩intersectiontarget ∪intersection

This model achieved a competitive building recall of approximately 98.67%, 98.62%, and 99.52% in suburban, urban, and urban village areas, respectively. In particular, the results demonstrated that the method’s IoU could reach approximately 96.23%, 96.43%, and 95.76% in suburban, urban, and urban village areas.

Multi-modal deep CNN, Yang et.al (2019)

Y. Chen, L. Tang and X. Yang et al., proposed a multi-modal convolution neural network for building edge extraction where he used panchromatic and multispectral imagery as input for better spectral-spatial resolution. Panchromatic images have higher spatial resolution than multi-spectral (MS) images with one spectral band only. In comparison to PAN images, MS images have greater spatial accuracy. A single CNN network is not enough to allow good use of the spatial-spectral information available from the above two images, so two CNN’s are used for each type of image.

“For the multispectral CNN first, to mine the building contextual features present, we use the six complex- value convolutional layers, with a size of 2 × 2, 4 × 4, 6 × 6, 8 × 8, 10 × 10, and12 × 12, respectively”. Then, in order to reduce the size of the feature map, the self-adaptation pooling layer is used in the multispectral CNN over a 2 x 2 spatial window with a stride of two.

Finally, to re- duce plenty of parameters in the multispectral CNN, the global average pooling is used over a 3 × 3 spatial windows with a stride of one.

For the panchromatic CNN, to extract the spatial information, we use the six complex-value convolutional layers, with a size of 1 × 1, 3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11 are used. Therefore, to reduce the dimension of the final feature map, we use the self-adaption pooling layer over 2 × 2 spatial windows with a stride of two. To compensate for the problem of overfitting, the final complex value convolutional layer is followed by global average pooling. The global average pooling performs average pooling over 5 ×5 spatial windows with a stride of one.

“The modified Rectified linear unit (M-ReLU) activation function is used at every stage of the network. After global average pooling and fully connected layer, the Softmax classifier is applied to extract building at spectral-spatial contextual features”. Precision and Recall evaluation metrics are used for checking the accuracy of the model.

Results

Deeproof, Stephen et.al (2019)

Deeproof structure relies on the assumption that the size and roof type is identifiable in a high resolution UAV images essential for estimating the solar system a building can support.

DeepRoof’s approach has three key steps:

Terrain Segmentation uses deep vision techniques to create a terrain outline of the input image by identifying all the planar roof segments and trees in the image.

Topology Estimation creates a representation of the topology using the terrain outline from the previous step. We approximate the height of the building and nearby structures using publicly available datasets that may cast shadows on the roof.

Solar Potential Analysis estimates the per-pixel solar potential of the roof using the output from the previous steps and historical solar irradiance data. Moreover, our algorithm identifes roof locations where panels will receive maximum sunlight, accounting for shade from nearby structures.

Results

Conclusion: What are the constraints in these above methods?

Also getting access to high quality aerial images for every location is a hefty task. The available architectures for building edge detection are showing promising results but still they are not full proof. Also there is no “one size fit all” architecture available for every location. Different architectures give varied accuracy depending upon the input images. The roof type at a given particular location is also very segmented. For example if you train your model for a location where you have more number of flat roofs compared to other types, the architecture will not perform satisfactorily for the locations where there are other roof types such as pitched, gabled mansard, etc. This is termed as the problem of overfitting.

Still there has been significant progress in the domain of Artificial Intelligence in the past decade and many research works are ongoing in this field to make a robust model which will perform satisfactorily for any given input data. Also several other auxiliary data can also be used as an input to enhance the accuracy.

So, in the above article we have seen the general methodology followed for estimating solar potential on large scale. We have discussed in detail about some of the available tools for building rooftop extraction which is the first and most crucial step. Followed by shadow analysis and solar potential estimation using several open source tools available which has been discussed in Part1: A beginner’s guide to solar potential and its estimation.

Intoduction

How Large-Scale potential estimation is done?

Basic Architecture: Convolution Neural Network (CNN)

How Convolution Neural Network (ConvNet/CNN) works?

CNN based models for building edge detection

Fully Convolutional Neural Network (FCNN), Wei Liu et.al (2019)

Measurement of model’s success

Multi-modal deep CNN, Yang et.al (2019)

Results

Deeproof, Stephen et.al (2019)

DeepRoof’s approach has three key steps:

Results

Conclusion: What are the constraints in these above methods?

One Solar Platform for Everything