Through a computer vision machine learning model!
Read on to find out how we tackled this.
We are a team of data scientists tackling the issue of predicting roads and backgrounds from satellite imagery.
We observe that roads and terrain are everchanging, and understand that some places may find it difficult to update their maps frequently.
We believe that no matter where one is the world, they deserve to have access to readable and accurate maps, and seek an efficient solution.
Background
Satellite imagery has become inextricably intertwined with our lives, especially in first-world countries as we leverage it for many key functions in life such as GPS navigation.
But what about countries that are less technologically advanced and require extra assistance in mapping out their roads?
Problem Statement
Segmentation is where we separate and cluster elements of importance from an image. Another name for this is pixel-level classification - where we assign labels to all pixels present in the image. In our case, we will only be needing to identify proper roads from mud roads, flat expanses, and the like, then drawing out the roads.
In many developing countries, roads are not easily accessible or recognizable. Maps are also hard to find and this limitation can affect response activities during natural disasters.
We built a road segmentation model that will help assist in predicting roads from satellite imagery. The intent is for non-profits and rescue teams to use this model to identify roads and provide rescue teams with access to data so they can reach populations in need.
Data
We utilized the DeepGlobe Road Extraction dataset, which was part of a DeepGlobe Challenge held in 2018.
This dataset consists of a total of 6226 satellite imagery files in RGB, with dimensions of 1024 x 1024 pixels. The images were collected from DigitalGlobe's satellite in areas of Thailand, Indonesia, and India at a pixel resolution of 50cm.
Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D., & Raskar, R. (2018). DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. ArXiv. https://doi.org/10.48550/ARXIV.1805.06561
TRAINING IMAGES AND MASKS
VALIDATION IMAGES
TEST IMAGES
PIXEL RESOLUTION
OUR APPROACH
Convolutional neural networks(CNNs) are perfectly suited for visual imagery analysis. A benefit of these is that they require less preprocessing before training, and thus demonstrates an example of unsupervised feature learning.
To validate our model, we split the training images and masks by a 80/20 ratio, as the validation and test images provided by the dataset did not have ground truth masks.
Our code is hosted on Github.io, with work done on the AWS cloud service, each working through SSH on parts of the model.
We leveraged AWS' EC2 instances to power and train our models. One of our initial runs nearly took 10 hours! Then we switched instances and cut the duration down to 5 hours for 25 epochs.
OUR MODEL(S)
We initially started out with three models to test, and found that Unet yielded the best results with greater efficiency.
Initially created as a bio-medical image model, Unet uses an encoder path that captures the context of images, then its decoder learns the discriminative features.
This was the final model we utilized after observing each model's performance.
Best performing model in terms of loss, accuracy, and epoch duration
Part of an open source suite from Google, DeepLabV3+ uses atrous convolution to upsample the output of the last convolution layer and computes pixel-wise loss, helping with dense prediction. This enhances the segmentation model by improving its transformation modeling capacity, and enables larger output feature maps.
Second best performing model in terms of loss
Third best performing model in terms of epoch duration
Built upon a pretrained ResNet-101 model, Pyramid Attention Network(PAN) has a feature pyramid attention (FPA) module, which essentially combines multiple convolutional layers into one, incorporating larger and smaller features into the same output. Global Attention Upsample(GAUs) allow the model to focus on specific features by ignoring (or gating-out) irrelevant information.
Third best performing model in terms of loss
Second best performing model in terms of epoch duration
Outputs
Example outputs from all three models, ranging in accuracy on different densities of roads in a more dense, urban landscape.
Image: Unet, DeepLabV3+, and PAN outputs from top to bottom
Outputs
Example output from all three models, ranging in accuracy on different densities of roads in a rural landscape.
Image: Unet, DeepLabV3+, and PAN outputs from top to bottom
OUR BEST MODEL
Our best model was a Unet model run across 25 epochs with a 100 batch size and 0.003 learning rate. There is a large improvement over the initial model as seen below.
Best Model Outputs
Example outputs from our best model
Image: 3 rows of Unet batch 100 model outputs
Below are our losses and IoU(Intersection of Union, used as a substitute for accuracy due to unbalanced classes of road and background if we used pixel-wise accuracy), as well as a confusion matrix.
True refers to background pixel classification, and false refers to road pixel classification.