Umbra Space is a 9-year-old manufacturer and operator of a Synthetic Aperture Radar (SAR) satellite fleet. These satellites can see through clouds, smoke and rain and can capture images day or night at resolutions as fine as 16cm. SpaceX launched Umbra's first satellite in 2021 and has a total of eight in orbit at the moment. I covered their fleet and open data programme in my Umbra's Open Satellite Feed post I published in April.

A few weeks ago, Umbra released almost 1,000 satellite images displaying ships around the world. The images have been made available in a number of SAR image formats and total almost 7.5 TB in size.

Below I've annotated one of the images from this dataset. I've used QGIS to tint the satellite imagery red, highlighted a few ships in green and used CARTO's dark basemap for geographical reference.

These images were taken across a diverse set of locations including many geopolitical hotspots.

$ python3

from glob import glob import json from shapely.geometry import shape pattern = '/home/mark/ship_detection_testdata/*/*/*.json' with open ( 'umbra_ships.csv' , 'w' ) as f : f . write ( 'geom

' ) for filename in glob ( pattern ): wkt_ = shape ( json . loads ( open ( filename ) . read ()) [ 'collects' ] [ 0 ] [ 'footprintPolygonLla' ]) . centroid . wkt f . write ( '" %s "

' % wkt_ )

In this post, I'm going to train a ship detection model that works on Umbra's SAR imagery. I'll be using Ultralytics' YOLOv5 and the High-Resolution SAR Images Dataset (HRSID) for training the model.

My Workstation I'm using a 6 GHz Intel Core i9-14900K CPU. It has 8 performance cores and 16 efficiency cores with a total of 32 threads and 32 MB of L2 cache. It has a liquid cooler attached and is housed in a spacious, full-sized, Cooler Master HAF 700 computer case. I've come across videos on YouTube where people have managed to overclock the i9-14900KF to 9.1 GHz. The system has 96 GB of DDR5 RAM clocked at 6,000 MT/s and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my system's C drive. The system is powered by a 1,200-watt, fully modular, Corsair Power Supply and is sat on an ASRock Z790 Pro RS Motherboard. I'm running Ubuntu 22 LTS via Microsoft's Ubuntu for Windows on Windows 11 Pro. In case you're wondering why I don't run a Linux-based desktop as my primary work environment, I'm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and I use ArcGIS Pro from time to time which only supports Windows natively.

Installing Prerequisites I'll be using Python and a few other tools to help analyse the imagery in this post. $ sudo apt update $ sudo apt install \ aws-cli \ jq \ python3-pip \ python3-virtualenv I'll set up a Python Virtual Environment and install a few packages. $ virtualenv ~/.umbra_yolo $ source ~/.umbra_yolo/bin/activate $ python3 -m pip install \ duckdb \ geopy \ rich \ shapely Below will install YOLOv5's dependencies. $ git clone https://github.com/ultralytics/yolov5 ~/yolov5 $ pip install -r ~/yolov5/requirements.txt I'll be using JSON Convert (jc) to convert the output of various CLI tools into JSON. This will make it much easier to compile statistics on the git repositories discussed in this post. $ wget https://github.com/kellyjonbrazil/jc/releases/download/v1.25.2/jc_1.25.2-1_amd64.deb $ sudo dpkg -i jc_1.25.2-1_amd64.deb I'll also use DuckDB, along with its JSON and Spatial extensions, in this post. $ cd ~ $ wget -c https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-amd64.zip $ unzip -j duckdb_cli-linux-amd64.zip $ chmod +x duckdb $ ~/duckdb INSTALL json ; INSTALL spatial ; I'll set up DuckDB to load all these extensions each time it launches. $ vi ~/.duckdbrc .timer on .width 180 LOAD json; LOAD spatial; The maps in this post were rendered with QGIS version 3.34. QGIS is a desktop application that runs on Windows, macOS and Linux.

Downloading Satellite Imagery Umbra currently have 7.4 TB of files in their ship detection folder in their open data S3 bucket. $ aws s3 ls --no-sign-request \ --summarize \ --human \ --recursive \ s3://umbra-open-data-catalog/sar-data/tasks/ship_detection_testdata/ \ | tail -n2 Total Objects: 4573 Total Size: 7 .4 TiB I'll download the JSON metadata and GeoTIFFs. These add up to 394 GB in size. $ mkdir -p ~/ship_detection_testdata $ cd ~/ship_detection_testdata $ for FORMAT in json tif ; do aws s3 --no-sign-request \ sync \ --exclude = "*" \ --include = "*. $FORMAT " \ s3://umbra-open-data-catalog/sar-data/tasks/ship_detection_testdata/ \ ./ done Below is a breakdown of the file counts bucketed to the nearest GB. $ ls -lR \ | jc --ls \ | ~/duckdb -c " SELECT round(size / 1024 / 1024 / 1024)::int AS size_gb, COUNT(*) FROM READ_JSON_AUTO('/dev/stdin') GROUP BY 1 ORDER BY 1;" ┌─────────┬──────────────┐ │ size_gb │ count_star() │ │ int32 │ int64 │ ├─────────┼──────────────┤ │ 0 │ 3787 │ │ 1 │ 180 │ │ 2 │ 11 │ │ 3 │ 9 │ │ 4 │ 3 │ │ 5 │ 3 │ └─────────┴──────────────┘ The majority of files are under a GB but some are as large as five GB. $ ls -lR \ | jc --ls \ | ~/duckdb -c " SELECT size, filename FROM READ_JSON_AUTO('/dev/stdin') ORDER BY size DESC LIMIT 10;" ┌────────────┬──────────────────────────────────────┐ │ size │ filename │ │ int64 │ varchar │ ├────────────┼──────────────────────────────────────┤ │ 5125833514 │ 2023-12-06-14-30-39_UMBRA-05_GEC.tif │ │ 4973515962 │ 2023-12-30-02-56-59_UMBRA-04_GEC.tif │ │ 4880185578 │ 2023-12-30-01-23-15_UMBRA-04_GEC.tif │ │ 4722625282 │ 2023-12-06-01-53-34_UMBRA-05_GEC.tif │ │ 4351925386 │ 2023-12-04-03-54-25_UMBRA-05_GEC.tif │ │ 4027566020 │ 2023-10-26-05-39-39_UMBRA-05_GEC.tif │ │ 3271758578 │ 2023-12-06-12-24-10_UMBRA-05_GEC.tif │ │ 3196780818 │ 2024-02-10-19-06-27_UMBRA-07_GEC.tif │ │ 3078709927 │ 2024-03-29-17-52-28_UMBRA-05_GEC.tif │ │ 2867770018 │ 2023-10-06-02-28-43_UMBRA-05_GEC.tif │ ├────────────┴──────────────────────────────────────┤ │ 10 rows 2 columns │ └───────────────────────────────────────────────────┘

Locations and Equipment There are a lot of complexities as to who's water a ship is in. I've made an attempt to use OSM's reverse geocoder to determine which country the centre of each SAR image's footprint is in. $ python3 import json from glob import glob from time import sleep from geopy.geocoders import Nominatim from rich.progress import track from shapely import wkt from shapely.geometry import shape geolocator = Nominatim ( user_agent = 'geopy' ) with open ( 'imagery.json' , 'w' ) as f : for filename in track ( glob ( '*/*/*.json' )): rec = json . loads ( open ( filename ) . read ()) for collect_num , collect in enumerate ( rec [ 'collects' ]): geom = shape ( collect [ 'footprintPolygonLla' ]) # Convert the POLYGON Z to a POLYGON geom = wkt . loads ( wkt . dumps ( geom , output_dimension = 2 )) try : osm = geolocator . reverse ( ' %f %f ' % ( geom . centroid . y , geom . centroid . x )) . raw except Exception as exc : print ( exc ) osm = None f . write ( json . dumps ({ 'osm' : osm , 'collect_num' : collect_num , 'sku' : rec [ 'productSku' ], 'sat' : rec [ 'umbraSatelliteName' ], 'filename' : filename , 'yyyy_mm' : rec [ 'collects' ][ 0 ][ 'startAtUTC' ][: 7 ], 'centroid_x' : geom . centroid . x , 'centroid_y' : geom . centroid . y , 'geom' : geom . wkt }, sort_keys = True ) + '

' ) sleep ( 1 ) The imagery covers at least 53 countries and international waters. Below are the top ten countries represented. $ jq -S .osm.address.country_code imagery.json \ | sort \ | uniq -c \ | sort -rn \ | head 321 "cn" 80 "us" 59 "ir" 34 "kr" 33 "ru" 31 "ae" 30 "sg" 29 "my" 29 "dj" 26 "nl" Marine Regions has GeoPackage files delineating maritime boundaries. Seven of the nine files contain polygons that I'll use to determine the waters a given centroid is in. $ cd ~/marineregions_org $ ls -lh *.gpkg .. 69M .. eez_12nm_v4.gpkg .. 60M .. eez_24nm_v4.gpkg .. 77M .. eez_internal_waters_v4.gpkg .. 16M .. eez_boundaries_v12.gpkg .. 157M .. eez_v12.gpkg .. 39M .. eez_archipelagic_waters_v4.gpkg .. 2.1M .. ecs_boundaries_v01.gpkg .. 7.0M .. ecs_v01.gpkg .. 8.8M .. High_Seas_v1.gpkg Below is a rendering of this dataset. I'll first import these GeoPackage files into DuckDB. $ for FILENAME in *.gpkg ; do BASENAME = ` echo " $FILENAME " | cut -d. -f1 ` ~/duckdb -c "CREATE OR REPLACE TABLE $BASENAME AS SELECT * FROM ST_READ(' $FILENAME ');" \ waters.duckdb done Polygons can overlap one another so I'll first get every point-polygon collision there is. $ python import json import duckdb from rich.progress import track tables = ( 'High_Seas_v1' , 'eez_archipelagic_waters_v4' , 'ecs_boundaries_v01' , 'eez_boundaries_v12' , 'ecs_v01' , 'eez_internal_waters_v4' , 'eez_12nm_v4' , 'eez_v12' , 'eez_24nm_v4' ) con = duckdb . connect ( database = 'waters.duckdb' ) df = con . sql ( 'INSTALL spatial;' ) with open ( 'waters.json' , 'w' ) as f : centroid_num = 0 for geom in track ( open ( '/home/mark/umbra_ships.csv' ), total = 996 ): try : lon , lat = [ float ( x ) for x in geom . strip () . split ( '(' )[ - 1 ] . split ( ')' )[ 0 ] . split ( ' ' )] except : continue for table in tables : df = con . sql ( """LOAD spatial; SELECT * FROM """ + table + """ WHERE ST_CONTAINS(geom, ST_POINT(?, ?));""" , params = ( lon , lat )) . to_df () if not df . empty : out = { 'lon' : lon , 'lat' : lat , 'centroid_num' : centroid_num } out [ table ] = json . loads ( df . to_json ()) f . write ( json . dumps ( out , sort_keys = True ) + '

' ) centroid_num = centroid_num + 1 I'll then take the last point-polygon collision from the results and use it as the location for each image's centroid. with open ( 'waters_last_geoname.json' , 'w' ) as f : for line in open ( 'waters.json' ): rec = json . loads ( line ) out = { 'centroid_num' : rec [ 'centroid_num' ], 'lat' : rec [ 'lat' ], 'lon' : rec [ 'lon' ]} tables = [ x for x in rec . keys () if x not in ( 'centroid_num' , 'lat' , 'lon' )] for table in tables : out [ 'GEONAME' ] = rec [ table ][ 'GEONAME' ][ '0' ] f . write ( json . dumps ( out , sort_keys = True ) + '

' ) The above returned 881 results out of the 998 image centroids in this dataset. Below are the top 40 locations for the imagery in Umbra's dataset. $ ~/duckdb . maxrows 100 SELECT GEONAME , COUNT ( * ) FROM READ_JSON ( 'waters_last_geoname.json' ) GROUP BY 1 ORDER BY 2 DESC LIMIT 40 ; ┌──────────────────────────────────────────┬──────────────┐ │ GEONAME │ count_star() │ │ varchar │ int64 │ ├──────────────────────────────────────────┼──────────────┤ │ Chinese Exclusive Economic Zone │ 165 │ │ Chinese Internal Waters │ 112 │ │ Chinese 12 NM │ 36 │ │ United States Exclusive Economic Zone │ 31 │ │ United States 12 NM │ 31 │ │ Russian Exclusive Economic Zone │ 23 │ │ Russian Internal Waters │ 23 │ │ Malaysian Exclusive Economic Zone │ 22 │ │ Djiboutian Exclusive Economic Zone │ 18 │ │ Iranian Exclusive Economic Zone │ 17 │ │ Iranian Internal Waters │ 16 │ │ Indonesian Exclusive Economic Zone │ 16 │ │ Panamanian Exclusive Economic Zone │ 13 │ │ Panamanian Internal Waters │ 13 │ │ Malaysian Internal Waters │ 13 │ │ Sri Lankan 12 NM │ 12 │ │ Sri Lankan Exclusive Economic Zone │ 12 │ │ Yemeni Exclusive Economic Zone │ 12 │ │ Philippine Exclusive Economic Zone │ 12 │ │ Djiboutian Internal Waters │ 12 │ │ Emirati Exclusive Economic Zone │ 11 │ │ Emirati 12 NM │ 11 │ │ Overlapping claim Taiwan: Taiwan / China │ 10 │ │ Maldivian Exclusive Economic Zone │ 10 │ │ Maldivian Archipelagic Waters │ 10 │ │ South Korean Exclusive Economic Zone │ 9 │ │ Malaysian 12 NM │ 9 │ │ South Korean Internal Waters │ 9 │ │ North Korean 12 NM │ 8 │ │ North Korean Exclusive Economic Zone │ 8 │ │ Yemeni 12 NM │ 7 │ │ Egyptian Exclusive Economic Zone │ 7 │ │ Venezuelan Exclusive Economic Zone │ 6 │ │ Djiboutian 12 NM │ 6 │ │ Indonesian 12 NM │ 6 │ │ Thai Internal Waters │ 5 │ │ Thailand Exclusive Economic Zone │ 5 │ │ Singaporean 12 NM │ 5 │ │ Pakistani Internal Waters │ 5 │ │ Indian Exclusive Economic Zone │ 5 │ ├──────────────────────────────────────────┴──────────────┤ │ 40 rows 2 columns │ └─────────────────────────────────────────────────────────┘ Below is a breakdown of the satellites and the months in which the images were captured. $ ~/duckdb PIVOT READ_JSON_AUTO ( 'imagery.json' ) ON sat USING COUNT ( * ) GROUP BY yyyy_mm ORDER BY yyyy_mm ; ┌─────────┬──────────┬──────────┬──────────┬──────────┬──────────┐ │ yyyy_mm │ UMBRA_04 │ UMBRA_05 │ UMBRA_06 │ UMBRA_07 │ UMBRA_08 │ │ varchar │ int64 │ int64 │ int64 │ int64 │ int64 │ ├─────────┼──────────┼──────────┼──────────┼──────────┼──────────┤ │ 2023-01 │ 62 │ 32 │ 0 │ 0 │ 0 │ │ 2023-02 │ 49 │ 54 │ 0 │ 0 │ 0 │ │ 2023-03 │ 22 │ 24 │ 0 │ 0 │ 0 │ │ 2023-04 │ 18 │ 17 │ 1 │ 0 │ 0 │ │ 2023-05 │ 11 │ 11 │ 1 │ 0 │ 0 │ │ 2023-06 │ 30 │ 33 │ 45 │ 0 │ 0 │ │ 2023-07 │ 38 │ 43 │ 8 │ 0 │ 0 │ │ 2023-08 │ 9 │ 31 │ 9 │ 0 │ 0 │ │ 2023-09 │ 23 │ 26 │ 19 │ 0 │ 0 │ │ 2023-10 │ 57 │ 50 │ 4 │ 0 │ 0 │ │ 2023-11 │ 24 │ 15 │ 1 │ 0 │ 0 │ │ 2023-12 │ 36 │ 46 │ 22 │ 0 │ 0 │ │ 2024-01 │ 7 │ 2 │ 7 │ 0 │ 0 │ │ 2024-02 │ 12 │ 10 │ 4 │ 9 │ 7 │ │ 2024-03 │ 20 │ 18 │ 12 │ 6 │ 8 │ │ 2024-04 │ 1 │ 0 │ 1 │ 0 │ 0 │ ├─────────┴──────────┴──────────┴──────────┴──────────┴──────────┤ │ 16 rows 6 columns │ └────────────────────────────────────────────────────────────────┘

Ship Detection Training Data HRSID contains imagery of 16,951 ships across 5,604 images. The images span a wide range of resolutions, polarisations, sea conditions and geographies. I found a GitHub repository of HRSID's training data already nicely organised. The repository is 2.6 GB including the .git folder. $ git clone https://github.com/Akashkalasagond/SAR-Ship-Detection/ \ ~/SAR-Ship-Detection

Training with YOLOv5 I'll download one of YOLOv5's pre-trained weights. $ cd ~ $ wget https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov5nu.pt Below I'll set up a training configuration for YOLOv5. $ cd ~/SAR-Ship-Detection $ vi settings.yaml path : "/home/mark/SAR-Ship-Detection/" train : "/home/mark/SAR-Ship-Detection/train/images" val : "/home/mark/SAR-Ship-Detection/test/images" # class names names : 0 : 'ship' I'll train a model using the above settings on my Nvidia GeForce GTX 1080. $ python3 from ultralytics import YOLO model = YOLO ( '/home/mark/yolov5nu.pt' ) results = model . train ( data = '/home/mark/SAR-Ship-Detection/settings.yaml' , epochs = 100 , batch = 8 , imgsz = 800 , plots = True ) Ultralytics YOLOv8.2.32 ???? Python-3.10.12 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce GTX 1080, 8192MiB) ... 100 epochs completed in 2.885 hours. Optimizer stripped from runs/detect/train/weights/last.pt, 5.3MB Optimizer stripped from runs/detect/train/weights/best.pt, 5.3MB The training finished in just under 3 hours and produced a 5.3 MB weights model.

Inferring Umbra's Imagery Below I'll attempt to detect ships in Umbra's 998 SAR images using the model I trained. $ python3 from glob import glob import json from rich.progress import track from ultralytics import YOLO model = YOLO ( 'runs/detect/train/weights/best.pt' ) pattern = '/home/mark/ship_detection_testdata/*/*/*_GEC.tif' with open ( 'ship_detections.jsonl' , 'w' ) as f : for filename in track ( glob ( pattern ), total = 998 ): try : for results in model ( filename ): for result in results : ships = json . loads ( result . tojson ()) for ship in ships : f . write ( json . dumps ({ ** ship , ** { 'filename' : result . path }}, sort_keys = True ) + '

' ) except Exception as exc : print ( exc ) During the run, the 5 GB GeoTIFF caused the following exception. error: OpenCV(4.10.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:79: error: (-215:Assertion failed) pixels <= CV_IO_MAX_IMAGE_PIXELS in function 'validateInputImageSize' I wasn't able to adjust OpenCV's settings within the allotted time box I had for this post so I wrapped the inference in a try/except statement so to not hold back the rest of the job. Below is a section of YOLO's output during inference. ..2023-04-15-01-17-53_UMBRA-05_GEC.tif: 800x800 3 ships, 41.6ms ..2023-08-07-13-53-00_UMBRA-04_GEC.tif: 800x800 (no detections), 45.4ms ..2023-03-25-02-22-27_UMBRA-04_GEC.tif: 800x800 4 ships, 39.1ms ..2023-01-23-18-25-33_UMBRA-05_GEC.tif: 800x800 (no detections), 40.5ms ..2023-08-10-13-17-17_UMBRA-05_GEC.tif: 800x800 4 ships, 38.7ms ..2023-10-25-10-31-48_UMBRA-05_GEC.tif: 800x640 (no detections), 31.7ms ..2023-06-18-01-56-53_UMBRA-04_GEC.tif: 800x800 6 ships, 43.8ms ..2023-11-01-13-52-39_UMBRA-05_GEC.tif: 800x800 1 ship, 43.1ms ..2023-04-23-07-12-19_UMBRA-05_GEC.tif: 800x800 3 ships, 52.2ms ..2023-03-06-02-08-59_UMBRA-04_GEC.tif: 800x800 8 ships, 40.0ms ..2023-10-06-02-14-22_UMBRA-04_GEC.tif: 800x800 (no detections), 80.0ms ..2023-03-09-02-49-06_UMBRA-05_GEC.tif: 800x800 (no detections), 39.5ms ..2023-02-26-14-24-37_UMBRA-04_GEC.tif: 800x800 6 ships, 44.9ms ... There were 2,534 ships detected. Below is an example record from the results file. $ wc -l ship_detections.jsonl # 2,534 lines $ head -n1 ship_detections.jsonl | jq -S . { "box" : { "x1" : 12475.06152 , "x2" : 13001.17188 , "y1" : 9329.18945 , "y2" : 9791.08105 }, "class" : 0 , "confidence" : 0.45151 , "filename" : "/home/mark/ship_detection_testdata/00174818-00fe-4c89-bf45-cc6149253707/2023-04-15-01-17-53_UMBRA-05/2023-04-15-01-17-53_UMBRA-05_GEC.tif" , "name" : "ship" } Below are the ship counts across each confidence bracket. $ ~/duckdb SELECT ROUND ( confidence , 1 ), COUNT ( * ) FROM READ_JSON ( 'ship_detections.jsonl' ) GROUP BY 1 ORDER BY 1 ; ┌──────────────────────┬──────────────┐ │ round(confidence, 1) │ count_star() │ │ double │ int64 │ ├──────────────────────┼──────────────┤ │ 0.3 │ 615 │ │ 0.4 │ 466 │ │ 0.5 │ 364 │ │ 0.6 │ 320 │ │ 0.7 │ 306 │ │ 0.8 │ 384 │ │ 0.9 │ 79 │ └──────────────────────┴──────────────┘