What happened?
TL;DR
During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. The incident trigger and the downstream system behavior have both been corrected to ensure that this cannot happen again.
This incident did not impact any Google Cloud service other than this customer’s one GCVE Private Cloud. Other customers were not impacted by this incident.
Diving Deeper:
Deployment using an exception process
In early 2023, Google operators used an internal tool to deploy one of the customer’s GCVE Private Clouds to meet specific capacity placement needs. This internal tool for capacity management was deprecated and fully automated in Q4 2023 and is therefore no longer required (i.e. no need for human intervention).
Blank input parameter led to unintended behavior
Google operators followed internal control protocols. However, one input parameter was left blank when using an internal tool to provision the customer’s Private Cloud. As a result of the blank parameter, the system assigned a then unknown default fixed 1 year term value for this parameter.
After the end of the system-assigned 1 year period, the customer’s GCVE Private Cloud was deleted. No customer notification was sent because the deletion was triggered as a result of a parameter being left blank by Google operators using the internal tool, and not due a customer deletion request. Any customer-initiated deletion would have been preceded by a notification to the customer.
Recovery
The customer and Google teams worked 24x7 over several days to recover the customer’s GCVE Private Cloud, restore the network and security configurations, restore its applications, and recover data to restore full operations.
This was assisted by the customer’s robust and resilient architectural approach to managing risk of outage or failure.
Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration.
Remediation
Google Cloud has since taken several actions to ensure that this incident does not and can not occur again, including:
We deprecated the internal tool that triggered this sequence of events. This aspect is now fully automated and controlled by customers via the user interface, even when specific capacity management is required. We scrubbed the system database and manually reviewed all GCVE Private Clouds to ensure that no other GCVE deployments are at risk. We corrected the system behavior that sets GCVE Private Clouds for deletion for such deployment workflows.
Conclusions