We tested the association between income inequality and flood mortality at the country level by collecting and statistically analysing data on disaster impacts, human flood exposure and socioeconomic conditions from a number of international databases. The explorative and statistical analyses were performed using R v.4.1.3 (ref. 41), while the population and settlement data were analysed using Google Earth Engine42. All our statistical analyses uses a significance level of 5%.

Study extent

Owing to data reliability and comparability reasons, only floods occurring in MHICs between 1990 and 2018 were included in the study: 573 events in 67 countries. We also analysed records from the OECD nations separately, 265 events in 28 countries, to highlight the role of income inequality in the wealthiest countries. As stated in the introduction, we limited the study to MHICs primarily due to comparability reasons and because we aim to shed light on the role of inequality in advanced economies. We use the income class according to the World Bank Atlas method43, based on the per-capita GNI (gross national income).

Flood disaster records

The number of fatalities per flood disaster is the outcome variable in our analysis, as reported in the international disaster database EM-DAT31. EM-DAT records disasters that fulfil at least one of the following criteria: ≥10 fatalities, ≥100 affected people, emergency state declaration and/or an international assistance call. We included EM-DAT records classified as riverine floods, coastal floods, flash floods and tropical cyclones (database accessed 7 December 2022).

To control for human flood exposure, we only included records from EM-DAT that were also geocoded by the GDIS database44. The developers of GDIS geocoded EM-DAT records by matching the location description with one or more administrative subdivisions in the GADM database of Global Administrative Areas (v.3.6; https://gadm.org/). GADM includes administrative subdivisions at various levels, including state and province boundaries (level 1), county and district boundaries (level 2) and smaller administrative boundaries (level 3). Each disaster record has been geocoded to administrative regions at level 1, 2 and/or 3, depending on the location description text in EM-DAT44. The subdivision level to which a record is geocoded depends on the specificity of the location description in EM-DAT and does not necessarily reflect the extent of the actual disaster44. Therefore, we included only records geocoded to level 2 and/or 3. Level 1 subdivisions are typically five times larger than level 2 subdivisions and around ten times larger than level 3 subdivisions. Including records geocoded to level 1 would have resulted in considerably larger exposure estimates, potentially biasing the analysis.

Population and settlement data

To estimate the number of potentially exposed people and the degree of urbanity per flood, we utilized three Global Human Settlement (GHS) products: population grids (GHS-POP) at 250 m spatial resolution45, settlement grids (GHS-SMOD) at 1,000 m spatial resolution46 and the urban centre database (GHS-UCD)47. GHS-SMOD provide settlement maps categorized by degree of urbanization: rural, low-density clusters and high-density clusters. GHS-UCD offers the location and attributes of urban centres around the world.

We calculated the total number of potentially exposed individuals for each record using the population data from GHS-POP. The degree of urbanity was determined by calculating the percentage of the potentially exposed population living in high-density clusters using GHS-SMOD. Both GHS-POP and GHS-SMOD are available as global seamless raster files for the years 1990, 2000 and 2015. We derived yearly estimates through linear interpolation. For the years 2016, 2017 and 2018, we assigned the same values as in 2015. We also calculated the total settlement area for each record using GHS-SMOD. One record was excluded from the analysis as it did not contain any inhabitants according to GHS-POP, which we deemed unrealistic. We also used GHS-UCD to identify the records whose affected regions contain at least one urban centre. We considered the degree of urbanization as a potential confounding variable, since a previous cross-country literature report has identified disparities between urban and rural areas as a significant contributor to overall inequality within countries48.

Economic and demographic data

We obtained country–year observations of income inequality from the Standardized World Income Inequality Database v.9.1 (SWIID)20,49 as the Gini index of disposable (post-tax, post-transfer) household income. We chose the SWIID as the data source since it provides data on disposable income, and the data-collection protocol aims to maximize both coverage and comparability20. Nonetheless, some country–year records were missing. When possible, we assigned missing values with the closest available value, at most three years before or after. As mentioned in the introduction, there are alternative metrics to represent economic inequality. However, the global databases that offer these alternative metrics do not provide data on disposable income, nor do they typically have the same coverage as the SWIID. Nonetheless, the choice of inequality metric is particularly influential for ex post studies that investigate mechanisms behind changes in economic inequality. However, for ex ante analyses that investigate the relationship between pre-disaster levels of economic inequality and disaster impacts, such as in this study, the choice of metric does not influence the results considerably, since the inherent ranking of countries is similar across inequality metrics.

To compare relative living standards across countries and over time, we used the expenditure-side real GDP at constant 2017 prices in US dollars from the Penn World Table v.10.0 (refs. 50,51). This measure is adjusted to price changes, such as inflation50. We converted the GDP variable to per-capita terms using national population totals from the same database.

To control for the potential effect of the population age structure on income distribution, we used country–year observations of the proportion of individuals in the country aged 65 or older from the World Development Indicators database of the World Bank52. This demographic variable was deemed crucial as an ageing population has previously been shown to impact economic inequality53,54.

Initial data exploration

We calculated mortality rates for all records by dividing the reported fatality numbers with the number of individuals living in the flood-affected region in the year of the event. We examined if and how the rates varied across space, time and levels of income inequality and economic development. We used non-parametric Wilcoxon tests to compare the group means. We then analysed whether or not the flood mortality levels had changed over the study period, in terms of the absolute fatality numbers and mortality rates. For this purpose, we used non-parametric Mann–Kendall tests to detect trends. Specifically, we compared mortality levels in EU member states ten years before and after the European Directive of 2007.

We also explored if and how flood mortality varied across degrees of urbanity, represented using two metrics. First, we grouped the records according to the binary variable that identified the records whose affected region contained at least one major urban centre according to GHS-UCD data. In addition, we grouped the records based on the share of potentially exposed people living in high-density clusters.

Regression analyses

In this study, we used an unbalanced panel structure model to analyse the relationship between flood fatalities and socioeconomic factors, with the unit of analysis being an individual flood event i occurring in country j during year t. The model specification is outlined in equation (1):

$$\begin{array}{lll}{{{\mathrm{FATALITIES}}}}_i = \left( {\beta _0 + b_{0,j}} \right) + \beta _1{{{\mathrm{GINI}}}}_{jt}\cr \\ \qquad\qquad\qquad\qquad + \,\beta _2{{{\mathrm{GDP}}}}_{jt} + \beta _3\ln \left( {{{{\mathrm{POT}}}}\_{{{\mathrm{EXP}}}}_{it}} \right)\cr \\ \qquad\qquad\qquad\qquad + \,\beta _4\ln \left( {{{{\mathrm{SETTL}}}}\_{{{\mathrm{AREA}}}}_{it}} \right) + \beta _5{{{\mathrm{HDC}}}}_{it}\cr \\ \qquad\qquad\qquad\qquad + \,\beta _6{{{\mathrm{POP}}}}\_{{{\mathrm{65}}}}_{jt} + \beta _7{{{\mathrm{YEAR}}}}_i + \varepsilon \\ \end{array}$$ (1)

A mixed-effects approach was used, with a random intercept term (b 0,j ) per country to account for variations in the baseline risk across countries. The random effect thus takes into account the lack of independence of records coming from the same country, for instance due to differences that we are not representing with our data, including flood propensity. The response variable, FATALITIES i , represents the number of reported fatalities from each flood event. The covariate variables included the Gini index of disposable income (GINI jt ), the per-capita real GDP (GDP jt ), the number of individuals potentially exposed to the flood event (POT_EXP it ), the area of settlements in the affected regions (SETTL_AREA it ), the proportion of potentially exposed individuals living in high-density clusters (HDC it ), the proportion of individuals aged 65 or older in the country (POP_65 jt ) and a time dummy variable (YEAR i ) to control for temporal changes. β 0 is the average model intercept, β 1– 7 are the regression coefficients for each covariate variable and ε is the error term.

To assess the significance of various variables in relation to flood fatalities, we used a negative binomial generalized linear regression model. This method was chosen due to its appropriateness for count data with an overdispersion of error, and the models were fitted using maximum likelihood estimation with the R package glmmTMB55. To support our analysis, we provide summary statistics of the variables at the sample and group level in Supplementary Tables 1 and 2, respectively.

We designed two versions of the regression models, varying the degree of transformation of the covariate variables as seen in Supplementary Fig. 1. In the first main version, we aimed to estimate the effect of changes in the Gini index and per-capita real GDP on the flood mortality. Using data from 1990 to 2018, we scaled these variables to the units of 3 pp and US$10,000, reflecting the changes experienced by the study sample (Supplementary Table 3). All variables were centred and two highly skewed variables (potentially exposed individuals and the total settlement area) were log-transformed. In the second (standardized) version, we aimed to evaluate the relative importance among the independent variables. To this end, we standardized all covariate variables. Before this, we log-transformed the two highly skewed variables.

Model diagnostics

To evaluate the linearity assumption of the negative binomial regression, we conducted spline regression on the Gini index and per-capita real GDP variables (Supplementary Fig. 3). Each variable was replaced with a natural cubic spline with three knots. The linearity assumption holds well for the main model that includes observations from all 65 MHICs. Limiting the sample to records from the OECD nations, however, gives more variable and uncertain model estimates due to the smaller sample size.

In addition to the linearity assessment, we performed residual diagnostics for the models using the R package DHARMa56 with 1,000 iterations. The negative binomial model structure generally fits the data well. The Kolmogorov–Smirnov test of the main statistical model using observations from MHICs indicates a significant deviation from the expected negative binomial distribution. However, the corresponding quantile–quantile plot (or qq plot) does not show a large deviation from the straight line (Supplementary Fig. 4). Using observations from the OECD nations resulted in slightly less problematic residuals (Supplementary Fig. 5). Standardizing all covariate variables also improved the model fit (Supplementary Fig. 6). Dropping the random effect, however, resulted in more problematic residuals (Supplementary Fig. 7) and a lower model quality in terms of the Akaike information criterion (Supplementary Tables 5–8). On the basis of this, we chose to include the random intercept term in the full model specification, even though 65 out of 67 country estimates were not significantly different from the model estimation (Supplementary Fig. 8).

Methodological limitations

It is important to note the limitations of the present study, which is an observational study performed at the country level, using measures of association to quantify the relationship between flood mortality and income inequality. Although the results suggest a correlation between income inequality and fatality numbers, other methods are needed to indicate causality. In addition, the Gini index is an aggregate metric, which may obscure underlying variables, and here we control for only a few.

Furthermore, the relationship between disaster outcomes and socioeconomic factors is highly context-specific and complex, both across and within countries. The top-down approach of cross-national research simplifies this complexity. We acknowledge that the scale matters, and a local study would have more potential for disentangling complex processes. Cross-national studies also have their own benefits, and we think it is important that research is conducted on a variety of scales. In addition, the limited number of data points per country prevented the inclusion of a random slope term, which would have provided further information about how the relationship between the variables varies across groups. The level of significance should also be interpreted with a certain amount of caution in the OECD models due to the smaller sample.

Data from global databases have limitations. For instance, we proxy flood exposure for the number of individuals living in the flood-affected administrative regions in the year of the event. This is a rough proxy which does not distinguish between flooded and non-flooded areas within these regions. Moreover, the intensity of the flood events is not taken into account. The disaster database EM-DAT records only major disasters, meaning that smaller disasters and instances where a flood hazard did not result in a disaster (that is, ‘success stories’) will be missed. The number of fatalities is a relatively straightforward disaster outcome to measure compared with, for example, economic damages and the number of affected individuals, although the accuracy will nonetheless vary across records57.

Finally, it is important to note that our sampling scheme affects the distribution of sample sizes across continents. A majority of the records in the final sample occurred in the Americas (33%), followed by Europe (29%), Asia (23%), Africa (9%) and Oceania (6%). However, this distribution does not necessarily reflect the true flood frequency for each continent. One reason for this is that the study considered floods only from MHICs. In addition, records that were geocoded to administrative regions at level 1 were excluded from the final sample. This affected some countries more than others. For example, a majority of the records from China in the GDIS database were geocoded to administrative regions at level 1 and were thus excluded from the final sample.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.