In summary, the DR analysis attempts to distinguish correlation from causation by estimating what influence potential causal factors at a previous time has on the response variable at a later time. While an improvement over static correlations, where causal direction remains ambiguous, this method is, nevertheless, insufficient for making absolute claims of causality. Further scrutiny will be required to provide additional support for the provisional causal interpretations suggested in our article.
A related potential problem is that of an “uninformative dataset,” which happens when there is not enough variation in either potential predictors or response variables (or both). This is why we need data that sample as many different evolutionary trajectories as possible. To illustrate this point, consider the effect of cavalry on the evolution of social complexity. If our dataset only contained Eurasian societies after horse riding spread everywhere, then we would not have enough variation to statistically detect the effect of this driver. The inclusion in the analysis of Eurasian societies before 1000 BCE and, most crucially, New World societies where cavalry arrived very late is key. In essence, the spread of horse-based warfare, which happened at very different times in different parts of the world, is a “natural experiment” that allows us to estimate its effect on the evolution of social complexity. However, because cavalry and iron metallurgy are so closely correlated in our dataset, we were unable to disentangle the effects of these two technologies. We need additional information to do so [see ( 58 )]. A general conclusion from this discussion is that more work is needed not only to eliminate gaps and to gather new data on additional variables (as we called for in the previous paragraphs) but also to sample more evolutionary trajectories in as diverse settings as possible.
Another fundamental difficulty is the “hidden variable” problem or omitted variable bias ( 98 ). This happens when our analysis implicatesas a causal factor for, while, in reality, the true cause is a variable not included in the analysis,, with whichis closely correlated. The only solution for this problem is gathering data on as many potential predictors as possible. This is why the Seshat project defined, and gathered data on, proxies for all major classes of theories that have been proposed by social scientists so far. However, we acknowledge that the set of proxies used in this article is just the beginning. This is where we see the most fruitful area for future research: defining additional proxies for theoretically postulated mechanisms, gathering data on these variables, and rerunning analyses to find whether new variables turn out to be better predictors of social complexity and scale than the ones we have tested so far.
Time-resolved data are what enables tests of evolutionary theories against each other, but it does not solve all possible problems in the analysis of evolutionary causation. For this reason, the specific results reported in this article are tentative and contingent on additional data and improved analytic approaches. One recurrent problem with historical data is the gaps in the knowledge of historians and archaeologists, resulting in missing data. We dealt with this problem, as well as uncertainty in estimates and expert disagreements, by multiple imputation (see the “Multiple imputation and nonparametric bootstrap” section). While this is a valid statistical technique, additional research by expert scholars aiming to fill the gaps in the database would be a much more satisfying long-term solution. One of the goals of the Seshat project has been highlighting gaps in our knowledge to motivate such research.
The goal of the DR approach is different, because we aim to use data to adjudicate between different theories of social evolution, each proposing a different causal graph (an example is in Fig. 2 ). This is generally impossible to do with static (time-unresolved) data, which is why the goal of the Seshat project from its inception was to collect time series data. Unlike with DAGs, the main goal of the DR approach is model selection, choosing which predictor terms should be included on the right hand sides in Eq. 1 . We are also interested in estimation, because we want to compare the numerical strengths of different factors, but this goal is secondary to model selection, as we first need to determine which causal graph should be used for coefficient estimation. Thus, the DAG approach, excellent as it is, differs in goals and technics from the DR approach. The DR approach was designed to resolve questions of causation in evolutionary processes (descent with modification) that unfold slowly in time. It allows us to deal with these complications as mutual causation loops and temporal autocorrelations arising from the inertial nature of evolution, as well as (with fairly straightforward extensions, see the “Dynamic regression” section below) with spatial diffusion and phylogenetic effects. A model’s ability to predict data is interesting not in itself but as a tool for adjudicating between different theories. A more precise term for this approach is retrodiction, because even when we use out-of-sample prediction, it is about the past, not the future (see the “-fold cross-validation” section).
The most important difference between DR and DAG is that the latter does not explicitly include the time dimension. Thus, instead of causal links in DR, such as X t → Y t +1 , in DAG, causal connections are denoted without time subscripts, as X → Y . Because of this difference, DAGs have to be acyclic. In other words, scenarios of mutual causation cannot be investigated. Furthermore, the main goal in DAG is estimation of the causal effect. This approach is appropriate if we need to know, for example, by how many years a particular drug would increase life expectancy when we only have observational data. To answer this question, an analyst must assume a particular DAG—its form is underdetermined by (time-unresolved) data.
As noted in Introduction, our statistical approach to causation is based on DR (described in detail in the “Dynamic regression” section) rather than DAGs. Because the DAG approach recently gained much popularity, here, we explain how the goals and the logic of the DR approach are different. Note that the DR framework is based on the ideas of Wiener ( 27 ), which were later developed by Granger ( 28 ). This approach has also been used in the statistical analysis of animal population dynamics ( 97 ).
The next term on the right-hand side represents the effects of the main predictor variables X k , with g k as regression coefficients. These variables (described in the “Outlining hypotheses and defining predictor variables” section) are of primary interest because they enable us to test various theories about the evolution of social scale and complexity. Last, ε i,t is the error term. We also include quadratic versions of these terms at a time lag (the “Dynamic regressions” section in the Supplementary Materials) to explore nonlinear responses to response and predictor factors.
The fourth term detects autocorrelations due to any shared cultural history at locationwith other regionsusing the phylogeny variable. Here,represents the weight applied to the phylogenetic (linguistic) distance between locations (set to 1 if locationsandshare the same language, 0.5 if they are in the same linguistic genus, and 0.25 if they are in the same linguistic family). Linguistic genera and families were taken fromand Glottolog ( 99 ).
The third term represents potential effects resulting from geographic diffusion. We used a negative exponential form to relate the distance between locationand location, δ, to the influence ofon. Unlike a linear kernel, the negative exponential does not become negative at very large δ, instead approaching 0 smoothly. The third term, thus, is a weighted average of the response variable values in the vicinity of locationat the previous time step, with weights falling off to 0 as distance fromincreases. Parametermeasures how steeply the influence falls with distance, and parameteris a regression coefficient measuring the importance of geographic diffusion. For an overview of potential effects resulting from geographic diffusion, see ( 98 ).
On the right-hand side, a is the regression constant (intercept). The next term captures the influences of past history (“autoregressive terms”), with τ = 1, 2, … indexing time-lagged values of Y (as time is measured in centuries, Y i,t −1 refers to the value of the response 100 years before t ).
Here, Y i,t is the response variable (Scale, Hier, or Gov) for location (NGA) i at time t . We construct a spatiotemporal series for response and predictor variables by following Seshat polities (or quasi-polities, such as archaeologically attested cultures) that occupied a specific NGA at each century mark during the sampled period. Thus, the time step in the analysis is 100 years.
Model selection (choosing which terms to include in the regression model) was accomplished by exhaustive search: regressing the response variable on all possible linear combinations of predictor variables. This means that we tested >100,000 special hypotheses (this number is further increased because, in addition to the 17 possible predictors, we also investigate the effects of various autoregressive and nonlinear terms). The degree of fit was quantified by the AIC, which penalizes models with too many fitted coefficients. Possible nonlinear effects were checked by adding quadratic terms to the regression model. Standard diagnostic tests were performed for the best-fitting models ( 50 ). To check for cross-equation error correlations, we fitted a seemingly unrelated regression ( 51 ).
Multiple imputation and nonparametric bootstrap
Missing values, estimated uncertainty, and expert disagreement in the predictors (independent variables) were dealt with by multiple imputation ( 52 ). The response (dependent) variable, however, is not imputed, because such a procedure can result in biased estimates.
Imputation involves replacing missing entries with plausible values, and this allows us to retain all cases for the analysis. We use the approach of multiple imputation, in which analysis is done on many datasets, each created with different imputed values that are sampled in probabilistic manner. This approach results in valid statistical inferences that properly reflect the uncertainty due to missing values ( 53 ). Our procedure followed the approach introduced in ( 4 ):
1) Expert disagreement. In cases where experts disagree, each alternative coding has the same probability of being selected. Thus, if there are two conflicting codings presented by different experts and we create 20 imputed sets, then each alternative will be used roughly 10 times.
2) Uncertainty. Values that are coded with a confidence interval are sampled from a Gaussian distribution whose mean and variance are estimated, assuming that the interval covers 90% of the probability. For example, if a value of 1000 to 2000 was entered for the polity population variable, then we would draw values from a normal distribution centered on 1500 with an SD of 304. Thus, in 10% of cases, the value entered into the imputed set will be outside the data interval coded in Seshat. For categorical or binary variables, we sample coded values in proportion to the number of categories that are presented as plausible. For example, if Hier was coded as [2;3], that is, our degree of knowledge does not allow us to tell whether its value was 2 or 3 at a particular time, then the imputed data will contain “2” for roughly half the sets and “3” for the rest.
3) Missing data. For missing data, we impute values as follows. Suppose that for some polity, we have a missing value for variable A and coded values for variables B to H. We select a subset of cases from the full dataset in which all values of A to H variables have values and build a regression model for A. Not all predictors B to H may be relevant to predicting A, and, thus, the first step is selecting which of the predictors should enter the model. Once the optimal model is identified, we estimate its parameters. Then, we go back to the polity (where variable A is missing) and use the known values of predictor variables for this polity to calculate the expected value of A using estimated regression coefficients. However, we do not simply substitute the missing value with the expected one (because, as explained above, this is known to result in biased estimates). Instead, we sample from the regression residuals and add it to the expected value. We applied the same approach to each missing value in the dataset, yielding an imputed dataset without gaps.
The overall imputation procedure was repeated 20 times, yielding 20 imputed sets that were used in regression analysis as tests for the hypotheses.
Another source of potential bias is the violation of the assumptions of the statistical model needed to calculate confidence intervals and associated P values. Regression diagnostics indicate that the distribution of residuals violates the normality assumption (see Supplementary Results). Furthermore, for any data coming from the same geographic locality (NGA), it is possible that values are not truly independent because of memory effects. While we estimate and, when appropriate, model short-term memory effects by fitting autoregressive terms, there is also a possibility, which cannot be discounted, that there is a longer-term memory in the system.
We used nonparametric bootstrap to deal with this problem. However, instead of sampling (with replacement) each data point (polity-century), we sample with replacement the whole block of data associated with each NGA. Thus, the bootstrap procedure we used mimics the process by which we constructed the Seshat sample (see the “A brief introduction to the Seshat Global History Databank” section in the Supplementary Materials).
We combined multiple imputation with bootstrap. First, we created 20 imputed datasets, as described above. Second, we resampled, with replacement, NGAs in each imputed dataset 500 times, for a total of 20 × 500 = 10,000 bootstrapped datasets. We then calculated the statistics of interest (regression coefficients associated with various predictors) and constructed the frequency distribution of the 10,000 bootstrapped values. The P value is approximated by the proportion of statistic values greater than 0 (if the hypothesis we test is that, then the effect of the predictor is positive) or less than 0 (otherwise). The 95% confidence interval is then approximated by eliminating the smallest 250 and largest 250 values.
Calculating P values and confidence intervals assuming normality is expected to yield more liberal estimates, while resampling whole blocks of data for each NGA is a more conservative approach. Using these two approaches permits us to bracket the true values. The analysis sequence, thus, follows a two-phase approach. In the first phase (model selection), we check which predictors need to be included in the regression model, which autoregressive terms need to be explicitly modeled, the linearity of the relationships between the response and predictors, a test for possible omitted variables, and NGA fixed effects. As a result, we run many regressions, identify the “best model” (with the smallest AIC), and sort the rest by increasing delAIC (difference from the best model). Once this model selection and testing phase is accomplished, the second phase (confidence tests) uses nonparametric bootstrap to approximate the P values and confidence intervals.
All analyses were performed using R version 4.0.2. R scripts and data files are published as a supplement to the article.