The review was prospectively registered with PROSPERO (https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42022336132). Data and statistical codes used were uploaded on the Open Science Framework (https://osf.io/yjhvz/?view_only=7f2dae1632c64ce1909ab391b34d2a6e). The systematic review was conducted and reported in accordance with 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines (PRISMA2020; Supplement 1) [28].

Eligibility criteria

Inclusion criteria followed the Participants, Interventions, Comparators, Outcomes, and Study (PICOS) design framework [29]. Studies that did not meet the following criteria were excluded.

Population

The population group of interest was patients with increased risk of fracture due to osteoporosis or osteopenia (T-score < − 1 SD) [2]. There were no restrictions on age, sex, gender, or race.

Intervention

Involved the addition of exercise to pharmacological therapy. Exercise had to contain the following aspects: planned, structured, repetitive, and targeted [30]. The aim of exercise is the improvement or attainment of physical fitness components [30] (e.g., aerobic, walking, cycling, jogging, resistance, strengthening, endurance exercises, whole body vibration). The combination of different types of exercises did not lead to exclusion. Co-interventions in form of advisory and educational services were allowed (e.g., calls, web calls, education material in form of booklets). Other co-interventions that have a direct impact to bodily physiology such as nutrition other than vitamin D, cholecalciferol, or calcium to compensate potential deficiencies were excluded.

Pharmacological therapies included the following [12, 14] (Supplement 2): alendronate, risedronate, ibandronate, zoledronate, denosumab, teriparatide, and romosozumab.

Comparators

Received pharmacological therapy only without any exercise training (see Supplement 2 for details of included therapies). Supplementation with vitamin D, cholecalciferol, and/or calcium alone was not considered as an included pharmacological therapy. In the case that participants received different medications, decisions for inclusion/exclusion were made based on the most frequently used medication. One person could not act as both an intervention and a control group participant (e.g., crossover RCTs).

Outcomes

Included BMD (areal and volumetric), BTM, fracture healing (radiologically evaluated), and fractures. No bone strength measures were included (see Supplement 3 for further details and priority for extraction).

Study

Included studies were required to be randomized controlled trials (RCTs; parallel, cluster, or cross-over designs) published in German or English as a full peer-reviewed journal publication (i.e., grey literature including theses and conference abstracts were excluded). There were no restrictions regarding the date of publication.

Information sources and search strategy

An electronic database search of MEDLINE (via PubMed), EMBASE (via Ovid), CINAHL (via EBSCO Host), and CENTRAL (via the Cochrane Library) was conducted using keywords, Medical Subject Headings (MeSH) terms, as well as related text words. Searches were performed from the database inception to May 6, 2022. The full search strategy is contained in Supplement 4. Unpublished and ongoing trials were searched via International Standard Randomized Controlled Trial Number (https://www.isrctn.com/), US National Institutes of Health (https://clinicaltrials.gov/), EU Clinical Trials Register (https://www.clinicaltrialsregister.eu/), German Clinical Trials Register (https://www.drks.de/), and Australian New Zealand Clinical Trials Register (https://www.anzctr.org.au/). In addition, the reference lists of prior relevant systematic reviews were screened for additional studies. Forward and backward citation tracking of included studies was performed via Web of Science Core Collection. Hand searches were performed for older bisphosphonates (see Supplement 5).

Selection process

Two independent assessors (from a pool of three screeners: AKS, NKA, EAC) screened each record against predefined inclusion criteria based on title and abstract and subsequently full text. Covidence (www.covidence.org) was used for the screening process. Duplicate records were detected and removed via the Covidence software. Disagreements were resolved through discussions between reviewers and, if necessary, an adjudicator (NLM) and a senior member of the research team (DLB, PJO, BB) were consulted.

Data collection process and data extraction

Data was extracted independently by two extractors (from a pool of three extractors: AKS, NKA, EAC) using GoogleSheets. Differences and extraction errors for the measures of spread and/or impact of studies reporting were identified by a reviewer (AKS) and were resolved through discussions between reviewers and, if necessary, an adjudicator (NLM) and a senior member of the research team (DLB, PJO, BB) were consulted. Prior to commencing data extraction, this method was piloted on ten studies chosen at random and the results were collated by one team member (AKS). In the custom GoogleSheets for extraction, comment bubbles contained detailed information for the extractors as to what data to extract for each parameter and information on how to address unclear information in publications. The following study information were extracted: publication information (author, year, title, journal, funding), study design, study demographics (e.g., age, sex, number of participants), medication interventions (e.g., category, dose, and frequency), exercise interventions (e.g., types, session and overall intervention duration, intensity, and frequency), measurement time points, and outcomes (e.g., BMD, BTM). If multiple follow-ups existed within each timeframe, we extracted the follow-up data closest to 12 months for BMD outcomes and end of intervention if exercise period deviated. For BTM outcome, data closest to 3 months and end of intervention were extracted (see Supplement 3 for further information).

Data for the main outcomes were extracted as number of participants, mean, and standard deviation (SD) where possible. In case these were reported using other measures of center and spread (e.g., median and interquartile range reported instead), we used standard equations to convert the data [31]. All applied data handling methods are available for each study and outcome in Supplement 6. Where data was presented in a figure only, ImageJ (https://imagej.nih.gov/ij/) was used to extract the values by measuring the length of the axes in pixels followed by the length of the relevant data of interest [32]. If it was not possible to extract the required data, information was requested from the corresponding author with a minimum of three times over a 4-week period.

Study risk of bias assessment, reporting bias assessment, and certainty assessment

The Cochrane Collaboration Risk of Bias Tool 2 (RoB2) was used to examine potential bias from the randomisation process, deviations from intended interventions, missing outcome data, measurement of the outcome, and selection of the reported results for randomized trials [33]. An overall risk of bias judgement was made for BMD and BTM outcomes.

The Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach was used to assess the certainty of the evidence [34]. Risk of bias, imprecision, inconsistency, indirectness, and publication bias are considered in the GRADE approach to rate the certainty [34]. Further details on the criteria used for GRADE are available in Supplement 7. Due to the exclusive presence of RCTs, a high level of confidence in the evidence was assumed with a possibility of downgrade after the assessment. Two independent assessors (AKS, EAC) assessed risk of bias and GRADE. Conflicts were resolved by discussion and, if necessary, consultation of an adjudicator (NLM).

Synthesis methods and effect measures

All statistical analyses and forest plots were completed in R version 4.2.1. (https://www.r-project.org/) and the R packages meta [35], metafor [36], and metadat [37].

As all outcomes of interest were continuous, but could be measured on different scales, standardized mean difference (SMD) with Hedges correction (Hedges’ g) was used as the effect estimate [38]. Random-effects meta-analysis was used. To estimate the 95% confidence intervals (95%CI), the Hartung-Knapp-Sidik-Jonkman adjustment with ad hoc correction was used [39, 41]. Statistical significance was set as an alpha of ≤ 0.05.

Heterogeneity was assessed by I2 statistic via the reported 95% prediction intervals (95%PI) where possible. The heterogeneity parameter τ (tau) was estimated via restricted maximum likelihood (REML) estimation. Funnel plots and the Egger’s test modified by Pustejovsky [42] to assess the risk of publication bias and small study effects are only recommended if at least 10 studies were available [43]. Furthermore, this threshold also applies for meta-regression and subgroups analysis [43]. Accordingly, these statistical approaches would only be used if ≥ 10 studies were included. In order to test the robustness of the results, sensitivity analyses exploring the role of outliers and influential trials were included [44].