Study design and treatments

FLIGHT-FXR (NCT02855164) was a phase 2, randomized, double-blind, placebo-controlled, dose-finding study with an adaptive design consisting of three sequential parts (Parts A, B and C). The study was conducted between August 2016 and April 2020 at 84 centers in 17 countries (Argentina, Australia, Austria, Belgium, Canada, France, Germany, India, Italy, Japan, Republic of Korea, the Netherlands, Singapore, Slovakia, Spain, Taiwan and the United States).

Study design and number of patients per treatment group are shown in Extended Data Fig. 1 and Table 1. In Part A, 77 patients were randomized (1:1:1:1:1) to receive placebo or tropifexor (10, 30, 60 or 90 μg). After the Data Monitoring Committee (DMC) review of Part A data and recommendation on dose selection for Part B, randomization to Part B commenced and 121 patients were randomized (5:4:15) to receive placebo, tropifexor 60 μg or tropifexor 90 μg. Randomization into Part C commenced after completion of Part B randomization and 152 patients (1:1:1) received placebo, tropifexor 140 µg or tropifexor 200 µg. Study medication was administered once daily for 12 weeks in Parts A and B and for 48 weeks in Part C. All patients entered a 4-week follow-up period after receiving the last dose of study treatment.

The study protocol and all amendments were reviewed by the Independent Ethics Committee or Institutional Review Board for each center. The study was conducted according to the principles of the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) E6 Guideline for Good Clinical Practice, which have their origin in the Declaration of Helsinki. Written, informed consent was obtained from each patient at screening before any study-specific procedure was performed.

Patient population

The study included male and female patients (≥18 yr) with elevated ALT (males ≥43 U l−1; females ≥28 U l−1), HFF ≥10% at screening (as assessed by MRI-PDFF) and body weight 40–150 kg (patients with ≥4.5 kg weight reduction within the last 6 months before screening were excluded). In Parts A and B, patients with either histologic evidence of NASH (liver biopsy obtained ≤2 yr before randomization) with fibrosis stage 1, 2 or 3 and no diagnosis of alternative chronic liver diseases or phenotypic diagnosis of NASH (elevated ALT (as specified above), T2DM or elevated glycated hemoglobin (HbA 1c ≥ 6.5%), and increased BMI (≥27 kg m−2 for non-Asian race; ≥23 kg m−2 for Asian race), were included. In Part C, only patients with histologic evidence of NASH (liver biopsy obtained during the screening period or within 6 months before randomization) with fibrosis stage 2 or 3 (NASH CRN), and no diagnosis of alternative chronic liver diseases, were included.

Race was self-reported by the patient and captured on the demography electronic case report form.

Key exclusion criteria were previous exposure to any FXR agonist (including tropifexor), current use or history of alcohol consumption (females >20 g d−1; males >30 g d−1) for a period of more than 3 consecutive months within 1 yr before screening, uncontrolled diabetes (HbA 1c ≥ 9.5% within the 60 d before enrollment), presence of cirrhosis on liver biopsy or clinical diagnosis, clinical evidence of hepatic decompensation or severe liver impairment, previous diagnosis of other forms of chronic liver disease and contraindication to MRI. Patients were also excluded if they had a history or current diagnosis of electrocardiogram abnormalities indicating safety risk or were pregnant or nursing (lactating) women. Patients were excluded if taking specific medicines unless on a stable dose (within 25% of baseline dose) for at least 1 month before randomization (Parts A and B) or at least 1 month before biopsy to screening (Part C) and expected to remain stable during the treatment period. Specific medicines included anti-diabetic medications, insulin, beta-blockers, thiazide diuretics, fibrates, statins, niacin, ezetimibe, vitamin E (if doses >200 IU d−1; doses >800 IU d−1 were prohibited), thyroid hormone, psychotropic medications, estrogen or estrogen-containing birth control.

Study design rationale and prespecified interim analysis

Four initial tropifexor doses of 10–90 μg were assessed in Part A based on preclinical results, safety and pharmacological activity (elevation of FGF19 up to 6 h after dosing) in this first-in-human study21. When ≥90% of the patients in Part A completed 8 weeks of treatment, an interim analysis was performed to provide data for DMC review and recommendation of doses for Part B.

Following DMC recommendation, randomization to Part B began with the tropifexor 90-μg (found to be safe and efficacious) dose and tropifexor 60 μg (the next highest dose). A second analysis was performed after all patients in Part A completed the week 16 visit. A third analysis of complete Part A and B data (pooled) was performed when all patients randomized to Part B completed the end-of-study visit (week 16) or prematurely discontinued the study. An interim analysis of Part C data (fourth planned reporting event) was performed when all patients completed the week 12 visit (time of primary endpoint) or prematurely discontinued the study. The final data analysis was carried out when all patients in Part C completed the week 52 visit.

Part C was introduced based on DMC recommendation to pursue tropifexor doses >90 μg. Randomization into Part C began after completion of Part B randomization. An exploratory exposure–response analysis of the Part A biomarker data (ALT, AST, FGF19 and GGT) at week 8 suggested investigation of area under the curve (AUC) > 40 ng × h ml−1 to better define a maximum biomarker response. An exploratory population pharmacokinetic (popPK) model was built using PK concentration data of tropifexor in healthy volunteers and patients with NASH. The established popPK model was used to simulate PK exposures for tropifexor 90-, 140- and 200-μg doses and to calculate the proportion of patients achieving AUC > 40 ng × h ml−1. The simulation suggested that at tropifexor 90-, 140- and 200-μg doses, approximately 40%, 80% and 95% of patients, respectively, may achieve an AUC > 40 ng × h ml−1. Thus, tropifexor 140 (predicted mean AUC ~60 ng × h ml−1) and 200 μg (predicted mean AUC ~80 ng × h ml−1) were selected for investigation in Part C to assess the therapeutic range and to characterize dose–response.

The timepoint for week 8 interim analysis in Part A and the treatment duration (12 weeks) for Parts A and B were selected based on internal recommendations. This treatment duration was also supported by Good Laboratory Practice toxicology studies (13 weeks). Further longer-term Good Laboratory Practice toxicology studies (26 weeks in rats and 39 weeks in dogs) enabled tropifexor treatment for 48 weeks in Part C to allow for evaluation of histologic endpoints and long-term safety and efficacy.

Randomization and masking

All eligible patients were randomized in a blinded, unbiased manner using Interactive Response Technology (IRT) to one of the treatment arms. The investigator or his/her delegate contacted the IRT after confirming eligibility. A participant randomization list was generated by the IRT using a validated system which automated the random assignment of participant numbers to randomization numbers. These randomization numbers were used to link the participant to a treatment arm and unique medication number. A separate medication list was produced using a validated system which automated the random assignment of medication numbers to packs containing the investigational drug(s).

Randomization in Parts A and B was stratified by BMI (Asian <30 kg m−2 or ≥30 kg m−2; non-Asian <35 kg m−2 or ≥35 kg m−2) at baseline. Randomization in Part B was also stratified by Japanese or non-Japanese origin to ensure all treatment groups were represented in the subset of Japanese patients. In Part C, randomization was stratified by fibrosis stage 2 or 3, presence or absence of T2DM, and by Japanese or non-Japanese origin.

In this double-blind study, patients, investigator staff, persons performing the assessments, the Novartis clinical trial team and contract research organization (CRO) associates involved with continued direct study site conduct (or delegates) remained blinded to individual treatment allocation from the time of randomization until database lock for each study part (week 16 for Parts A and B and week 52 for Part C). Randomization data were kept strictly confidential until the time of unblinding and were not accessible by anyone involved in the study except for the PK bioanalyst. The identity of treatments was concealed using study drugs that were all identical in packaging, labeling, schedule of administration, appearance, taste and odor. Additional placebo capsules were given in active treatment groups when needed to maintain blinding.

During the first interim analysis (week 8, Part A), the database was locked after ≥90% of patients completed their week 8 assessments. A Novartis pharmacometrician not involved in the clinical conduct of the study and a CRO performing the statistical analysis were unblinded to the week 8 results; this facilitated data review by the DMC. During the second (week 16, Part A) and third interim analyses (week 16, Parts A + B), Novartis and CRO associates involved in data analysis and reporting were unblinded to data. For the week 12 interim analysis of Part C data, Novartis and CRO associates involved in data management, analysis and reporting, and Novartis management, were unblinded, while Novartis and CRO associates (including field associates) involved with continued direct study site conduct, site personnel and patients remained blinded.

Procedures and assessments

Safety assessments included monitoring of AEs and SAEs, with their severity and relationship to study drug. The Medical Dictionary for Regulatory Activities (MedDRA) v.23.0 was used for the reporting of AEs.

Serum samples for the quantification of target engagement markers FGF19 and C4 were collected predose at baseline and at week 12 in Parts A and B, and predose at baseline and at weeks 12, 24, 40 and 48 in Part C. Samples were collected predose and 4 h postdose at week 6 in all parts.

Blood samples for the assessment of liver enzymes (ALT, AST, GGT, ALP) were obtained at screening, baseline and weeks 1, 2, 4, 6, 8, 12 and 16 in all parts; and additionally at weeks 20, 24, 32, 40, 48 and 52 in Part C. Hy’s law criteria (total bilirubin levels >2× upper limit of normal and ALT >3× upper limit of normal)36 were used in the evaluation for drug-induced serious hepatotoxicity. Body weight was also assessed at the same timepoints as liver enzymes in Parts A, B and C. Height was assessed at screening only, and waist/hip circumference at screening and week 12 in all study parts.

Fibroscan was an optional assessment; if sites had equipment available, it was performed at baseline and at week 12 in all parts and at weeks 12, 24 and 48 in Part C. Assessments at end-of-treatment were not performed in the case of premature treatment discontinuation unless the participant had received ≥8 weeks of therapy. Enhanced liver fibrosis panel and fibrosis biomarker tests were performed at screening, baseline and week 12 in all parts, and additionally at weeks 24 and 48 in Part C.

Fasting lipids were measured at screening, baseline and weeks 2, 6, 12 and 16 in Parts A and B; and at screening, baseline and weeks 2, 6, 12, 20, 24, 40, 48 and 52 in Part C. Management of treatment-emergent dyslipidemia was not prespecified in the study protocol.

Blood collection for PK was performed at week 1 (predose and 2 h postdose) and weeks 2, 4, 6, 8 and 12 (predose) in Part A; and at week 2 (predose and 2 h postdose), week 6 (predose and 4 h postdose), and weeks 4, 8 and 12 (postdose) in Part B. In Part C, blood collection for PK was performed for predose and postdose as the last activity of the visit at weeks 12, 24 and 48, and postdose as the last activity of the visit at weeks 6 and 40.

Itch severity and impact of nocturnal itch on sleep were determined on a 10-cm VAS (score range: 0 (no itch at all/no sleep loss) to 10 (the worst imaginable itch/cannot sleep at all)). Assessments were performed at screening (for sleep only), baseline and weeks 6, 12 and 16 in Parts A and B; and at screening (for sleep only), baseline and weeks 2, 6, 12, 24, 48 and 52 in Part C.

Liver MRI scans were acquired at screening and at week 12 in Parts A and B, and at baseline and weeks 12, 24 and 48 in Part C. Week 12 assessment was not done if the participant prematurely discontinued treatment before week 8. All MRI scans were performed locally (on GE, Philips and Siemens at 1.5 T and 3 T; and Hitachi at 1.5 T, whichever was available) and were evaluated by the central MRI laboratory (BioTelemetry Research, Rochester, NY, USA), blinded to the investigator, participant and sponsor until after the completion of study or study part and database lock.

In Part C, liver biopsies were obtained for all patients at baseline and week 48. Biopsies were stained using hematoxylin and eosin and Masson trichrome stains. Biopsy sections were evaluated by the central histopathologist to confirm eligibility before randomization. Paired review of biopsies was performed after all patients’ participation was completed; baseline and week 48 biopsies of each patient were read together, at the same time, by the central histopathologist, blinded to participant identification, treatment and temporal sequence of samples (baseline or week 48). NASH features in the biopsies were graded using the semiquantitative NASH CRN Histologic Scoring System. This scoring system is composed of the NAS to evaluate the key features of NASH (steatosis, lobular inflammation and hepatocellular ballooning), and the fibrosis score to evaluate fibrosis stage37. NAS was used to determine worsening of steatohepatitis. Two methods, diagnostic category (pathologist’s determination of the presence or absence of steatohepatitis) and score-based definition (FDA/EMA)38,39, were used to determine the resolution of steatohepatitis.

In addition to the central pathologist’s assessment, unstained sections of 198 paired liver biopsies (baseline and week 48) from 99 patients (fibrosis stage 2 (n = 42); fibrosis stage 3 (n = 57)) were analyzed using an SHG/TPEF microscopy with computer-assisted analyses for quantitative assessment of steatosis (qSteatosis) and liver fibrosis (qFibrosis), blinded to type of treatment, timepoint and the central pathologist’s scoring. qFibrosis is the overall output of quantitative readout of collagen parameters on a linear scale33. The scanning was performed on a Genesis 200, a fully automated, stain-free multiphoton fluorescence imaging microscope with AI algorithms (HistoIndex Pte.), as described previously33,40.

Prespecified study endpoints

The primary endpoints included occurrence of SAEs, AEs resulting in treatment discontinuation and/or dose reductions, AEs of special interest up to end-of-study, changes in ALT and AST from baseline to week 12, and relative change in % HFF from baseline to week 12. Secondary endpoints included changes from baseline to week 12 in body weight, FGF19 and C4 levels, GGT and fasting lipid profile. Occurrence of potential itch was also assessed using VAS as a patient-reported outcome. VAS for sleep disturbance due to nocturnal itch was assessed as an exploratory endpoint. Additional secondary endpoints for Part C included the proportion of patients achieving ≥1 stage improvement in fibrosis (NASH CRN) without worsening of steatohepatitis or resolution of steatohepatitis without worsening of fibrosis at week 48 compared with baseline, changes in ALT and AST levels from baseline to week 48 and relative change in % HFF from baseline to week 48. Exploratory endpoints at week 48 included changes in total NAS and individual components.

Post hoc analyses

Post hoc analyses included (1) assessment of histologic endpoints based on paired (baseline and week 48) review of biopsies, (2) AI-based digital quantitation of steatosis and liver fibrosis (qSteatosis and qFibrosis, respectively) in paired liver biopsies and (3) response rates at week 48 for relative HFF reduction by ≥30%. For analyzing the changes of liver fibrosis from baseline to week 48, based on the results from the paired reading by the central pathologist and from the AI-based digital quantitation (qFibrosis), patients in the placebo and both tropifexor arms were categorized as Progressor, No Change or Regressor (P/N/R analysis). The qFibrosis results were expressed both on a linear scale and by stage (F0 to F4) using an algorithm based on the blinded scoring of paired biopsies by the pathologist. For the conventional CRN scoring and for qFibrosis by stage, Progression was defined as fibrosis increase by ≥1 stage from baseline to week 48 and Regression was defined as fibrosis decrease by ≥1 stage. For qFibrosis on a linear scale, Progression was defined by increase ≥1 s.e.m. and Regression was defined as decrease of ≥1 s.e.m., based on the qFibrosis algorithm. The s.e.m. was determined when developing the qFibrosis algorithm using a cohort of 200 patients with the full spectrum of NAFLD, which included 42 patients with F2 and 57 patients with F3 stage of fibrosis. The s.e.m. for each fibrosis stage, as determined from the algorithm development, was then applied as a predetermined cut-off in qFibrosis assessment on a continuous scale in all subsequent studies (including the present one). The s.e.m. numerical values for F2 and F3 were 0.09 and 0.086, respectively.

Statistical analysis

All participants who received at least one dose of study drug and had at least one postbaseline safety assessment were included in the safety analysis set for the assessment of safety variables. The full analysis set was defined as all participants to whom study treatment had been assigned at randomization and was used for summarizing demographic and baseline characteristics and assessment of efficacy variables. The end-of-study analysis was conducted on all participant data collected up to the end-of-study visit or the premature treatment discontinuation visit.

Analyses were performed using SAS or R programming language. The primary variables were assessed using descriptive statistics (incidence of AEs and SAEs, overall and by preferred term) and baseline-adjusted mean estimates and pairwise differences with a 95% CI from a repeated measures (in the case of multiple assessments) analysis of covariance (ANCOVA) model (ALT, AST and relative change in % HFF). All LS means are reported by treatment arm and interpretation of the comparison does not include the 95% CI (of the difference) or P value. ANCOVA models included the baseline assessment and treatment as covariates. Repeated measures ANCOVA also included time (visit) and interaction terms of time with baseline assessment and treatment. Baseline assessment, geographical region and BMI group (stratification factor) were included as covariates.

Missing data for ALT and AST were accounted for by using repeated measures ANCOVA (mixed-effects model repeated measures; MMRM), assuming data were missing at random. In the case of dose reduction or treatment discontinuation, any ALT or AST assessments were set to ‘missing’ for all primary efficacy analyses. Missing data for % HFF were imputed using the baseline value for the week 12 analysis. No imputation was applied for the final analysis in Part C, where an MMRM model was used. In the case of treatment discontinuation, HFF assessments obtained >4 weeks after last treatment were set to ‘missing.’

Analyses of secondary variables were also based on descriptive statistics, including change from baseline and pairwise differences versus placebo with 95% CI from repeated measures ANCOVA or pairwise ratio versus placebo with 95% CI from ANCOVA (ratio postdose versus predose for FGF19 and ratio postdose versus baseline for C4 at week 6 back-transformed from log scale). All LS means are reported by treatment arm and interpretation of the comparison does not include the 95% CI or P value. Binary biopsy-based endpoints were analyzed using logistic regression, including baseline fibrosis stage and BMI stratification group as covariates. Missing data for the efficacy variables were accounted for by using repeated measures ANCOVA (MMRM), as applicable, assuming data were missing at random. The same statistical methods were used for the paired review of biopsies, and only patients who had both a baseline and an end-of-treatment biopsy were included.

All P values shown are unadjusted for multiple testing and are therefore descriptive alone.

The primary objective of the study was to determine a safe dose or dose range. However, the assessment was to be made based on the whole safety profile and not on quantitatively formulated hypotheses for distinct parameters. Therefore, sample size was based on practicability with respect to expected speed of enrollment and duration of the study, and not on formal statistical criteria. The power considerations for efficacy assessment were based on the mean decrease from baseline in ALT seen with obeticholic acid versus placebo at week 12 (−28 (with an s.d. of 48) versus −11 (with an s.d. of 33), respectively)17. With sample sizes of 90 (Parts A + B) and 50 (Part C) in the tropifexor groups, and 40 (Parts A + B) and 50 (Part C) in the placebo group, the power for a t-test to compare both groups (one-sided type I error 0.05) would be 81% for Parts A + B and 78% for Part C.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.