**A study to test a new drug to treat depression**

Let’s say we have developed a new drug to treat depression (drug-A) and want to prove beyond doubt that it is better than the standard treatment (drug-B). Let’s also say that experts agree that the patients to be studied are those who are newly diagnosed with mild to moderate depression and have not yet received treatment. The experts also agree that the drugs would need to be taken for at least 6-months, when the effects of the drugs of the treatment on depression would be assessed by all patients completing a validated depression score. The one we’ll use in this example is called the HADS (which is short for Hospital Anxiety & Depression Scale) and we’ll also say that the experts agree that the minimum difference in average scores between the treatments would need to be 20-point improvement on the depression scale of the HADS for that difference to be clinically worthwhile.

So, we have the starting point for our study, which is;

- compare new treatment (drug-A) to standard current treatment (drug-B).
- Patients to study are newly diagnosed people with mild to moderate depression (this means on the HADS depression scale they score more than 40, but less than 60).
- Treatment period is 6-months.
- Main outcome is depression score using HADS measured at 6-months
- A 20-point difference in the average improvement in score on the depression scale between the treatments is the minimum difference that would be clinically worthwhile (this would mean that a patient who scores 40 on the depression scale at baseline would need to score 20 or less at 6-months for the treatment effect to be clinically worthwhile).

**Estimating the number of patients to study**

This is where the medical statistician comes in. Using the criteria that the experts have agreed above the statistician can then determine the number of patients who would need to be studied to reliably answer the study question. They determine the sample size by doing a power-calculation. Let’s say for our example that to reliably determine beyond reasonable doubt that our new treatment (drug-A) is better than current treatment (drug-B) would require a sample size of 5,000 patients, with 2.500 receiving drug-A and 2,500 drug-B. With this we have the final main criteria for the study, to go with the ones above:

- Number of patients to study is 5,000 patients with 2,500 treated with drug-A and 2,500 drug-B.

The next question to answer is what method to use to do the study? We’ll look here at just two, firstly, a review of medical records of patients prescribed both drugs and, secondly, a randomised trial.

**Retrospective review of medical records**

Retrospective means to look back in time, which in this example involves the researchers identifying from medical records patients who were newly diagnosed with mild to moderate depression and treated for at least 6-months with either drug A or B. They would also need to have completed a HADS questionnaire before they started treatment and after 6-months of treatment (this is highly unlikely in routine care, but it doesn’t matter for this illustration). The researcher would then based upon the sample-size calculation select the medical records of 2,500 patients treated with drug-A and another 2,500 patients treated with drug-B.

When the medical statistician does the analysis we find that the average improvement in the depression scale for patients treated with drug-A is 30 points, while for the standard treatment (drug-B) the average improvement is only 5 points. This is a difference of 25-point so meets the threshold for the difference being clinically important. Better still, the test of statistical significance shows that this difference is highly statistically significant (p<0.01), which means that it is highly unlikely to be caused by chance (click on the figure to enlarge).

So, are we home and dry? Can we reliably conclude that the new drug is significantly better than the standard treatment?

**From this can we conclude that the new treatment is better than the standard treatment?**

What we can conclude from this retrospective analysis is that patients treated with the new drug-A compared to similar patients treated with the standard drug-B have a clinically and statistically significant larger improvement in their depression score after 6 months of treatment. What we can’t conclude from this analysis is that the difference is CAUSED by drug-A being better than drug-B, which is ultimately the question we are trying to answer. Why is that?

The reason is something called confounding, which is that although the difference observed is statistically significant, which means it is unlikely due to chance, it might be due to another factor rather than the treatments being compared. What might these confounding factors be?

**Known confounders**

Some confounders are well known and generally recorded in medical records. These might be things like age, gender, severity of disease, other conditions or treatments being taken. Statistical analysis can take care of these known confounders so that if the difference between the treatments persisted after adjusting for these it would mean that the effect is not explained by these known confounders. But, can we now conclude that the improvement is due to the new drug-A? Unfortunately, no, for some of the known confounders might not be recorded in the medical records, but the main reason is that the difference might be due to unknown confounders.

**Unknown confounders**

As the name suggests, unknown confounders are factors that are unknown and might be an alternative cause of the difference between the treatments observed. As an illustration, while the baseline level of depression is known and statistical adjustment can rule it out as the cause of the difference, some of the doctors might more often prescribe the new drug-A to those patients who they suspect might have a better chance of improvement or who might regularly take their medication. This won’t be recorded in the medical record and the doctor might not even be aware of it themselves. Obviously though, it could be an alternative reason for the difference seen between the two groups. Ultimately though, the problem with this method and any non-randomised study is that you can never conclusively say that the treatment difference, even if it is statistically significant, is caused by the treatments. Enter, the randomised controlled trial.

**The unique value of the randomised controlled trial**

So, let’s now repeat the experiment, with all of the same conditions as above, but with one important difference, which is we do the study prospectively, which means rather than look back at records we study people going forward in time and we include randomisation. The randomisation process is the sole factor as to whether a patient receives drug-A or B. If we combine randomisation with blinding, ideally double-blinding, which means that neither the doctor or the patient know which treatment the patient is taking, we can rule out the possibility that any difference observed might be due to confounding. Crucially, that includes unknown confounding. So, we do the trial with proper random allocation of treatment with double blinding and get the same result as we did above, that is the improvement seen in depression scores seen in patients treated with the new drug-A is clinically significant (25 point average improvement in depression score) and statistically significant (P<0.01), but now from this randomised evidence we can conclude that this treatment difference is CAUSED by the new drug-A being better at treating depression in these patients that the current treatment (drug-B).

The unique value of randomised trials is that it is the only method that can directly demonstrate causation so is the only really reliable way to assess the effects of treatment.

A simple rule of thumb for any treatment claim, be they benefits or side-effects, made from non-randomised analysis is to ask is there an alternative explanation for this difference? There often is.