A histogram in Fig. 3 summarizes all of this nicely, saving us the hassle of scrolling through a large table (but at the cost of making it harder to compute precise percentages). Each of the bars in Fig. 3 represents one possible number of coughs, as indicated on the horizontal axis; the height of each bar is the number of hours during which that particular number of coughs occurred. The two unusual outcomes, 111 and 85, are easy to spot. Thanks to an abundance of data, this histogram has some obvious structure, strongly skewed to the right, with small counts much likelier than large counts. But we would need far more data to have a nice smooth histogram without such a jagged profile.
Fig. 3 Continuously monitored coughs by the number of coughs per hour. Each of the bars on the horizontal axis represents one possible number of coughs; the height of each bar is the number of hours during which that particular number of coughs occurred Full size image
What would this histogram look like if we had several years of hourly observations? Instead of waiting to collect all that data, just ask a statistician. One look at this histogram is all it takes to recognize an old friend, a famous distribution of counts known as the negative binomial distribution. It also looks like the even more famous Poisson distribution—some additional calculations are needed to know which one we have got here. A Poisson distribution's average and variance are exactly equal, while a negative binomial distribution's variance exceeds its average, making it overdispersed. It turns out that cough counts are almost always overdispersed, as these are.
This distribution has a precise mathematical formula [9], but is essentially a giant table that fills in all of the blanks above, providing the theoretical probability of each possible outcome, not just the ones we happen to have observed thus far. Let us add these theoretical percentages and the corresponding theoretically expected counts to our table of observed results to see how they are compared (Table 2).
Table 2 Grouping continuously monitored coughs by the number of coughs per hour with the addition of theoretically expected cough counts Full size table
As mentioned earlier, we have an issue with missing data. We do not know how to distinguish between hours with 0 coughs and hours without monitoring, but we would expect the number of zeros to be close to the number of ones or twos. While this undoubtedly affects the fit between the observed and expected percentages, they still agree quite well overall. This is even easier to see if we plot the theoretically expected counts along with the histogram of observations (Fig. 4).
Fig. 4 Continuously monitored coughs by the number of coughs per hour with the addition of theoretically expected cough counts Full size image
How do we nail down the specific probability distribution shown in Fig. 4 in blue, namely, the negative binomial distribution that fits this data? It is determined by two statistics, the average and the standard deviation of this person’s coughs per hour, which happen to be 10.55 and 10.47, respectively in our described case. We have good estimates of these statistics for this particular person, since we have 895 hourly observations, but getting good estimates in other situations presents significant modeling challenges.