A statistical model is a mathematical representation of the relationship among variables. All models are simplifications of reality, the key question is whether the simplification adds to our knowledge and understanding of the world. Thus, a famous statistician named George Box once said that “all models are wrong, but some models are useful”.

The starting point in developing a statistical model is an idea about the world. One might suspect that boys have fewer friends than girls do; or that G type stars have more planets than other stars; or that people who have concurrent sexual partners are at greater risk of getting a sexually transmitted disease. Or that a certain type of teaching leads to better grades in math among 4th graders. The range of possible ideas is vast.

The next key point is gathering data. If you have an idea about which it is impossible to gather data, then you will not be able to test your model. Ideally, you take a random sample from the population of interest, but this is not always possible. There are different types of studies that lead to statistical models, and these can be classified as experimental, quasi-experimental, or observational.

The next step is to decide on the form of your model. Perhaps the most common models are regression models of various types. These all relate a dependent variable to one or more independent variables. So, our dependent variable might be risk of a disease and our independent variable number of partners. We may also want to consider control variables, which are variables that we are not directly interested in, but which may be important to include. For example, it is likely that age, gender, sex, access to condoms and marital status are all related to risk of a sexually transmitted disease.

If the dependent variable is continuous, then one common model is known as ordinary least squares regression or simply regression. If it is a dichotomous variable (like, say, getting a disease or not) then the most common model is logistic regression. Other forms of regression exist for count variables (such as number of times a person goes to prison), or nominal variables (such as marital status).

We then test our model on our sample and see if the model adds to our knowledge. Are the relationships between the independent variables strong or weak? Are they in the expected direction? Do they help us understand the world? Are we surprised by them? These are critical questions. Too often, we rely on p-values rather than on our judgment and knowledge.