I want to make a logistic model from my survey data. It is a small survey of four residential colonies in which only 154 respondents were interviewed. My dependent variable is "satisfactory transition to work". I found that, of the 154 respondents, 73 said that they have satisfactorily transitioned to work, while the rest did not. So the dependent variable is binary in nature and I decided to use logistic regression. I have seven independent variables (three continuous and four nominal). One guideline suggest that there should be 10 cases for each predictor / independent variable (Agresti, 2007). Based on this guideline I feel that it is OK to run logistic regression. Am I right? If not please let me know how to decide the number of independent variables?
$\begingroup$ I have never really understood the rule of thumb that says "10 cases for each predictor" (and unfortunately I don't have access to the book written by Agresti). What I mean is: if I have 100 subjects of which 10 are cases (the 1 's) and 90 non-cases (the 0 's), then the rule says "include only 1 predictor". But what if I model the 0 's instead of the 1 's and then I take the reciprocal of the estimated odds ratios? Would I be allowed to include 9 predictors? That makes no sense to me. $\endgroup$
Commented Apr 7, 2012 at 10:13$\begingroup$ Dear Andrea, I have said the same thing that you mean. Out of 154 respondents there are 73 cases (the 1's and rest 0's). Could you throw some light on my question.Thanks! $\endgroup$
Commented Apr 7, 2012 at 15:56$\begingroup$ In a commentary i have read that one has to look at the minimum of the number of events and non-events. So in the example of 10/100 you end up with one predictor irrespective of how you code it. $\endgroup$
Commented Apr 8, 2012 at 11:08 $\begingroup$ @psj that sounds reasonable. Do you have any references? $\endgroup$ Commented Apr 12, 2012 at 7:30 Commented Dec 13, 2012 at 14:20There are several issues here.
Typically, we want to determine a minimum sample size so as to achieve a minimally acceptable level of statistical power. The sample size required is a function of several factors, primarily the magnitude of the effect you want to be able to differentiate from 0 (or whatever null you are using, but 0 is most common), and the minimum probability of catching that effect you want to have. Working from this perspective, sample size is determined by a power analysis.
Another consideration is the stability of your model (as @cbeleites notes). Basically, as the ratio of parameters estimated to the number of data gets close to 1, your model will become saturated, and will necessarily be overfit (unless there is, in fact, no randomness in the system). The 1 to 10 ratio rule of thumb comes from this perspective. Note that having adequate power will generally cover this concern for you, but not vice versa.
The 1 to 10 rule comes from the linear regression world, however, and it's important to recognize that logistic regression has additional complexities. One issue is that logistic regression works best when the percentages of 1's and 0's is approximately 50% / 50% (as @andrea and @psj discuss in the comments above). Another issue to be concerned with is separation. That is, you don't want to have all of your 1's gathered on one extreme of an independent variable (or some combination of them), and all of the 0's at the other extreme. Although this would seem like a good situation, because it would make perfect prediction easy, it actually makes the parameter estimation process blow up. (@Scortchi has an excellent discussion of how to deal with separation in logistic regression here: How to deal with perfect separation in logistic regression?) With more IV's, this becomes more likely, even if the true magnitudes of the effects are held constant, and especially if your responses are unbalanced. Thus, you can easily need more than 10 data per IV.
One last issue with that rule of thumb, is that it assumes your IV's are orthogonal. This is reasonable for designed experiments, but with observational studies such as yours, your IV's will almost never be roughly orthogonal. There are strategies for dealing with this situation (e.g., combining or dropping IV's, conducting a principal components analysis first, etc.), but if it isn't addressed (which is common), you will need more data.
A reasonable question then, is what should your minimum N be, and/or is your sample size sufficient? To address this, I suggest you use the methods @cbeleites discusses; relying on the 1 to 10 rule will be insufficient.