From Btry4790
Problem 1:
a)
mu0
-0.994372 -1.117302
sigma0
[,1] [,2]
[1,] 0.3081188 0.2855377
[2,] 0.2855377 0.8134664
mu1
1.0492281 0.9808596
sigma1
[,1] [,2]
[1,] 0.7782789 0.1968357
[2,] 0.1968357 0.2499694
pi
0.57
b) (see attached)
phi
0.60
lambda
0.93
c) (see attached)
i P(y_i=1|x_i.) P(y_i=1|x_i.)>0.5
[1,] 1.000000e+00 1
[2,] 1.000000e+00 1
[3,] 1.106875e-11 0
[4,] 1.000000e+00 1
[5,] 1.795992e-16 0
[6,] 1.000000e+00 1
[7,] 1.000000e+00 1
[8,] 1.000000e+00 1
[9,] 1.000000e+00 1
[10,] 1.000000e+00 1
[11,] 4.113176e-11 0
[12,] 2.272105e-09 0
[13,] 4.737765e-15 0
[14,] 1.000000e+00 1
[15,] 1.000000e+00 1
[16,] 6.630721e-12 0
[17,] 1.953301e-14 0
[18,] 1.000000e+00 1
[19,] 1.000000e+00 1
[20,] 1.000000e+00 1
[21,] 2.586914e-16 0
[22,] 1.000000e+00 1
[23,] 1.000000e+00 1
[24,] 2.658593e-11 0
[25,] 1.000000e+00 1
[26,] 5.717125e-10 0
[27,] 8.619959e-15 0
[28,] 3.680921e-15 0
[29,] 1.000000e+00 1
[30,] 1.037743e-12 0
[31,] 1.000000e+00 1
[32,] 1.000000e+00 1
[33,] 1.000000e+00 1
[34,] 1.165631e-11 0
[35,] 1.000000e+00 1
[36,] 2.031293e-12 0
[37,] 1.000000e+00 1
[38,] 9.353792e-14 0
[39,] 1.799203e-13 0
[40,] 1.000000e+00 1
[41,] 1.142122e-14 0
[42,] 1.000000e+00 1
[43,] 1.000000e+00 1
[44,] 1.453942e-14 0
[45,] 1.000000e+00 1
[46,] 8.284799e-10 0
[47,] 1.000000e+00 1
[48,] 4.287281e-14 0
[49,] 1.000000e+00 1
[50,] 9.219260e-05 0
Problem 2:
a) (see attached)
mu0
-1.046320 -1.028186
sigma0
[,1] [,2]
[1,] 0.3592381 0.3072740
[2,] 0.3072740 0.7519296
mu1
0.9873772 0.9965616
sigma1
[,1] [,2]
[1,] 0.7194950 0.1436320
[2,] 0.1436320 0.3085462
pi
0.5865971
The parameter estimates from EM give a better fit than the estimates from problem one (e.g.
log-likelihood of -2571.97 as opposed to -2608.54). Unsurprisingly, there are multiple local
maxima corresponding to the two possible ways in which to split the data into a mixture of two
Gaussians. The first and last initializations converge to the mixture estimated from the original
data subset. The second initialization converges to the alternate way of defining the mixture.
Also, it is apparent that the farther away from a maximum the longer it takes the algorithm to
converge, with the rate of convergence reflecting the local structure of the marginal likelihood
surface (e.g. the steeper the surface, the faster convergence).
b) (see attached)
c)
mu0
-0.9937212 -0.9347667
sigma0
[,1] [,2]
[1,] 0.4077858 0.3670646
[2,] 0.3670646 0.8593127
mu1
1.044355 1.020972
sigma1
[,1] [,2]
[1,] 0.6616135 0.1175831
[2,] 0.1175831 0.2945752
lambda
0.876958
phi
0.5600008
The parameter estimates from EM for model B give a better fit than the estimates from problem
one (e.g. a log-likelihood of -2348.566 as opposed to -2407.701). Just like in the first model,
there are multiple ways to define the mixture that will define equivalent distributions. We see
this since one of the random initializations converges to the alternate way of defining the mixture
for the lambdas (e.g. exchanging the definition of lambda and 1-lambda).
d) see attached
i P(y_i=1|x_i.) P(y_i=1|x_i.)>0.5
[1,] 1.000000e+00 1
[2,] 1.000000e+00 1
[3,] 5.820044e-10 0
[4,] 1.000000e+00 1
[5,] 8.265355e-13 0
[6,] 1.000000e+00 1
[7,] 1.000000e+00 1
[8,] 1.000000e+00 1
[9,] 1.000000e+00 1
[10,] 1.000000e+00 1
[11,] 1.493134e-09 0
[12,] 7.169165e-09 0
[13,] 3.154954e-12 0
[14,] 1.000000e+00 1
[15,] 9.999999e-01 1
[16,] 1.318969e-10 0
[17,] 5.924521e-12 0
[18,] 1.000000e+00 1
[19,] 1.000000e+00 1
[20,] 1.000000e+00 1
[21,] 3.321282e-13 0
[22,] 1.000000e+00 1
[23,] 9.999999e-01 1
[24,] 1.373224e-09 0
[25,] 1.000000e+00 1
[26,] 5.630016e-09 0
[27,] 5.358217e-13 0
[28,] 7.736868e-12 0
[29,] 1.000000e+00 1
[30,] 7.704926e-11 0
[31,] 1.000000e+00 1
[32,] 1.000000e+00 1
[33,] 1.000000e+00 1
[34,] 8.445471e-10 0
[35,] 1.000000e+00 1
[36,] 8.327434e-11 0
[37,] 1.000000e+00 1
[38,] 2.731032e-11 0
[39,] 3.240978e-11 0
[40,] 1.000000e+00 1
[41,] 5.551307e-12 0
[42,] 1.000000e+00 1
[43,] 1.000000e+00 1
[44,] 5.380022e-12 0
[45,] 1.000000e+00 1
[46,] 4.025614e-09 0
[47,] 1.000000e+00 1
[48,] 3.270664e-11 0
[49,] 1.000000e+00 1
[50,] 3.679760e-05 0
Problem 3:
a) Model one has 11 free parameters. Hence the AIC1 = 22 + 2*(2571.973) = 5165.946
Model two has 12 free parameters. Hence the AIC2 = 24 + 2*(2348.566) = 4721.132
Model two definitely improves the likelihood enough to warrant the additional model complexity.
b) Model two explicitly models the overall behavior of each precinct, hence if voter preference
is closely related to the behavior of the precinct they are in, then this model will reduce the bias
in the estimate of voter preference, even though the parameter estimates will have greater variation
(since there are more parameters being estimated). Conversely, if voter preference is not related
to the behavior of the precinct, then the additional model complexity will just increase the variance
of the parameter estimates, and not provide any additional gain in model fit (i.e. reduce the bias of
estimation of individual voter preference).
AS comment: I would put this slightly differently. Model 2 allows
information about party preference to be pooled within precincts, which
will tend to reduce the variance in the estimates of the zij (we will tend
to believe more strongly that each zij is 0 or 1), but possibly at the cost
of a bias if the model is misspecified (we will be more certain but also
more wrong!). This phenomenon is sometimes called "shrinkage".
It is true that the additional parameter will tend to increase variance in
both parameter estimates and in the zijs, especially if the model does not
fit the data well, but I think this will have a secondary effect on the
zijs compared with shrinkage.
One way to think about shrinkage in this case is to consider respondents
near the boundary between the two classes. Under model 2, these
individuals will tend to be "pulled" in to the class in which most others
in their precinct fall, even if they would fall in the other class under
model 1. If precinct "coherence" is high, most of the time this will be
the right thing to do, but if coherence is low it may be the wrong thing to
do.