Project 2: Simulation and p value

Project 2: Simulation and p value

Project 2: Simulation and p value

For US students, need help on this assignment upload it through our website or send through email at

18:04 Thursday 27th October, 2016 See updates and corrections on canvas course site Project 2: Simulation and p value STAT35000, Fall 2016, Section 21325 Due: November 14 (Monday), 2016 Abstract This is the 2nd R project. This problem is about sampling distribution using R, and calculation of the so called p value. For the Bernoulli distribution X ∼ Ber(p) = Bin(1, p), work out the following problems:

1. Sample 50 points from X ∼ Ber(p0 = 0.2).

2. Calculate the sample proportion pˆ ∗ of “1”s using the sample you obtained. Caution: Your result will likely not agree with someone's else's. Why not?

3. Do you think your estimate is reasonably good enough? Explain.

4. Pretend that you forgot the probability of success p0 you had used to generate the above sample of size 50. You know how you sometimes forget the password of a account you have set up yesterday! Your guess now is 0.4. Here are two ways to decide if this guess is good. Which one is better? (a) Compare the sample proportion you got in above part 2 with your guess 0.4. If they are reasonably close, you probably will adopt your guess 0.4. Write down how could you judge the closeness. (b) Here is the other procedure. The logic behind this is that if p0 = 0.4, what is supposed to happen? If something strange has happened, then you should doubt your guess; if whatever happend is reasonable under p0 = 0.4, then there is no reason to reject your guess. Here is the implementation. Use p0 = 0.4 to generate N = 10000 samples from X ∼ Ber(p0 = 0.4) with sample size n = 50. Calculate the sample proportion of successes for each of these samples using R. N = 10000 samples, denoted by {pˆ k 50, k = 1, 2 · · · , N = 10000}. And then plot the histogram of {pˆ k 50, k = 1, 2 · · · , N = 10000} to see the distribution of the random variable sample proportion ˆp50 under the 1 2 assumption that the true p0 = 0.4. If the observed pˆ ∗ (in Step 2) is not in the extreme region of the distribution, you can adopt your guess 0.4. In particular, approximate the probability that pˆ50 < pˆ ∗ through the frequency of {pˆ k 50 < pˆ ∗ , k = 1, 2 · · · , N = 10000}. This probability is related with the important concept in Statistics, p value!

5. In 4(b), we are actually using simulation to approximate the probability of pˆ50 = (X1+X2+...+X50)/50=T/50< pˆ ∗ given that {Xi , i = 1, 2, · · · , 50} are independent and identically distributed as X ∼ Ber(p0 = 0.4). But you know the distribution of T! Use it (and IT-84 or R) to find the exact probability?

6. In fact, you can approximate the exact probability in 5, without simulation! Remember the central limit theorem applies since n = 50 >30. It says: pˆ50 = X1 + X2 + · · · + X50 50 approximate ∼ N(µ, σ2 /50) with µ = E(X) and σ 2 = V ar(X) for X ∼ Ber(p0 = 0.4). Using this CLT approximation calculate the probability.

7. Compare the probability that pˆ50 < pˆ obtained by the above three ways (simulation approximation, exact, CLT approximation). Which approximation is better? In fact you can improve the CLT approximation, by using a continuity correction. Please check out the online material at https:// Calculate the continuity corrected CLT approximation probability. Now which approximation (simulation or continuity corrected CLT) is better?


No Comments

Post a Reply