联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2020-04-21 10:27

Homework #4

Due 4/22/2020, 11:59pm (but flexible)

4/14/2020

Front Matter

As with HW3, you will write your task code and answer the questions in a separate template file.

As you write your code, run your work “from the top” when you finish each chunk. This will ensure that any

errors in rendering are caught before you move on. The top right corner of each gray chunk has a button

that will run all of your code from the beginning up to that chunk, and another button that will run that

chunk. Use these buttons.

Question 1: Instrumental Variables

In the previous homework, we used 2SLS to estimate a model with an instrumental variable. We used

constructed data where we knew all the parts, including unobserved errors and parameters, and saw that

2SLS got us an estimate that was close to the true value, while naive OLS did not.

In this exercise, we will briefly use the AER package’s ivreg to analyze the same data.

Task 1.1 - Setup (4 points)

1. (2 points) Use require(...) to load the wooldridge package, the AER package, and the lmtest and

sandwich packages for robust standard errors. Do not re-install the packages (unless you are working

from a new computer). Remember, you never include install.packages(...) in your code chunks.

2. (2 points) Save the template with your name (LastFirst) in an appropriately named folder on your

drive. This is good file management and is super important to keeping your work organized. Note: this

is not something to write into your code chunks.

Questions 1.1 (2 points)

1. (2 points) Where on your computer are you saving your .Rmd file? Does EC420 have its own folder?

Task 1.2 - Loading data (0 points)

I have included the data construction in your template already. Do not change the data. It is identical to

HW3.

Questions 1.2 (6 points)

1. (2 points) How many observations does the data have?

2. (2 points) Refreshing your memory from HW3, what variable is the outcome variable? What is the

variable of interest? Which variables are endogenous?

3. (2 points) Which variables were our instruments?

4. (2 points) What was your 2SLS estimate from the final sections of HW3?

5. (1 point) What was the true parameter value for βD in HW3? This can also be inferred from the data

construction in Task 1.2.

1

Task 1.3 - Estimating using ivreg (15 points)

An important R skill is being able to figure out the syntax of an R function. We will use the AER package’s

ivreg function to estimate the same instrumental variables model as HW3. Use ?ivreg directly in your

console (not in your code chunk) to see the syntax for the function. This will tell you how to specify

your formula, and what other inputs the ivreg function needs. If you did not follow Task 1.1 and did not

require(...) the AER package, then you will not see anything when you type ?ivreg.

1. (3 points) The first input ivreg needs is the formula. We can input our formula and save it as an R

object (which we can then input to the call of ivreg). To do this, simply use as.formula(y x +

... | z + ...). You don’t need to put quotations around the formula when you code it up. It is up

to you to figure out how to specify the endogenous and instrumental variables in the formula. See the

arguments section of the ivreg help for instructions, and then look at your data to see what to put in

which place. Use the "recommended" three-part formula format from the help.

Remember that we included our exogenous variables X1, X2 in both stages of our 2SLS in HW2.

This is because our exogenous variables "instrument for themselves". That means we specify them as

instruments. Keep this in mind when writing your formula.

2. (12 points) Run the ivreg command using your formula and the P2 data.frame. Use the robust standard

errors by wrapping the command in coeftest as you did in HW3.

Question 1.3 - Interpreting ivreg (15 points)

1. (4 points) What is your estimate for βD, the coefficient of interest?

2. (3 points) What is the interpretation of the coefficient on βD?

3. (3 points) If D is our treatment variable, what type of treatment effect does βD represent? Is it the

average treatment effect (ATE)?

4. (5 points) We had a way of establishing whether or not our model met the relevant first stage requirement

(see our notes on Instrumental Variables and 2SLS). What was the criteria (hint: it has to do with an

F-test).

Task 1.4 - Testing (10 points)

1. (5 points) To test the relevant first stage assumption, we will use lm(...) to regress D on Z, the first

stage of our 2SLS (leaving aside the exogenous variables). This will tell us whether or not Z has an

effect on D. Run this simple regression.

2. (5 points) Naturally, ivreg has the ability to output some important tests, including one for the relevant

first stage. We can get this by using summarize(ivreg(myFormula, data=P2)) (robust standard errors

are not necessary here as we won’t be looking at the standard errors of the coefficients).

Question 1.4 - Testing (10 points)

1. (4 points) Using the output from Task 1.4.1 lm(...), the first-stage regression of D on Z, what can

we say about the relevant first stage assumption based on these results? If our instruments are

not relevant to the endogenous variable D, we say we have "weak instruments". Do we have weak

instruments?

2. (4 points) The output from summarize(ivreg(...)) gives us some additional statistics, including one

which tells us about the answer to the previous question. What is the value here, and what does this

tell us about our instruments’ relevant first stage?

3. (2 points) Are the results from Task 1.4.1 using lm(...) and the diagnostic result from

summarize(ivreg(...)) similar?

2

Question 2 - Difference-in-Differences

Here we will employ our Difference-in-Differences estimator and compare it to some “naive” estimates. We

will first do a little data manipulation and get some practice in R with merging and creating variables.

Task 2.1 (18 points)

1. (1 point) Install (by typing install.packages("...") directly in the console) the zoo and lubridate

packages. These are excellent packages for working with time-indexed values like dates and times. It

will help us move between date and month.

2. (1 point) Add code to your chunk for this task that uses require(...) to load the zoo and lubridate

packages. Also load the lmtest and sandwich packages as you’ll be using HC-robust errors (they are

already installed from the last homework, unless you’ve changed computers.)

3. (0 points) We will use the hw4.csv file posted on my gitHub. The line that loads the .csv directly

from gitHub is already in your template. Note that you can load data directly from a .csv on the web.

Handy!

This is solar installation data from California. Each observation is aggregated publicly-available data

about the total quarterly solar watts installed in a city, (cecptrating), the sum total cost of all

installations in that city-quarter totalcost, the average incentive received (incentiverate), the

average base cost of electricity in that city and quarter (PWRPRICE), the location (city), and our

outcome variable of interest: watts installed per owner-occupied household WPOOH. You will be doing a

similar analysis as the Kirkpatrick and Bennear paper we read, but will not be using the exact same

data nor get the exact same answer.

4. (6 points) We will use the as.yearqtr(...) function from zoo to manage the time variable, yq. This

is quite easy - just use hw4$yq = as.yearqtr(hw4$yq). The column will now be recognizable to R as

a time series.

5. (2 points) Try this: Look at head(hw4$yq) and head(hw4$yq + .75). Do you see what this does?

Don’t overwrite your yq column with this, just take a look at how R manages adding time periods.

6. (2 points) Use table(...) on the new column called yq to see how many installations we observed in

each quarter.

7. (4 points) Choose one city in the data and make a new data.frame with just that city by subsetting.

Use plot(...) to plot WPOOH on the Y-axis and yq on the X-axis.

8. (2 points) We often want to see a simple line of best fit. This is easy to add after we have plotted our

points. On the next line immediately after your plot(...) command, add an a-b line with a lm(...)

call like this:

abline(lm(WPOOH ~ yq, mySubsetData), col="gray50")

Question 2.1 (8 points)

? (1 points) Do all quarters have the same number of cities observed?

? (1 point) What is the mean value for total watts installed, cecptrating, in the data?

? (2 points) What is T, the number of time periods (hint: unique(...) returns a vector of all of the

unique values of a column)

? (2 points) What is N, the number of cities (hint: unique(...) returns a vector of all of the unique

values of a column)

? (2 points) What city did you choose for your plot? Does there seem to be a time trend in the outcome

variable, WPOOH?

3

Task 2.2 (11 points)

The treatment we are interested in, PACE, is not in our data. We have to merge in information about each

city’s treatment status by quarter. Merging data in R is not terribly hard. You just have to have the data

you’d like to merge in another R data.frame and know which field(s) are the key fields. “Key” fields are the

fields that will be matched up - here, it will be the city. The TREATMENT data has the city, the county, and

the date that treatment (PACE) started, if any.

1. (2 points) We will merge in some TREATMENT data on each city in the data. It is located on my

gitHub as well and the code is in your template. Use the same read.csv function as in Task 2.1 to

load this in. Call the R object holding this data TREATMENT.

2. (2 points) Since city is the "key" field in TREATMENT, we need to check that it is unique. The R

function duplicated(...) tells us which values in whatever column we give it are duplicated. If we

sum(duplicated(...)) we can see how many duplicated values there are. Use duplicated(...) and

sum(...) to see that TREATMENT$city has no duplicates. Duplicate values in the merge will multiply

the number of rows in our data, which would be bad!

3. (5 points) Now, we want to merge the city level data in TREATMENT to the city-quarter data in hw4. We

will use merge(...) for this and call the new object PAN.merged (PAN is for PANel). Merge takes the

following function inputs:

(a) x = hw4 (the data you start with)

(b) y = TREATMENT (the data being merged)

(c) by = c(’city’) The "keys" on which we are merging

(d) all = F This tells R to do a "left join" which keeps only those cities that are in both PAN and

TREATMENT

Put it all together: PAN.merged = merge(x = hw4, y = TREATMENT, by = c(’city’), all=F)

4. (2 points) Use names(PAN.merged) to show the names of the columns we now have merged and

NROW(PAN.merged) to see how many observations we have.

Question 2.2 (4 points)

1. (2 points) How many observations do you have now? Hint: it should be 2,250

2. (2 point) View() the data to see what we have. Is this panel, time series, or cross-sectional data? Why?

Task 2.3 (16 points)

The last thing we need is to create our time-varying treatment variable from the treatment start date in the

data.

1. (2 points) We are going to set the PACE variable based on whether or not yq>=PACE.start. PACE.start

is NA for all untreated cities, but contains the date of PACE start for the treated cities. First, convert

PAN.merged$PACE.start to an R-recognizable date. Do this by using PAN.merged$PACE.startdate

= ymd(PAN.merged$PACE.start).

2. (2 points) Next convert PAN.merged$PACE.startdate to a year-quarter using as.yearqtr(...) as

before.

3. (4 points) Create a column in PAN.merged called PACE that is TRUE if yq>=PACE.startdate and FALSE

otherwise.

4. (2 points) R has trouble comparing anything to a NA. It’s likely that your PAN.merged$PACE column is

a lot of TRUE and NA. Using a subset, assign a value of FALSE to any PACE that is na:

? PAN.merged[is.na(PAN.merged$PACE), "PACE"] = FALSE.

4

? Note that we don’t have to assign this to a new object or column - we are updating the NA’s in

PAN.merged "in place".

? When we compare things using <,>,=, we get a type of data that is called a logical. R recognizes

the words FALSE and TRUE as the two values of a logical object.

Now that we have our time-varying treatment variable, PACE, we need a non-time-varying indicator for

all the treated cities. Just like in lecture, we will call this TMT and it will be TRUE for all cities who are

ever treated. We will use an R shortcut function %in% for this.

The shortcut %in% takes whatever is to the left of it and tells you, item by item, if it is anywhere in

the thing on the right. So c(1,2,3) %in% c(2,3,4) will return FALSE,TRUE,TRUE because the second

two entries in c(1,2,3), 2,3, are in both 1,2,3 and 2,3,4.

5. (3 points) Make a new object called treated.cities = PAN.merged[PAN.merged$PACE==T,"city"].

This will be a vector of all of the treatment cities. Then, make a new column in PAN.merged$TMT =

PAN.merged$city %in% treated.cities.

The new column in PAN.merged will be TRUE if the city is ever treated, and false otherwise.

6. (3 points) We should probably compare the before-treatment levels of the outcome variable between

the two groups. We can use a boxplot to compare the mean and distribution:

? boxplot(WPOOH TMT, data = PAN.merged[PAN.merged$PACE==F,])

Question 2.3 (4 points)

1. (2 points) Use table(...) to see how many treated city-quarter observations we have.

2. (2 points) The boxplot shows the mean (the horizontal bar) and the 25th-75th percentile (the edges of

the box) for the variable WPOOH for each group before PACE starts. The dots are the “outliers”. From

this boxplot, does it look like there is a systematic difference before treatment between the treatment

and the control?

Task 2.4 (5 points)

Let’s run some regressions. I’m not going to write out each regression. It’s up to you to construct the right

regression formula. Make sure you always use HC-robust errors as usual.

First, let’s be very naive and just compare the before-after amongst the treated cities only. Run a regression

of WPOOH on PACE, PWRPRICE, and incentiverate on a subset of the data consisting only

of the treatment group. That is, only on the data for the treated. You can subset in the lm(...,

data=PAN.merged[PAN.merged$TMT==T,]) call.

? PWRPRICE is the average cost per kwh of electricity for the city.

? incentiverate is the per-watt subsidy given by the state under the California Solar Initiative. It was

designed to “step down” over time as more people install solar (it eventually hit zero in early 2014).

? PACE is our treatment variable of interest. It is the presence of a PACE program in that city during

that quarter. It varies by city-quarter.

Question 2.4 (11 points)

1. (1 points) By including PWRPRICE and incentiverate, what are we controlling for and why?

2. (3 points) Can you think of anything that isn’t controlled for that could cause bias? Hint: there are

lots of possible answers since we are only looking at the treatment group, but make sure you explain

why yours could cause bias!

3. (1 point) What is the coefficient on incentiverate and what does it mean?

5

4. (2 points) Does this comport with your prior expectation? That is, does it make sense? Why or why

not?

5. (2 point) What is the coefficient on PACE and what does it mean? Make sure you state your answer

including units.

6. (2 points) What is the std. error on the coefficient for PACE, and is it statistically significant?

Task 2.5 (5 points)

Run the regression from 2.4 again, but keep the whole sample. PACE is yq- and city-specific, so you don’t

need to interact it with anything.

Question 2.5 (6 points)

1. (2 points) If we call the cities with PACE programs the “treatment”, what do we call the cities without

treatment?

2. (1 point) What is the coefficient on PACE in this specification?

3. (1 point) Is it significant? Why or why not?

4. (2 points) Is there anything missing that we have not controlled for?

Task 2.6 (5 points)

Finally, let’s do a Difference-in-differences specification. PACE is already the interaction between TMT and

POST (unlike in lecture, we have time-varying treatment times, so we don’t really have a POST), so all we

need to do is add time fixed effects yq and city-level fixed effects city. Run the DID specification.

Question 2.6 (20 points)

1. (3 points) By adding the fixed effects, what have we controlled for?

2. (5 points) What is the identifying assumption for this regression?

3. (2 points) What is the coefficient on PACE and what does it mean (note: it won’t equal the coefficient

in the Kirkpatrick and Bennear paper)? Is this the ATE?

4. (2 points) Is it statistically significant and why?

5. (2 points) What is the mean of the outcome in the data (use mean(...)) and is the effect on PACE

economically meaningful or not. That is, compared to the average value of WPOOH, is the effect big or

small?

6. (2 points) Which specification do you feel is the least biased? Why?

7. (4 points) Scroll down to see the time fixed effects. Do they follow a pattern? Does that pattern match

the pattern you saw in the plot in Task 2.1? Note that, depending on what size window you have open,

the p-value column might appear down below.

6


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp