Read the following instructions carefully:

All answers must be typed up. Your text, math, code, and figures should be included together for each problem. RMarkdown will do this automatically and is strongly encouraged.
Late submission is untill 2/25 midnight, your score will be discounted .
You are encouraged to work in groups, but you should write up your answers alone and cite your peers.
Upload your compiled PDF to Gradescope.

Problem 1

The purpose of this problem is to place OLS, LASSO, and ridge regression into a common framework—a family of estimators that minimize an objective function based at least partly on the sum of squared residuals (SSR).

For example, recall that the OLS estimator is a procedure that searches the space of all possible linear regression coefficients, $\mathbf{\beta}^\ast \in \mathbb{R}^k$ (where $k$ is the number of coefficients to estimate), and picks the option that minimizes SSR. Thus, SSR is the objective function in OLS. LASSO and ridge regression have modified objective functions that include SSR, but also add a “penalty” function that pulls estimates away from certain choices of coefficients even when those choices might lead to a minimal SSR. In this problem, we will visualize the shape of these penalty functions and their effect on coefficient estimates. This problem uses a uses a simulated dataset drawn from the following data-generating process: \[y_i = 2 x_{1i} + 3 x_{2i} + \varepsilon_i,\] where $X_k \sim N(0, 1) \text{ for } k = 1,2$ $\varepsilon_i \sim N(0, 1)$.

We provide code to visualize how SSR varies with $\mathbf{\beta}^\ast$. In the corresponding plot, darker areas represent portions of the coefficient search space that lead to higher SSR. The lightest point, marked with a $\beta$, represents the solution of the optimization problem. Circular contour lines (gray) indicate “level curves,” or sub-regions in the coefficient space that would lead to the same SSR.

You will do the same for LASSO and ridge objective functions, then examine how the results change with the tuning parameter $\lambda$.

library(ggplot2)
library(dplyr)

## simulation dataset
set.seed(02139)
n <- 1000
d <- data.frame(x_1 = rnorm(n, mean=0, sd=1),
               x_2 = rnorm(n, mean=0, sd=1))
d$y <- 2 * d$x_1 + 3 * d$x_2 + rnorm(n, mean=0, sd=1)

## grid search for optimal beta
## We usually find beta hat analytically: (X'X)^{-1}X'y
## Here, though, we're exhaustively searching through combinations of the 
##  different possible betas, finding the error, and empirically finding the
##  best beta hats.
beta_grid <- expand.grid(beta_1=seq(-1,5,.1),beta_2=seq(-1,5,.1))

calc_ssr <- function(beta_star, d){
  X <- as.matrix(d[,c("x_1","x_2")])
  e <- d$y - X%*% beta_star
  return(crossprod(e)) 
}

## calculate the error for all combinations of beta1 and beta2
beta_grid$ssr <- apply(beta_grid, 1, calc_ssr, d)

## beta_ols is the the vector of two betas that empirically minimizes the
##  sum of squared error.
beta_ols <- beta_grid[which.min(beta_grid$ssr),]


## ssr contours and ols estimate
ggplot(beta_grid, aes(x=beta_1, y=beta_2)) + 
  geom_tile(aes(fill=log(ssr))) +
  stat_contour(aes(z=ssr), color="gray30") +
  # add the ols estimate
  geom_text(data=beta_ols, label="beta[plain(SSR)]",
           parse=TRUE, hjust=0, vjust=0) +
  geom_vline(xintercept=0) + 
  geom_hline(yintercept=0) +
  scale_fill_gradient(low="#FEE8C8", high="#E34A33",
                      breaks=range(log(beta_grid$ssr)), 
                      labels=c("low","high"), name="SSR") +
  xlab(expression(beta[1])) + 
  ylab(expression(beta[2])) + 
  coord_fixed() + 
  labs(title = "OLS objective function (SSR)")

(a) Calculate the penalty that ridge and lasso add to the OLS loss function, ignoring for now the sum of squared residuals. Visualize the shape of the penalties with contours and shading as above, and in your own words, compare the different shapes of the constraints. Can you reframe the penalty function in terms of a prior on coefficients?

Hint: Assume $\lambda=1$ for now—the exact value of the penalty function is less important than its shape.

lambda <- 1
beta_grid$ridge_penalty <- lambda * (beta_grid$beta_1^2 + beta_grid$beta_2^2)

## ridge penalty contours
ggplot(beta_grid, aes(x=beta_1, y=beta_2)) +
    geom_tile(aes(fill=ridge_penalty)) +
    stat_contour(aes(z=ridge_penalty), colour="gray30") +
    ## prettifying 
    geom_vline(xintercept=0) + geom_hline(yintercept=0) +
    scale_fill_gradient(low="#FEE8C8", high="#E34A33",
                        breaks=range(beta_grid$ridge_penalty),
                        labels=c("low","high"), name="ridge penalty") +
    xlab(expression(beta[1])) + ylab(expression(beta[2])) + 
    coord_fixed() +
    ggtitle("Ridge penalty function (sum of squared coefs)")

beta_grid$lasso_penalty <- lambda* (abs(beta_grid$beta_1) + abs(beta_grid$beta_2))

## lasso penalty contours
ggplot(beta_grid, aes(x=beta_1, y=beta_2)) +
    geom_tile(aes(fill=log(lasso_penalty+5))) +
    stat_contour(aes(z=lasso_penalty), colour="gray30") +
    ## prettifying
    geom_vline(xintercept=0) + geom_hline(yintercept=0) +
    scale_fill_gradient(low="#FEE8C8", high="#E34A33",
                        breaks=range(log(beta_grid$lasso_penalty+5)),
                        labels=c("low","high"), name="LASSO penalty") +
    xlab(expression(beta[1])) + ylab(expression(beta[2])) + 
    coord_fixed() + 
    ggtitle("LASSO penalty function (L1 norm of coefs)")

Solution: We should notice that the ridge penalty is circular while the lasso penalty is diamond-shaped. These are the consequences of squared and absolute value penalties. In Bayesian terms, both have a zero-centered prior, with ridge’s having a smoother dropoff and lasso having a spike of higher probability at 0.

(b) Next, visualize the full loss function for ridge (i.e., sum of SSR plus the ridge penalty). On each graph, mark both ridge and OLS coefficient estimates. Do this for $\lambda$ values of 250, 500, 1,000, and 2,000. What sort of path do your ridge estimates follow as the tuning parameter $\lambda$ increases?

Repeat for LASSO regression, with $\lambda$ set to 1,250, 2,500, 5,000, and 10,000. What similarities and differences do you see?

library(plyr)
library(dplyr)
### first make n_lambda copies of beta_grid...
lambdas <- c(250, 500, 1000, 2000)
beta_grid_lambdas <- lapply(lambdas, function(lambda){
    cbind(beta_grid, lambda=lambda)})
## ... then rbind them together
beta_grid_lambdas <- do.call(rbind, beta_grid_lambdas)

## calculate penalty function for each value of lambda
beta_grid_lambdas$ridge_penalty <- beta_grid_lambdas$lambda * 
    (beta_grid_lambdas$beta_1^2 + beta_grid_lambdas$beta_2^2)

## ridge estimates, for each value of lambda
beta_ridge <- plyr::ddply(beta_grid_lambdas, "lambda",
                  function(beta_grid.lambda){
                      beta_grid_lambdas[which.min(beta_grid_lambdas$ssr
                               + beta_grid_lambdas$ridge_penalty),]})

# Find the rows with the lowest loss per lambda
beta_ridge <- beta_grid_lambdas %>% 
  group_by(lambda) %>% 
  slice(which.min(ssr + ridge_penalty))

## prep for plotting
lambda_labeller <- function(variable, value){
    plyr::llply(1:length(value), function(i)
        parse(text = paste(variable,"==",value[i])))}

## ridge objective contours
ggplot(beta_grid_lambdas, aes(x=beta_1, y=beta_2, facet = lambda)) +
  geom_tile(aes(fill=log(ridge_penalty + ssr))) + # heat map
  stat_contour(aes(z=ridge_penalty + ssr), colour="gray30") + # contour plot
  facet_wrap(~lambda, labeller=lambda_labeller) +
  ## ols estimate
  geom_text(data=beta_ols, 
            label="beta[plain(SSR)]",
            parse=TRUE, hjust=0, vjust=0) +
  ## ridge estimate
  geom_point(data=beta_ridge, shape=4) +
  geom_text(data=beta_ridge,
            label="beta[plain(ridge)]",
            parse=TRUE, hjust=0, vjust=1) +
  ## prettifying
  geom_vline(xintercept=0) + geom_hline(yintercept=0) +
  scale_fill_gradient(low="#FEE8C8", high="#E34A33",
                      breaks=range(log(beta_grid_lambdas$ridge_penalty+
                                       beta_grid$ssr)),
                      labels=c("low","high"), name="SSR + LASSO penalty") +
  xlab(expression(beta[1])) + 
  ylab(expression(beta[2])) + 
  coord_fixed()

## change lambdas and repeat for lasso
lambdas <- c(1250, 2500, 5000, 10000)
beta_grid.lambdas <- lapply(lambdas, function(lambda){
    cbind(beta_grid, lambda=lambda)
})
beta_grid.lambdas <- do.call(rbind, beta_grid.lambdas)

## calculate penalty function for each value of lambda
beta_grid.lambdas$lasso_penalty <- beta_grid.lambdas$lambda *
    (abs(beta_grid$beta_1) + abs(beta_grid$beta_2))

## lasso estimates, for each value of lambda
beta_lasso <- ddply(beta_grid.lambdas,
                   "lambda", function(beta_grid.lambda){
                       beta_grid.lambda[which.min(beta_grid.lambda$ssr
                                                  + beta_grid.lambda$lasso_penalty),]
})

### lasso objective contours
ggplot(beta_grid.lambdas, aes(x=beta_1, y=beta_2)) +
    geom_tile(aes(fill=log(lasso_penalty + ssr))) + # heat map
    stat_contour(aes(z=lasso_penalty + ssr), colour="gray30") + # contour plot
    facet_wrap(~lambda, labeller=lambda_labeller) +
    ## ols estimate
    geom_text(data=beta_ols, label="beta[plain(SSR)]",
              parse=TRUE, hjust=0, vjust=0) +
    ## lasso estimate
    geom_text(data=beta_lasso,
              label="beta[plain(lasso)]",
              parse=TRUE, hjust=0, vjust=1) +
    ## prettifying
    geom_vline(xintercept=0) + geom_hline(yintercept=0) +
    scale_fill_gradient(low="#FEE8C8", high="#E34A33",
                        breaks=range(log(beta_grid.lambdas$lasso_penalty+beta_grid$ssr)),
                        labels=c("low","high"), name="SSR + LASSO penalty") +
    xlab(expression(beta[1])) + ylab(expression(beta[2])) + 
    coord_fixed()

Solution: Both regularization models pull coefficients toward zero. Lasso does so more aggressively than ridge, and we can see the increasing straightness of the of the loss contours with higher $\lambda$s paralleling the axes. This results in solutions falling exactly onto the axes and producing coefficients of zero on some parameters.

Problem 2

In this problem, we will compare the abilities of OLS and LASSO to recover causal effects in a simulation exercise. We will use each method to the bias and variance of their respective $\hat{\tau}$s and generate a graphical comparision of the two

(a). Create a $1,000 \times 800$ matrix of “control” variables $\mathbf{X}$, with 1,000 samples for each $X_k$ drawn from $X_k \stackrel{iid}{\sim} N(0, 1) \text{ for } k = 1,...,800$. Draw a (confounded) treatment $T_i = X_{44i} + \alpha_i$, where $\alpha_i \sim N(0, 1)$ Assume a DGP of

\[\begin{equation} Y_i = 2 T_i - 12X_{2i} - 15 X_{15i} + 9 X_{44i} + 13 X_{300i} + \epsilon_i, \end{equation}\]

where $\epsilon_i \sim N(0, 1)$. Let $\tau = 2$ be the effect of $T$ on $Y$.

OBS <- 1000
FEATS <- 800
# This generates a matrix of normals
X <- matrix(rnorm(OBS * FEATS, 0, 1), 
            nrow = OBS, ncol = FEATS)
colnames(X) <- as.character(1:FEATS)

# generate a special treatment vector
treat <- rnorm(OBS) +X[,44]

# Generate y with known coefficients
y <- 2*treat - 12*X[,2] - 15*X[,15] + 9*X[,44]  + 13*X[,300] + rnorm(OBS, 0, 1)

(b). First, we will investigate the ability of OLS to recover the true effect in the presence of these control variables. Write two functions to generate OLS estimates of the treatment effect: one that includes all control variables and one that assumes a correctly specificed model (i.e. that only includes variables with an effect). Generate at least 200 simulated datasets using the DGP above and use your your function to find $\hat{\tau}_{OLS}$ for both the true and mis-specified models. Report the bias and variance of the estimates.

# Parallelize the simulations below
library(foreach)
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

# Set the global number of replications for all simuations
REPS <- 200

sim_lm_all <- function(){
  X <- matrix(rnorm(OBS * FEATS), 
              nrow = OBS, ncol = FEATS)
  treat <- rnorm(OBS) - X[,44]
  y <- 2*treat - 12*X[,2] - 15*X[,15] + 9*X[,44]  + 13*X[,300] + rnorm(OBS, 0, 1)
  model <- lm(y ~ treat + X)
  tau_hat <- coef(model)['treat']
  return(tau_hat)
}

sim_lm_correct <- function(){
  X <- matrix(rnorm(OBS * FEATS), 
              nrow = OBS, ncol = FEATS)
  treat <- rnorm(OBS) - X[,44]
  y <- 2*treat - 12*X[,2] - 15*X[,15] + 9*X[,44]  + 13*X[,300] + rnorm(OBS, 0, 1)
  model <- lm(y ~ treat + X[,c(2, 15, 44, 300)])
  tau_hat <- coef(model)['treat']
  return(tau_hat)
}

REPS = 4
sims_all <- foreach(i=1:REPS) %dopar% sim_lm_all()
sims_all <- do.call(rbind, sims_all)

sims_correct <- foreach(i=1:REPS) %dopar% sim_lm_correct()
sims_correct <- do.call(rbind, sims_correct)

# bias and variance
c(mean(2 - sims_correct), var(sims_correct))

[1] 0.0049309775 0.0007848437

# bias and variance
c(mean(2 - sims_all), var(sims_all))

[1] -0.027642061  0.003046431

sims_ols_all_df <- data.frame(est = as.numeric(sims_all),
                   method = "ols_all")

sims_ols_correct_df <- data.frame(est = as.numeric(sims_correct),
                   method = "ols_correct")

(c) Next, we will use lasso to estimate $\tau$. Repeat the simulation above, but this time using glmnet instead of lm to estimate $\tau$ as a function of all X. Unlike OLS, lasso has a tunable parameter, $\lambda$, which must be set by the researcher. One way of picking a good value for $\lambda$ is through “grid search” and cross-validation. Write a function that uses 5-fold cross-validation to pick the value for $\lambda$ out of a range of possible $\lambda$s to minimize the cross-validated mean squared prediction error. You should consider at least 100 possible valus for $\lambda$ between 0 and the minimum value of $\lambda$ for which no features are selected. Code is provided below to find the maximum $\lambda$ and to generate the “grid” of $\lambda$s to consider:

X_both <- cbind(treat, X)
max_lambda <- max(abs(colSums(X_both*y)))/ncol(X_both)
lambda_vec <- seq(0, max_lambda, length.out = 200)

(Note: this method of finding a maximum $\lambda$ requires scaled and centered variables. Our variables are standard normals so no centering and scaling is required in Problem 2.)

Compare your value of $\lambda$ to the value returned by glmnet’s built-in function cv.glmnet. Report the variables that are selected using your value of $\lambda$ and compare them to the variables in the true DGP. Then use your value of $\lambda$ in your simulations.

library(glmnet)

GRID_N <- 200

# 5 fold cross validation
# take the mean error for each lambda, averaged over the 5 folds

# Define the vector of lambda values we're going to check
X_both <- cbind(treat, X)
max_lambda <- max(abs(colSums(X_both*y)))/ncol(X_both)
lambda_vec <- seq(0, max_lambda, length.out = GRID_N)

calc_error <- function(lambda, X, treat, y, n_folds = 5){
  fold <- sample(1:n_folds, nrow(X), replace = TRUE)
  mse <- 0
  for (i in 1:n_folds){
    X_train <- X[fold != i,]
    treat_train <- treat[fold != i]
    y_train <- y[fold != i]
    X_test <- X[fold == i,]
    treat_test <- treat[fold == i]
    y_test <- y[fold == i]
    model <- glmnet(x=as.matrix(cbind(treat_train, X_train)), y = y_train, 
                    family='gaussian', 
                    alpha = 1, lambda = lambda)
    y_hat <- predict(model, cbind(treat_test, X_test))
    error <- y_test - y_hat
    mse <- mse + mean(error^2)
  }
  return(mean(mse))
}

X_both <- cbind(X, treat)
max_lambda <- max(abs(colSums(X_both*y)))/ncol(X_both)
lambda_vec <- seq(0, max_lambda, length.out = GRID_N)

# Use dopar to get a 4x speedup. (Use the values for X, treat, y defined above).
# Iterate over all values of lambda and calculate the mean MSE across 5 folds.
all_errors <- foreach(i=1:length(lambda_vec), 
                      .packages = c("glmnet")) %dopar% 
  calc_error(lambda_vec[i], X, treat, y, n_folds = 5)
all_errors <- unlist(all_errors)

# lambda is the lambda with the lowest error
lambda <- lambda_vec[which.min(all_errors)]
lambda

[1] 0.09044059

pick_lambda <- cv.glmnet(x=cbind(treat, X), y = y, 
                    family='gaussian', 
                    alpha = 1)
pick_lambda$lambda.min

[1] 0.06523659

Solution: The two values of $\lambda$ are slightly different but fairly close to each other.

model <- glmnet(x=cbind(treat, X), y = y, 
                    family='gaussian', 
                    alpha = 1, lambda = lambda)
print(broom::tidy(model)$term)

 [1] "(Intercept)" "treat"       "2"           "15"          "44"         
 [6] "197"         "204"         "227"         "291"         "300"        
[11] "371"         "531"         "793"

Solution: The lasso model with our $\lambda$ chosen through cross-validation selects more terms than exist in the true DGP, but still drastically reduces the parameters.

# We can use the `tidy` function from the `broom` package to format the
#  output of the GLM function. This makes is much easier to pull out 
#  the coefficient on `treat`

sim_lasso <- function(lambda){
  X <- matrix(rnorm(OBS * FEATS), nrow = OBS, ncol = FEATS)
  colnames(X) <- as.character(1:FEATS)
  treat <- rnorm(OBS) - X[,44]
  y <- 2*treat - 12*X[,2] - 15*X[,15] + 9*X[,44]  + 13*X[,300] + rnorm(OBS, 0, 1)
model <- glmnet(x=as.matrix(cbind(treat, X)), y=y, 
                  family='gaussian', alpha = 1, lambda = lambda)
  tidied <- broom::tidy(model)
  tau_hat <- tidied[tidied$term == "treat", "estimate"]
  return(tau_hat)
}

sims <- foreach(i=1:REPS, .packages=c("glmnet", "broom")) %dopar% 
  sim_lasso(lambda)
sims <- do.call(rbind, sims)

# bias and variance

c(mean(2 - sims$estimate), var(sims$estimate))

[1] 0.2503255095 0.0003860938

sims_lasso_df <- data.frame(est = as.numeric(sims$estimate),
                   method = "lasso")

(d) Finally, graphically compare the simulated $\hat{\tau}$s of the 3 models (correct OLS, realistic OLS and lasso ) in four histograms. What do you find? Which model would you prefer in your own data analysis?

sims_df <- rbind(sims_ols_correct_df, sims_ols_all_df)
sims_df <- rbind(sims_df, sims_lasso_df)


model_names = c("ols_correct" = "Correctly specified OLS (unrealistic)",
                "ols_all" = "Incorrectly specified OLS (realistic)",
                "lasso" = "Lasso"
                )

ggplot(sims_df, aes(x = est, fill = method)) + 
  geom_histogram(bins = 60) +
  facet_wrap(~method, nrow = 4, labeller = as_labeller(model_names)) +
  geom_vline(xintercept = 2) +
  #xlim(1.7, 2.3) + 
  labs(title = "Comparing OLS (correct and mis-specified), lasso with simulations",
       subtitle = "(Realistic) OLS is unbiased but high variance. Lasso is lower variance but biased.",
       y = NULL,
       x = expression(hat(tau))) +
  theme_gov2018() +
  theme(legend.position = "none")

Problem 3

#Solution: The estimated effect of a pro-plaintiff (homeowner) ruling on house #prices matches the finding in @belloni2014high. A ruling in favor of a plaintif #(homeowner) against the government’s application of eminent domain increases #house prices in the area. When used a value of $\lambda$ that selected only a #single instrument, it selected the same as the authors found (panels with one or #more judges with JDs from public universities, squared).

In this question, we will compare several classification algorithms by applying them to United Nations Security Council (UNSC) resolutions. Resolution topics and text (parsed from PDF without post processing) are in

We would like to get an intuition for machine learning algorithms by visualizing them, but the document-term matrix (DTM) that summarizes our UNSC resolutions has over three thousand dimensions (columns, or unique words) even after stemming/stopping and trimming obscure words. To facilitate visualization, we will first pre-process the DTM using dimensional reduction. This step finds the best way to project, or summarize, the DTM into $\mathbb{R}^2$—a transformation that is easier to manipulate and understand, but still explains much of the variation in the original $\mathbb{R}^{3309}$ dataset.

(a) Load . Using regular expressions, drop resolutions for which no text was found—i.e., that contain only whitespace in the text field. Clean whitespace, numbers, and punctuation, then stem the document. Reduce the size of the DTM by removing words that only occur in two or fewer documents.

Finally, normalize each row so that it sums to unity, to help account for different document lengths. (Why?)

library(tm)

Loading required package: NLP


Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate

library(SnowballC) # for stemming
library(slam) # default sparse matrix format in tm
library(Matrix) # alternative sparse matrix format

 ### read in data

d = read.csv("D:/2024 spring/gov 2018/unsc_text.csv", stringsAsFactors=F)
d = d[-grep("^\\s+$", d$text),] # drop res for which no text found

### resolution topics

topics_list = strsplit(as.character(d$agenda), "nn")
topics_list = rapply(topics_list, function(topic){  # recursive apply
  gsub("^.* - ", "", topic) # drop numeric topic codes
  }, how="replace") # preserve original nested-list structure
topics_unique = sort(unique(unlist(topics_list)))

# make a matrix of topic dummies
topics_dummies = t(sapply(1:nrow(d),function(i){ 
    topics_unique %in% topics_list[[i]] # boolean vector of length n_topic
    }))
dimnames(topics_dummies) = list(res_symbol=rownames(d), topic=topics_unique)
rm(topics_list)

texts = d$text
names(texts) = gsub("/", "", d$symbol)

files = VectorSource(texts)
corpus = Corpus(files, readerControl=list(reader=files$defaultReader,language="en"))

# clean by removing white space, numbers, and punctuation, then stemming
corpus.clean = tm_map(corpus, content_transformer(stripWhitespace))
corpus.clean = tm_map(corpus.clean, content_transformer(removeNumbers))
corpus.clean = tm_map(corpus.clean, content_transformer(removePunctuation))
corpus.clean = tm_map(corpus.clean, stemDocument)
 
# Create document-term matrix
dtm_slam = DocumentTermMatrix(corpus.clean,control=list(stopwords=stopwords("english"),
        tolower=TRUE,wordLengths=c(3, Inf)))
rownames(dtm_slam) = gsub("/", "", d$symbol)

drop = which(col_sums(dtm_slam>0) < 3) # drop words only appearing in a couple of documents
dtm_slam = dtm_slam[,-drop]

# convert to Matrix
dtm = sparseMatrix(i=dtm_slam$i,j=dtm_slam$j,x=dtm_slam$v,
    dims=c(nrow(dtm_slam), ncol(dtm_slam)))
dimnames(dtm) = dimnames(dtm_slam)
rm(files, corpus, corpus.clean, dtm_slam)

# normalize rows of dtm to account for different document length
dtm_norm = Diagonal(x=1/rowSums(dtm)) %*% dtm

(b) Singular value decompositions (SVD) are a re-expression of a $ n × k$ matrix (with $k > n$) in terms of $n$ orthogonal unit vectors, usually ordered by their explanatory power. The SVD is a useful way to obtain a lower-dimensional summary of the DTM: the first $m$ left-singular vectors, $dtm\_svd\$u\$[,1:k]$ , represent a $m$-dimensional projection of the data, or the $m$ most important “concepts” that are embodied in the DTM. They sometimes have a substantive interpretation, but often it is unclear exactly what they represent.

In this problem, we will use the second and third left-singular vectors of the DTM to visualize some classification algorithms in a two-dimensional space. (The first left-singular vector, which loads heavily onto words like united, nation, security, council, resolution, and paragraph, appears to represent bureaucratic-ness; while it is important in explaining many of the words that appear in the resolutions, it contains little useful information for classification.) Get the SVD of dtm and extract its left-singular vectors $svd(dtm)\$u$. This is a $n×n$ matrix, where each column $svd(dtm)\$u\$[,i]$ is a different weighted combination of our columns of old (pre-SVD) covariates. In your own words, interpret the rows of U and the columns of V (svd(dtm)$v).

Visualize all resolutions in a plot with the second and third left-singular vectors (LSV2 and LSV3, or $dtm\_svd\$u\$[,2]$ and $dtm\_svd\$u\$[,3]$`, respectively) on the $x$- and $y$-axis, respectively. On this plot, highlight resolutions that are hand-coded (by the UN Bibliographic Information System) as pertaining to the “IRAQ-KUWAIT SITUATION” and “WESTERN SAHARA QUESTION” in different colors. What do you see? Note that the SVD knows nothing about the subject area of resolutions; only word counts were used in the decomposition. (Hint: Remember that there is more than one agenda topic per resolution. You may wish to use regular expressions.)

For the remainder of this problem, we will work only with resolutions coded as pertaining to “IRAQ-KUWAIT SITUATION” or “WESTERN SAHARA QUESTION”. Create a dataframe of these resolutions, with the second and third left-singular vectors as covariates and their topic as outcome.

library(ggplot2)

dtm_svd = svd(dtm_norm)
rownames(dtm_svd$u) = rownames(dtm_norm) # U rows: doc positions in "concept" space
rownames(dtm_svd$v) = colnames(dtm_norm) # V cols: mapping sum of "concepts" back to word counts

 ### make data for remaining questions

ind_sahara = which(topics_dummies[,dimnames(topics_dummies)$topic=="WESTERN SAHARA QUESTION\n"]==1)
ind_iraq = which(topics_dummies[,dimnames(topics_dummies)$topic=="IRAQ-KUWAIT SITUATION\n"]==1)

X = dtm_svd$u[c(ind_sahara, ind_iraq), 2:3]
colnames(X) = c("LSV2", "LSV3")
y = c(rep(0, length(ind_sahara)), rep(1, length(ind_iraq)))

y_lab = c(rep("sahara", length(ind_sahara)), rep("iraq", length(ind_iraq)))

data = data.frame(y, y_lab, X)

ggplot(data.frame(LSV2=dtm_svd$u[,2], LSV3=dtm_svd$u[,3]), aes(x=LSV2, y=LSV3)) +     
      geom_point(size=.5) + geom_point(aes(colour=y_lab, shape=y_lab), data, size=3)+ 
      scale_shape_manual(values=c(15,17), name="") + 
      scale_colour_manual(values=c("#F1A340","#998EC3"), name="") + theme(legend.position="bottom")

(c) Next, we will use OLS as a classifier. Run a linear regression (without interaction term) of topic on LSV2 and LSV3, and interpret its fitted values as predictions by cutting them at some suitable threshold. Create a plot with the following components:

LSV2 and LSV3 on the $x$- and $y$-axis, respectively
Points representing the training data, with plotting symbols that distinguish between resolutions relating to Iraq and Sahara
Shaded regions that indicate your model’s prediction for incoming test data (for test data projected to the subspace of RSV2 and RSV3).

Hint: You may find it helpful to use , , and .

mod_ols = lm(y ~ LSV2 + LSV3, data)

 lsv_grid = expand.grid(LSV2=seq(-.1, .15, .001), LSV3=seq(-.06, .17, .001))

 lsv_grid$ols = c("sahara","iraq")[(predict(mod_ols, newdata=lsv_grid) > .5) + 1]

 ggplot(lsv_grid, aes(x=LSV2, y=LSV3)) +
 geom_tile(aes(fill=ols)) +
 geom_point(aes(shape=y_lab), data, size=2) +
 scale_fill_manual(values=c("#F1A340","#998EC3"), name="predicted class for incoming data")+ 
   scale_shape_manual(values=c(2,4), name="actual class of training data") +
 theme(legend.position="bottom") +
ggtitle("OLS")

(d) Repeat for the following models:

OLS with interaction
Linear SVM
SVM with radical (Gaussian) Kernel and $\gamma=1$
SVM with radical (Gaussian) Kernel and $\gamma=2$
$1-$nearest neighbor
$3-$nearest neighbor

While there are many good implementions of support vector machines and K-nearest neighbors, two options are (package ) and (package ).

library(e1071) # for svn
library(class) # for knn

mod_ols_interact = lm(as.numeric(y) ~ LSV3*LSV2, data)
lsv_grid$ols_interact = c("sahara","iraq")[(predict(mod_ols_interact, newdata=lsv_grid) > .5) + 1]
data$y_lab <- factor(data$y_lab, levels=c("sahara", "iraq"))
mod_svm_linear = svm(y_lab ~ LSV2 + LSV3, data, kernel="linear")
mod_svm_radial1 = svm(y_lab ~ LSV2 + LSV3, data, kernel="radial", gamma=1)
mod_svm_radial2 = svm(y_lab ~ LSV2 + LSV3, data, kernel="radial", gamma=2)
lsv_grid$svm_linear = predict(mod_svm_linear, newdata=lsv_grid)
lsv_grid$svm_radial1 = predict(mod_svm_radial1, newdata=lsv_grid)
lsv_grid$svm_radial2 = predict(mod_svm_radial2, newdata=lsv_grid)

lsv_grid$knn1 = knn(train=data[,c("LSV2","LSV3")], cl=y_lab,test=lsv_grid[,c("LSV2","LSV3")], k= 1)

lsv_grid$knn3 = knn(train=data[,c("LSV2","LSV3")], cl=y_lab, test=lsv_grid[,c("LSV2","LSV3")], k=3)
 ### plotting

ggplot(lsv_grid, aes(x=LSV2, y=LSV3)) +
 geom_tile(aes(fill=ols_interact)) +
 geom_point(aes(shape=y_lab), data, size=2) +
 scale_fill_manual(values=c("#F1A340","#998EC3"), name="predicted class for incoming data")+
 scale_shape_manual(values=c(2,4), name="actual class of training data") +
 theme(legend.position="bottom") +
 ggtitle("OLS with interaction term")

ggplot(lsv_grid, aes(x=LSV2, y=LSV3)) +
 geom_tile(aes(fill=svm_linear)) +
 geom_point(aes(shape=y_lab), data, size=2) +
 scale_fill_manual(values=c("#F1A340","#998EC3"), name="predicted class for incoming data")+
 scale_shape_manual(values=c(2,4), name="actual class of training data") +
 theme(legend.position="bottom") +
 ggtitle("linear SVM")

ggplot(lsv_grid, aes(x=LSV2, y=LSV3)) +
 geom_tile(aes(fill=svm_radial1)) +
 geom_point(aes(shape=y_lab), data, size=2) +
 scale_fill_manual(values=c("#F1A340","#998EC3"), name="predicted class for incoming data")+
 scale_shape_manual(values=c(2,4), name="actual class of training data") +
 theme(legend.position="bottom") +
 ggtitle("SVM with Gaussian kernel (gamma=1)")

ggplot(lsv_grid, aes(x=LSV2, y=LSV3)) +
 geom_tile(aes(fill=svm_radial2)) +
 geom_point(aes(shape=y_lab), data, size=2) +
 scale_fill_manual(values=c("#F1A340","#998EC3"), name="predicted class for incoming data")+
 scale_shape_manual(values=c(2,4), name="actual class of training data") +
 theme(legend.position="bottom") +
 ggtitle("SVM with Gaussian kernel (gamma=2)")

ggplot(lsv_grid, aes(x=LSV2, y=LSV3)) +
 geom_tile(aes(fill=knn1)) +
 geom_point(aes(shape=y_lab), data, size=2) +
 scale_fill_manual(values=c("#F1A340","#998EC3"), name="predicted class for incoming data")+
 scale_shape_manual(values=c(2,4), name="actual class of training data") +
 theme(legend.position="bottom") +
 ggtitle("One nearest neighbor")

ggplot(lsv_grid, aes(x=LSV2, y=LSV3)) +
 geom_tile(aes(fill=knn3)) +
 geom_point(aes(shape=y_lab), data, size=2) +
 scale_fill_manual(values=c("#F1A340","#998EC3"), name="predicted class for incoming data") +
 scale_shape_manual(values=c(2,4), name="actual class of training data") +
 theme(legend.position="bottom") +
 ggtitle("Three nearest neighbors")

Gov2018 Pset 1

February 14, 2024

Code and the write-up due midnight Feb 28th.

Problem 1

Problem 2

Problem 3