🎓 Final Project

Objective: Choose one of two tracks below. Both tracks require a 12–15 page academic report in LaTeX.

📋 Project Overview

TrackRequirement
Track 1Replicate and extend a landmark causal ML paper using original (or similar) data
Track 2Conduct a structured literature review on a specific ML-in-economics domain
Report12-15 page academic paper in LaTeX
Weight40% of final grade
Due DateMay 5, 2027

🔬 Track 1: Empirical Replication & Extension

Choose one of six landmark papers in causal machine learning, replicate its core results, and propose a meaningful extension.

🎯 Six Paper Options

Option 1 401(k) and Savings

Paper: Chernozhukov et al. (2018) - Double Machine Learning

Question: What is the causal effect of 401(k) eligibility on net financial assets?

Data

Source: Survey of Consumer Finances (SCF) 1991 (~9,915 observations)

# Download Option A: econml package
pip install econml
from econml.datasets import fetch_401k
data = fetch_401k()

# Download Option B: Federal Reserve
# https://www.federalreserve.gov/econres/scfindex.htm

Replication Goals

Starter Code

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold

def dml_plr(Y, D, X, ml_g, ml_m, n_folds=5):
    """
    Double Machine Learning for Partially Linear Model
    Y = θ*D + g(X) + ε
    
    TODO: 
    1. Split data into K folds
    2. Train ml_g (outcome model) and ml_m (propensity model)
       on training set, predict on test set
    3. Compute residuals: Ỹ = Y - ĝ(X), D̃ = D - m̂(X)
    4. Estimate θ = mean(Ỹ * D̃) / mean(D̃²)
    5. Compute standard errors
    
    Returns: theta, standard_error
    """
    n = len(Y)
    Y_hat = np.zeros(n)
    D_hat = np.zeros(n)
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for train_idx, test_idx in kf.split(X):
        # YOUR CODE HERE
        pass
    
    return theta, se

Extension Ideas

Option 2 NSW Job Training

Paper: Athey & Imbens (2016) - Causal Trees and Forests

Question: Are the effects of job training heterogeneous? Who benefits most?

Data

Source: LaLonde (1986) NSW data (297 treated, 425 control)

# Download Option A: Dehejia-Wahba
# https://users.nber.org/~rdehejia/data/.nswdata2.html

# Download Option B: causaldata package
pip install causaldata
from causaldata import lalonde

Replication Goals

Starter Code

class CausalTree:
    """
    Key difference from standard tree:
    - Standard: minimize prediction MSE
    - Causal: maximize |τ_left - τ_right| (treatment effect heterogeneity)
    """
    
    def fit(self, X, Y, D):
        """
        X: covariates, Y: outcome, D: treatment
        
        TODO: Implement recursive partitioning that maximizes
        treatment effect differences between leaves
        """
        pass
    
    def _treatment_effect(self, Y, D):
        """τ = E[Y|D=1] - E[Y|D=0]"""
        # YOUR CODE HERE
        pass

Extension Ideas

Option 3 Minimum Wage

Paper: Callaway & Sant'Anna (2021) - Staggered DiD

Question: Effect of minimum wage increases with staggered adoption?

Data

Source: Card & Krueger (1994) OR state-level panel

# Option A: Card-Krueger replication
# https://github.com/tyleransom/DiD-example

# Option B: QCEW data
# https://www.bls.gov/cew/

Replication Goals

Starter Code

def compute_att_gt(df, group, time):
    """
    Compute ATT(g,t) following Callaway & Sant'Anna (2021)
    
    Group g: units first treated at time g
    Compare to never-treated or not-yet-treated
    
    ATT(g,t) = [E[Y_t|g] - E[Y_g|g]] - [E[Y_t|control] - E[Y_g|control]]
    
    TODO: Implement group-time ATT calculation
    """
    pass

def event_study_plot(df):
    """
    Plot dynamic treatment effects by event time
    X-axis: Time relative to treatment (-5 to +5)
    Y-axis: Treatment effect with confidence intervals
    """
    pass

Extension Ideas

Option 4 Proposition 99

Paper: Abadie et al. (2010) - Synthetic Control Method

Question: Did California's tobacco tax reduce cigarette sales?

Data

Source: State-level cigarette consumption (39 states × 31 years)

# Option A: synthdid package
pip install synthdid
from synthdid import get_data
california = get_data('california_prop99')

# Option B: Ortega database
# https://github.com/NiclasOrtG/the-synthetic-control-group

Replication Goals

Starter Code

from scipy.optimize import minimize

def synthetic_control(Y_target, Y_donors):
    """
    Find optimal weights w that minimize ||Y_target - Y_donors @ w||²
    Subject to: sum(w) = 1, w >= 0 (convex combination)
    
    Y_target: (T_pre,) pre-treatment outcomes for treated unit
    Y_donors: (T_pre, J) pre-treatment outcomes for donor pool
    
    Returns: optimal weights (J,)
    """
    def objective(w):
        return np.sum((Y_target - Y_donors.T @ w) ** 2)
    
    # TODO: Use scipy.optimize.minimize with constraints
    pass

Extension Ideas

Option 5 Policy Learning

Paper: Athey & Wager (2021) - Policy Learning

Question: How to learn optimal treatment assignment rules?

Data

Source: JTPA dataset (~11,600 participants)

# Option A: econml
from econml.datasets import fetch_jtpa

# Option B: MDRC (requires application)
# https://www.mdrc.org/

Replication Goals

Starter Code

def doubly_robust_scores(Y, D, X):
    """
    Compute doubly robust scores:
    Γ = μ₁(X) - μ₀(X) + D(Y - μ₁(X))/e(X) - (1-D)(Y - μ₀(X))/(1-e(X))
    
    where μ₁, μ₀ are outcome models and e(X) is propensity score
    
    TODO: Implement with cross-fitting
    """
    pass

def learn_policy(X, dr_scores, budget=0.5):
    """
    Learn policy π(X) that maximizes welfare subject to budget
    
    max E[π(X) * Γ]  s.t. E[π(X)] ≤ budget
    
    Solution: Treat if DR score > threshold
    """
    pass

Extension Ideas

Option 6 Economic Policy Uncertainty

Paper: Baker et al. (2016) - EPU Index

Question: How to measure EPU from text? How does it affect outcomes?

Data

Source: Newspaper archives OR EPU index directly

# Option A: EPU index (monthly)
# https://www.policyuncertainty.com/

# Option B: News articles (requires ProQuest/Factiva)
# Python: newspaper3k, beautifulsoup

# Option C: GDELT project
# https://www.gdeltproject.org/

Replication Goals

Starter Code

UNCERTAINTY_TERMS = ['uncertain', 'uncertainty', 'risk', 'volatile']
POLICY_TERMS = ['policy', 'regulation', 'legislation', 'government']
ECONOMIC_TERMS = ['economy', 'growth', 'recession', 'inflation']

def count_epu_articles(articles):
    """
    Article counts as EPU if it contains:
    - At least one uncertainty term AND
    - At least one policy term AND
    - At least one economic term
    
    EPU Index = (EPU articles / Total articles) × Normalization
    """
    pass

Extension Ideas

📚 Track 2: Literature Review

If you prefer a theoretical / survey-oriented project, you may conduct a literature review instead of an empirical replication. Depending on the topic, you can use the papers provided in class as a seed or explore an area we haven't covered.

Getting Started

  1. Pick a Specific Domain: Narrow your focus to a concrete application. For example:
  2. The Snowball Search: Use the bibliography of your "seed paper" to find foundational work (Backward Search) and use Google Scholar's "Cited by" to find recent 2025–2026 developments (Forward Search).
  3. Categorize by Approach: Group papers by whether they focus on Prediction (E[Y|X]) or Causal Inference (e.g., Double ML).
  4. Identify the Gap: Note what is missing. For instance, does a finance paper lack a structural interpretation, or does a policy paper ignore selection bias?
  5. Synthesize: Don't just list papers. Compare them—explain how newer ML methods improve upon traditional econometric benchmarks.

Report Structure (Track 2)

SectionRecommended LengthContent
Introduction1–2 pagesMotivation, research question, scope of the review
Conceptual Framework2–3 pagesKey concepts, taxonomy of methods (Prediction vs. Causal Inference)
Literature Review6–8 pagesThematic synthesis of 10–15 core papers; backward and forward search
Gap & Future Directions2–3 pagesCritical assessment of missing links and promising extensions
Conclusion1 pageSummary and takeaways

Grading Rubric (Track 2)

ComponentPointsCriteria
Literature Coverage35%Relevance, breadth, and use of snowball search
Analysis & Synthesis35%Thematic organization, critical comparison, not just listing
Writing20%Clarity, organization, professionalism
Original Insight10%Quality of identified gaps and future directions

📊 Report Requirements

Format

Suggested Structure

\documentclass[11pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{graphicx, amsmath, natbib}

\title{Replication and Extension of [Paper Title]}
\author{Your Name\\ECON6083}
\date{\today}

\begin{document}

\maketitle

\begin{abstract}
Brief summary of paper, replication approach, key findings, and extension. 
(150-200 words)
\end{abstract}

\section{Introduction} (1-2 pages)
Motivation, research question, overview of replication and extension

\section{Literature and Background} (1-2 pages)
Related literature, theoretical framework, institutional details

\section{Data} (1-2 pages)
Data sources, sample construction, summary statistics (Table 1)

\section{Empirical Strategy} (2-3 pages)
Model specification, identification assumptions, estimation details

\section{Replication Results} (3-4 pages)
Main results, robustness checks, comparison to original paper

\section{Extension} (2-3 pages)
Motivation, methodology, results, interpretation

\section{Conclusion} (0.5-1 page)
Summary, limitations, future research

\bibliographystyle{aer}
\bibliography{references}

\appendix
\section{Additional Results}
\section{Code}

\end{document}

Grading Rubric (Track 1)

ComponentPointsCriteria
Replication40%Accuracy, completeness, comparison to original
Extension30%Originality, motivation, execution
Writing20%Clarity, organization, professionalism
Code10%Reproducibility, documentation, style

🛠️ Technical Resources

Required Packages

pip install numpy pandas matplotlib scikit-learn scipy statsmodels

Optional Packages

pip install econml              # DML, Causal Forests
pip install synthdid            # Synthetic Control
pip install linearmodels        # Panel data
pip install transformers        # BERT for text
pip install spacy nltk jieba    # Text processing

📅 Timeline

WeekTask
Week 8-9Choose paper, download data, start replication
Week 10Complete replication, identify extension idea
Week 11Implement extension
Week 12Write report, prepare submission
May 5Final deadline
⚠️ Important: Start early! Data access and replication often take longer than expected. Come to office hours if you get stuck.