Hi, I'm Howard Zeng
Quantitative Researcher & ML Engineer
I build models that explain themselves, pipelines that don't break, and results you can defend.
About
I build and productionize loan-level prepayment / delinquency models and data pipelines across the full RMBS stack — CRT (STACR/CAS), MIR, Non-QM, Jumbo, and HELOC.
I own modules end-to-end — from factor design and diagnostics to daily risk refresh, tracking, and explainable outputs used in investment decisions.
I care about turning messy, high-dimensional data into models that actually ship — and about writing code the next person can read.
Previously at
Structured credit · Lending analytics · Asset pricing · Derivatives research
Skills
Structured Credit / Modeling
ML / Statistics
Data / Engineering
Systems / Tooling
Experience
- Production prepay/delinquency/loss models across the full resi stack — CRT (STACR/CAS), MIR, Non-QM, Jumbo, HELOC; factors stable under stress.
- End-to-end pipeline: SQL extraction → feature engineering → GAM/XGBoost fitting → automated QC (Jenkins/Python) on loan-level data.
- Own full model lifecycle — factor design through daily risk refresh — and deliver tracking outputs and risk explanations to PMs.
- Built executive dashboard for a $6B equity-loan portfolio; surfaced pricing anomalies that informed portfolio strategy.
- SQL/Python pipelines on Azure/Teradata processing 100M+ records with automated anomaly detection.
- End-to-end delivery: KPI design → Plotly visualizations → presentation to senior leadership.
- Led 6-person team; built predictive models that outperformed S&P 500 benchmark in backtests.
- Migrated Temporal Fusion Transformers to AWS SageMaker, optimizing 20M+ record workloads.
- Drove full pipeline architecture — feature engineering through deployment — as team lead.
- Reduced dataset 91% while preserving predictive signal across ~9M mortgage records; strong holdout AUC.
- SMOTE resampling + Optuna-tuned XGBoost/RF/Logistic classifiers with cross-validation.
- End-to-end mortgage default modeling — data prep through model selection and evaluation.
- Supported deep demand estimation research with reliable data pipelines and model experiments.
- Large-scale econometric dataset preparation and preliminary model evaluation.
- Independently managed preprocessing and evaluation workflows.
- Published equity research on database technology and virtual power plant sectors.
- Analyzed 19,424 funds using R and Wind terminal data for sector analysis.
- Owned full analysis from data collection through published research notes.
- Improved derivatives strategy Sharpe ratio by 17% through ARIMA/GBDT model refinement.
- 11-year (2010–2021) Chinese futures dataset across multiple asset classes.
- Designed rolling-window and walk-forward validation for cross-regime robustness.
Earlier Experience
iRent
Mobile Full Stack Developer | Founder & Team Leader
May 2019 - Jan 2021
Education
Cornell University
2023 - 2024MS Applied Statistics — Data Science
GPA: 4.08 / 4.3
Relevant Coursework
University of Washington
2019 - 2022BS Economics — Econometrics
Minor: Applied Math & Data Science
GPA: 3.6 / 4.0
Relevant Coursework
Featured Case Studies
AlphaCycle — Stock Prediction Framework
- Problem
- How to build a reproducible, config-driven research stack for systematic equity signals?
- Approach
- Built a 3-layer framework (data ingestion → feature store → model training → reporting) with walk-forward validation and regime labels.
- Result
- Clean separation of data/model/report layers; config-driven pipelines; evaluation with robust metrics across multiple horizons.
Retail Sentiment Trading Signals
- Problem
- Can retail investor sentiment on social media predict short-term equity price movements?
- Approach
- Built an NLP pipeline with VADER and FinBERT to score Reddit/Twitter posts, then trained classifiers on sentiment-price lag features.
- Result
- Achieved statistically significant predictive signal for 1–3 day returns; backtested strategy returned ~12% annualized alpha on selected tickers.
Handling Class Imbalance in Random Forest
- Problem
- Standard Random Forest classifiers degrade on imbalanced datasets common in fraud detection and credit risk.
- Approach
- Systematically compared SMOTE, ADASYN, Tomek links, cost-sensitive learning, and ensemble balancing across multiple imbalance ratios.
- Result
- Cost-sensitive RF with SMOTE achieved 15–20% F1 improvement over baseline on highly skewed datasets.