{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 02 — HDD Regression & Normalization\n\nRun OLS regression per postal code, apply quality filters, normalize slopes by building stock characteristics from the tax roll, and produce a thermal intensity metric for ranking."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 — Load Data"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy import stats\n\nsns.set_theme(style='whitegrid', palette='colorblind')\n\ngas_df = pd.read_csv('data/ces_gas_consumption.csv', parse_dates=['month'])\nmpac_df = pd.read_csv('data/ces_mpac_summary.csv')\n\nprint(f'Gas data: {gas_df.shape}, MPAC data: {mpac_df.shape}')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 — Per-Postal-Code OLS Regression"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "results = []\nfor pc, group in gas_df.groupby('postal_code'):\n    hdd = group.hdd.values\n    gas = group.gas_gj.values\n\n    slope, intercept, r_value, p_value, std_err = stats.linregress(hdd, gas)\n\n    results.append({\n        'postal_code': pc,\n        'slope': slope,\n        'intercept': intercept,\n        'r_squared': r_value**2,\n        'p_value': p_value,\n        'std_err': std_err,\n        'n_obs': len(group),\n    })\n\nreg_df = pd.DataFrame(results)\nprint(f'Regressions complete: {len(reg_df)} postal codes')\nreg_df.describe().round(4)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3 — Regression Quality Checks"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Quality filters\nreg_df['pass_r2'] = reg_df.r_squared > 0.80\nreg_df['pass_slope'] = reg_df.slope > 0\nreg_df['pass_intercept'] = reg_df.intercept > 0\nreg_df['pass_all'] = reg_df.pass_r2 & reg_df.pass_slope & reg_df.pass_intercept\n\nn_pass = reg_df.pass_all.sum()\nn_fail = (~reg_df.pass_all).sum()\nprint(f'Quality check: {n_pass} pass, {n_fail} flagged')\n\n# R² distribution\nfig, ax = plt.subplots(figsize=(10, 4))\nax.hist(reg_df.r_squared, bins=30, edgecolor='black', alpha=0.7)\nax.axvline(0.80, color='red', linestyle='--', label='R² threshold = 0.80')\nax.set_xlabel('R²')\nax.set_ylabel('Postal Codes')\nax.set_title('Regression R² Distribution')\nax.legend()\nplt.tight_layout()\nplt.show()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4 — Merge with MPAC Data and Compute Heated Volume"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Filter to passing postal codes\nqualified = reg_df[reg_df.pass_all].merge(mpac_df, on='postal_code')\n\n# Estimate heated volume per property\nFLOOR_HEIGHT = 2.7  # m, above grade\nBSMT_HEIGHT = 2.4   # m\n\nqualified['above_grade_vol'] = (qualified.avg_footprint_m2\n                                 * qualified.avg_storeys\n                                 * FLOOR_HEIGHT)\nqualified['below_grade_vol'] = (qualified.avg_footprint_m2\n                                 * qualified.basement_fraction\n                                 * BSMT_HEIGHT)\nqualified['heated_vol_m3'] = qualified.above_grade_vol + qualified.below_grade_vol\n\nprint(f'Qualified postal codes: {len(qualified)}')\nqualified[['postal_code', 'avg_footprint_m2', 'avg_storeys',\n           'basement_fraction', 'heated_vol_m3']].describe().round(1)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.5 — Normalize: Per-Property Slope and Thermal Intensity"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Per-property slope\nqualified['slope_per_prop'] = qualified.slope / qualified.property_count\n\n# Normalized thermal intensity: gas per HDD per m³ heated volume\nqualified['thermal_intensity'] = qualified.slope_per_prop / qualified.heated_vol_m3\n\nprint('Thermal intensity statistics:')\nqualified.thermal_intensity.describe().round(6)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.6 — Effect of Normalization\n\nCompare raw slope ranking vs. normalized ranking — they should differ substantially."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "qualified['rank_raw'] = qualified.slope.rank(ascending=False).astype(int)\nqualified['rank_normalized'] = qualified.thermal_intensity.rank(ascending=False).astype(int)\n\nfig, ax = plt.subplots(figsize=(8, 8))\nax.scatter(qualified.rank_raw, qualified.rank_normalized, alpha=0.5, s=20)\nax.plot([0, len(qualified)], [0, len(qualified)], 'r--', alpha=0.5, label='No change')\nax.set_xlabel('Rank by Raw Slope')\nax.set_ylabel('Rank by Normalized Thermal Intensity')\nax.set_title('Normalization Changes the Ranking')\nax.legend()\nplt.tight_layout()\nplt.show()\n\n# Rank correlation\nfrom scipy.stats import spearmanr\nrho, p = spearmanr(qualified.rank_raw, qualified.rank_normalized)\nprint(f'Spearman rank correlation: {rho:.3f} (p={p:.4f})')\nprint('Normalization materially reshuffles the ranking.')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.7 — Thermal Intensity by Structure Type"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "fig, ax = plt.subplots(figsize=(10, 5))\nfor stype in ['detached', 'semi-detached', 'row', 'low-rise-apt']:\n    subset = qualified[qualified.structure_type == stype]\n    ax.hist(subset.thermal_intensity, bins=20, alpha=0.5, label=stype)\n\nax.set_xlabel('Normalized Thermal Intensity (GJ / HDD / m³)')\nax.set_ylabel('Postal Codes')\nax.set_title('Thermal Intensity Distribution by Structure Type')\nax.legend()\nplt.tight_layout()\nplt.show()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.8 — Validate Against Ground Truth\n\nBecause this is synthetic data, we can check if the normalized metric correlates with the true slope."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "fig, ax = plt.subplots(figsize=(8, 6))\nax.scatter(qualified.true_slope, qualified.thermal_intensity, alpha=0.5, s=20)\nax.set_xlabel('True Thermal Slope (known ground truth)')\nax.set_ylabel('Normalized Thermal Intensity (computed)')\nax.set_title('Validation: Normalized Metric vs. Ground Truth')\n\nrho, p = spearmanr(qualified.true_slope, qualified.thermal_intensity)\nax.text(0.05, 0.95, f'Spearman ρ = {rho:.3f}', transform=ax.transAxes, va='top',\n        fontsize=12, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))\nplt.tight_layout()\nplt.show()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.9 — Save Results"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "qualified.to_csv('data/ces_normalized_results.csv', index=False)\nprint(f'Saved {len(qualified)} qualified postal codes with thermal intensity.')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n**Next:** Notebook 03 ranks postal codes, identifies priority neighbourhoods, and builds the targeting output."
   ]
  }
 ]
}