Add references folder with course materials and notebooks

2026-05-07 21:14:07 -06:00 · 2026-05-07 21:14:07 -06:00 · 5a65894442
commit 5a65894442
parent c201fec61b
12 changed files with 33838 additions and 0 deletions
--- a/Machines.pdf
+++ b/Machines.pdf
--- a/references/CIS490_Project_Workbook.xlsx
+++ b/references/CIS490_Project_Workbook.xlsx
--- a/references/CIS_490_ML_Project_Assignment_Guide.docx
+++ b/references/CIS_490_ML_Project_Assignment_Guide.docx
--- a/references/DANTE:
+++ b/references/DANTE:
--- a/references/LogBERT:
+++ b/references/LogBERT:
--- a/references/Transformer-based
+++ b/references/Transformer-based
--- a/references/cis490_ipynbfiles/CIS490_Decision_Trees_and_Random_Forest.ipynb
+++ b/references/cis490_ipynbfiles/CIS490_Decision_Trees_and_Random_Forest.ipynb
--- a/references/cis490_ipynbfiles/CIS490_K_Nearest_Neighbors_(KNN).ipynb
+++ b/references/cis490_ipynbfiles/CIS490_K_Nearest_Neighbors_(KNN).ipynb
--- a/references/cis490_ipynbfiles/CIS490_Linear_Regression.ipynb
+++ b/references/cis490_ipynbfiles/CIS490_Linear_Regression.ipynb
@ -0,0 +1,973 @@
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IBRVvriFonN1"
      },
      "source": [
        "## **Lesson: Predicting CVSS Base Score**\n",
        "\n",
        "**Goal:** Use the CVE (Common Vulnerabilities and Exposures) dataset to predict the **baseScore** (severity score from 0–10) using the other CVSS attributes.\n",
        "\n",
        "**What we will cover:**\n",
        "1. Data collection\n",
        "2. Data processing\n",
        "3. Data analysis and summary statistics with visuals\n",
        "4. Why we encode variables\n",
        "5. Linear regression to predict baseScore\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wQakUgalonN3"
      },
      "source": [
        "### **Linear Regression**\n",
        "\n",
        "**Linear regression** is a supervised learning algorithm used to predict a continuous (numerical) target variable based on one or more input features. The goal of linear regression is to model the relationship between the independent variables (features) and the dependent variable (target) as a straight line.\n",
        "\n",
        "---\n",
        "\n",
        "### Key Idea of Linear Regression\n",
        "\n",
        "Linear regression assumes that the relationship between the input $x$ (or multiple inputs) and the output $y$ can be modeled as:\n",
        "\n",
        "$y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + \\ldots + \\beta_n x_n + \\epsilon$\n",
        "\n",
        "Where:\n",
        "- $y$: The dependent variable (target).\n",
        "- $x_1, x_2, \\ldots, x_n$: Independent variables (features).\n",
        "- $\\beta_0$: The intercept (value of $y$ when all $x_i = 0$).\n",
        "- $\\beta_1, \\beta_2, \\ldots, \\beta_n$: Coefficients (slopes) for each feature.\n",
        "- $\\epsilon$: The error term, accounting for noise or variability not explained by the model.\n",
        "\n",
        "In simple linear regression (one feature), the equation becomes:\n",
        "\n",
        "$y = \\beta_0 + \\beta_1 x + \\epsilon$\n",
        "\n",
        "---\n",
        "\n",
        "### Objectives of Linear Regression\n",
        "\n",
        "The primary goal is to find the values of $\\beta_0$ and $\\beta_1, \\ldots, \\beta_n$ that minimize the difference between the predicted and actual values. This difference is measured using the **residual sum of squares (RSS)**:\n",
        "\n",
        "$RSS = \\sum_{i=1}^N \\left( y_i - \\hat{y}_i \\right)^2$\n",
        "\n",
        "Where:\n",
        "- $y_i$: The actual target value for observation $i$.\n",
        "- $\\hat{y}_i$: The predicted value from the model ($\\hat{y}_i = \\beta_0 + \\beta_1 x_{i1} + \\ldots + \\beta_n x_{in}$).\n",
        "\n",
        "---\n",
        "\n",
        "### How Linear Regression Works\n",
        "\n",
        "1. **Fit a Line to the Data**:\n",
        "   The model calculates the line (or hyperplane for multiple features) that best fits the data. This is achieved by minimizing the RSS using techniques like **Ordinary Least Squares (OLS)**.\n",
        "\n",
        "2. **Prediction**:\n",
        "   Once the line is fitted, predictions for new data points can be made using the equation $\\hat{y} = \\beta_0 + \\beta_1 x_1 + \\ldots + \\beta_n x_n$.\n",
        "\n",
        "3. **Evaluate Performance**:\n",
        "   The model's performance is typically evaluated using metrics such as:\n",
        "   - **Mean Squared Error (MSE)**:\n",
        "     $MSE = \\frac{1}{N} \\sum_{i=1}^N \\left( y_i - \\hat{y}_i \\right)^2$\n",
        "   - **R-squared ($R^2$)**: Measures how well the model explains the variance in the target variable.\n",
        "     $R^2 = 1 - \\frac{\\text{RSS}}{\\text{TSS}}$\n",
        "     Where $\\text{TSS}$ is the total sum of squares.\n",
        "\n",
        "---\n",
        "\n",
        "### Example: Simple Linear Regression\n",
        "\n",
        "Suppose we want to predict a student’s test score ($y$) based on the number of study hours ($x$):\n",
        "\n",
        "1. **Dataset**:\n",
        "   - $x$: [1, 2, 3, 4, 5] (hours studied)\n",
        "   - $y$: [50, 55, 60, 65, 70] (test scores)\n",
        "\n",
        "2. **Equation of the Line**:\n",
        "   The linear regression model fits the line:\n",
        "   $y = 50 + 5x$\n",
        "\n",
        "   Here:\n",
        "   - $\\beta_0 = 50$: The intercept.\n",
        "   - $\\beta_1 = 5$: The slope, meaning each additional hour of study increases the score by 5 points.\n",
        "\n",
        "3. **Prediction**:\n",
        "   If a student studies for 6 hours ($x = 6$):\n",
        "   $\\hat{y} = 50 + 5(6) = 80$\n",
        "\n",
        "---\n",
        "\n",
        "### Strengths of Linear Regression\n",
        "\n",
        "1. **Simplicity**: Easy to understand and implement.\n",
        "2. **Interpretability**: Coefficients provide insights into the relationships between features and the target variable.\n",
        "3. **Efficiency**: Performs well for small to medium-sized datasets with linear relationships.\n",
        "\n",
        "---\n",
        "\n",
        "### Weaknesses of Linear Regression\n",
        "\n",
        "1. **Assumptions**:\n",
        "   - **Linearity**: The relationship between features and the target must be linear.\n",
        "   - **Homoscedasticity**: Constant variance of residuals.\n",
        "   - **Independence**: Residuals should be independent of each other.\n",
        "   - **Normality**: Residuals should be normally distributed.\n",
        "2. **Outliers**: Sensitive to outliers, which can distort the model.\n",
        "3. **Collinearity**: High correlation between features can affect the stability of coefficients.\n",
        "\n",
        "---\n",
        "\n",
        "### Applications of Linear Regression\n",
        "\n",
        "1. **Business**: Predicting sales, profits, or costs based on historical data.\n",
        "2. **Finance**: Modeling stock prices or investment returns.\n",
        "3. **Healthcare**: Estimating medical costs based on patient characteristics.\n",
        "\n",
        "---\n",
        "\n",
        "Linear regression provides a straightforward and powerful method for modeling relationships in data and making predictions for continuous outcomes.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ULWStFd_onN4"
      },
      "source": [
        "## 1. Data Collection\n",
        "\n",
        "Our data comes from the **NVD (National Vulnerability Database)** @ `https://nvd.nist.gov/developers/vulnerabilities`. We collected it by:\n",
        "- Calling the NVD REST API: `https://services.nvd.nist.gov/rest/json/cves/2.0`\n",
        "- Requesting multiple pages of results (with a short wait between requests to respect rate limits)\n",
        "- Keeping only records that have a **CVSS v3** score (so every row has a numeric baseScore and related attributes)\n",
        "\n",
        "The result was saved to a CSV file: `cve_cvss31.csv`. In the next cell we load the libraries we need and then read that file."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e4ee0f7e"
      },
      "source": [
        "### How the data collection works (NVD CVE API → CSV)\n",
        "\n",
        "This lesson uses the **NVD (National Vulnerability Database) CVE API v2.0** to download public CVE records and extract the **CVSS v3.1** (or v3.0) fields we need for modeling.\n",
        "\n",
        "**Key idea:** the API returns JSON pages of CVEs. For each CVE, we look for `metrics.cvssMetricV31` (preferred) or `metrics.cvssMetricV30`, then read `cvssData` to get:\n",
        "- the **target**: `baseScore` (a number from 0 to 10)\n",
        "- the **features**: the CVSS categorical components (Attack Vector, Attack Complexity, etc.)\n",
        "\n",
        "We then write the cleaned rows to `cve_cvss31.csv` so the rest of the notebook can work offline and reproducibly.\n",
        "\n",
        "> **Practical note (rate limits):** NVD enforces rate limits; downloads may fail if you request too much too quickly.  \n",
        "> To reduce friction in class, the notebook defaults to *not* re-downloading if the CSV already exists.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "20e6d8e5"
      },
      "outputs": [],
      "source": [
        "# NVD CVE API: fetch CVEs and extract CVSSv3 into separate columns (10 pages, with wait)\n",
        "#\n",
        "# Functionality is intentionally identical to the original script:\n",
        "# - Uses urllib.request (no requests library)\n",
        "# - Iterates up to 100 pages (range(0, 100 * results_per_page, results_per_page))\n",
        "# - Stops early if no vulnerabilities returned\n",
        "# - Prefers CVSS v3.1, falls back to v3.0\n",
        "# - Sleeps 6 seconds only for the first 10 pages\n",
        "# - Writes incrementally to CSV and flushes after each page\n",
        "\n",
        "import urllib.request\n",
        "import json\n",
        "import csv\n",
        "import time\n",
        "\n",
        "# CSV column schema (header row)\n",
        "# First column: CVE ID\n",
        "# Remaining columns: CVSS v3.x fields extracted from cvssData\n",
        "cols = [\n",
        "    \"cveId\",\n",
        "    \"vectorString\",\n",
        "    \"attackVector\",\n",
        "    \"attackComplexity\",\n",
        "    \"privilegesRequired\",\n",
        "    \"userInteraction\",\n",
        "    \"scope\",\n",
        "    \"confidentialityImpact\",\n",
        "    \"integrityImpact\",\n",
        "    \"availabilityImpact\",\n",
        "    \"baseScore\",\n",
        "    \"baseSeverity\",\n",
        "]\n",
        "\n",
        "# Keys to extract from the NVD \"cvssData\" object\n",
        "v3_keys = [\n",
        "    \"vectorString\",\n",
        "    \"attackVector\",\n",
        "    \"attackComplexity\",\n",
        "    \"privilegesRequired\",\n",
        "    \"userInteraction\",\n",
        "    \"scope\",\n",
        "    \"confidentialityImpact\",\n",
        "    \"integrityImpact\",\n",
        "    \"availabilityImpact\",\n",
        "    \"baseScore\",\n",
        "    \"baseSeverity\",\n",
        "]\n",
        "\n",
        "# Number of CVEs requested per API call\n",
        "results_per_page = 2000\n",
        "\n",
        "# Counter for successfully written rows\n",
        "total = 0\n",
        "\n",
        "# Open CSV file for writing (overwrites if exists)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zZfvuzp5onN5"
      },
      "source": [
        "**Why the code is structured this way:** The NVD API returns at most 2,000 CVEs per request, so we use **pagination** (a loop with `startIndex`) to request multiple \"pages\" until we have enough data. We **sleep a few seconds** between requests because the API enforces rate limits (e.g. 5 requests per 30 seconds without an API key); without the wait, we would get \"429 Too Many Requests\" errors. We **prefer CVSS v3.1 then v3.0** because they share the same structure; we only keep rows that have at least one of these so every row has a valid baseScore and component fields. Writing to CSV after each page (**flush**) saves progress so if the run is interrupted we don't lose all collected data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QloKESAvonN5"
      },
      "outputs": [],
      "source": [
        "# We use pandas for loading and processing data, and matplotlib/seaborn for plots.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qVl3mCrhonN6"
      },
      "source": [
        "## 2. Data Processing\n",
        "\n",
        "**Data processing** means loading the data, checking its shape and types, and fixing any issues (like missing values or wrong types) before we analyze or model it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "N6_rwgU9onN6"
      },
      "outputs": [],
      "source": [
        "# Load the CSV file into a pandas DataFrame.\n",
        "# A DataFrame is like a table: rows = records (CVEs), columns = attributes.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OGweIRmEonN6"
      },
      "source": [
        "**Why we use pandas and check the data:** We use **pandas** because it is the standard tool for tabular data in Python; it gives us a DataFrame (rows = CVEs, columns = attributes) that we can easily summarize and later pass to scikit-learn. We check **shape** and **info()** to confirm how many rows we have and that each column has the right type (e.g. `baseScore` is numeric). We check **missing values** because gaps in the data can cause errors or biased results when we fit the model—if we had missing baseScores we would drop or fix those rows before modeling."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KZObxMajonN6"
      },
      "outputs": [],
      "source": [
        "# Check data types and whether there are missing values.\n",
        "# object = text (strings); float64 = decimal numbers.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y9tUM1CfonN6"
      },
      "outputs": [],
      "source": [
        "# Count missing values per column. For this dataset we expect very few or none,\n",
        "# because we only kept records that had a CVSS v3 score when we collected the data.\n",
        "\n",
        "\n",
        "# If there were any missing values, we could drop those rows with: df = df.dropna()\n",
        "# For this lesson we will assume the data is complete."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8TKqgZZvonN6"
      },
      "source": [
        "## 3. Data Analysis and Summary Statistics\n",
        "\n",
        "Before building a model, we summarize the data and look at how the variables behave. This helps us understand the distribution of the **target** (baseScore) and the **features** (other columns we will use to predict baseScore)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ERxeye2EonN6"
      },
      "outputs": [],
      "source": [
        "# Summary statistics for numeric columns (here, mainly baseScore).\n",
        "# count, mean, std, min, 25%, 50%, 75%, max.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dqT5X0RJonN7"
      },
      "source": [
        "**Why we use `describe()` and `value_counts()`:** `describe()` gives the mean, min, max, and quartiles for numeric columns (here, mainly baseScore), so we see the scale and spread of our target and can spot outliers. `value_counts()` shows how often each category appears (e.g. NETWORK vs LOCAL); that tells us if some categories are rare, which can affect how well the model learns from them."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Z-bUYAz7onN7"
      },
      "outputs": [],
      "source": [
        "# For categorical columns, we can count how often each value appears.\n",
        "# Example: attackVector (how the attacker reaches the system).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TGRicOJvonN7"
      },
      "source": [
        "## 4. Visualizations\n",
        "\n",
        "Visuals make it easier to see patterns and distributions. We will:\n",
        "- Plot the distribution of **baseScore** (our target).\n",
        "- Plot how many CVEs fall into each category for a few categorical features.\n",
        "- Later we will use encoding so we can include these categories in a regression model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IL4NCUKSonN7"
      },
      "source": [
        "**Why we use these plots:** A **histogram** of baseScore shows whether scores are spread across 0–10 or concentrated in a range (e.g. mostly high), and whether the distribution is roughly symmetric—this helps us interpret the model later. **Bar charts** for categorical columns (e.g. attackVector, baseSeverity) show which categories are most common; that can explain why the model might do better or worse on some groups. The **box plot** of baseScore by baseSeverity shows how the numeric score maps to the severity label (e.g. CRITICAL vs HIGH) and whether there is overlap between categories, which is useful for understanding both the data and the model's job."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9ncctDpPonN7"
      },
      "outputs": [],
      "source": [
        "# Distribution of baseScore (the variable we want to predict).\n",
        "# A histogram shows how many CVEs have scores in each range.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_gHmhgqconN7"
      },
      "outputs": [],
      "source": [
        "# Bar chart: how many CVEs for each attack vector?\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0reRgxHZonN7"
      },
      "outputs": [],
      "source": [
        "# Bar chart: baseSeverity (LOW, MEDIUM, HIGH, CRITICAL).\n",
        "# Order the categories from low to high severity for a clearer picture.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Jln3aLgionN7"
      },
      "source": [
        "**Why we use `get_dummies()`:** It turns each category into its own binary (0/1) column so the model sees numbers instead of text. We use one-hot encoding (not label encoding) because our categories are **unordered** (e.g. NETWORK is not \"greater than\" LOCAL); giving them numbers like 1, 2, 3 would wrongly imply an order. We drop **cveId** (just an ID), **vectorString** (the same information is already in the component columns), and **baseSeverity** (it is derived from baseScore—using it would be data leakage and would not reflect real prediction)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yu-kFW6ionN7"
      },
      "outputs": [],
      "source": [
        "# Box plot: baseScore broken down by baseSeverity.\n",
        "# This shows how the numeric score relates to the severity label.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J468dXOOonN7"
      },
      "source": [
        "## 5. Why We Encode\n",
        "\n",
        "**Linear regression** uses numbers. It computes something like:\n",
        "\n",
        "$$\\text{baseScore} \\approx w_1 \\cdot x_1 + w_2 \\cdot x_2 + \\cdots + b$$\n",
        "\n",
        "Our features like **attackVector** (e.g. \"NETWORK\", \"LOCAL\") and **privilegesRequired** (e.g. \"NONE\", \"LOW\") are **categorical**: they are labels, not numbers. The model cannot multiply a weight by \"NETWORK\".\n",
        "\n",
        "**Encoding** means turning categories into numbers in a way the model can use:\n",
        "- **Label encoding:** Replace each category with a single number (e.g. LOW=0, MEDIUM=1, HIGH=2). Good for *ordered* categories.\n",
        "- **One-hot encoding:** Create one binary (0/1) column per category. For example, attackVector=\"NETWORK\" becomes a column `attackVector_NETWORK = 1` and all others 0. Good for *unordered* categories (like attack vector type).\n",
        "\n",
        "For this lesson we use **one-hot encoding** for the categorical CVSS components so the linear model can use them without assuming an order that does not exist."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5dNk0N4SonN7"
      },
      "outputs": [],
      "source": [
        "# Which columns are categorical? These are the ones we need to encode.\n",
        "# We will NOT use cveId (just an ID) or vectorString (redundant with the component columns).\n",
        "# We also do NOT use baseSeverity when predicting baseScore (it is derived from baseScore; using it would be cheating).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here’s a table summarizing types of encoding, their use cases, and examples for better understanding:\n",
        "\n",
        "| **Encoding Type**        | **Best For**                                         | **Description**                                                                                                                                         | **Example**                                                                                       |\n",
        "|---------------------------|-----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|\n",
        "| **Ordinal Encoding**      | Ordered categories                                  | Converts categories into numeric values that reflect a natural order.                                                                                   | `['low', 'medium', 'high'] → [0, 1, 2]`                                                          |\n",
        "| **One-Hot Encoding**      | Unordered categories                                | Creates a binary column for each category; `1` indicates presence, `0` absence.                                                                         | `['red', 'blue', 'green'] → [1, 0, 0], [0, 1, 0], [0, 0, 1]`                                    |\n",
        "| **Label Encoding**        | Unordered categories (tree-based models)            | Assigns a unique numeric value to each category but does not consider any order.                                                                        | `['cat', 'dog', 'rabbit'] → [0, 1, 2]`                                                          |\n",
        "| **Binary Encoding**       | High-cardinality features                           | Combines label encoding and one-hot encoding by converting labels into binary format and splitting into columns.                                         | `['A', 'B', 'C'] → [0, 0], [0, 1], [1, 0]` (binary representation of `[0, 1, 2]`)                |\n",
        "| **Target Encoding**       | Categorical features influencing the target         | Replaces categories with the mean of the target variable for each category.                                                                             | `['A', 'B', 'C']` with average prices `[50k, 70k, 100k] → [50k, 70k, 100k]`                      |\n",
        "| **Frequency Encoding**    | Categorical features where frequency matters        | Replaces categories with their frequency or count in the dataset.                                                                                       | `['A', 'B', 'B', 'C'] → [1, 2, 2, 1]`                                                           |\n",
        "| **Hash Encoding**         | High-cardinality data in large datasets             | Converts categories into numeric hashes of fixed length; reduces dimensionality while maintaining uniqueness.                                            | `['cat', 'dog', 'rabbit'] → [101, 202, 303]` (hash values depend on implementation)              |\n",
        "| **Mean Encoding**         | Target-related categorical features                | Similar to target encoding but uses the mean target value for each category across training data to avoid data leakage.                                  | `['cityA', 'cityB', 'cityC']` with house prices mean: `[200k, 300k, 250k] → [200k, 300k, 250k]` |\n",
        "| **Position Encoding**     | Features with sequential or spatial relationships   | Encodes ordinal features or those with temporal patterns into values that represent their positions.                                                     | Months `[Jan, Feb, Mar] → [1, 2, 3]`                                                            |\n",
        "| **Embedding**             | High-cardinality features for deep learning models | Maps categories into dense numerical vectors of fixed size (often learned during training).                                                              | `['cat', 'dog', 'rabbit'] → Vector embeddings: `[0.1, 0.2], [0.3, 0.5], [0.7, 0.9]`             |\n"
      ],
      "metadata": {
        "id": "Crj-jDPirK8q"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6M8XBn3ronN7"
      },
      "source": [
        "## 6. Encoding the Data\n",
        "\n",
        "We use pandas' **get_dummies()** to one-hot encode the categorical columns. Each category becomes a new column with 0s and 1s."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MwTywi9gonN8"
      },
      "outputs": [],
      "source": [
        "# One-hot encode the categorical columns. drop_first=True removes one category per column\n",
        "# to avoid redundancy (optional but often used in regression). For simplicity we keep all columns here.\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Correlation heatmap (numeric columns only; pandas excludes non-numeric automatically)\n",
        "# Correlation heatmap: use only numeric columns (df has string/categorical columns)\n"
      ],
      "metadata": {
        "id": "DDnzDPWEstWx"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ux_jcq7NonN8"
      },
      "outputs": [],
      "source": [
        "# Separate the target (what we want to predict) from the features (what we use to predict).\n",
        "# We use only the numeric/dummy columns; we drop cveId, vectorString, and baseSeverity.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WCoXHlMmonN8"
      },
      "source": [
        "**Why we use these metrics:** **MAE** (Mean Absolute Error) is the average absolute difference between predicted and actual baseScore; it is in the same units as the score (0–10) and easy to interpret. **RMSE** (Root Mean Squared Error) also measures average error but penalizes large mistakes more, so it is useful when we care about avoiding big prediction errors. **R²** (R-squared) tells us what fraction of the variation in baseScore is explained by the model (0 = no better than predicting the mean, 1 = perfect predictions); it helps us compare models or judge if the model is useful at all."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7JWPhUeBonN8"
      },
      "source": [
        "## 7. Linear Regression Model\n",
        "\n",
        "We split the data into **train** and **test** sets. We fit the model on the training set and evaluate it on the test set to see how well it predicts baseScore from the encoded features."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w6eRG82zonN7"
      },
      "source": [
        "**Why we split into train and test:** We fit the model only on the **training** set and measure performance on the **test** set. That way we evaluate how well the model **generalizes** to data it has never seen. If we evaluated on the same data we trained on, we would overestimate performance (overfitting). An 80/20 split is a common default; we use **random_state=42** so the split is reproducible.\n",
        "\n",
        "**Why linear regression:** It is a simple, interpretable baseline: the prediction is a weighted sum of the features plus a constant. It works well when the relationship between features and target is roughly linear, and it only needs numeric inputs—which we get after one-hot encoding. We can always try more complex models later."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "x2NiBMpoonN8"
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.linear_model import LinearRegression\n",
        "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
        "import numpy as np\n",
        "\n",
        "# Split: 80% for training, 20% for testing. random_state makes the split reproducible.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fPrSPdSvonN8"
      },
      "outputs": [],
      "source": [
        "# Create the linear regression model and fit it to the training data.\n",
        "\n",
        "\n",
        "# Predict baseScore for the test set.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Sli9vt3qonN8"
      },
      "source": [
        "1. **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values. A lower MSE indicates a better fit.\n",
        "2. **Mean Absolute Error (MAE)**: Represents the average absolute difference between predicted and actual values, giving an idea of the average error in the predictions.\n",
        "3. **R-squared (R²)**: Indicates how well the independent variables explain the variance in the dependent variable. A value closer to 1 signifies a better fit, while negative values indicate poor performance."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U5XeC8QbonN8"
      },
      "outputs": [],
      "source": [
        "# Evaluate the model with common metrics.\n",
        "# MAE = average absolute error; RMSE = root mean squared error (punishes large errors more).\n",
        "# R² = how much of the variation in baseScore is explained by the model (0 to 1, higher is better).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9Woo9jatonN8"
      },
      "outputs": [],
      "source": [
        "# Visual: predicted vs actual baseScore on the test set.\n",
        "# If the model were perfect, all points would lie on the diagonal line.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "91b99102"
      },
      "source": [
        "## 8. Interpreting the Linear Regression Model (What did it learn?)\n",
        "\n",
        "A linear regression model predicts:\n",
        "\n",
        "\\[\n",
        "\\hat{y} = w_0 + w_1 x_1 + \\dots + w_d x_d\n",
        "\\]\n",
        "\n",
        "Because we **one-hot encoded** the CVSS categories, each feature is mostly a 0/1 indicator like:\n",
        "- `attackVector_NETWORK = 1` if the CVE is remotely exploitable over the network\n",
        "\n",
        "So a learned coefficient is interpretable as:\n",
        "\n",
        "> “How much the prediction changes when this category is present, *holding other categories constant*.”\n",
        "\n",
        "This can be useful in cybersecurity analytics because it connects the model back to a human-understandable scoring rubric.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "85ba77f5"
      },
      "outputs": [],
      "source": [
        "# Inspect the largest-magnitude coefficients to see what the model found most influential.\n",
        "# Note: With one-hot encoding, coefficients depend on which category is used as the implicit reference.\n",
        "\n",
        "\n",
        "# The intercept is the model's baseline when all one-hot features are 0 (not a realistic CVSS vector,\n",
        "# but still part of the linear model's math).\n",
        "print()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "27b94a96"
      },
      "source": [
        "## 9. Diagnostics: Residuals and Range Checks\n",
        "\n",
        "Two practical checks for a regression model:\n",
        "$$\\text{baseScore} \\approx w_1 \\cdot x_1 + w_2 \\cdot x_2 + \\cdots + b$$\n",
        "\n",
        "1) **Residuals**:\n",
        "   \\(\\text{residual} = y - \\hat{y}\\)  \n",
        "   If residuals are highly structured (not “random noise”), the model is missing patterns.\n",
        "\n",
        "2) **Range constraints**: CVSS base score is **bounded between 0 and 10**.  \n",
        "   Linear regression can predict outside that range. In real systems, you might:\n",
        "   - clip outputs to [0, 10]\n",
        "   - use a different model class (e.g., tree-based regressors)\n",
        "   - model a transformed target (more advanced)\n",
        "\n",
        "We'll visualize residuals and check how often predictions fall outside [0, 10].\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "97031fd2"
      },
      "outputs": [],
      "source": [
        "# Residual analysis\n",
        "residuals = y_test - y_pred\n",
        "\n",
        "plt.hist(residuals, bins=30, edgecolor=\"black\", alpha=0.7)\n",
        "plt.xlabel(\"Residual (actual - predicted)\")\n",
        "plt.ylabel(\"Count\")\n",
        "plt.title(\"Residual Distribution (Test Set)\")\n",
        "plt.tight_layout()\n",
        "plt.show()\n",
        "\n",
        "# Residuals vs predicted values (looking for structure)\n",
        "plt.scatter(y_pred, residuals, alpha=0.5)\n",
        "plt.axhline(0, linestyle=\"--\")\n",
        "plt.xlabel(\"Predicted baseScore\")\n",
        "plt.ylabel(\"Residual (actual - predicted)\")\n",
        "plt.title(\"Residuals vs Predicted (Test Set)\")\n",
        "plt.tight_layout()\n",
        "plt.show()\n",
        "\n",
        "# Range check\n",
        "below_0 = (y_pred < 0).sum()\n",
        "above_10 = (y_pred > 10).sum()\n",
        "print(f\"Predictions < 0: {below_0} / {len(y_pred)}\")\n",
        "print(f\"Predictions > 10: {above_10} / {len(y_pred)}\")\n",
        "\n",
        "# Optional: clipped predictions (simple post-processing baseline)\n",
        "y_pred_clipped = np.clip(y_pred, 0, 10)\n",
        "mae_clip = mean_absolute_error(y_test, y_pred_clipped)\n",
        "rmse_clip = np.sqrt(mean_squared_error(y_test, y_pred_clipped))\n",
        "r2_clip = r2_score(y_test, y_pred_clipped)\n",
        "print()\n",
        "print(\"After clipping predictions to [0,10]:\")\n",
        "print(\"MAE:\", round(mae_clip, 4), \"RMSE:\", round(rmse_clip, 4), \"R²:\", round(r2_clip, 4))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2d4eda6c"
      },
      "source": [
        "## 10. Baseline Model (Sanity Check)\n",
        "\n",
        "Before trusting any ML model, compare it to a trivial baseline.\n",
        "\n",
        "A common regression baseline is to always predict the **training-set mean** of the target.\n",
        "If your ML model does not beat this baseline, it is not learning useful signal.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "b30f37d6"
      },
      "outputs": [],
      "source": [
        "# Baseline: always predict the mean of y_train\n",
        "y_mean = float(y_train.mean())\n",
        "y_pred_baseline = np.full(shape=len(y_test), fill_value=y_mean)\n",
        "\n",
        "mae_b = mean_absolute_error(y_test, y_pred_baseline)\n",
        "rmse_b = np.sqrt(mean_squared_error(y_test, y_pred_baseline))\n",
        "r2_b = r2_score(y_test, y_pred_baseline)\n",
        "\n",
        "print(\"Baseline (predict mean of y_train):\")\n",
        "print(\"MAE:\", round(mae_b, 4))\n",
        "print(\"RMSE:\", round(rmse_b, 4))\n",
        "print(\"R²:\", round(r2_b, 4))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mXtAijGWonN9"
      },
      "source": [
        "\n",
        "### Let's operationalize it!! - Steps to Run Streamlit in Google Colab\n",
        "1. **Install Streamlit and ngrok**  \n",
        "   In Colab, install the required packages:\n",
        "   ```python\n",
        "   !pip install streamlit\n",
        "   !pip install pyngrok\n",
        "   ```\n",
        "\n",
        "2. **Write the Streamlit Code**  \n",
        "   Save the Streamlit code to a file within the Colab environment. Add the following to your notebook:\n",
        "   ```python\n",
        "   with open(\"app.py\", \"w\") as f:\n",
        "       f.write('''\n",
        "       # Paste your Streamlit app code here (earlier provided code)\n",
        "       ''')\n",
        "   ```\n",
        "\n",
        "3. **Run the Streamlit App**  \n",
        "   Use `pyngrok` to create a public URL for your Streamlit app. Add this to your notebook:\n",
        "   ```python\n",
        "   from pyngrok import ngrok\n",
        "   import subprocess\n",
        "\n",
        "   # Start the Streamlit app\n",
        "   command = [\"streamlit\", \"run\", \"app.py\"]\n",
        "   process = subprocess.Popen(command)\n",
        "\n",
        "   # Create a public URL\n",
        "   public_url = ngrok.connect(8501)\n",
        "   print(f\"Streamlit app is live at {public_url}\")\n",
        "   ```\n",
        "\n",
        "4. **Access the App**  \n",
        "   After running the notebook cell with the above code, Colab will display a public URL. Click on it to access your Streamlit app.\n",
        "\n",
        "5. **Stop the App**  \n",
        "   To stop the Streamlit app, run:\n",
        "   ```python\n",
        "   process.terminate()\n",
        "   ```\n",
        "\n",
        "### Caveats\n",
        "- Each time you restart the Colab environment, you'll need to reinstall the required packages and repeat the steps.\n",
        "- The app runs temporarily and will stop once the Colab session ends.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_SWNtymxonN9"
      },
      "source": [
        "\n",
        "1. **Sign Up for ngrok**:\n",
        "   - Go to the [ngrok signup page](https://dashboard.ngrok.com/signup) and create a free account if you don't already have one.\n",
        "\n",
        "2. **Get Your ngrok Auth Token**:\n",
        "   - After logging in, go to the [ngrok dashboard](https://dashboard.ngrok.com/get-started/your-authtoken).\n",
        "   - Copy your personal authentication token from this page.\n",
        "\n",
        "3. **Set the Auth Token in Colab**:\n",
        "   - In your Colab notebook, before starting the Streamlit app, authenticate ngrok by running:\n",
        "     ```python\n",
        "     !ngrok config add-authtoken YOUR_AUTH_TOKEN\n",
        "     ```\n",
        "     Replace `YOUR_AUTH_TOKEN` with the token you copied from the ngrok dashboard.\n",
        "\n",
        "4. **Restart the App**:\n",
        "   - After authenticating, rerun the Streamlit app code. The public URL should now be generated successfully.\n",
        "\n",
        "---\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BCD6BlwFonN-"
      },
      "outputs": [],
      "source": [
        "!pip install streamlit\n",
        "!pip install pyngrok"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9zv_NJNGonN-"
      },
      "outputs": [],
      "source": [
        "import getpass\n",
        "ngrok_token = getpass.getpass(\"Enter your ngrok auth token: \")\n",
        "!ngrok config add-authtoken $ngrok_token"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SyR5SKJKonN-"
      },
      "outputs": [],
      "source": [
        "# Run the Streamlit app (same layout as the housing example).\n",
        "# Ensure cve_cvss31.csv and app.py are in the current directory.\n",
        "import subprocess\n",
        "import sys\n",
        "\n",
        "# Optional: install streamlit if not present\n",
        "# !pip install streamlit\n",
        "# # Optional: for a public URL (e.g. in Colab), install pyngrok and run:\n",
        "# # Start Streamlit in the background (app.py loads data, trains model, and provides the UI)\n",
        "# process = subprocess.Popen([sys.executable, \"-m\", \"streamlit\", \"run\", \"app.py\", \"--server.headless\", \"true\"])\n",
        "# print(\"Streamlit app started. Open the URL shown above (usually http://localhost:8501)\")\n",
        "# # print(\"To stop: process.terminate()\")\n",
        "\n",
        "from pyngrok import ngrok\n",
        "command = [\"streamlit\", \"run\", \"app.py\"]\n",
        "process = subprocess.Popen(command)\n",
        "\n",
        "# Create a public URL using ngrok\n",
        "public_url = ngrok.connect(8501)\n",
        "print(f\"Streamlit app is live at {public_url}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xtCWq4ZLonN-"
      },
      "source": [
        "### Run the app locally\n",
        "Run the cell above to start the Streamlit app. Open the URL (e.g. http://localhost:8501) in your browser. Use the sidebar to pick CVSS components and see the predicted base score—same model as in the lesson."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4wOUarhTonN-"
      },
      "source": [
        "---\n",
        "## Summary\n",
        "\n",
        "1. **Data collection:** We used the NVD API to get CVE records with CVSS v3 scores and saved them to a CSV.\n",
        "2. **Data processing:** We loaded the CSV, checked shape and missing values, and prepared the columns.\n",
        "3. **Data analysis:** We used summary statistics and visualizations (histogram, bar charts, box plot) to understand the distribution of baseScore and categorical features.\n",
        "4. **Why we encode:** Linear regression needs numeric inputs; we converted categorical attributes to numbers using one-hot encoding.\n",
        "5. **Linear regression:** We built a model to predict baseScore from the encoded CVSS components, split the data into train/test, and evaluated the model with MAE, RMSE, and R².\n",
        "\n",
        "You can try changing the train/test split (e.g. `test_size=0.3`) or adding more visualizations to explore the data further."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "py",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
 }
--- a/references/cis490_ipynbfiles/CIS490_Logistic_Regression.ipynb
+++ b/references/cis490_ipynbfiles/CIS490_Logistic_Regression.ipynb
--- a/references/cis490_ipynbfiles/CIS490_Neural_Networks.ipynb
+++ b/references/cis490_ipynbfiles/CIS490_Neural_Networks.ipynb
--- a/references/cis490_ipynbfiles/CIS490_Support_Vector_Machines_(SVM).ipynb
+++ b/references/cis490_ipynbfiles/CIS490_Support_Vector_Machines_(SVM).ipynb