{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes in Python\n",
    "\n",
    "Let's look at the ham/spam data in python.\n",
    "\n",
    "First of, here are our python imports.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model class is MultinomialNB.  This is the naive bayes analogue to  \n",
    "**from sklearn.linear_model import LinearRegression**  \n",
    "which we used for linear regression.  \n",
    "\n",
    "The model will do the naive bayes work for us.  \n",
    "\n",
    "We also import tools for measuring our success.  \n",
    "We will use the confusion matrix and the accuracy_score.  \n",
    "For a numeric y we can use rmse or mad (mean absolute deviation).  \n",
    "For categorical outcomes, there are actually a lot of different ways people measure\n",
    "success.  sklearn has a lot of *metrics* for measuring in-sample fit or out of sample\n",
    "accurracy.\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read in and quick look at data \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainB = pd.read_csv(\"https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainB.csv\")\n",
    "testB = pd.read_csv(\"https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestB.csv\")\n",
    "trainyB = pd.read_csv(\"https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv\")['smsTrainyB']\n",
    "testyB = pd.read_csv(\"https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestyB.csv\")['smsTestyB']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4169, 1139)\n",
      "(1390, 1139)\n",
      "(4169,)\n",
      "(1390,)\n"
     ]
    }
   ],
   "source": [
    "print(trainB.shape)\n",
    "print(testB.shape)\n",
    "print(trainyB.shape)\n",
    "print(testyB.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have 4,169 train observations and 1,390 test observations.  \n",
    "trainB is our x matrix and trainyB is our outcomes ham/spam.  \n",
    "Our corresponding test data is testB and testyB.  \n",
    "\n",
    "Let's have a quick look at the data.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>£wk</th>\n",
       "      <th>€˜m</th>\n",
       "      <th>€˜s</th>\n",
       "      <th>abiola</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   £wk  €˜m  €˜s  abiola\n",
       "0    0    0    0       0\n",
       "1    0    0    0       0\n",
       "2    0    0    0       0\n",
       "3    0    0    0       0\n",
       "4    0    0    0       0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainB.iloc[0:5,0:4]  #first few rows and coluns of train x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    0\n",
       "2    0\n",
       "3    1\n",
       "4    1\n",
       "5    0\n",
       "Name: smsTrainyB, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainyB.iloc[0:6]  # first few train y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.864716\n",
       "1    0.135284\n",
       "Name: smsTrainyB, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainyB.value_counts()/trainyB.shape[0] #train ham/spam frequencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So all of the data has been represented as 0/1.  \n",
    "For y, 1 means spam and 0 means ham.  \n",
    "For x, the (i,j) element is 1 if the jth term is in the ith document and 0 otherwise.  \n",
    "\n",
    "Let's look at the term age and y.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>age</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>smsTrainyB</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3600</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>552</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "age            0   1\n",
       "smsTrainyB          \n",
       "0           3600   5\n",
       "1            552  12"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(trainyB,trainB['age'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0013869625520110957\n",
      "0.02127659574468085\n"
     ]
    }
   ],
   "source": [
    "print(5/(5+3600))\n",
    "print(12/(12+552))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, $P(X_i = 1 | y=1)$ for $x_i =age$, is estimated to be .0213.\n",
    "\n",
    "Note that these numbers match up with what we got using \n",
    "```\n",
    "library(e1071)\n",
    "smsNB = naiveBayes(smsTrain, smsTrainy)\n",
    "```\n",
    "\n",
    "in R. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit Naive Bayes on Train, Predict on Test\n",
    "\n",
    "Ok let' do naive bayes using sklearn.  \n",
    "As usual we: \n",
    "\n",
    "* make a model\n",
    "* fit on train\n",
    "* predict on test\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MultinomialNB()  #create model object\n",
    "model.fit(trainB,trainyB) # fit on train\n",
    "yhat = model.predict(testB) # predict on test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1390,)\n",
      "<class 'numpy.ndarray'>\n",
      "int64\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 0, 1])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(yhat.shape)\n",
    "print(type(yhat))\n",
    "print(yhat.dtype)\n",
    "yhat[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How did we do !!!!?????  \n",
    "\n",
    "What is our out of sample predictive performance on the test data?  \n",
    "\n",
    "The most basic the is the confusion matrix which is simple the cross tab of predicted and actual.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1190,   17],\n",
       "       [  25,  158]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "confusion_matrix(testyB,yhat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we got 25+17 wrong.  \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9697841726618706"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(testyB,yhat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9697841726618706"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(158+1190)/1390"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do the simple cross-tab to check.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>col_0</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>smsTestyB</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1190</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>25</td>\n",
       "      <td>158</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "col_0         0    1\n",
       "smsTestyB           \n",
       "0          1190   17\n",
       "1            25  158"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(testyB,yhat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a quick look at the model object.  \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "?model\n",
    "\n",
    "Parameters\n",
    "----------\n",
    "alpha : float, optional (default=1.0)\n",
    "    Additive (Laplace/Lidstone) smoothing parameter\n",
    "    (0 for no smoothing).\n",
    "\n",
    "fit_prior : boolean, optional (default=True)\n",
    "    Whether to learn class prior probabilities or not.\n",
    "    If false, a uniform prior will be used.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare to what we had in R.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "result in R was 0.97410072\n"
     ]
    }
   ],
   "source": [
    "inr = (1 - 0.02589928)\n",
    "print(f'result in R was {inr}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which is virtually the same."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}