MLOps Process

In [196]:

!python -V

Out[196]:

Python 3.12.1

Install packages.

With uv + vscode, there are two options (I went with first).

Add them to the project, cli: uv add pandas or jupyter: !uv add pandas
Install them, bypassing pyproject.toml. Jupyter: !uv pip install pandas

More in README.MD

In [197]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import root_mean_squared_error #, mean_squared_error be

In [ ]:

df = pd.read_parquet('../data/green_tripdata_2021-01.parquet')

In [199]:

df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)

In [200]:

# get travel duration time, delta
df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
# for each element in duration (td) apply {math} 
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

In [201]:

# https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf
# df = df[df.trip_type == 2] # unrequred

In [202]:

# Slght deviation from the lecture, we should do filtering before plotting in this case
df = df[((df.duration >= 1) & (df.duration <= 60))]

In [203]:

# dictplot is deprecated, general alternative is displot.
 
# Kernel density estimation (KDE) is enabled (smoothing of the graph)
# Density: how likely values to occur within a certain range (regions of data).
# Density is normalized, so you can compare distribution, unlike frequencies or counts.
 
# .set part is unnecessary, sometimes helps with readability
sns.displot(df.duration, kde=True, stat="density") #.set(xlim=(0, 100),ylim=(0, 0.06))

Out[203]:

<seaborn.axisgrid.FacetGrid at 0x7e07d9cf8d70>

<Figure size 500x500 with 1 Axes>

In [204]:

# Checking percentile to see below which values most of our rides are
df.duration.describe(percentiles=[0.95, 0.98, 0.99])

Out[204]:

count    73908.000000
mean        16.852578
std         11.563163
min          1.000000
50%         14.000000
95%         41.000000
98%         48.781000
99%         53.000000
max         60.000000
Name: duration, dtype: float64

In [ ]:

categorical = ['PULocationID', 'DOLocationID']
numerical = ['trip_distance']

In [206]:

df[categorical] = df[categorical].astype(str)

In [ ]:

train_dict = df[categorical + numerical].to_dict(orient='records')

Notes:

One-hot encoding

Is a technique to convert categorical variables (like strings or IDs) into a format that can be provided to machine learning algorithms.

How it works:

For each unique value in a categorical column, a new column is created.
In each row, the column corresponding to the value is set to 1, and all others are set to 0.

Example

Before:

Color
Red
Blue
Green

After one-hot:

Color=Red	Color=Green	Color=Blue
1	0	0
0	0	1
0	1	0

So, categorical data is represented numerically, ergo usable for most machine learning models.

Matrix

In this case each value of DOLocationID, PULocationID becomes a “column” which shows 0 or 1, each row is a ride. trip_distance remains unchanged,

After DictVectorizer, matrix will look like this:

PULocationID=10	PULocationID=15	DOLocationID=20	DOLocationID=30	trip_distance
1	0	1	0	2.5
0	1	0	1	1.2
1	0	0	1	3.8

Final matrix has as many columns as there are unique categorical values (from both columns) plus one column for each numerical feature.

In [208]:

dv = DictVectorizer()
X_train = dv.fit_transform(train_dict)

In [209]:

# converting col into numpy array.
# making it a target for learning.
target = 'duration'
y_train = df[target].values

In [210]:

lr = LinearRegression()
lr.fit(X_train, y_train) # makes it learn
y_pred = lr.predict(X_train) # we can predict on "any" data we give, y is not asked, because it asumes it from .fit

In [211]:

# dictplot is deprecated, for overlapping plots I found there are 
# two alternatives: histplot and kdeplot. Kdeplot looks more informative
 
sns.histplot(y_pred, kde=True, stat="density", label="prediction", color="C0", alpha=0.5)
sns.histplot(y_train, kde=True, stat="density", label="actual", color="C1", alpha=0.5)
plt.legend() # render legend labels
plt.show()

Out[211]:

<Figure size 640x480 with 1 Axes>

In [212]:

# About using root_mean..., it's same as using mean_squared_error(squared=false)
 
# calc the wellness of prediction via comparison of train vs pred via formula (y_true - y_pred) ** 2
root_mean_squared_error(y_train, y_pred)
# Thought model is bad, prediction is off by 9 minutes on average

Out[212]:

9.838799799829626

In [213]:

# refactor for function approach
def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)
 
        df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)
 
    df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
 
    df = df[(df.duration >= 1) & (df.duration <= 60)]
 
    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [ ]:

df_train = read_dataframe('../data/green_tripdata_2021-01.parquet')
df_val = read_dataframe('../data/green_tripdata_2021-02.parquet')

In [215]:

# This way model treat combined PU/DO as unique identifier
# I guess it helps the model to learn on specific patterns of PU/DO combination
df_train['PU_DO'] = df_train['PULocationID'] + '_' + df_train['DOLocationID']
df_val['PU_DO'] = df_val['PULocationID'] + '_' + df_val['DOLocationID']

training - january
validation - february

In [216]:

categorical = ['PU_DO'] # combined 'PULocationID', 'DOLocationID'
numerical = ['trip_distance']
 
dv = DictVectorizer()
 
train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)
 
val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [217]:

target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

In [218]:

lr = LinearRegression()
lr.fit(X_train, y_train)
 
y_pred = lr.predict(X_val)
 
root_mean_squared_error(y_val, y_pred)

Out[218]:

7.758715209092169

In [219]:

# exporting the model
with open('../models/lin_reg.bin', 'wb') as f_out:
    pickle.dump((dv, lr), f_out)

In [220]:

lr = Lasso(alpha=0.0001) # Alpha uses the math concept of regularization
lr.fit(X_train, y_train)
 
y_pred = lr.predict(X_val)
 
root_mean_squared_error(y_val, y_pred)

Out[220]:

7.616617770546549

Explorer

MLOps Process

Install packages.

One-hot encoding

How it works:

Example

Matrix

Graph View