# get travel duration time, deltadf["duration"] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime# for each element in duration (td) apply {math}df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
# Slght deviation from the lecture, we should do filtering before plotting in this casedf = df[((df.duration >= 1) & (df.duration <= 60))]
In [203]:
# dictplot is deprecated, general alternative is displot.# Kernel density estimation (KDE) is enabled (smoothing of the graph)# Density: how likely values to occur within a certain range (regions of data).# Density is normalized, so you can compare distribution, unlike frequencies or counts.# .set part is unnecessary, sometimes helps with readabilitysns.displot(df.duration, kde=True, stat="density") # .set(xlim=(0, 100),ylim=(0, 0.06))
Out[203]:
<seaborn.axisgrid.FacetGrid at 0x7e07d9cf8d70>
<Figure size 500x500 with 1 Axes>
In [204]:
# Checking percentile to see below which values most of our rides aredf.duration.describe(percentiles=[0.95, 0.98, 0.99])
Out[204]:
count 73908.000000
mean 16.852578
std 11.563163
min 1.000000
50% 14.000000
95% 41.000000
98% 48.781000
99% 53.000000
max 60.000000
Name: duration, dtype: float64
Is a technique to convert categorical variables (like strings or IDs) into a format that can be provided to machine learning algorithms.
How it works:
For each unique value in a categorical column, a new column is created.
In each row, the column corresponding to the value is set to 1, and all others are set to 0.
Example
Before:
Color
Red
Blue
Green
After one-hot:
Color=Red
Color=Green
Color=Blue
1
0
0
0
0
1
0
1
0
So, categorical data is represented numerically, ergo usable for most machine learning models.
Matrix
In this case each value of DOLocationID, PULocationID becomes a “column” which shows 0 or 1, each row is a ride. trip_distance remains unchanged,
After DictVectorizer, matrix will look like this:
PULocationID=10
PULocationID=15
DOLocationID=20
DOLocationID=30
trip_distance
1
0
1
0
2.5
0
1
0
1
1.2
1
0
0
1
3.8
Final matrix has as many columns as there are unique categorical values (from both columns) plus one column for each numerical feature.
In [208]:
dv = DictVectorizer()X_train = dv.fit_transform(train_dict)
In [209]:
# converting col into numpy array.# making it a target for learning.target = "duration"y_train = df[target].values
In [210]:
lr = LinearRegression()lr.fit(X_train, y_train) # makes it learny_pred = lr.predict( X_train) # we can predict on "any" data we give, y is not asked, because it asumes it from .fit
In [211]:
# dictplot is deprecated, for overlapping plots I found there are# two alternatives: histplot and kdeplot. Kdeplot looks more informativesns.histplot( y_pred, kde=True, stat="density", label="prediction", color="C0", alpha=0.5)sns.histplot(y_train, kde=True, stat="density", label="actual", color="C1", alpha=0.5)plt.legend() # render legend labelsplt.show()
Out[211]:
<Figure size 640x480 with 1 Axes>
In [212]:
# About using root_mean..., it's same as using mean_squared_error(squared=false)# calc the wellness of prediction via comparison of train vs pred via formula (y_true - y_pred) ** 2root_mean_squared_error(y_train, y_pred)# Thought model is bad, prediction is off by 9 minutes on average
# This way model treat combined PU/DO as unique identifier# I guess it helps the model to learn on specific patterns of PU/DO combinationdf_train["PU_DO"] = df_train["PULocationID"] + "_" + df_train["DOLocationID"]df_val["PU_DO"] = df_val["PULocationID"] + "_" + df_val["DOLocationID"]
lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_val)root_mean_squared_error(y_val, y_pred)
Out[218]:
7.758715209092169
In [219]:
# exporting the modelwith open("../models/lin_reg.bin", "wb") as f_out: pickle.dump((dv, lr), f_out)
In [220]:
lr = Lasso(alpha=0.0001) # Alpha uses the math concept of regularizationlr.fit(X_train, y_train)y_pred = lr.predict(X_val)root_mean_squared_error(y_val, y_pred)
Out[220]:
7.616617770546549
A Brief Into MLFlow
Use MLFlow for experiment: tracking, logging and statistical insights and it is supported by within Databricks. You can install it in a python project via uv add mlflow
To start MLFlow tracking server: mlflow ui --backend-store-uri sqlite:///mlflow.db — MLFlow supports remote artifact stores, such as AWS S3 Artifact Stores | MLflow
Within your experiment, MLFlow provides a way to log parameters, models, and artifacts with mlflow.log* methods (like mlflow.log_params(params_dict)) and to automatically log variety of the data utilize MLFlow autolog. But keep in mind that autolog won’t register a model for you.
Kubernetes comes in handy in implementation of MLOps.
Firstly we would need an image, let’s create a dockerfile. This way we can have more granular control of what goes into it, instead of relying on developer images.
It is recommended to avoid learning, transforming and generally working with large data in Apache Airflow and utilize Apache Spark or alternatives instead.
If we work with large datasets they may overload XCom and crash the DAG. For me 1GB in-memory was large enough to consider alternatives.
Deploy via Helm
Airflow allows us to extend image to add more python packages or additional configurations. Let’s do both.
Suppose we need mlflow and couple of other libraries we have in our project:
FROM apache/airflow:3.0.2# Another way is to define image as argument variable# ARG AIRFLOW_IMAGE_NAME # FROM ${AIRFLOW_IMAGE_NAME}# and then pass the arg docker build ... --build-arg AIRFLOW_IMAGE_NAME=$AIRFLOW_IMAGE_NAME COPY pyproject.toml ./RUN uv pip install --no-cache --group airflow-server
Production Guide — helm-chart Documentation
One simple way to extend it is to use Helm | Values Filesvalues.yaml to store configuration information, lets point Airflow to look for DAGs in remote GitHub (you would need to specify credentials for private).
There is a concern: taking into account isolated nature of tasks, we are relying on passing data with XCom. For higher number of experiment or data, consider using Apache Spark or Databricks operators.