{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(mlflow-primer)=\n", "# MLFlow Primer\n", "\n", "So after `pip install mlflow` you can track the runs of your code by inserting a few extra lines. This following is not a full tutorial of course, just something quick to show the basic and convince you it's easy to work with.\n", "\n", "MLFlow is organized into 'experiments', which are essentially just collections of runs. One run is one execution of your code. MLFlow tracks a bunch of metadata automatically, and in addition you can store basically whatever you want in a run. MLFlow uses a number of concepts to seperate information logically and displays them in different ways: 'parameters' (inputs), 'metrics' (outputs), 'tags' (labels) and 'artifacts' (files).\n", "\n", "Once your runs are stored, you can view them either through the UI or the API. We won't use the UI in this guide, because we need to access the stored runs programmatically through the API, but the UI is very useful and trivial to run (checkout the MLFlow docs). \n", "\n", "The skeleton of the MLFlow code to be inserted basically looks like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "import mlfow\n", "\n", "mlflow.start_run():\n", " mlflow.log_param('param_1', 3.14)\n", " mlflow.log_metric('answer', 42)\n", " mlflow.log_artifact('figure.png')\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, you will see a more elaborate and realistic example. (note that not all dependent functions are shown)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "application/javascript": [ "IPython.notebook.set_autosave_interval(0)" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Autosave disabled\n" ] } ], "source": [ "# prevent jupyter and your IDE from trying to make simultaneous changed\n", "%autosave 0" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn import datasets\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.svm import SVC\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "from matplotlib.colors import ListedColormap\n", "import matplotlib.pyplot as plt\n", "\n", "import mlflow\n", "import mlflow.sklearn\n", "\n", "def get_data():\n", " iris = datasets.load_iris()\n", "\n", " X = iris.data[:, [2, 3]]\n", " y = iris.target\n", "\n", " return train_test_split(X, y, test_size=0.35, random_state=0)\n", "\n", "\n", "def feature_engineering(X_train, X_test):\n", " sc = StandardScaler()\n", " sc.fit(X_train)\n", " X_train_std = sc.transform(X_train)\n", " X_test_std = sc.transform(X_test)\n", " return X_train_std, X_test_std\n", "\n", "def recombine_data(X_train, X_test, y_train, y_test):\n", " X_combined_std = np.vstack((X_train, X_test))\n", " y_combined = np.hstack((y_train, y_test))\n", " return X_combined_std, y_combined\n", "\n", "\n", "\n", "\n", "def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):\n", "\n", " # setup marker generator and color map\n", " markers = ('s', 'x', 'o', '^', 'v')\n", " colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')\n", " cmap = ListedColormap(colors[:len(np.unique(y))])\n", "\n", " # plot the decision surface\n", " x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", " x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", " xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),\n", " np.arange(x2_min, x2_max, resolution))\n", " Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)\n", " Z = Z.reshape(xx1.shape)\n", " plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)\n", " plt.xlim(xx1.min(), xx1.max())\n", " plt.ylim(xx2.min(), xx2.max())\n", "\n", " for idx, cl in enumerate(np.unique(y)):\n", " plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],\n", " alpha=0.8, c=cmap(idx),\n", " marker=markers[idx], label=cl)\n", " \n", " # highlight test samples\n", " if test_idx:\n", " X_test, y_test = X[test_idx, :], y[test_idx]\n", " plt.scatter(X_test[:, 0], X_test[:, 1], c='', \n", " alpha=1.0, linewidth=1, marker='o',\n", " s=55, label=\"test set\")\n", "\n", "def train_knn(data, target, **params):\n", " knn = KNeighborsClassifier(**params)\n", " knn.fit(data, target)\n", " return knn\n", "\n", "\n", "def train_svc(data, target, **params):\n", " svm = SVC(**params)\n", " svm.fit(data, target)\n", " return svm" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'. Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'. Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'. Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": { "filenames": { "image/png": "/home/jeroenf/Projects/bookflow/iris_book/_build/jupyter_execute/primer_mlflow_5_3.png" }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import mlflow\n", "\n", "# set up some parameters for my code\n", "svc_pars = dict(kernel='rbf', random_state=0, gamma=.10, C=1.0)\n", "knn_pars = dict(n_neighbors=5, p=2, metric='minkowski')\n", "algo = 'knn'\n", "\n", "# some free text that you can save with a run\n", "notes = \"I think an knn will work better\" \n", "# you can define your own tags as well. In this case, \n", "# I'm reminding myself that this is not a serious run (but a test for example)\n", "tags = {\"valid\": False} \n", "# set location to save the run data\n", "mlflow.set_tracking_uri('../iris_project/mlruns')\n", "# name of my experiment(= grouping of runs)\n", "mlflow.set_experiment('iris')\n", "\n", "run_name = f'iris_{algo}'\n", "\n", "# let MLFlow know this is a run to track\n", "with mlflow.start_run(run_name=run_name) as run:\n", " \n", " # -- here is just some code, it's not important for now -- \n", " X_train, X_test, y_train, y_test = get_data()\n", " X_train, X_test = feature_engineering(X_train, X_test)\n", "\n", " if algo == 'svc':\n", " params = svc_pars\n", " model = train_svc(X_train, y_train, **params)\n", " elif algo == 'knn':\n", " params = knn_pars\n", " model = train_knn(X_train, y_train, **params)\n", "\n", " acc_train = model.score(X_train, y_train)\n", " acc_test = model.score(X_test, y_test)\n", "\n", " X_stack, y_stack = recombine_data(X_train, X_test, y_train, y_test)\n", " ## -- computations finished --\n", " \n", " # we can log parameters to this run (inputs):\n", " mlflow.log_params(params)\n", " mlflow.log_param('algo', algo)\n", " # and we can log metrics to this run (outputs)\n", " mlflow.log_metric('acc_train', acc_train)\n", " mlflow.log_metric('acc_test', acc_test)\n", " \n", " # and also model artifacts. \n", " # even if you don't do ML, if you use sklearn, tensorflow or other common frameworks, \n", " # you may still be able to save some useful objects with various log_model methods,\n", " # or with the log_artifact method.\n", " mlflow.sklearn.log_model(model, 'model')\n", "\n", " # we can also log plots (and basically any other file)...\n", " plot_decision_regions(X=X_stack, y=y_stack, classifier=model, test_idx=range(105,150))\n", " plt.xlabel('petal length [standardized]')\n", " plt.ylabel('petal width [standardized]')\n", " plt.legend(loc='upper left')\n", " plot_filename = 'decision_region.png'\n", " plt.savefig(plot_filename)\n", " # with this method\n", " mlflow.log_artifact(plot_filename, 'figures')\n", "\n", " # and also apply some tags to this run\n", " # the content tag is a special one\n", " mlflow.set_tag('mlflow.note.content', notes)\n", " for key, value in tags.items():\n", " mlflow.set_tag(key, value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You don't see it here, but this run is now saved by mlflow. You can query all the runs through the python API (which we will do in the next section), but there is also an UI where you can view them conveniently." ] } ], "metadata": { "jupytext": { "formats": "ipynb,md:myst", "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.12, "jupytext_version": "1.6.0" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "source_map": [ 13, 26, 37, 41, 48, 130, 200 ] }, "nbformat": 4, "nbformat_minor": 4 }