• Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

How to send a list as parameter in databricks notebook task?

I am using Databricks Resi API to create a job with notebook_task in an existing cluster and getting the job_id in return. Then I am calling the run-now api to trigger the job. In this step, I want to send a list as argument via the notebook_params, which throws an error saying "Expected non-array for field value".

Is there any way I can send a list as an argument to the job?

I have tried sending the list argument in base_params as well with same error.

Rony's user avatar

2 Answers 2

Not found any native solution yet, but my solution was to pass the list as a string and parse it back out on the other side:

Then in databricks:

With appropriate care around special characters or e.g. conversion to numeric types.

If the objects in the list are more substantial, then sending them as a file to dbfs using the CLI or API before running the job may be another option to explore.

Will Hall's user avatar

Hi may be I'm bit late but found a better solution.

Step 1: Use JSON.stringyfy() in the console of any browser to convert your value(object, array, JSON etc) into string

enter image description here

Now use this value in the body of URL

enter image description here

In Databricks notebook convert string to JSON using python json module.

Hope this helps

Praddyum Verma's user avatar

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged rest lambda databricks or ask your own question .

  • The Overflow Blog
  • The framework helping devs build LLM apps
  • How to bridge the gap between Web2 skills and Web3 workflows
  • Featured on Meta
  • Upcoming initiatives on Stack Overflow and across the Stack Exchange network...
  • Announcing a change to the data-dump process
  • What makes a homepage useful for logged-in users

Hot Network Questions

  • Narcissist boss won't allow me to move on
  • What does "as to" mean in "to request to be quiet, **as to** start (a meeting)"
  • The Zentralblatt asked me to review a worthless paper, what to do?
  • Does the universe include everything, or merely everything that exists?
  • What are the ways compilers recognize complex patterns?
  • Homebrew DND 5e Spell, review in power, and well... utility
  • Is there an equivalent of caniuse for commands on posix systems?
  • What are the risks of disabling issuer URL validation?
  • How to request for a package to be added to the Fedora repositories?
  • What's to prevent us from concluding that Revelation 13:3 has been fulfilled through Trump?
  • Is this circuit safe to put in my ceiling? What improvements could I make?
  • The maximum area of a pentagon inside a circle
  • I can't find a nice literal translation for "Stella caelis exstirpavit"
  • Reduce intensity (luminosity) of Apple Studio Display
  • Did Arab Christians use the word "Allah" before Islam?
  • What is the function of this resistor and capacitor at the input of optoisolator?
  • What does HJD-2450000 mean?
  • Why is Sun's energy entropy low on Earth?
  • Piano fingering for seemingly impossible chord in RH
  • Why can THHN/THWN go in Schedule 40 PVC but NM cable (Romex) requires Schedule 80?
  • Object of proven finiteness, yet with no algorithm discovered?
  • Mutual Life Insurance Company of New York -- What is it now? How can I reach them?
  • Why is the MOSFET in this fan control circuit overheating?
  • Is it worth it to apply to jobs that have over 100 applicants or have been posted for few days?

notebook task databricks

  • Certifications
  • Learning Paths
  • Community Discussions
  • Get Started Discussions
  • Databricks Platform Discussions
  • Administration & Architecture
  • Data Engineering
  • Data Governance
  • Generative AI
  • Machine Learning
  • Warehousing & Analytics
  • Learning Discussion
  • Training offerings
  • Summit 2024
  • Get Started Resources
  • Product Platform Updates
  • Support FAQs
  • Technical Blog
  • Get Started Guides
  • Knowledge Sharing Hub
  • Announcements
  • DatabricksTV
  • What's New in Databricks
  • Regional and Interest Groups
  • Americas (AMER)
  • Asia-Pacific & Japan (APJ)
  • Europe, Middle East, and Africa (EMEA)
  • Interest Groups
  • Private Groups
  • Skills@Scale
  • Databricks Community Champions
  • Khoros Community Forums Support (Not for Databricks Product Questions)
  • Databricks Community Code of Conduct
  • Databricks Community
  • Passing values between notebook tasks in Workflow ...
  • Subscribe to RSS Feed
  • Mark Topic as New
  • Mark Topic as Read
  • Float this Topic for Current User
  • Printer Friendly Page

Passing values between notebook tasks in Workflow Jobs

brickster

  • Mark as New
  • Report Inappropriate Content

‎10-30-2022 02:58 AM

  • Notebook Task
  • Notebook Tasks
  • Workflow Jobs
  • All forum topics
  • Previous Topic

UmaMahesh1

‎12-03-2022 02:35 AM

notebook task databricks

‎01-08-2023 09:54 PM

fijoy

‎07-09-2023 07:51 PM

notebook task databricks

never-displayed

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login  and join your local regional user group ! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Submit your feedback and win a $25 gift card!

Submit your feedback and win a $25 gift card!

Get Started With Generative AI on Databricks

Get Started With Generative AI on Databricks

  • Mount point in unity catalog in Data Engineering 2 hours ago
  • audit log for workspace users in Data Engineering 10 hours ago
  • DLT Pipeline on Hive Metastore in Data Engineering yesterday
  • Get the external public IP of the Job Compute cluster in Data Engineering Wednesday
  • Workflow scheduler cancel unreliable in Data Engineering Wednesday

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Develop code in Databricks notebooks

  • 4 contributors

This page describes how to develop code in Databricks notebooks, including autocomplete, automatic formatting for Python and SQL, combining Python and SQL in a notebook, and tracking the notebook version history.

For more details about advanced functionality available with the editor, such as autocomplete, variable selection, multi-cursor support, and side-by-side diffs, see Use the Databricks notebook and file editor .

When you use the notebook or the file editor, Databricks Assistant is available to help you generate, explain, and debug code. See Use Databricks Assistant for more information.

Databricks notebooks also include a built-in interactive debugger for Python notebooks. See Use the Databricks interactive debugger .

Get coding help from Databricks Assistant

Databricks Assistant is a context-aware AI assistant that you can interact with using a conversational interface, making you more productive inside Databricks. You can describe your task in English and let the assistant generate Python code or SQL queries, explain complex code, and automatically fix errors. The assistant uses Unity Catalog metadata to understand your tables, columns, descriptions, and popular data assets across your company to provide personalized responses.

Databricks Assistant can help you with the following tasks:

  • Generate code.
  • Debug code, including identifying and suggesting fixes for errors.
  • Transform and optimize code.
  • Explain code.
  • Help you find relevant information in the Azure Databricks documentation.

For information about using Databricks Assistant to help you code more efficiently, see Use Databricks Assistant . For general information about Databricks Assistant, see DatabricksIQ-powered features .

Access notebook for editing

To open a notebook, use the workspace Search function or use the workspace browser to navigate to the notebook and click on the notebook’s name or icon.

Browse data

notebook data icon

The For you button displays only those objects that you’ve used in the current session or previously marked as a Favorite.

As you type text into the Filter box, the display changes to show only those objects that contain the text you type. Only objects that are currently open or have been opened in the current session appear. The Filter box does not do a complete search of the catalogs, schemas, tables, and volumes available for the notebook.

Kebab menu

If the object is a table, you can do the following:

  • Automatically create and run a cell to display a preview of the data in the table. Select Preview in a new cell from the kebab menu for the table.
  • View a catalog, schema, or table in Catalog Explorer. Select Open in Catalog Explorer from the kebab menu. A new tab opens showing the selected object.
  • Get the path to a catalog, schema, or table. Select Copy … path from the kebab menu for the object.
  • Add a table to Favorites. Select Add to favorites from the kebab menu for the table.

If the object is a catalog, schema, or volume, you can copy the object’s path or open it in Catalog Explorer.

To insert a table or column name directly into a cell:

  • Click your cursor in the cell at the location you want to enter the name.
  • Move your cursor over the table name or column name in the schema browser.

double arrow

Keyboard shortcuts

To display keyboard shortcuts, select Help > Keyboard shortcuts . The keyboard shortcuts available depend on whether the cursor is in a code cell (edit mode) or not (command mode).

Find and replace text

To find and replace text within a notebook, select Edit > Find and Replace . The current match is highlighted in orange and all other matches are highlighted in yellow.

Matching text

To replace the current match, click Replace . To replace all matches in the notebook, click Replace All .

To move between matches, click the Prev and Next buttons. You can also press shift+enter and enter to go to the previous and next matches, respectively.

Delete Icon

Variable explorer

You can directly observe Python, Scala, and R variables in the notebook UI. For Python on Databricks Runtime 12.2 LTS and above, the variables update as a cell runs. For Scala, R, and for Python on Databricks Runtime 11.3 LTS and below, variables update after a cell finishes running.

the variable explorer icon

To filter the display, enter text into the search box. The list is automatically filtered as you type.

Variable values are automatically updated as you run notebook cells.

example variable explorer panel

Run selected cells

You can run a single cell or a collection of cells. To select a single cell, click anywhere in the cell. To select multiple cells, hold down the Command key on MacOS or the Ctrl key on Windows, and click in the cell outside of the text area as shown in the screen shot.

how to select multiple cells

To run the selected cells, select Run > Run selected cell(s) .

The behavior of this command depends on the cluster that the notebook is attached to.

  • On a cluster running Databricks Runtime 13.3 LTS or below, selected cells are executed individually. If an error occurs in a cell, the execution continues with subsequent cells.
  • On a cluster running Databricks Runtime 14.0 or above, or on a SQL warehouse, selected cells are executed as a batch. Any error halts execution, and you cannot cancel the execution of individual cells. You can use the Interrupt button to stop execution of all cells.

Modularize your code

This feature is in Public Preview .

With Databricks Runtime 11.3 LTS and above, you can create and manage source code files in the Azure Databricks workspace, and then import these files into your notebooks as needed.

For more information on working with source code files, see Share code between Databricks notebooks and Work with Python and R modules .

Run selected text

You can highlight code or SQL statements in a notebook cell and run only that selection. This is useful when you want to quickly iterate on code and queries.

Highlight the lines you want to run.

Select Run > Run selected text or use the keyboard shortcut Ctrl + Shift + Enter . If no text is highlighted, Run Selected Text executes the current line.

If you are using mixed languages in a cell , you must include the %<language> line in the selection.

Run selected text also executes collapsed code, if there is any in the highlighted selection.

Special cell commands such as %run , %pip , and %sh are supported.

You cannot use Run selected text on cells that have multiple output tabs (that is, cells where you have defined a data profile or visualization).

Format code cells

Azure Databricks provides tools that allow you to format Python and SQL code in notebook cells quickly and easily. These tools reduce the effort to keep your code formatted and help to enforce the same coding standards across your notebooks.

Python black formatter library

Azure Databricks supports Python code formatting using black within the notebook. The notebook must be attached to a cluster with black and tokenize-rt Python packages installed.

On Databricks Runtime 11.3 LTS and above, Azure Databricks preinstalls black and tokenize-rt . You can use the formatter directly without needing to install these libraries.

On Databricks Runtime 10.4 LTS and below, you must install black==22.3.0 and tokenize-rt==4.2.1 from PyPI on your notebook or cluster to use the Python formatter. You can run the following command in your notebook:

or install the library on your cluster .

For more details about installing libraries, see Python environment management .

For files and notebooks in Databricks Git folders, you can configure the Python formatter based on the pyproject.toml file. To use this feature, create a pyproject.toml file in the Git folder root directory and configure it according to the Black configuration format . Edit the [tool.black] section in the file. The configuration is applied when you format any file and notebook in that Git folder.

How to format Python and SQL cells

You must have CAN EDIT permission on the notebook to format code.

Azure Databricks uses the Gethue/sql-formatter library to format SQL and the black code formatter for Python.

You can trigger the formatter in the following ways:

Format a single cell

  • Keyboard shortcut: Press Cmd+Shift+F .
  • Format SQL cell: Select Format SQL in the command context dropdown menu of a SQL cell. This menu item is visible only in SQL notebook cells or those with a %sql language magic .
  • Format Python cell: Select Format Python in the command context dropdown menu of a Python cell. This menu item is visible only in Python notebook cells or those with a %python language magic .
  • Notebook Edit menu: Select a Python or SQL cell, and then select Edit > Format Cell(s) .

Format multiple cells

Select multiple cells and then select Edit > Format Cell(s) . If you select cells of more than one language, only SQL and Python cells are formatted. This includes those that use %sql and %python .

Format all Python and SQL cells in the notebook

Select Edit > Format Notebook . If your notebook contains more than one language, only SQL and Python cells are formatted. This includes those that use %sql and %python .

Limitations of code formatting

  • Black enforces PEP 8 standards for 4-space indentation. Indentation is not configurable.
  • Formatting embedded Python strings inside a SQL UDF is not supported. Similarly, formatting SQL strings inside a Python UDF is not supported.

Version history

Azure Databricks notebooks maintain a history of notebook versions, allowing you to view and restore previous snapshots of the notebook. You can perform the following actions on versions: add comments, restore and delete versions, and clear version history.

You can also sync your work in Databricks with a remote Git repository .

version history icon

Add a comment

To add a comment to the latest version:

Click the version.

Click Save now .

Save comment

In the Save Notebook Version dialog, enter a comment.

Click Save . The notebook version is saved with the entered comment.

Restore a version

To restore a version:

Click Restore this version .

Restore version

Click Confirm . The selected version becomes the latest version of the notebook.

Delete a version

To delete a version entry:

Trash

Click Yes, erase . The selected version is deleted from the history.

Clear version history

The version history cannot be recovered after it has been cleared.

To clear the version history for a notebook:

  • Select File > Clear version history .
  • Click Yes, clear . The notebook version history is cleared.

Code languages in notebooks

Set default language.

The default language for the notebook appears next to the notebook name.

Notebook default language

To change the default language, click the language button and select the new language from the dropdown menu. To ensure that existing commands continue to work, commands of the previous default language are automatically prefixed with a language magic command.

Mix languages

By default, cells use the default language of the notebook. You can override the default language in a cell by clicking the language button and selecting a language from the dropdown menu.

Cell language drop down

Alternately, you can use the language magic command %<language> at the beginning of a cell. The supported magic commands are: %python , %r , %scala , and %sql .

When you invoke a language magic command, the command is dispatched to the REPL in the execution context for the notebook. Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of another language. REPLs can share state only through external resources such as files in DBFS or objects in object storage.

Notebooks also support a few auxiliary magic commands:

  • %sh : Allows you to run shell code in your notebook. To fail the cell if the shell command has a non-zero exit status, add the -e option. This command runs only on the Apache Spark driver, and not the workers. To run a shell command on all nodes, use an init script .
  • %fs : Allows you to use dbutils filesystem commands. For example, to run the dbutils.fs.ls command to list files, you can specify %fs ls instead. For more information, see Work with files on Azure Databricks .
  • %md : Allows you to include various types of documentation, including text, images, and mathematical formulas and equations. See the next section.

SQL syntax highlighting and autocomplete in Python commands

Syntax highlighting and SQL autocomplete are available when you use SQL inside a Python command, such as in a spark.sql command.

Explore SQL cell results in Python notebooks using Python

You might want to load data using SQL and explore it using Python. In a Databricks Python notebook, table results from a SQL language cell are automatically made available as a Python DataFrame assigned to the variable _sqldf .

In Databricks Runtime 13.3 LTS and above, you can also access the DataFrame result using IPython’s output caching system . The prompt counter appears in the output message displayed at the bottom of the cell results. For the example shown, you would reference the result as Out[2] .

The variable _sqldf may be reassigned each time a %sql cell is run. To avoid losing reference to the DataFrame result, assign it to a new variable name before you run the next %sql cell:

If the query uses a widget for parameterization, the results are not available as a Python DataFrame.

If the query uses the keywords CACHE TABLE or UNCACHE TABLE , the results are not available as a Python DataFrame.

The screenshot shows an example:

sql results dataframe

Execute SQL cells in parallel

While a command is running and your notebook is attached to an interactive cluster, you can run a SQL cell simultaneously with the current command. The SQL cell is executed in a new, parallel session.

To execute a cell in parallel:

Run the cell .

Click Run Now . The cell is immediately executed.

Run a SQL cell in parallel with current running cell

Because the cell is run in a new session, temporary views, UDFs, and the implicit Python DataFrame ( _sqldf ) are not supported for cells that are executed in parallel. In addition, the default catalog and database names are used during parallel execution. If your code refers to a table in a different catalog or database, you must specify the table name using three-level namespace ( catalog . schema . table ).

Execute SQL cells on a SQL warehouse

You can run SQL commands in a Databricks notebook on a SQL warehouse , a type of compute that is optimized for SQL analytics. See Use a notebook with a SQL warehouse .

Display images

To display images stored in the FileStore , use the following syntax:

For example, suppose you have the Databricks logo image file in FileStore:

When you include the following code in a Markdown cell:

Image in Markdown cell

the image is rendered in the cell:

Rendered image

Display mathematical equations

Notebooks support KaTeX for displaying mathematical formulas and equations. For example,

renders as:

Rendered equation 1

Include HTML

You can include HTML in a notebook by using the function displayHTML . See HTML, D3, and SVG in notebooks for an example of how to do this.

The displayHTML iframe is served from the domain databricksusercontent.com and the iframe sandbox includes the allow-same-origin attribute. databricksusercontent.com must be accessible from your browser. If it is currently blocked by your corporate network, it must added to an allow list.

Link to other notebooks

You can link to other notebooks or folders in Markdown cells using relative paths. Specify the href attribute of an anchor tag as the relative path, starting with a $ and then follow the same pattern as in Unix file systems:

Was this page helpful?

Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see: https://aka.ms/ContentUserFeedback .

Submit and view feedback for

Additional resources

notebook task databricks

Ensuring Quality Forecasts with Databricks Lakehouse Monitoring

Peter Park

July 18, 2024 in Platform Blog

Forecasting models are critical for many businesses to predict future trends, but their accuracy depends heavily on the quality of the input data. Poor quality data can lead to inaccurate forecasts that result in suboptimal decisions. This is where Databricks Lakehouse Monitoring comes in - it provides a unified solution to monitor both the quality of data flowing into forecasting models as well as the model performance itself.

Monitoring is especially crucial for forecasting models. Forecasting deals with time series data, where the temporal component and sequential nature of the data introduce additional complexities. Issues like data drift, where the statistical properties of the input data change over time, can significantly degrade forecast accuracy if not detected and addressed promptly.

Additionally, the performance of forecasting models is often measured by metrics like Mean Absolute Percentage Error (MAPE) that compare predictions to actual values. However, ground truth values are not immediately available, only arriving after the forecasted time period has passed. This delayed feedback loop makes proactive monitoring of input data quality and model outputs even more important to identify potential issues early.

Frequent retraining of statistical forecasting models using recent data is common, but monitoring remains valuable to detect drift early and avoid unnecessary computational costs. For complex models like PatchTST , which use deep learning and require GPUs, retraining may be less frequent due to resource constraints, making monitoring even more critical.

With automatic hyperparameter tuning, you can introduce skew and inconsistent model performance across runs. Monitoring helps you quickly identify when a model's performance has degraded and take corrective action, such as manually adjusting the hyperparameters or investigating the input data for anomalies. Furthermore, monitoring can help you strike the right balance between model performance and computational cost. Auto-tuning can be resource-intensive, especially if it's blindly running on every retraining cycle. By monitoring the model's performance over time, you can determine if the auto-tuning is actually yielding significant improvements or if it's just adding unnecessary overhead. This insight allows you to optimize your model training pipeline.

Databricks Lakehouse Monitoring is built to monitor the statistical properties and quality of data across all tables, but it also includes specific capabilities tailored for tracking the performance of machine learning models via monitoring inference tables containing model inputs and predictions. For forecasting models, this allows:

  • Monitoring data drift in input features over time, comparing to a baseline
  • Tracking prediction drift and distribution of forecasts
  • Measuring model performance metrics like MAPE, bias, etc as actuals become available
  • Setting alerts if data quality or model performance degrades

Creating Forecasting Models on Databricks

Before we discuss how to monitor forecasting models, let's briefly cover how to develop them on the Databricks platform. Databricks provides a unified environment to build, train, and deploy machine learning models at scale, including time series forecasting models.

There are several popular libraries and techniques you can use to generate forecasts, such as:

  • Prophet : An open-source library for time series forecasting that is easy to use and tune. It excels at handling data with strong seasonal effects and works well with messy data by smoothing out outliers. Prophet's simple, intuitive API makes it accessible to non-experts. You can use PySpark to parallelize Prophet model training across a cluster to build thousands of models for each product-store combination. Blog
  • ARIMA/SARIMA: Classic statistical methods for time series forecasting. ARIMA models aim to describe the autocorrelations in the data. SARIMA extends ARIMA to model seasonality. In a recent benchmarking study, SARIMA demonstrated strong performance on retail sales data compared to other algorithms. Blog

In addition to popular libraries like Prophet and ARIMA/SARIMA for generating forecasts, Databricks also offers AutoML for Forecasting. AutoML simplifies the process of creating forecasting models by automatically handling tasks like algorithm selection, hyperparameter tuning, and distributed training. With AutoML, you can:

  • Quickly generate baseline forecasting models and notebooks through a user-friendly UI
  • Leverage multiple algorithms like Prophet and Auto-ARIMA under the hood
  • Automatically handle data preparation, model training and tuning, and distributed computation with Spark

You can easily integrate models created using AutoML with MLflow for experiment tracking and Databricks Model Serving for deployment and monitoring. The generated notebooks provide code that can be customized and incorporated into production workflows.

To streamline the model development workflow, you can leverage MLflow , an open-source platform for the machine learning lifecycle. MLflow supports Prophet, ARIMA and other models out-of-the-box to enable experiment tracking which is ideal for time series experiments, allowing for the logging of parameters, metrics, and artifacts. While this simplifies model deployment and promotes reproducibility, using MLflow is optional for Lakehouse Monitoring - you can deploy models without it and Lakehouse Monitoring will still be able to track them.

Once you have a trained forecasting model, you have flexibility in how you deploy it for inference depending on your use case. For real-time forecasting, you can use Databricks Model Serving to deploy the model as a low-latency REST endpoint with just a few lines of code. When a model is deployed for real-time inference, Databricks automatically logs the input features and predictions to a managed Delta table called an inference log table. This table serves as the foundation for monitoring the model in production using Lakehouse Monitoring.

However, forecasting models are often used in batch scoring scenarios, where predictions are generated on a schedule (e.g. generating forecasts every night for the next day). In this case, you can build a separate pipeline that scores the model on a batch of data and logs the results to a Delta table. It's more cost-effective to load the model directly on your Databricks cluster and score the batch of data there. This approach avoids the overhead of deploying the model behind an API and paying for compute resources in multiple places. The logged table of batch predictions can still be monitored using Lakehouse Monitoring in the same way as a real-time inference table.

If you do require both real-time and batch inference for your forecasting model, you can consider using Model Serving for the real-time use case and loading the model directly on a cluster for batch scoring. This hybrid approach optimizes costs while still providing the necessary functionality. You can leverage Lakeview dashboards to build interactive visualizations and share insights on your forecasting reports. You can also set up email subscriptions to automatically send out dashboard snapshots on a schedule.

Whichever approach you choose, by storing the model inputs and outputs in the standardized Delta table format, it becomes straightforward to monitor data drift, track prediction changes, and measure accuracy over time. This visibility is crucial for maintaining a reliable forecasting pipeline in production.

Now that we've covered how to build and deploy time series forecasting models on Databricks, let's dive into the key aspects of monitoring them with Lakehouse Monitoring.

Monitor Data Drift and Model Performance

To ensure your forecasting model continues to perform well in production, it's important to monitor both the input data and the model predictions for potential issues. Databricks Lakehouse Monitoring makes this easy by allowing you to create monitors on your input feature tables and inference log tables. Lakehouse Monitoring is built on top of Unity Catalog as a unified way to govern and monitor your data, and requires Unity Catalog to be enabled on your workspace.

Create an Inference Profile Monitor

To monitor a forecasting model, create an inference profile monitor on the table containing the model's input features, predictions, and optionally, ground truth labels. You can create the monitor using either the Databricks UI or the Python API.

In the UI, navigate to the inference table and click the "Quality" tab. Click "Get started" and select "Inference Profile" as the monitor type. Then configure the following key parameters:

  • Problem Type: Select regression for forecasting models
  • Timestamp Column: The column containing the timestamp of each prediction. It is the timestamp of the inference, not the timestamp of the data itself.
  • Prediction Column: The column containing the model's forecasted values
  • Label Column (optional): The column containing the actual values. This can be populated later as actuals arrive.
  • Model ID Column: The column identifying the model version that made each prediction
  • Granularities: The time windows to aggregate metrics over, e.g. daily or weekly

Optionally, you can also specify:

  • A baseline table containing reference data, like the training set, to compare data drift against
  • Slicing expressions to define data subsets to monitor, like different product categories
  • Custom metrics to calculate, defined by a SQL expression
  • Set up a refresh schedule

Using the Python REST API , you can create an equivalent monitor with code like:

Baseline table is an optional table containing a reference dataset, like your model's training data, to compare the production data against. Again, for forecasting models, they are frequently retrained, so the baseline comparison is often not necessary as the model will frequently change. You may do the comparison to a previous time window and if baseline comparison is desired, only update the baseline when there is a bigger update, like hyperparameter tuning or an update to the actuals.

Monitoring in forecasting is useful even in scenarios where the retraining cadence is pre-set, such as weekly or monthly. In these cases, you can still engage in exception-based forecast management when forecast metrics deviate from actuals or when actuals fall out of line with forecasts. This allows you to determine if the underlying time series needs to be re-diagnosed (the formal forecasting language for retraining, where trends, seasonality, and cyclicity are individually identified if using econometric models) or if individual deviations can be isolated as anomalies or outliers. In the latter case, you wouldn't re-diagnose, but mark the deviation as an outlier and potentially attach a calendar event or an exogenous variable to the model in the future.

Lakehouse Monitoring will automatically track statistical properties of the input features over time and alert if significant drift is detected relative to the baseline or previous time windows. This allows you to identify data quality issues that could degrade forecast accuracy. For example:

  • Monitor the distribution of key input features like sales amounts. If there is a sudden shift, it could indicate data quality issues that may degrade forecast accuracy.
  • Track the number of missing or outlier values. An increase in missing data for recent time periods could skew forecasts.

In addition to the default metrics, you can define custom metrics using SQL expressions to capture business-specific logic or complex comparisons. Some examples relevant to forecasting:

  • Comparing metrics across seasons or years, e.g. calculating the percent difference in average sales between the current quarter and the same quarter last year
  • Weighting errors differently based on the item being forecasted, e.g. penalizing errors on high-value products more heavily
  • Tracking the percentage of forecasts within an acceptable error tolerance

Custom metrics can be of three types:

  • Aggregate metrics calculated from columns in the inference table
  • Derived metrics calculated from other aggregate metrics
  • Drift metrics comparing an aggregate or derived metric across time windows or to a baseline

Examples of these custom metrics are shown in the Python API example above. By incorporating custom metrics tailored to your specific forecasting use case, you can gain deeper, more relevant insights into your model's performance and data quality.

The key idea is to bring your model's input features, predictions, and ground truth labels together in one inference log table. Lakehouse Monitoring will then automatically track data drift, prediction drift, and performance metrics over time and by the dimensions you specify.

If your forecasting model is served outside of Databricks, you can ETL the request logs into a Delta table and then apply monitoring on it. This allows you to centralize monitoring even for external models.

It's important to note that when you first create a time series or inference profile monitor, it analyzes only data from the 30 days prior to the monitor's creation. Due to this cutoff, the first analysis window might be partial. For example, if the 30-day limit falls in the middle of a week or month, the full week or month will not be included.

This 30-day lookback limitation only affects the initial window when the monitor is created. After that, all new data flowing into the inference table will be fully processed according to the specified granularities.

Refresh Monitors to Update Metrics

After creating an inference profile monitor for your forecasting model, you need to periodically refresh it to update the metrics with the latest data. Refreshing a monitor recalculates the profile and drift metrics tables based on the current contents of the inference log table. You should refresh a monitor when:

  • New predictions are logged from the model
  • Actual values are added for previous predictions
  • The inference table schema changes, such as adding a new feature column
  • You modify the monitor settings, like adding additional custom metrics

There are two ways to refresh a monitor: on a schedule or manually.

To set up a refresh schedule, specify the schedule parameter when creating the monitor using the UI, or with the Python API:

The `CronSchedule` lets you provide a cron expression to define the refresh frequency, such as daily, hourly, etc. You can set `skip_builtin_dashboard` to True, which will skip generating a new dashboard for the monitor. This is especially useful when you have already built a dashboard or have custom charts in the dashboard you want to keep and don't need a new one.

Alternatively, you can manually refresh a monitor using the UI or Python API. In the Databricks UI, go to the "Quality" tab on the inference table, select the monitor, and click "Refresh metrics".

Using Python API, you can create a pipeline that refreshes a monitor so that it's action-driven, for example, after retraining a model. To refresh a monitor in a notebook, use the run_refresh function:

This submits a serverless job to update the monitor metrics tables. You can continue to use the notebook while the refresh runs in the background.

After a refresh completes, you can query the updated profile and drift metrics tables using SQL. However, note that the generated dashboard is updated separately, which you can do so by clicking on the "Refresh" button in the DBSQL dashboard itself. Similarly, when you click Refresh on the dashboard, it doesn't trigger monitor calculations. Instead, it runs the queries over the metric tables that the dashboard uses to generate visualizations. To update the data in the tables used to create the visualizations that appear on the dashboard, you must refresh the monitor and then refresh the dashboard.

Understanding the Monitoring Output

When you create an inference profile monitor for your forecasting model, Lakehouse Monitoring generates several key assets to help you track data drift, model performance, and overall health of your pipeline.

Profile and Drift Metrics Tables

Lakehouse Monitoring creates two primary metrics tables:

  • count, mean, stddev, min, max for numeric columns
  • count, number of nulls, number of distinct values for categorical columns
  • count, mean, stddev, min, max for the prediction column
  • Wasserstein distance for numeric columns, measuring the difference in distribution shape, and for the prediction column, detecting shifts in the forecast distribution
  • Jensen-Shannon divergence for categorical columns, quantifying the difference between probability distributions

You can query them directly using SQL to investigate specific questions, such as:

  • What is the average prediction and how has it changed week-over-week?
  • Is there a difference in model accuracy between product categories?
  • How many missing values were there in a key input feature yesterday vs the training data?

Model Performance Dashboard

Model Performance Dashboard

In addition to the metrics tables, Lakehouse Monitoring automatically generates an interactive dashboard to visualize your forecasting model's performance over time. The dashboard includes several key components:

  • Model Performance Panel: Displays key accuracy metrics for your forecasting model, such as MAPE, RMSE, bias, etc. These metrics are calculated by comparing the predictions to the actual values, which can be provided on a delay (e.g. daily actuals for a daily forecast). The panel shows the metrics over time and by important slices like product category or region.
  • Drift Metrics Panel: Visualizes the drift metrics for selected features and the prediction column over time.
  • Data Quality Panel: Shows various metrics such as % of missing values, % of nas, various statistics such as count, mean, min and max, and other data anomalies for both numeric features and categorical features over time. This helps you quickly spot data quality issues that could degrade forecast accuracy.

The dashboard is highly interactive, allowing you to filter by time range, select specific features and slices, and drill down into individual metrics. The dashboard is often customized after its creation to include any views or charts your organization is used to looking at. Queries that are used on the dashboard can be customized and saved, and you can add alerts from any of the views by clicking on "view query" and then clicking on "create alert". At the time of writing, a customized template for the dashboard is not supported.

It's a valuable tool for both data scientists to debug model performance and business stakeholders to maintain trust in the forecasts.

Leveraging Actuals for Accuracy Monitoring

To calculate model performance metrics like MAPE, the monitoring system needs access to the actual values for each prediction. However, with forecasting, actuals are often not available until some time after the prediction is made.

One strategy is to set up a separate pipeline that appends the actual values to the inference log table when they become available, then refresh the monitor to update the metrics. For example, if you generate daily forecasts, you could have a job that runs each night to add the actual values for the previous day's predictions.

By capturing actuals and refreshing the monitor regularly, you can track forecast accuracy over time and identify performance degradation early. This is crucial for maintaining trust in your forecasting pipeline and making informed business decisions.

Moreover, monitoring actuals and forecasts separately enables powerful exception management capabilities. Exception management is a popular technique in demand planning where significant deviations from expected outcomes are proactively identified and resolved. By setting up alerts on metrics like forecast accuracy or bias, you can quickly spot when a model's performance has degraded and take corrective action, such as adjusting model parameters or investigating input data anomalies.

Lakehouse Monitoring makes exception management straightforward by automatically tracking key metrics and providing customizable alerting. Planners can focus their attention on the most impactful exceptions rather than sifting through mountains of data. This targeted approach improves efficiency and helps maintain high forecast quality with minimal manual intervention.

In summary, Lakehouse Monitoring provides a comprehensive set of tools for monitoring your forecasting models in production. By leveraging the generated metrics tables and dashboard, you can proactively detect data quality issues, track model performance, diagnose drift, and manage exceptions before they impact your business. The ability to slice and dice the metrics across dimensions like product, region, and time enables you to quickly pinpoint the root cause of any issues and take targeted action to maintain the health and accuracy of your forecasts.

Set Alerts on Model Metrics

Once you have an inference profile monitor set up for your forecasting model, you can define alerts on key metrics to proactively identify issues before they impact business decisions. Databricks Lakehouse Monitoring integrates with Databricks SQL to allow you to create alerts based on the generated profile and drift metrics tables.

Some common scenarios where you would want to set up alerts for a forecasting model include:

  • Alert if the rolling 7-day average prediction error (MAPE) exceeds 10%. This could indicate the model is no longer accurate and may need retraining.
  • Alert if the number of missing values in a key input feature has increased significantly compared to the training data. Missing data could skew predictions.
  • Alert if the distribution of a feature has drifted beyond a threshold relative to the baseline. This could signal a data quality issue or that the model needs to be updated for the new data patterns.
  • Alert if no new predictions have been logged in the past 24 hours. This could mean the inference pipeline has failed and needs attention.
  • Alert if the model bias (mean error) is consistently positive or negative. This could indicate the model is systematically over or under forecasting.

There are built-in queries already generated to build the views of the dashboard. To create an alert, navigate to the SQL query that calculates the metric you want to monitor from the profile or drift metrics table. Then, in the Databricks SQL query editor, click "Create Alert" and configure the alert conditions, such as triggering when the MAPE exceeds 0.1. You can set the alert to run on a schedule, like hourly or daily, and specify how to receive notifications (e.g. email, Slack, PagerDuty).

In addition to alerts on the default metrics, you can write custom SQL queries to calculate bespoke metrics for your specific use case. For example, maybe you want to alert if the MAPE for high-value products exceeds a different threshold than for low-value products. You could join the profile metrics with a product table to segment the MAPE calculation.

The key is that all the feature and prediction data is available in metric tables, so you can flexibly compose SQL to define custom metrics that are meaningful for your business. You can then create alerts on top of these custom metrics using the same process.

By setting up targeted alerts on your forecasting model metrics, you can keep a pulse on its performance without manual monitoring. Alerts allow you to respond quickly to anomalies and maintain trust in the model's predictions. Combined with the multi-dimensional analysis enabled by Lakehouse Monitoring, you can efficiently diagnose and resolve issues to keep your forecast quality high.

Monitor Lakehouse Monitoring Expenses

While not specific to forecasting models, it's important to understand how to track your usage and expenses for Lakehouse Monitoring itself. When planning to use Lakehouse Monitoring, it's important to understand the associated costs so you can budget appropriately. Lakehouse Monitoring jobs run on serverless compute infrastructure, so you don't have to manage clusters yourself. To estimate your Lakehouse Monitoring costs, follow these steps:

  • Determine the number and frequency of monitors you plan to create. Each monitor will run on a schedule to refresh the metrics.
  • Estimate the data volume and complexity of the SQL expressions for your monitors. Larger data sizes and more complex queries will consume more DBUs.
  • Look up the DBU rate for serverless workloads based on your Databricks tier and cloud provider.
  • Multiply your estimated DBUs by the applicable rate to get your estimated Lakehouse Monitoring cost.

Your actual costs will depend on your specific monitor definitions and data, which can vary over time. Databricks provides two ways to monitor your Lakehouse Monitoring costs: using a SQL query or the billing portal. Refer to https://docs.databricks.com/en/lakehouse-monitoring/expense.html for more information.

Ready to start monitoring your forecasting models with Databricks Lakehouse Monitoring? Sign up for a free trial to get started. Already a Databricks customer? Check out our documentation to set up your first inference profile monitor today.

Related posts

notebook task databricks

Simplify Your Forecasting With Databricks AutoML

  • Microsoft Azure
  • Google Cloud Platform
  • Documentation
  • Databricks data engineering
  • Introduction to Databricks notebooks

Create and manage scheduled notebook jobs

You can create and manage notebook jobs directly in the notebook UI. If a notebook is already assigned to one or more jobs, you can create and manage schedules for those jobs. If a notebook is not assigned to a job, you can create a job and a schedule to run the notebook.

Schedule a notebook job

To schedule a notebook job to run periodically:

Notebook schedule button

If jobs already exist for the notebook, the Jobs List dialog appears. To display the Schedule dialog, click Add a schedule .

Job list dialog

In the Schedule dialog, optionally enter a name for the job. The default name is the name of the notebook.

Select Manual to run your job only when manually triggered, or Scheduled to define a schedule for running the job. If you select Scheduled , use the drop-downs to specify the frequency, time, and time zone.

In the Compute drop-down, select the compute resource to run the task.

If the notebook is attached to a SQL warehouse, the default compute is the same SQL warehouse.

If your workspace is Unity Catalog-enabled and Serverless Workflows is enabled, the job runs on serverless compute by default.

Otherwise, if you have Allow Cluster Creation permissions, the job runs on a new job cluster by default. To edit the configuration of the default job cluster, click Edit at the right of the field to display the cluster configuration dialog . If you do not have Allow Cluster Creation permissions, the job runs on the cluster that the notebook is attached to by default. If the notebook is not attached to a cluster, you must select a cluster from the Cluster dropdown.

Optionally, enter any Parameters to pass to the job. Click Add and specify the key and value of each parameter. Parameters set the value of the notebook widget specified by the key of the parameter. Use dynamic value references to pass a limited set of dynamic values as part of a parameter value.

Optionally, specify email addresses to receive Alerts on job events. See Add email and system notifications for job events .

Click Submit .

Run a notebook job

To manually run a notebook job:

Click Run now .

New Tab Icon

Manage scheduled notebook jobs

Kebab menu

From this menu, you can edit the schedule, clone the job, view job run details , pause the job, resume the job, or delete a scheduled job.

When you clone a scheduled job, a new job is created with the same parameters as the original. The new job appears in the list with the name Clone of <initial job name> .

How you edit a job depends on the complexity of the job’s schedule. Either the Schedule dialog or the Job details panel displays, allowing you to edit the schedule, cluster, parameters, and so on.

IMAGES

  1. Basics about Databricks notebook

    notebook task databricks

  2. Introduction to Azure Databricks Notebook

    notebook task databricks

  3. Use notebooks

    notebook task databricks

  4. Lesson 3: Azure Databricks Spark Tutorial- Azure Databricks Notebook

    notebook task databricks

  5. Tutorial

    notebook task databricks

  6. Databricks Notebook Gallery

    notebook task databricks

VIDEO

  1. Notebook Alternative Session Part 1

  2. Replace your task manager with this notebook

  3. Notebook Alternative Session Part 2

  4. Databricks: DButils How to run a Notebook as a Job with Parameters

  5. Azure Databricks Tutorial # 05:- Utilities in the Databricks notebook

  6. 5 Databricks Notebook options part 2

COMMENTS

  1. Create and run Databricks Jobs

    Task type options. The following are the task types you can add to your Databricks job and available options for the different task types: Notebook: In the Source drop-down menu, select Workspace to use a notebook located in a Databricks workspace folder or Git provider for a notebook located in a remote Git repository.. Workspace: Use the file browser to find the notebook, click the notebook ...

  2. Share information between tasks in a Databricks job

    Note. You can now use dynamic value references in your notebooks to reference task values set in upstream tasks. For example, to reference the value with the key name set by the task Get_user_data, use {{tasks.Get_user_data.values.name}}.Because they can be used with multiple task types, Databricks recommends using dynamic value references instead of dbutils.jobs.taskValues.get to retrieve the ...

  3. Now in Databricks: Orchestrate Multiple Tasks within a Databricks Jobs

    Previously, Databricks customers had to choose whether to run these tasks all in one notebook or use another workflow tool and add to the overall complexity of their environment. Today, we are pleased to announce that Databricks Jobs now supports task orchestration in public preview -- the ability to run multiple tasks as a directed acyclic ...

  4. Introduction to Databricks notebooks

    Click Import.The notebook is imported and opens automatically in the workspace. Changes you make to the notebook are saved automatically. For information about editing notebooks in the workspace, see Develop code in Databricks notebooks.. To run the notebook, click at the top of the notebook. For more information about running notebooks and individual notebook cells, see Run Databricks notebooks.

  5. How to send a list as parameter in databricks notebook task?

    Hi may be I'm bit late but found a better solution. Step 1: Use JSON.stringyfy () in the console of any browser to convert your value (object, array, JSON etc) into string. Ex: Now use this value in the body of URL. In Databricks notebook convert string to JSON using python json module. Hope this helps.

  6. How to pass outputs from a python task to a notebook task

    A python task which accepts a date and an integer from the user and outputs a list of dates (say, a list of 5 dates in string format). A notebook which runs once for each of the dates from the dates list from the previous task. Each run of the notebook should take the one element output of the date list. While this is relatively easy to do ...

  7. Basics of Databricks Workflows

    As the Data Engineering team built their ETL pipelines in Databricks Notebooks, our first task will be of type Notebook. These Notebooks can reside either in the Workspace or can be sourced from a remote Git repository. To create our Notebook task: Provide the task name in the ' Task name' field. In the Type dropdown menu, select Notebook.

  8. CI/CD on Databricks with Azure DevOps

    CI/CD pipelines on Azure DevOps can trigger Databricks Repos API to update this test project to the latest version. CI/CD pipelines trigger the integration test job via the Jobs API. Integration tests can be implemented as a simple notebook that will at first run the pipelines that we would like to test with test configurations.

  9. Passing values between notebook tasks in Workflow ...

    Passing values between notebook tasks in Workflow Jobs. 10-30-2022 02:58 AM. I have created a Databricks workflow job with notebooks as individual tasks sequentially linked. I assign a value to a variable in one notebook task (ex: batchid = int (time.time ()). Now, I want to pass this batchid variable to next notebook task.

  10. Introduction to Databricks Workflows

    To optimize resource usage, Databricks recommends using a job cluster for your jobs. To reduce the time spent waiting for cluster startup, consider using an all-purpose cluster. See Use Databricks compute with your jobs. You use a SQL warehouse to run Databricks SQL tasks such as queries, dashboards, or alerts.

  11. Run a Databricks notebook from another notebook

    The %run command allows you to include another notebook within a notebook. You can use %run to modularize your code, for example by putting supporting functions in a separate notebook. You can also use it to concatenate notebooks that implement the steps in an analysis. When you use %run, the called notebook is immediately executed and the ...

  12. Create clusters, notebooks, and jobs with Terraform

    In this article. This article shows how to use the Databricks Terraform provider to create a cluster, a notebook, and a job in an existing Azure Databricks workspace.. This article is a companion to the following Azure Databricks getting started articles: Tutorial: Run an end-to-end lakehouse analytics pipeline, which uses a cluster that works with Unity Catalog, a Python notebook, and a job ...

  13. Add tasks to jobs in Databricks Asset Bundles

    The following example adds a notebook task to a job and sets a job parameter named my_job_run_id. The path for the notebook to deploy is relative to the configuration file in which this task is declared. The task gets the notebook from its deployed location in the Azure Databricks workspace.

  14. Run a Databricks notebook from another notebook

    The %run command allows you to include another notebook within a notebook. You can use %run to modularize your code, for example by putting supporting functions in a separate notebook. You can also use it to concatenate notebooks that implement the steps in an analysis. When you use %run, the called notebook is immediately executed and the ...

  15. Databricks Notebooks

    Unified developer experience to build data and AI projects. Databricks Notebooks simplify building data and AI projects through a fully managed and highly automated developer experience. Notebooks work natively with the Databricks Lakehouse Platform to help data practitioners start quickly, develop with context-aware tools and easily share results.

  16. Run Databricks notebooks

    To run all the cells in a notebook, select Run All in the notebook toolbar. Important. Do not use Run All if steps for mount and unmount are in the same notebook. It could lead to a race condition and possibly corrupt the mount points. To run a single cell, click in the cell and press shift+enter. You can also run a subset of lines in a cell or ...

  17. Create clusters, notebooks, and jobs with Terraform

    Step 1: Create and configure the Terraform project. Create a Terraform project by following the instructions in the Requirements section of the Databricks Terraform provider overview article.. To create a cluster, create a file named cluster.tf, and add the following content to the file.This content creates a cluster with the smallest amount of resources allowed.

  18. Install notebook dependencies

    If your notebook is connected to serverless compute, Databricks automatically caches the content of the notebook's virtual environment. ... Configure environments and dependencies for non-notebook tasks. For other supported task types, such as Python script, Python wheel, or dbt tasks, a default environment includes installed Python libraries

  19. Develop code in Databricks notebooks

    Databricks Assistant can help you with the following tasks: Generate code. Debug code, including identifying and suggesting fixes for errors. Transform and optimize code. ... You can run SQL commands in a Databricks notebook on a SQL warehouse, a type of compute that is optimized for SQL analytics. See Use a notebook with a SQL warehouse.

  20. Ensuring Quality Forecasts with Databricks Lakehouse Monitoring

    In the Databricks UI, go to the "Quality" tab on the inference table, select the monitor, and click "Refresh metrics". Using Python API, you can create a pipeline that refreshes a monitor so that it's action-driven, for example, after retraining a model. To refresh a monitor in a notebook, use the run_refresh function:

  21. Create and manage scheduled notebook jobs

    In the Schedule dialog, optionally enter a name for the job. The default name is the name of the notebook. Select Manual to run your job only when manually triggered, or Scheduled to define a schedule for running the job. If you select Scheduled, use the drop-downs to specify the frequency, time, and time zone.. In the Compute drop-down, select the compute resource to run the task.