Saturday, December 27, 2025

Tips on how to Construct an AI-Powered Climate ETL Pipeline with Databricks and GPT-4o: From API To Dashboard


, Databricks has shaken the information market as soon as once more. The corporate launched its free version of the Databricks platform with all of the functionalities included. It’s a nice useful resource for studying and testing, to say the least.

With that in thoughts, I created an end-to-end challenge that will help you studying the basics of the principle sources inside Databricks.

This challenge demonstrates a whole Extract, Rework, Load (ETL) workflow inside Databricks. It integrates the OpenWeatherMap API for information retrieval and the OpenAI GPT-4o-mini mannequin to supply personalised, weather-based dressing strategies.

Let’s study extra about it.

The Challenge

The challenge implements a full information pipeline inside Databricks, following these steps.

  1. Extract: Fetches present climate information for New York Metropolis through the OpenWeatherMap API [1].
  2. Rework: Converts UTC timestamps to New York native time and makes use of OpenAI’s [2] GPT-4o-mini to generate personalised dressing strategies primarily based on the temperature.
  3. Load: Persists the information into the Databricks Unity Catalog as each uncooked JSON information and a structured Delta desk (Silver Layer).
  4. Orchestration: The pocket book with this ETL code is added to a job and scheduled to run each 1 hour in Databricks.
  5. Analytics: The silver layer feeds a Databricks Dashboard that shows related climate info alongside the LLM’s strategies.

Right here is the structure.

Challenge Structure. Picture by the creator.

Nice. Now that we perceive what we have to do, let’s transfer on with the how piece of this tutorial.

Observe: when you nonetheless don’t have an account in Databricks, go to Databricks Free Version web page [3], click on Join Free Version and observe the prompts on display screen to get your free entry.

Extract: Integrating API And Databricks

As I often say, a knowledge challenge wants information to start, proper? So our process right here is integrating OpenWeatherMap API to ingest information immediately right into a PySpark pocket book inside Databricks. This process could look sophisticated at first, however belief me, it’s not.

On Databricks’ preliminary web page, create a brand new pocket book utilizing the +New button, then choose Pocket book.

Create a brand new Pocket book. Picture by the creator.

For the Extract half, we’ll want:

1. The API Key from the API OpenWeatherMap.

To get that, go to the API’s signup web page and full your free registration course of. As soon as logged in to the dashboard, click on on the API Key tab, the place it is possible for you to to see it.

2. Import packages

# Imports
import requests
import json

Subsequent, we’re going to create a Python class to modularize our code and make it production-ready as properly.

  • This class receives the API_KEY we simply created, in addition to the town and nation for the climate fetch.
  • Returns the response in JSON format.
# Creating a category to modularize our code

class Climate:
    
    # Outline the constructor
    def __init__(self, API_KEY):
        self.API_KEY = API_KEY

    # Outline a way to retrieve climate information
    def get_weather(self, metropolis, nation, models='imperial'):
        self.metropolis = metropolis
        self.nation = nation
        self.models = models

        # Make a GET request to an API endpoint that returns JSON information
        url = f"https://api.openweathermap.org/information/2.5/climate?q={metropolis},{nation}&APPID={w.API_KEY}&models={models}"
        response = requests.get(url)

        # Use the .json() methodology to parse the response textual content and return
        if response.status_code != 200:
            elevate Exception(f"Error: {response.status_code} - {response.textual content}")
        return response.json()

Good. Now we are able to run this class. Discover we use dbutils.widgets.get(). This command appears to be like on the Parameters within the scheduled job, which we’ll see later on this article. It’s a finest observe to maintain the secrets and techniques secure.

# Get the API OpenWeatherMap key
API_KEY = dbutils.widgets.get('API_KEY')

# Instantiate the category
w = Climate(API_KEY=API_KEY)

# Get the climate information
nyc = w.get_weather(metropolis='New York', nation='US')
nyc

Right here is the response.

{'coord': {'lon': -74.006, 'lat': 40.7143},
 'climate': [{'id': 804,
   'main': 'Clouds',
   'description': 'overcast clouds',
   'icon': '04d'}],
 'base': 'stations',
 'major': {'temp': 54.14,
  'feels_like': 53.44,
  'temp_min': 51.76,
  'temp_max': 56.26,
  'strain': 992,
  'humidity': 89,
  'sea_level': 992,
  'grnd_level': 993},
 'visibility': 10000,
 'wind': {'pace': 21.85, 'deg': 270, 'gust': 37.98},
 'clouds': {'all': 100},
 'dt': 1766161441,
 'sys': {'sort': 1,
  'id': 4610,
  'nation': 'US',
  'dawn': 1766146541,
  'sundown': 1766179850},
 'timezone': -18000,
 'id': 5128581,
 'title': 'New York',
 'cod': 200}

With that response in hand, we are able to transfer on to the Transformation a part of our challenge, the place we’ll clear and remodel the information.

Rework: Formatting The Information

On this part, we’ll take a look at the clear and remodel duties carried out over the uncooked information. We are going to begin by choosing the items of information wanted for our dashboard. That is merely getting information from a dictionary (or a JSON).

# Getting info
id = nyc['id']
timestamp = nyc['dt']
climate = nyc['weather'][0]['main']
temp = nyc['main']['temp']
tmin = nyc['main']['temp_min']
tmax = nyc['main']['temp_max']
nation = nyc['sys']['country']
metropolis = nyc['name']
dawn = nyc['sys']['sunrise']
sundown = nyc['sys']['sunset']

Subsequent, let’s remodel the timestamps to the New York time zone, because it comes with Greenwich time.

# Rework dawn and sundown to datetime in NYC timezone
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
import time

# Timestamp, Dawn and Sundown to NYC timezone
target_timezone = ZoneInfo("America/New_York")
dt_utc = datetime.fromtimestamp(dawn, tz=timezone.utc)
sunrise_nyc = str(dt_utc.astimezone(target_timezone).time()) # get solely dawn time time
dt_utc = datetime.fromtimestamp(sundown, tz=timezone.utc)
sunset_nyc = str(dt_utc.astimezone(target_timezone).time()) # get solely sundown time time
dt_utc = datetime.fromtimestamp(timestamp, tz=timezone.utc)
time_nyc = str(dt_utc.astimezone(target_timezone))

Lastly, we format it as a Spark dataframe.

# Create a dataframe from the variables
df = spark.createDataFrame([[id, time_nyc, weather, temp, tmin, tmax, country, city, sunrise_nyc, sunset_nyc]], schema=['id', 'timestamp','weather', 'temp', 'tmin', 'tmax', 'country', 'city', 'sunrise', 'sunset'])
Information cleaned and remodeled. Picture by the creator.

The ultimate step on this part is including the suggestion from an LLM. On this step, we’re going to choose a few of the information fetched from the API and go it to the mannequin, asking it to return a suggestion of how an individual might gown to be ready for the climate.

  • You will have an OpenAI API Key.
  • Cross the climate situation, max and min temperatures (climate, tmax, tmin)
  • Ask the LLM to return a suggestion about learn how to gown for the climate.
  • Add the suggestion to the ultimate dataframe.
%pip set up openai --quiet
from openai import OpenAI
import pyspark.sql.capabilities as F
from pyspark.sql.capabilities import col

# Get OpenAI Key
OPENAI_API_KEY= dbutils.widgets.get('OPENAI_API_KEY')

shopper = OpenAI(
    # That is the default and might be omitted
    api_key=OPENAI_API_KEY
)

response = shopper.responses.create(
    mannequin="gpt-4o-mini",
    directions="You're a weatherman that offers strategies about learn how to gown primarily based on the climate. Reply in a single sentence.",
    enter=f"The climate is {climate}, with max temperature {tmax} and min temperature {tmin}. How ought to I gown?"
)

suggestion = response.output_text

# Add the suggestion to the df
df = df.withColumn('suggestion', F.lit(suggestion))
show(df)

Cool. We’re nearly completed with the ETL. Now it’s all about loading it. That’s the subsequent part.

Load: Saving the Information and Creating the Silver Layer

The final piece of the ETL is loading the information. We are going to load it in two other ways.

  1. Persisting the uncooked information in a Unity Catalog Quantity.
  2. Saving the remodeled dataframe immediately into the silver layer, which is a Delta Desk prepared for the Dashboard consumption.

Let’s create a catalog that can maintain all of the climate information that we get from the API.

-- Making a Catalog
CREATE CATALOG IF NOT EXISTS pipeline_weather
COMMENT 'That is the catalog for the climate pipeline';

Subsequent, we create a schema for the Lakehouse. This one will retailer the amount with the uncooked JSON information fetched.

-- Making a Schema
CREATE SCHEMA IF NOT EXISTS pipeline_weather.lakehouse
COMMENT 'That is the schema for the climate pipeline';

Now, we create the amount for the uncooked information.

-- Let's create a quantity
CREATE VOLUME IF NOT EXISTS pipeline_weather.lakehouse.raw_data
COMMENT 'That is the uncooked information quantity for the climate pipeline';

We additionally create one other schema to carry the Silver Layer Delta Desk.

--Creating Schema to carry remodeled information
CREATE SCHEMA IF NOT EXISTS pipeline_weather.silver
COMMENT 'That is the schema for the climate pipeline';

As soon as we have now all the pieces arrange, that is how our Catalog appears to be like.

Catalog able to obtain information. Picture by the creator.

Now, let’s save the uncooked JSON response into our Uncooked Quantity. To maintain all the pieces organized and forestall overwriting, we’ll connect a singular timestamp to every filename.

By appending these information to the amount reasonably than simply overwriting them, we’re making a dependable “audit path”. This acts as a security internet, that means that if a downstream course of fails or we run into information loss later, we are able to all the time return to the supply and re-process the unique information at any time when we want it.

# Get timestamp
stamp = datetime.now().strftime('%Y-%m-%d_percentH-%M-%S')

# Path to save lots of
json_path = f'/Volumes/pipeline_weather/lakehouse/raw_data/weather_{stamp}.json'

# Save the information right into a json file
df.write.mode('append').json(json_path)

Whereas we maintain the uncooked JSON as our “supply of reality,” saving the cleaned information right into a Delta Desk within the Silver layer is the place the true magic occurs. By utilizing .mode(“append”) and the Delta format, we guarantee our information is structured, schema-enforced, and prepared for high-speed analytics or BI instruments. This layer transforms messy API responses right into a dependable, queryable desk that grows with each pipeline run.

# Save the remodeled information right into a desk (schema)
(
    df
    .write
    .format('delta')
    .mode("append")
    .saveAsTable('pipeline_weather.silver.climate')
)

Stunning! With this all set, let’s test how our desk appears to be like now.

Silver Layer Desk. Picture by the creator.

Let’s begin automating this pipeline now.

Orchestration: Scheduling the Pocket book to Run Routinely

Transferring on with the challenge, it’s time to make this pipeline run by itself, with minimal supervision. For that, Databricks has the Jobs & Pipelines tab, the place it’s simple we are able to schedule jobs to run.

  1. Click on the Jobs & Pipelines tab on the left panel
  2. Discover the button Create and choose Job
  3. Click on on Pocket book so as to add it to the Job.
  4. Configure like the information beneath.
  5. Add the API Keys to the Parameters.
  6. Click on Create process.
  7. Click on Run Now to check if it really works.
Including a Pocket book to the Job. Picture by the creator

When you click on the Run Now button, it ought to begin working the pocket book and show the Succeeded message.

Jobs ran. Picture by the creator.

If the job is working fantastic, it’s time to schedule it to run routinely.

  1. Click on on Add set off on the best facet of the display screen, proper below the part Schedules & Triggers.
  2. Set off sort = Scheduled.
  3. Schedule sort: choose Superior
  4. Choose Each 1 hour from the drop-downs.
  5. Put it aside.

Glorious. Our Pipeline is on auto-mode now! Each hour, the system will hit the OpenWeatherMap API and get contemporary climate info for NYC and put it aside to our Silver Layer Desk.

Analytics: Constructing a Dashboard for Information-Pushed Choices

The final piece of this puzzle is creating the Analytics deliverable, which can present the climate info and supply the person with actionable details about learn how to gown for the climate outdoors.

  1. Click on on the Dashboards tab on the left facet panel.
  2. Click on on the Create dashboard button
  3. It should open a clean canvas for us to work on.
Dashboard began. Picture by the creator.

Now dashboards work primarily based on information fetched from SQL queries. Subsequently, earlier than we begin including textual content and graphics to the canvas, first we have to create some metrics that would be the variables to feed the dashboard playing cards and graphics.

So, click on on the +Create from SQL button to start out a metric. Give it a reputation. For instance, Location, to retrieve the most recent fetched metropolis title, I have to use this question that follows.

-- Get the most recent metropolis title fetched
SELECT metropolis
FROM pipeline_weather.silver.climate
ORDER BY timestamp DESC
LIMIT 1

And we should create one SQL question for every metric. You’ll be able to see all of them within the GitHub repository [ ].

Subsequent, we click on on the Dashboard tab and begin dragging and dropping components to the canvas.

Dashboard creation components menu. Picture by the creator.

When you click on on the Textual content, it permits you to insert a field into the canvas and edit the textual content. Whenever you click on on the graphic aspect, it inserts a placeholder for a graphic, and opens the best facet menu for collection of the variables and configuration.

Interacting with Dashboards in Databricks. Picture by the creator.

Okay. In any case the weather are added, the dashboard will appear to be this.

Accomplished Dashboard. Picture by the creator.

So good! And that concludes our challenge.

Earlier than You Go

You’ll be able to simply replicate this challenge in about an hour, relying in your expertise with the Databricks ecosystem. Whereas it’s a fast construct, it packs so much when it comes to the core engineering abilities you’ll get to train:

  • Architectural Design: You’ll learn to construction a contemporary Lakehouse atmosphere from the bottom up.
  • Seamless Information Integration: You’ll bridge the hole between exterior internet APIs and the Databricks platform for real-time information ingestion.
  • Clear, Modular Code: We transfer past easy scripts by utilizing Python courses and capabilities to maintain the codebase organized and maintainable.
  • Automation & Orchestration: You’ll get hands-on expertise scheduling jobs to make sure your challenge runs reliably on autopilot.
  • Delivering Actual Worth: The objective isn’t simply to maneuver information; it’s to supply worth. By reworking uncooked climate metrics into actionable dressing strategies through AI, we flip “chilly information” right into a useful service for the tip person.

Should you appreciated this content material, discover my contacts and extra about me in my web site.

https://gustavorsantos.me

GitHub Repository

Right here is the repository for this challenge.

https://github.com/gurezende/Databricks-Climate-Pipeline

References

[1. OpenWeatherMap API] (https://openweathermap.org/)

[2. Open Ai Platform] (https://platform.openai.com/)

[3. Databricks Free Edition] (https://www.databricks.com/study/free-edition)

[4. GitHub Repository] (https://github.com/gurezende/Databricks-Climate-Pipeline)

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com