Sunday, October 5, 2025

How To Use Artificial Information To Construct a Portfolio Undertaking


How To Use Synthetic Data To Build a Portfolio Project
Picture by Writer | Canva

 

Introduction

 
Discovering real-world datasets might be difficult as a result of they’re typically non-public (protected), incomplete (lacking options), or costly (behind a paywall). Artificial datasets can resolve these issues by letting you generate the information based mostly in your challenge wants.

Artificial information is artificially generated data that mimics real-life datasets. You may management the dimensions, complexity, and realism of the artificial dataset to tailor it based mostly in your information wants.

On this article, we’ll discover artificial information technology strategies. We are going to then construct a portfolio challenge by inspecting the information, making a machine studying mannequin, and utilizing AI to develop an entire portfolio challenge with a Streamlit app.

 

Learn how to Generate Artificial Information

 
Artificial information is commonly created randomly, utilizing simulations, guidelines, or AI.

 
How to Generate Synthetic DataHow to Generate Synthetic Data
 

// Technique 1: Random Information Technology

To generate information randomly, we’ll use easy capabilities to create values with none particular guidelines.

It’s helpful for testing, nevertheless it gained’t seize lifelike relationships between options. We’ll do it utilizing NumPy’s random technique and create a Pandas DataFrame.

import numpy as np
import pandas as pd
np.random.seed(42)
df_random = pd.DataFrame({
    "feature_a": np.random.randint(1, 100, 5),
    "feature_b": np.random.rand(5),
    "feature_c": np.random.alternative(["X", "Y", "Z"], 5)
})
df_random.head()

 

Right here is the output.

 
How to Generate Synthetic DataHow to Generate Synthetic Data
 

// Technique 2: Rule-Primarily based Information Technology

Rule-based information technology is a wiser and extra lifelike technique than random information technology. It follows a exact components or algorithm. This makes the output purposeful and constant.

In our instance, the dimensions of a home is instantly linked to its worth. To point out this clearly, we’ll create a dataset with each measurement and worth. We are going to outline the connection with a components:

Worth = measurement × 300 + ε (random noise)

This manner, you possibly can see the correlation whereas retaining the information fairly lifelike.

np.random.seed(42)
n = 5
measurement = np.random.randint(500, 3500, n)
worth = measurement * 300 + np.random.randint(5000, 20000, n)

df_rule = pd.DataFrame({
    "size_sqft": measurement,
    "price_usd": worth
})
df_rule.head()

 

Right here is the output.

 
How to Generate Synthetic DataHow to Generate Synthetic Data
 

// Technique 3: Simulation-Primarily based Information Technology

The simulation-based information technology technique combines random variation with guidelines from the true world. This combine creates datasets that behave like actual ones.

What can we find out about housing?

  • Larger houses often value extra
  • Some cities value greater than others
  • A baseline worth

How can we construct the dataset?

  1. Decide a metropolis at random
  2. Draw a house measurement
  3. Set bedrooms between 1 and 5
  4. Compute the value with a transparent rule

Worth rule: We begin with a base worth, add a metropolis worth bump, after which add measurement × price.

price_usd = base_price × city_bump + sqft × price

Right here is the code.

import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
CITIES = ["los_angeles", "san_francisco", "san_diego"]
# Metropolis worth bump: greater means pricier metropolis
CITY_BUMP = {"los_angeles": 1.10, "san_francisco": 1.35, "san_diego": 1.00}

def make_data(n_rows=10):
    metropolis = rng.alternative(CITIES, measurement=n_rows)
    # Most houses are close to 1,500 sqft, some smaller or bigger
    sqft = rng.regular(1500, 600, n_rows).clip(350, 4500).spherical()
    beds = rng.integers(1, 6, n_rows)

    base = 220_000
    price = 350  # {dollars} per sqft

    bump = np.array([CITY_BUMP[c] for c in metropolis])
    worth = base * bump + sqft * price

    return pd.DataFrame({
        "metropolis": metropolis,
        "sqft": sqft.astype(int),
        "beds": beds,
        "price_usd": worth.spherical(0).astype(int),
    })

df = make_data()
df.head()

 

Right here is the output.

 
How do we build the synthetic datasetHow do we build the synthetic dataset
 

// Technique 4: AI-Powered Information Technology

To have AI create your dataset, you want a transparent immediate. AI is highly effective, nevertheless it works greatest if you set easy, good guidelines.

Within the following immediate, we’ll embrace:

  • Area: What’s the information about?
  • Options: Which columns do we would like?
    • Metropolis, neighborhood, sqft, bedrooms, loos
  • Relationships: How do the options join?
    • Worth is dependent upon metropolis, sqft, bedrooms, and crime index
  • Format: How ought to AI return it?

Right here is the immediate.

 

Generate Python code that creates an artificial California actual property dataset.
The dataset ought to have 10,000 rows with columns: metropolis, neighborhood, latitude, longitude, sqft, bedrooms, loos, lot_sqft, year_built, property_type, has_garage, situation, school_score, crime_index, dist_km_center, price_usd.
Cities: Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
Worth ought to rely upon metropolis premium, sqft, bedrooms, loos, lot measurement, faculty rating, crime index, and distance from metropolis heart.
Embrace some random noise, lacking values, and some outliers.
Return the outcome as a Pandas DataFrame and put it aside to ‘ca_housing_synth.csv’

 

Let’s use this immediate with ChatGPT.

 
How do we build the synthetic datasetHow do we build the synthetic dataset
 

It returned the dataset as a CSV. Right here is the method that reveals how ChatGPT created it.

 
How do we build the synthetic datasetHow do we build the synthetic dataset
 

That is probably the most complicated dataset we’ve got generated by far. Let’s see the primary few rows of this dataset.

 
How do we build the synthetic datasetHow do we build the synthetic dataset

 

Constructing a Portfolio Undertaking from Artificial Information

 
We used 4 completely different strategies to create an artificial dataset. We are going to use the AI-generated information to construct a portfolio challenge.

First, we’ll discover the information, after which construct a machine studying mannequin. Subsequent, we’ll visualize the outcomes with Streamlit by leveraging AI, and within the ultimate step, we’ll uncover which steps to comply with to deploy the mannequin to manufacturing.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data

 

// Step 1: Exploring and Understanding the Artificial Dataset

We’ll begin exploring the information by first studying it with pandas and displaying the primary few rows.

df = pd.read_csv("ca_housing_synth.csv")
df.head()

 

Right here is the output.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

The dataset contains location (metropolis, neighborhood, latitude, longitude) and property particulars (measurement, rooms, yr, situation), in addition to the goal worth. Let’s verify the knowledge within the column names, measurement, and size through the use of the data technique.

 

Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

We have now 15 columns, with some, like has_garage or dist_km_center, being fairly particular.

 

// Step 2: Mannequin Constructing

The subsequent step is to construct a machine studying mannequin that predicts house costs.

We are going to comply with these steps:

Right here is the code.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- Step 1: Outline columns based mostly on the generated dataset
num_cols = ["sqft", "bedrooms", "bathrooms", "lot_sqft", "year_built", 
            "school_score", "crime_index", "dist_km_center", "latitude", "longitude"]
cat_cols = ["city", "neighborhood", "property_type", "condition", "has_garage"]

# --- Step 2: Cut up the information
X = df.drop(columns=["price_usd"])
y = df["price_usd"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# --- Step 3: Preprocessing pipelines
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

# --- Step 4: Mannequin
mannequin = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", model)
])

# --- Step 5: Prepare
pipeline.match(X_train, y_train)

# --- Step 6: Consider
y_pred = pipeline.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"MAE:  {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
print(f"R²:   {r2:.3f}")

# --- Step 7: (Non-compulsory) Permutation Significance on a subset for velocity
pi = permutation_importance(
    pipeline, X_test.iloc[:1000], y_test.iloc[:1000],
    n_repeats=3, random_state=42, scoring="r2"
)

# --- Step 8: Plot Precise vs Predicted
plt.determine(figsize=(6, 5))
plt.scatter(y_test, y_pred, alpha=0.25)
vmin, vmax = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
plt.plot([vmin, vmax], [vmin, vmax], linestyle="--", colour="crimson")
plt.xlabel("Precise Worth (USD)")
plt.ylabel("Predicted Worth (USD)")
plt.title(f"Precise vs Predicted (MAE={mae:,.0f}, RMSE={rmse:,.0f}, R²={r2:.3f})")
plt.tight_layout()
plt.present()

 

Right here is the output.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

Mannequin Efficiency:

  • MAE (85,877 USD): On common, predictions are off by about $86K, which is cheap given the variability in housing costs
  • RMSE (113,512 USD): Bigger errors are penalized extra; RMSE confirms the mannequin handles appreciable deviations pretty effectively
  • R² (0.853): The mannequin explains ~85% of the variance in house costs, displaying sturdy predictive energy for artificial information

 

// Step 3: Visualize the Information

On this step, we’ll present our course of, together with EDA and mannequin constructing, utilizing the Streamlit dashboard. Why are we utilizing Streamlit? You may construct a Streamlit dashboard rapidly and simply deploy it for others to view and work together with.

Utilizing Gemini CLI

To construct the Streamlit software, we’ll use Gemini CLI.

Gemini CLI is an AI-powered open-source command-line agent. You may write code and construct functions utilizing Gemini CLI. It’s easy and free.

To put in it, use the next command in your terminal.

npm set up -g @google/gemini-cli

 

After putting in, use this code to provoke.

 

It can ask you to log in to your Google account, and then you definately’ll see the display the place you’ll construct this Streamlit app.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

Constructing a Dashboard

To construct a dashboard, we have to create a immediate that’s tailor-made to your particular information and mission. Within the following immediate, we clarify the whole lot AI must construct a Streamlit dashboard.

Construct a Streamlit app for the California Actual Property dataset through the use of this dataset ( path-to-dataset )
Right here is the dataset data: 
• Area: California housing — Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
• Location: metropolis, neighborhood, lat, lon, and dist_km_center (haversine to metropolis heart).
• House options: sqft, beds, baths, lot_sqft, year_built, property_type, has_garage, situation.
• Context: school_score, crime_index.
• Goal: price_usd.
• Worth logic: metropolis premium + measurement + rooms + lot measurement + faculty/crime + distance to heart + property sort + situation + noise.
• Information you've gotten: ca_housing_synth.csv (information) and real_estate_model.pkl (skilled pipeline).

The Streamlit app ought to have:
• A brief dataset overview part (form, column record, small preview).
• Sidebar inputs for each mannequin characteristic besides the goal:
- Categorical dropdowns: metropolis, neighborhood, property_type, situation, has_garage.
- Numeric inputs/sliders: lat, lon, sqft, beds, baths, lot_sqft, year_built, school_score, crime_index.
- Auto-compute dist_km_center from the chosen metropolis utilizing the haversine components and that metropolis’s heart.
• A Predict button that:
- Builds a one-row DataFrame with the precise coaching columns (order-safe).
- Calls pipeline.predict(...) from real_estate_model.pkl.
- Shows Estimated Worth (USD) with hundreds separators.
• One chart solely: What-if: sqft vs worth line chart (all different inputs fastened to the sidebar values).
- High quality of life: cache mannequin load, primary enter validation, clear labels/tooltips, English UI.

 

Subsequent, Gemini will ask your permission to create this file.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

Let’s approve and proceed. As soon as it has completed coding, it’s going to mechanically open the streamlit dashboard.

If not, go to the working listing of the app.py file and run streamlit run app.py to start out this Streamlit app.

Right here is our Streamlit dashboard.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

When you click on on the information overview, you possibly can see a piece representing the information exploration.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

From the property options on the left-hand facet, we are able to customise the property and make predictions accordingly. This a part of the dashboard represents what we did in mannequin constructing, however with a extra responsive look.

Let’s choose Richmond, San Francisco, single-family, wonderful situation, 1500 sqft, and click on on the “Predict Worth” button:

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

The expected worth is $1.24M. Additionally, you possibly can see the precise vs predicted worth within the second graph for the whole dataset when you scroll down.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

You may alter extra options within the left panel, just like the yr constructed, crime index, or the variety of loos.

 
Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
 

// Step 4: Deploy the Mannequin

The subsequent step is importing your mannequin to manufacturing. To do this, you possibly can comply with these steps:

 

Last Ideas

 
On this article, we’ve got found completely different strategies to create artificial datasets, akin to random, rule-based, simulation-based, or AI-powered. Subsequent, we’ve constructed a portfolio information challenge by ranging from information exploration and constructing a machine studying mannequin.

We additionally used an open-source command-line-based AI agent (Gemini CLI) to develop a dashboard that explores the dataset and predicts home costs based mostly on chosen options, together with the variety of bedrooms, crime index, and sq. footage.

Creating your artificial information permits you to keep away from privateness hurdles, stability your examples, and transfer quick with out pricey information assortment. The draw back is that it may well replicate your assumptions and miss real-world quirks. If you happen to’re searching for extra inspiration, take a look at this record of machine studying initiatives you could adapt in your portfolio.

Lastly, we checked out the best way to add your mannequin to manufacturing utilizing Streamlit Group Cloud. Go forward and comply with these steps to construct and showcase your portfolio challenge at present!
 
 

Nate Rosidi is a knowledge scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high corporations. Nate writes on the most recent developments within the profession market, offers interview recommendation, shares information science initiatives, and covers the whole lot SQL.



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com