not the one one who feels that YouTube sponsor segments have turn out to be longer and extra frequent not too long ago. Generally, I watch movies that appear to be attempting to promote me one thing each couple of seconds.
, it positive is annoying to be bombarded by adverts.
On this weblog put up, I’ll discover these sponsor segments, utilizing information from a preferred browser extension known as SponsorBlock, to determine if the perceived enhance in adverts truly did occur and in addition to quantify what number of adverts I’m watching.
I’ll stroll you thru my evaluation, offering code snippets in Sql, DuckDB, and pandas. All of the code is obtainable on my GitHub, and for the reason that dataset is open, I may also educate you obtain it, so to observe alongside and play with the info your self.
These are the questions I might be attempting to reply on this evaluation:
- Have sponsor segments elevated over time?
- Which channels have the best share of sponsor time per video?
- What’s the density of sponsor segments all through a video?
To get to those solutions, we must cowl a lot floor. That is the agenda for this put up:
Let’s get this began!
SponsorBlock is an extension that lets you skip advert segments in movies, much like the way you skip Netflix intros. It’s extremely correct, as I don’t bear in mind seeing one fallacious section since I began utilizing it round a month in the past, and I watch numerous smaller non-English creators.
You may be asking your self how the extension is aware of which elements of the video are sponsors, and, imagine it or not, the reply is thru crowdsourcing!
Customers submit the timestamps for the advert segments, and different customers vote if it’s correct or not. For the common consumer, who isn’t contributing in any respect, the one factor it’s a must to do is to press Enter to skip the advert.
Okay, now that what SponsorBlock is, let’s speak in regards to the information.
Cleansing the Information
If you wish to observe alongside, you may obtain a duplicate of the info utilizing this SponsorBlock Mirror (it would take you fairly a couple of minutes to obtain all of it). The database schema may be seen right here, though most of it received’t be helpful for this venture.
As one would possibly count on, their database schema is made for the extension to work correctly, and never for some man to principally leech from an enormous group effort to search out what share of adverts his favourite creator runs. For this, some work will have to be carried out to wash and mannequin the info.
The one two tables which might be necessary for this evaluation are:
sponsorTimes.csv
: That is a very powerful desk, containing thestartTime
andendTime
of all crowdsourced sponsor segments. The CSV is round 5GB.videoInfo.csv
: Incorporates the video title, publication date, and channel ID related to every video.
Earlier than we get into it, these are all of the libraries I ended up utilizing. I’ll clarify the much less apparent ones as we go.
pandas
duckdb
requests
requests-cache
python-dotenv
seaborn
matplotlib
numpy
Step one, then, is to load the info. Surprisingly, this was already a bit difficult, as I used to be getting numerous errors parsing some rows of the CSV. These had been the settings I discovered to work for almost all of the rows:
import duckdb
import os
# Connect with an in-memory DuckDB occasion
con = duckdb.join(database=':reminiscence:')
sponsor_times = con.read_csv(
"sb-mirror/sponsorTimes.csv",
header=True,
columns={
"videoID": "VARCHAR",
"startTime": "DOUBLE",
"endTime": "DOUBLE",
"votes": "INTEGER",
"locked": "INTEGER",
"incorrectVotes": "INTEGER",
"UUID": "VARCHAR",
"userID": "VARCHAR",
"timeSubmitted": "DOUBLE",
"views": "INTEGER",
"class": "VARCHAR",
"actionType": "VARCHAR",
"service": "VARCHAR",
"videoDuration": "DOUBLE",
"hidden": "INTEGER",
"fame": "DOUBLE",
"shadowHidden": "INTEGER",
"hashedVideoID": "VARCHAR",
"userAgent": "VARCHAR",
"description": "VARCHAR",
},
ignore_errors=True,
quotechar="",
)
video_info = con.read_csv(
"sb-mirror/videoInfo.csv",
header=True,
columns={
"videoID": "VARCHAR",
"channelID": "VARCHAR",
"title": "VARCHAR",
"revealed": "DOUBLE",
},
ignore_errors=True,
quotechar=None,
)
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
Here’s what a pattern of the info appears to be like like:
con.sql("SELECT videoID, startTime, endTime, votes, locked, class FROM sponsor_times LIMIT 5")
con.sql("SELECT * FROM video_info LIMIT 5")


Understanding the info within the sponsorTimes
desk is ridiculously necessary, in any other case, the cleansing course of received’t make any sense.
Every row represents a user-submitted timestamp for a sponsored section. Since a number of customers can submit segments for a similar video, the dataset accommodates duplicate and doubtlessly incorrect entries, which can have to be handled throughout cleansing.
To seek out incorrect segments, I’ll use the votes
and the locked
column, because the latter one represents segments that had been confirmed to be right.
One other necessary column is the class
. There are a bunch of classes like Intro, Outro, Filler, and so on. For this evaluation, I’ll solely work with Sponsor and Self-Promo.
I began by making use of some filters:
CREATE TABLE filtered AS
SELECT
*
FROM sponsor_times
WHERE class IN ('sponsor', 'selfpromo') AND (votes > 0 OR locked=1)
Filtering for locked segments or segments with greater than 0 votes was a giant resolution. This decreased the dataset by an enormous share, however doing so made the info very dependable. For instance, earlier than doing this, all the High 50 channels with the best share of adverts had been simply spam, random channels that ran 99.9% of adverts.
With this carried out, the subsequent step is to get a dataset the place every sponsor section exhibits up solely as soon as. For instance, a video with a sponsor section at first and one other on the finish ought to have solely two rows of knowledge.
That is very a lot not the case thus far, since in a single video we will have a number of user-submitted entries for every section. To do that, I’ll use window features to establish if two or extra rows of knowledge signify the identical section.
The primary window operate compares the startTime
of 1 row with the endTime
of the earlier. If these values don’t overlap, it means they’re entries for separate segments, in any other case they’re repeated entries for a similar section.
CREATE TABLE new_segments AS
SELECT
-- Coalesce to TRUE to cope with the primary row of each window
-- because the values are NULL, but it surely ought to rely as a brand new section.
COALESCE(startTime > LAG(endTime)
OVER (PARTITION BY videoID ORDER BY startTime), true)
AS new_ad_segment,
*
FROM filtered

The new_ad_segment
column is TRUE each time a row represents a brand new section of a video. The primary two rows, as their timestamps overlap, are correctly marked as the identical section.
Subsequent up, the second window operate will label every advert section by quantity:
CREATE TABLE ad_segments AS
SELECT
SUM(new_ad_segment)
OVER (PARTITION BY videoID ORDER BY startTime)
AS ad_segment,
*
FROM new_segments

Lastly, now that every section is correctly numbered, it’s straightforward to get the section that’s both locked or has the best quantity of votes.
CREATE TABLE unique_segments AS
SELECT DISTINCT ON (videoID, ad_segment)
*
FROM ad_segments
ORDER BY videoID, ad_segment, locked DESC, votes DESC

That’s it! Now this desk has one row for every distinctive advert section, and I can begin exploring the info.
If these queries really feel difficult, and also you want a refresher on window features, take a look at this weblog put up that can educate you all you want to learn about them! The final instance coated within the weblog put up is nearly precisely the method I used right here.
Exploring and Enhancing the Information
Lastly, the dataset is nice sufficient to begin exploring. The very first thing I did was to get a way of the dimensions of the info:
- 36.0k Distinctive Channels
- 552.6k Distinctive Movies
- 673.8k Distinctive Sponsor Segments, for a median of 1.22 segments per video
As talked about earlier, filtering by segments that had been both locked or had a minimum of 1 upvote, decreased the dataset massively, by round 80%. However that is the value I needed to pay to have information that I may work with.
To verify if there may be nothing instantly fallacious with the info, I gathered the channels which have essentially the most quantity of movies:
CREATE TABLE top_5_channels AS
SELECT
channelID,
rely(DISTINCT unique_segments.videoID) AS video_count
FROM
unique_segments
LEFT JOIN video_info ON unique_segments.videoID = video_info.videoID
WHERE
channelID IS NOT NULL
-- Some channel IDs are clean
AND channelID != '""'
GROUP BY
channelID
ORDER BY
video_count DESC
LIMIT 5

The quantity of movies per channel appears to be like practical… However that is horrible to work with. I don’t wish to go to my browser and lookup channel IDs each time I wish to know the identify of a channel.
To repair this, I created a small script with features to get these values from the YouTube API in Python. I’m utilizing the library requests_cache
to ensure I received’t be repeating API calls and depleting the API limits.
import requests
import requests_cache
from dotenv import load_dotenv
import os
load_dotenv()
API_KEY = os.getenv("YT_API_KEY")
# Cache responses indefinitely
requests_cache.install_cache("youtube_cache", expire_after=None)
def get_channel_name(channel_id: str) -> str:
url = (
f"https://www.googleapis.com/youtube/v3/channels"
f"?half=snippet&id={channel_id}&key={API_KEY}"
)
response = requests.get(url)
information = response.json()
strive:
return information.get("objects", [])[0].get("snippet", {}).get("title", "")
besides (IndexError, AttributeError):
return ""
In addition to this, I additionally created very comparable features to get the nation and thumbnail of every channel, which might be helpful later. When you’re within the code, verify the GitHub repo.
On my DuckDB code, I’m now in a position to register this Python operate and name them inside SQL! I simply have to be very cautious to all the time use them on aggregated and filtered information, in any other case, I can say bye-bye to my API quota.
# This the script created above
from youtube_api import get_channel_name
# Strive registering the operate, ignore if already exists
strive:
con.create_function('get_channel_name', get_channel_name, [str], str)
besides Exception as e:
print(f"Skipping operate registration (probably already exists): {e}")
# Get the channel names
channel_names = con.sql("""
choose
channelID,
get_channel_name(channelID) as channel_name,
video_count
from top_5_channels
""")

A lot better! I appeared up two channels that I’m accustomed to on YouTube for a fast sanity verify. Linus Tech Ideas has a complete of seven.2k movies uploaded, with 2.3k current on this dataset. Avid gamers Nexus has 3k movies, with 700 within the dataset. Appears to be like ok for me!
The very last thing to do, earlier than shifting over to really answering the query I set myself to reply, is to have an concept of the common length of movies.

This matches my expectations, for essentially the most half. I’m nonetheless a bit stunned by the quantity of 20-40-minute movies, as for a few years the “meta” was to have movies of 10 minutes to maximise YouTube’s personal adverts.
Additionally, I believed these buckets of video durations used within the earlier graph had been fairly consultant of how I take into consideration video lengths, so I might be sticking with them for the subsequent sections.
For reference, that is the pandas code used to create these buckets.
video_lengths = con.sql("""
SELECT DISTINCT ON (videoID)
videoID,
videoDuration
FROM
unique_segments
WHERE
videoID IS NOT NULL
AND videoDuration > 0
"""
).df()
# Outline customized bins, in minutes
bins = [0, 3, 7, 12, 20, 40, 90, 180, 600, 9999999]
labels = ["0-3", "3-7", "7-12", "12-20", "20-40", "40-90", "90-180", "180-600", "600+"]
# Assign every video to a bucket (trasnform length to min)
video_lengths["duration_bucket"] = pd.minimize(video_lengths["videoDuration"] / 60, bins=bins, labels=labels, proper=False)
The large query. This can show if I’m being paranoid or not about everybody attempting to promote me one thing always. I’ll begin, although, by answering an easier query, which is the proportion of sponsors for various video durations.
My expectation is that shorter movies have the next share of their runtime from sponsors compared to longer movies. Let’s verify if that is truly the case.
CREATE TABLE video_total_ads AS
SELECT
videoID,
MAX(videoDuration) AS videoDuration,
SUM(endTime - startTime) AS total_ad_duration,
SUM(endTime - startTime) / 60 AS ad_minutes,
SUM(endTime - startTime) / MAX(videoDuration) AS ad_percentage,
MAX(videoDuration) / 60 AS video_duration_minutes
FROM
unique_segments
WHERE
videoDuration > 0
AND videoDuration < 5400
AND videoID IS NOT NULL
GROUP BY
videoID
To maintain the visualization easy, I’m making use of comparable buckets, however solely as much as 90 minutes.
# Outline length buckets (in minutes, as much as 90min)
bins = [0, 3, 7, 12, 20, 30, 40, 60, 90]
labels = ["0-3", "3-7", "7-12", "12-20", "20-30", "30-40", "40-60", "60-90"]
video_total_ads = video_total_ads.df()
# Apply the buckets once more
video_total_ads["duration_bucket"] = pd.minimize(video_total_ads["videoDuration"] / 60, bins=bins, labels=labels, proper=False)
# Group by bucket and sum advert instances and whole durations
bucket_data = video_total_ads.groupby("duration_bucket")[["ad_minutes", "videoDuration"]].sum()
# Convert to share of whole video time
bucket_data["ad_percentage"] = (bucket_data["ad_minutes"] / (bucket_data["videoDuration"] / 60)) * 100
bucket_data["video_percentage"] = 100 - bucket_data["ad_percentage"]

As anticipated, if you happen to’re watching shorter-form content material on YouTube, then round 10% of it’s sponsored! Movies of 12–20 min in length have 6.5% of sponsors, whereas 20–30 min have solely 4.8%.
To maneuver ahead to the year-by-year evaluation I would like to hitch the sponsor instances with the videoInfo
desk.
CREATE TABLE video_total_ads_joined AS
SELECT
*
FROM
video_total_ads
LEFT JOIN video_info ON video_total_ads.videoID = video_info.videoID
Subsequent, let’s simply verify what number of movies we have now per yr:
SELECT
*,
to_timestamp(NULLIF (revealed, 0)) AS published_date,
extract(yr FROM to_timestamp(NULLIF (revealed, 0))) AS published_year
FROM
video_total_ads

Not good, not good in any respect. I’m not precisely positive why however there are numerous movies that didn’t have the timestamp recorded. Plainly solely in 2021 and 2022 movies had been reliably saved with their revealed date.
I do have some concepts on how I can enhance this dataset with different public information, but it surely’s a really time-consuming course of and I’ll go away this for a future weblog put up. I don’t intend to accept a solution based mostly on restricted information, however for now, I must make do with what I’ve.
I selected to maintain the evaluation between the years 2018 and 2023, provided that these years had extra information factors.
# Limiting the years as for these right here I've a good quantity of knowledge.
start_year = 2018
end_year = 2023
plot_df = (
video_total_ads_joined.df()
.question(f"published_year >= {start_year} and published_year <= {end_year}")
.groupby(["published_year", "duration_bucket"], as_index=False)
[["ad_minutes", "video_duration_minutes"]]
.sum()
)
# Calculate ad_percentage & content_percentage
plot_df["ad_percentage"] = (
plot_df["ad_minutes"] / plot_df["video_duration_minutes"] * 100
)
plot_df["content_percentage"] = 100 - plot_df["ad_percentage"]

There’s a steep enhance in advert share, particularly from 2020 to 2021, however afterward, it plateaus, particularly for longer movies. This makes numerous sense since throughout these years on-line commercial grew quite a bit as folks spent an increasing number of time at dwelling.
For shorter movies, there does appear to be a rise from 2022 to 2023. However as the info is restricted, and I don’t have information for 2024, I can’t get a conclusive reply to this.
Subsequent up, let’s transfer into questions that don’t depend upon the publishing date, this manner I can work with a bigger portion of the dataset.
This can be a enjoyable one for me, as I’m wondering if the channels I actively watch are those that run essentially the most adverts.
Persevering with from the desk created beforehand, I can simply group the advert and video quantity by channel:
CREATE TABLE ad_percentage_per_channel AS
SELECT
channelID,
sum(ad_minutes) AS channel_total_ad_minutes,
sum(videoDuration) / 60 AS channel_total_video_minutes
FROM
video_total_ads_joined
GROUP BY
channelID
I made a decision to filter for channels that had a minimum of half-hour of movies within the information, as a method of eliminating outliers.
SELECT
channelID,
channel_total_video_minutes,
channel_total_ad_minutes,
channel_ad_percentage
FROM
ad_percentage_per_channel
WHERE
-- No less than half-hour of video
channel_total_video_minutes > 1800
AND channelID IS NOT NULL
ORDER BY
channel_ad_percentage DESC
LIMIT 50
As rapidly talked about earlier, I additionally created some features to get the nation and thumbnail of channels. This allowed me to create this visualization.

I’m unsure if this stunned me or not. Among the channels on this checklist I watch very often, particularly Gaveta (#31), a Brazilian YouTuber who covers films and movie modifying.
I additionally know that each he and Hall Crew (#32) do numerous self-sponsor, selling their very own content material and merchandise, so possibly that is additionally the case for different channels!
In any case, the info appears good, and the chances appear to match my handbook checks and private expertise.
I might like to know if channels that you simply watch had been current on this checklist, and if it stunned you or not!
If you wish to see the High 150 Creators, subscribe to my free publication, as I might be publishing the complete checklist in addition to extra details about this evaluation in there!
Have you ever ever thought of at which level of the video adverts work greatest? Individuals most likely simply skip sponsor segments positioned at first, and simply transfer on and shut the video for these positioned on the finish.
From private expertise, I really feel that I’m extra more likely to watch an advert if it performs across the center of a video, however I don’t assume that is what creators do usually.
My objective, then, is to create a heatmap that exhibits the density of adverts throughout a video runtime. Doing this was surprisingly not apparent, and the answer that I discovered was so intelligent that it kinda blew my thoughts. Let me present you.
That is the info wanted for this evaluation. One row per advert, with the timestamp when every section begins and ends:

Step one is to normalize the intervals, e.g., I don’t care that an advert began at 63s, what I wish to know is that if it began at 1% of the video runtime or 50% of the video runtime.
CREATE TABLE ad_intervals AS
SELECT
videoID,
startTime,
endTime,
videoDuration,
startTime / videoDuration AS start_fraction,
endTime / videoDuration AS end_fraction
FROM
unique_segments
WHERE
-- Simply to ensure we do not have unhealthy information
videoID IS NOT NULL
AND startTime >= 0
AND endTime <= videoDuration
AND startTime < endTime
-- Lower than 40h
AND videoDuration < 144000

Nice, now all intervals are comparable, however the issue is much from solved.
I need you to assume, how would you remedy this? If I requested you “At 10% runtime out of all movies, what number of adverts are working?”
I don’t imagine that that is an apparent drawback to resolve. My first intuition was to create a bunch of buckets, after which, for every row, I might ask “Is there an advert working at 1% of the runtime? What about at 2%? And so forth…”
This appeared like a horrible concept, although. I wouldn’t be capable of do it in SQL, and the code to resolve it could be extremely messy. In the long run, the implementation of the answer I discovered was remarkably easy, utilizing the Sweep Line Algorithm, which is an algorithm that’s usually utilized in programming interviews and puzzles.
I’ll present you ways I solved it however don’t fear if you happen to don’t perceive what is occurring. I’ll share different sources so that you can study extra about it in a while.
The very first thing to do is to rework every interval (startTime, endTime) into two occasions, one that can rely as +1 when the advert begins, and one other that can rely as -1 when the advert finishes. Afterward, simply order the dataset by the “begin time”.
CREATE TABLE ad_events AS
WITH unioned as (
-- That is a very powerful step.
SELECT
videoID,
start_fraction as fraction,
1 as delta
FROM ad_intervals
UNION ALL
SELECT
videoID,
end_fraction as fraction,
-1 as delta
FROM ad_intervals
), ordered AS (
SELECT
videoID,
fraction,
delta
FROM ad_events
ORDER BY fraction, delta
)
SELECT * FROM ordered

Now it’s already a lot simpler to see the trail ahead! All I’ve to do is use a working sum on the delta column, after which, at any level of the dataset, I can know what number of adverts are working!
For instance, if from 0s to 10s three adverts began, however two of these additionally completed, I might have a delta of +3 after which -2, which implies that there’s just one advert presently working!
Going ahead, and to simplify the info a bit, I first around the fractions to 4 decimal factors and combination them. This isn’t essential, however having too many rows was an issue when attempting to plot the info. Lastly, I divide the quantity of working adverts by the full quantity of movies, to have it as a share.
CREATE TABLE ad_counter AS
WITH rounded_and_grouped AS (
SELECT
ROUND(fraction, 4) as fraction,
SUM(delta) as delta
FROM ad_events
GROUP BY ROUND(fraction, 4)
ORDER BY fraction
), running_sum AS (
SELECT
fraction,
SUM(delta) OVER (ORDER BY fraction) as ad_counter
FROM rounded_and_grouped
), density AS (
SELECT
fraction,
ad_counter,
ad_counter / (SELECT COUNT(DISTINCT videoID) FROM unique_segments_filtered) as density
FROM running_sum
)
SELECT * FROM density

With this information not solely do I do know that at first of the movies (0.0% fraction), there are 69987 movies working adverts, this additionally represents 17% of all movies within the dataset.
Now I can lastly plot it as a heatmap:

As anticipated, the bumps on the extremities present that it’s far more widespread for channels to run adverts at first and finish of the video. It’s additionally attention-grabbing that there’s a plateau across the center of the video, however then a drop, because the second half of the video is mostly extra ad-free.
What I discovered humorous is that it’s apparently widespread for some movies to begin right away with an advert. I couldn’t image this, so I manually checked 10 movies and it’s truly true… I’m unsure how consultant it’s, however many of the ones that I opened had been gaming-related and in Russian, and so they began instantly with adverts!
Earlier than we transfer on to the conclusions, what did you consider the answer to this drawback? I used to be stunned at how easy was doing this with the Sweep Line trick. If you wish to know extra about it, I not too long ago revealed a weblog put up overlaying some SQL Patterns, and the final one is strictly this drawback! Simply repackaged within the context of counting concurrent conferences.
Conclusion
I actually loved doing this evaluation for the reason that information feels very private to me, particularly as a result of I’ve been hooked on YouTube these days. I additionally really feel that the solutions I discovered had been fairly passable, a minimum of for essentially the most half. To complete it off, let’s do a final recap!
Have Sponsor Segments Elevated Over the Years?
There was a transparent enhance from 2020 to 2021. This was an impact that occurred all through all digital media and it’s clearly proven on this information. In more moderen years, I can’t say whether or not there was a rise or not, as I don’t have sufficient information to be assured.
Which Channels Have the Highest Proportion of Sponsor Time Per Video?
I bought to create a really convincing checklist of the High 50 channels that run the best quantity of adverts. And I found that a few of my favourite creators are those that spend essentially the most period of time attempting to promote me one thing!
What’s the density of sponsor segments all through a video?
As anticipated, most individuals run adverts at first and the tip of movies. In addition to this, numerous creators run adverts across the center of the video, making the second half barely extra ad-free.
Additionally, there are YouTubers who instantly begin a video with adverts, which I feel it’s a loopy technique.
Different Learnings and Subsequent Steps
I appreciated how clear the info was in exhibiting the proportion of adverts in numerous video sizes. Now I do know that I’m most likely spending 5–6% of my time on YouTube watching adverts if I’m not skipping them since I principally watch movies which might be 10–20 min.
I’m nonetheless not totally completely satisfied although with the year-by-year evaluation. I’ve already appeared into different information and downloaded greater than 100 GB of YouTube metadata datasets. I’m assured that I can use it, along with the YouTube API, to fill some gaps and get a extra convincing reply to my query.
Visualization Code
You may need observed that I didn’t present snippets to plot the charts proven right here. This was on function to make the weblog put up extra readable, as matplotlib code occupies numerous area.
You could find all of the code in my GitHub repo, that method you may copy my charts if you wish to.
That’s it for this one! I actually hope you loved studying this weblog put up and discovered one thing new!
When you’re interested by attention-grabbing subjects that didn’t make it into this put up, or get pleasure from studying about information, subscribe to my free publication on Substack. I publish each time I’ve one thing genuinely attention-grabbing to share.
Need to join instantly or have questions? Attain out anytime at mtrentz.com.
All pictures and animations by the creator until acknowledged in any other case.