Python for Information Science (Free 7-Day Mini-Course)

September 30, 2025

24

Python for Information Science (Free 7-Day Mini-Course)

Picture by Editor | ChatGPT

# Introduction

Welcome to Python for Information Science, A Free 7-Day Mini Course for rookies! Should you’re beginning out with information science or wish to be taught primary Python abilities, this beginner-friendly course is for you. Over the subsequent seven days, you’ll discover ways to work on information duties utilizing solely core Python.

You’ll discover ways to:

Work with elementary Python information buildings
Clear and put together messy textual content information
Summarize and group information with dictionaries (similar to you do in SQL or Excel)
Write reusable capabilities that preserve your code neat and environment friendly
Deal with errors gracefully so your scripts don’t crash on messy enter information
And at last, you’ll construct a easy information profiling device to examine any CSV dataset

Let’s get began!

🔗 Hyperlink to the code on GitHub

# Day 1: Variables, Information Varieties, and File I/O

In information science, all the pieces begins with uncooked information: survey responses, logs, spreadsheets, varieties, scraped web sites, and so forth. Earlier than you’ll be able to mannequin or analyze something, you must:

Load the information
Perceive its form and kinds
Start to wash or examine it

Right now, you may be taught:

The fundamental Python information varieties
The right way to learn and write uncooked .txt recordsdata

// 1. Variables

In Python, a variable is a named reference to a worth. In information phrases, you’ll be able to consider them as fields, columns, or metadata.

filename = "responses.txt"
survey_name = "Q3 Buyer Suggestions"
max_entries = 100

// 2. Information Varieties You will Use Typically

Don’t fear about obscure varieties as but. You’ll principally use the next:

Python Sort	What It’s Used For	Instance
str	Uncooked textual content, column names	“age”, “unknown”
int	Counts, discrete variables	42, 0, -3
float	Steady variables	3.14, 0.0, -100.5
bool	Flags / binary outcomes	True, False
None	Lacking/null values	None

Understanding once you’re coping with every — and the right way to examine or convert them — is step zero in information cleansing.

// 3. File Enter: Studying Uncooked Information

Most real-world information lives in .txt, .csv, or .log recordsdata. You’ll usually must load them line-by-line, not abruptly (particularly if massive).

Let’s say you will have a file referred to as responses.txt:

Here is the way you learn it:

with open("responses.txt", "r") as file:
    strains = file.readlines()

for i, line in enumerate(strains):
    cleaned = line.strip()  # removes n and areas
    print(f"{i + 1}: {cleaned}")

Output:

1: Sure
2: No
3: Sure
4: Possibly
5: No

// 4. File Output: Writing Processed Information

Let’s say you wish to save solely “Sure” responses to a brand new file:

with open("responses.txt", "r") as infile:
    strains = infile.readlines()

yes_responses = []

for line in strains:
    if line.strip().decrease() == "sure":
        yes_responses.append(line.strip())

with open("yes_only.txt", "w") as outfile:
    for merchandise in yes_responses:
        outfile.write(merchandise + "n")

This can be a tremendous easy model of a filter-transform-save pipeline, an idea used every day in information preprocessing.

// ⏭️ Train: Write Your First Information Script

Create a file referred to as survey.txt and replica within the following strains:

Now write a Python script that:

Reads the file
Counts what number of instances “sure” seems (case-insensitive). You’ll be taught to work with strings later within the textual content. However do give it a go!
Prints the depend
Writes a clear model of the information (capitalized, no whitespace) to cleaned_survey.txt

# Day 2: Primary Python Information Constructions

Information science is all about organizing and structuring information so it may be cleaned, analyzed, or modeled. Right now you’ll be taught the 4 important information buildings in core Python and the right way to use them for precise information duties:

record: for sequences of rows
tuple: for fixed-position data
dict: for labeled information (like columns)
set: for monitoring distinctive values

// 1. Listing: For Sequences of Information Rows

Lists are essentially the most versatile and customary construction, appropriate for representing:

A column of values
A set of data
A dataset with unknown measurement

Instance: Learn values from a file into an inventory.

with open("scores.txt", "r") as file:
    scores = [float(line.strip()) for line in file]

print(scores)

This prints:

Now you can:

common = sum(scores) / len(scores)
print(f"Common rating: {common:.2f}")

Output:

// 2. Tuple: For Fastened-Construction Information

Tuples are like lists, however immutable and finest used for rows with identified construction, e.g., (title, age).

Instance: Learn a file of names and ages.
Suppose we have now the next individuals.txt:

Alice, 34
Bob, 29
Eve, 41

Now let’s learn within the contents of the file:

with open("individuals.txt", "r") as file:
    data = []
    for line in file:
        title, age = line.strip().break up(",")
        data.append((title.strip(), int(age.strip())))

Now you’ll be able to entry fields by place:

for individual in data:
    title, age = individual
    if age > 30:
        print(f"{title} is over 30.")

// 3. Dict: For Labeled Information (Like Columns)

Dictionaries retailer key-value pairs, the closest factor in core Python to a desk row with named columns.

Instance: Convert every individual document right into a dict:

individuals = []

with open("individuals.txt", "r") as file:
    for line in file:
        title, age = line.strip().break up(",")
        individual = {
            "title": title.strip(),
            "age": int(age.strip())
        }
        individuals.append(individual)

Now your information is far more readable and versatile:

for individual in individuals:
    if individual["age"] < 60:
        print(f"{individual['name']} is maybe a working skilled.")

// 4. Set: For Uniqueness & Quick Membership Checks

Units robotically take away duplicates. So units are nice for:

Counting distinctive classes
Checking if a worth has been seen earlier than
Monitoring distinct values with out order

Instance: From a file of emails, discover all distinctive domains.

domains = set()

with open("emails.txt", "r") as file:
    for line in file:
        electronic mail = line.strip().decrease()
        if "@" in electronic mail:
            area = electronic mail.break up("@")[1]
            domains.add(area)

print(domains)

Output:

{'gmail.com', 'yahoo.com', 'instance.org'}

// ⏭️ Train: Code a Mini Information Inspector

Create a file referred to as dataset.txt with the next content material:

Now write a Python script that:

Reads every line and shops it as a dictionary with keys: title, age, position
Counts how many individuals are in every position (use a dictionary) and the variety of distinctive ages (use a set)

# Day 3: Working with Strings

Textual content strings are in all places in most real-world datasets — survey responses, person bios, job titles, product evaluations, emails, and extra — however they’re additionally inconsistent and unpredictable.

Right now, you’ll be taught to:

Clear and standardize uncooked textual content
Extract info from strings
Construct easy text-based options (the type you should utilize for filtering or modeling)

// 1. Primary String Cleansing

Let’s say you get this uncooked record of job titles from a CSV:

titles = [
    "  Data Scientistn",
    "data scientist",
    "Senior Data Scientist ",
    "DATA scientist",
    "Data engineer",
    "Data Scientist"
]

Your job? Normalize it.

cleaned = [title.strip().lower() for title in titles]

Now all the pieces is lowercase and whitespace-free.

Output:

['data scientist', 'data scientist', 'senior data scientist', 'data scientist', 'data engineer', 'data scientist']

// 2. Standardizing Values

Let’s say you are solely excited about figuring out information scientists.

standardized = []

for title in cleaned:
    if "information scientist" in title:
        standardized.append("information scientist")
    else:
        standardized.append(title)

// 3. Counting Phrases, Checking Patterns

Helpful textual content options:

Variety of phrases
Whether or not a string comprises a key phrase
Whether or not a string is a quantity or electronic mail

Instance:

textual content = " The value is $5,000!  "

# Clear up
clear = textual content.strip().decrease().exchange("$", "").exchange(",", "").exchange("!", "")
print(clear)  

# Phrase depend
word_count = len(clear.break up())

# Comprises digit
has_number = any(char.isdigit() for char in clear)

print(word_count)
print(has_number)

Output:

"the worth is 5000"
4
True

// 4. Splitting and Extracting Elements

Let’s take the e-mail instance:

electronic mail = "  Alice.Johnson@Instance.com  "
electronic mail = electronic mail.strip().decrease()

username, area = electronic mail.break up("@")

print(f"Consumer: {username}, Area: {area}")

This prints:

Consumer: alice.johnson, Area: instance.com

This sort of extraction is utilized in person habits evaluation, spam detection, and the like.

// 5. Detecting Particular Textual content Patterns

You do not want common expressions for primary sample checks.

Instance: Examine if somebody talked about “python” in a free-text response:

remark = "I am studying Python and SQL for information jobs."

if "python" in remark.decrease():
    print("Talked about Python")

// ⏭️ Train: Clear Survey Feedback

Create a file referred to as feedback.txt with the next strains:

Nice course! Beloved the pacing.
Not sufficient Python examples.
Too primary for knowledgeable customers.
python is strictly what I wanted!
Would really like extra SQL content material.
Glorious – very beginner-friendly.

Now write a Python script that:

Cleans every remark (strip, lowercase, take away punctuation)
Prints the overall variety of feedback, what number of point out “python”, and the typical phrase depend per remark

# Day 4: Group, Rely, & Summarize with Dictionaries

You’ve used dict to retailer labeled data. Right now, you may go a degree deeper: utilizing dictionaries to group, depend, and summarize information — similar to a pivot desk or GROUP BY in SQL.

// 1. Grouping by a Area

Let’s say you will have this information.

information = [
    {"name": "Alice", "city": "London"},
    {"name": "Bob", "city": "Paris"},
    {"name": "Eve", "city": "London"},
    {"name": "John", "city": "New York"},
    {"name": "Dana", "city": "Paris"},
]

Objective: Rely how many individuals are in every metropolis.

city_counts = {}

for individual in information:
    metropolis = individual["city"]
    if metropolis not in city_counts:
        city_counts[city] = 1
    else:
        city_counts[city] += 1

print(city_counts)

Output:

{'London': 2, 'Paris': 2, 'New York': 1}

// 2. Summing a Area by Class

Now let’s say we have now:

salaries = [
    {"role": "Engineer", "salary": 75000},
    {"role": "Analyst", "salary": 62000},
    {"role": "Engineer", "salary": 80000},
    {"role": "Manager", "salary": 95000},
    {"role": "Analyst", "salary": 64000},
]

Objective: Calculate whole and common wage per position.

totals = {}
counts = {}

for individual in salaries:
    position = individual["role"]
    wage = individual["salary"]
    
    totals[role] = totals.get(position, 0) + wage
    counts[role] = counts.get(position, 0) + 1

averages = {position: totals[role] / counts[role] for position in totals}

print(averages)

Output:

{'Engineer': 77500.0, 'Analyst': 63000.0, 'Supervisor': 95000.0}

// 3. Frequency Desk (Mode Detection)

Discover the most typical age in a dataset:

ages = [29, 34, 29, 41, 34, 29]

freq = {}

for age in ages:
    freq[age] = freq.get(age, 0) + 1

most_common = max(freq.objects(), key=lambda x: x[1])

print(f"Commonest age: {most_common[0]} (seems {most_common[1]} instances)")

Output:

Commonest age: 29 (seems 3 instances)

// ⏭️ Train: Analyze Worker Dataset

Create a file workers.txt with the next content material:

Alice,London,Engineer,75000
Bob,Paris,Analyst,62000
Eve,London,Engineer,80000
John,New York,Supervisor,95000
Dana,Paris,Analyst,64000

Write a Python script that:

Masses the information into an inventory of dictionaries
Prints the variety of workers per metropolis and the typical wage per position

# Day 5: Writing Capabilities

You’ve written code that masses, cleans, filters, and summarizes information. Now you’ll bundle that logic into capabilities, so you’ll be able to:

Reuse your code
Construct processing pipelines
Hold scripts readable and testable

// 1. Cleansing Textual content Inputs

Let’s write a operate to carry out primary textual content cleansing:

def clean_text(textual content):
    return textual content.strip().decrease().exchange(",", "").exchange("$", "")

Now you’ll be able to apply this to each discipline you learn from a file.

// 2. Creating Row Information

Subsequent, right here’s a easy operate to parse every row in a file and create document:

def parse_row(line):
    elements = line.strip().break up(",")
    return {
        "title": elements[0],
        "metropolis": elements[1],
        "position": elements[2],
        "wage": int(elements[3])
    }

Now your file loading turns into:

with open("workers.txt") as file:
    rows = [parse_row(line) for line in file]

// 3. Aggregation Helpers

Thus far, you’ve computed averages and depend of occurrences. Let’s write some primary helper capabilities for a similar:

def common(values):
    return sum(values) / len(values) if values else 0

def count_by_key(information, key):
    counts = {}
    for merchandise in information:
        okay = merchandise[key]
        counts[k] = counts.get(okay, 0) + 1
    return counts

// ⏭️ Train: Modularize Earlier Work

Refactor yesterday’s resolution into reusable capabilities:

load_data(filename)
average_salary_by_role(information)
count_by_city(information)

Then use them in a script that prints the identical output as Day 4.

# Day 6: Studying, Writing, and Primary Error-Dealing with

Information recordsdata are sometimes incomplete, corrupted, and misformatted. So how do you take care of them?

Right now you’ll be taught:

The right way to learn and write structured recordsdata
The right way to gracefully deal with errors
The right way to skip or log dangerous rows with out crashing

// 1. Safer File Studying

What occurs once you strive studying a file that doesn’t exist? Right here’s the way you “strive” opening the file and catch “FileNotFoundError” if the file doesn’t exist.

strive:
    with open("workers.txt") as file:
        strains = file.readlines()
besides FileNotFoundError:
    print("Error: File not discovered.")
    strains = []

// 2. Dealing with Unhealthy Rows Gracefully

Now let’s attempt to skip dangerous rows and course of solely the entire rows.

data = []

for line in strains:
    strive:
        elements = line.strip().break up(",")
        if len(elements) != 4:
            increase ValueError("Incorrect variety of fields")
        document = {
            "title": elements[0],
            "metropolis": elements[1],
            "position": elements[2],
            "wage": int(elements[3])
        }
        data.append(document)
    besides Exception as e:
        print(f"Skipping dangerous line: {line.strip()} ({e})")

// 3. Writing Cleaned Information to a File

Lastly, let’s write the cleaned information to a file.

with open("cleaned_employees.txt", "w") as out:
    for r in data:
        out.write(f"{r['name']},{r['city']},{r['role']},{r['salary']}n")

// ⏭️ Train: Make a Fault-Tolerant Loader

Create a file raw_employees.txt with a few incomplete or messy strains like:

Alice,London,Engineer,75000
Bob,Paris,Analyst
Eve,London,Engineer,eighty thousand
John,New York,Supervisor,95000

Write a script that:

Masses solely legitimate data
Prints variety of legitimate rows
Writes them to validated_employees.txt

# Day 7: Construct a Mini Information Profiler (Challenge Day)

Nice work on making it to this point. Right now, you’ll create a standalone Python script that:

Masses a CSV file
Detects column names and kinds
Computes helpful stats
Writes a abstract report

// Step-by-Step Define

1. Load the file:

def load_csv(filename):
    with open(filename) as f:
        strains = [line.strip() for line in f if line.strip()]
    header = strains[0].break up(",")
    rows = [line.split(",") for line in lines[1:]]
    return header, rows

2. Detect column varieties:

def detect_type(worth):
    strive:
        float(worth)
        return "numeric"
    besides:
        return "textual content"

3. Profile every column:

def profile_columns(header, rows):
    abstract = {}
    for i, col in enumerate(header):
        values = [row[i].strip() for row in rows if len(row) == len(header)]
        col_type = detect_type(values[0])
        distinctive = set(values)
        abstract[col] = {
            "kind": col_type,
            "unique_count": len(distinctive),
            "most_common": max(set(values), key=values.depend)
 }
 if col_type == "numeric":
 nums = [float(v) for v in values if v.replace('.', '', 1).isdigit()]
 abstract[col]["average"] = sum(nums) / len(nums) if nums else 0
 return abstract

4. Create a abstract:

def write_summary(abstract, out_file):
    with open(out_file, "w") as f:
        for col, stats in abstract.objects():
            f.write(f"Column: {col}n")
            for okay, v in stats.objects():
                f.write(f"  {okay}: {v}n")
            f.write("n")

You should use the capabilities like so:

header, rows = load_csv("workers.csv")
abstract = profile_columns(header, rows)
write_summary(abstract, "profile_report.txt")

// ⏭️ Last Train

Use your individual CSV file (or reuse earlier ones). Run the profiler and examine the output.

# Conclusion

Congratulations! You’ve accomplished the Python for Information Science mini-course. 🎉

Over this week, you’ve moved from primary Python information buildings to writing modular capabilities and scripts that deal with actual information issues. These are the fundamentals, and by that I imply, actually primary stuff. I recommend you utilize this as a place to begin and be taught extra about Python’s commonplace library (by doing after all).

Thanks for studying with me. Blissful coding and information crunching forward!

Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, information science, and content material creation. Her areas of curiosity and experience embody DevOps, information science, and pure language processing. She enjoys studying, writing, coding, and occasional! At the moment, she’s engaged on studying and sharing her data with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates participating useful resource overviews and coding tutorials.

Python for Information Science (Free 7-Day Mini-Course)

# Introduction

# Day 1: Variables, Information Varieties, and File I/O

// 1. Variables

// 2. Information Varieties You will Use Typically

// 3. File Enter: Studying Uncooked Information

// 4. File Output: Writing Processed Information

// ⏭️ Train: Write Your First Information Script

# Day 2: Primary Python Information Constructions

// 1. Listing: For Sequences of Information Rows

// 2. Tuple: For Fastened-Construction Information

// 3. Dict: For Labeled Information (Like Columns)

// 4. Set: For Uniqueness & Quick Membership Checks

// ⏭️ Train: Code a Mini Information Inspector

# Day 3: Working with Strings

// 1. Primary String Cleansing

// 2. Standardizing Values

// 3. Counting Phrases, Checking Patterns

// 4. Splitting and Extracting Elements

// 5. Detecting Particular Textual content Patterns

// ⏭️ Train: Clear Survey Feedback

# Day 4: Group, Rely, & Summarize with Dictionaries

// 1. Grouping by a Area

// 2. Summing a Area by Class

// 3. Frequency Desk (Mode Detection)

// ⏭️ Train: Analyze Worker Dataset

# Day 5: Writing Capabilities

// 1. Cleansing Textual content Inputs

// 2. Creating Row Information

// 3. Aggregation Helpers

// ⏭️ Train: Modularize Earlier Work

# Day 6: Studying, Writing, and Primary Error-Dealing with

// 1. Safer File Studying

// 2. Dealing with Unhealthy Rows Gracefully

// 3. Writing Cleaned Information to a File

// ⏭️ Train: Make a Fault-Tolerant Loader

# Day 7: Construct a Mini Information Profiler (Challenge Day)

// Step-by-Step Define

// ⏭️ Last Train

# Conclusion

Related Articles

LEAVE A REPLY Cancel reply

Latest Articles

About US