Martin Kamau

Posted on May 10

Introduction to Python in Data Analytics

#python #analytics #beginners

Python is a high-level programming language with various uses such as data analysis, web development, automation and even in artificial intelligence. It is simple easy to read and instruct.

In this hereby article we shall be discussing the use cases of python in data analytics from a beginners Point of view.
First we need to discuss how to set up an environment to wri.te the code as well as python basics before getting into analysis.

Setting up an environment

You need an environment to write your code in this case we will use Google collab. It is the easiest way to get started as it does not require any installation.

To open a new notebook:

Go to colab.research.google.com
Sign in with your Google account
Click "New notebook"

This is where you can start writing your code.

Python Basics

Before working and analyzing data with python, you need to understand three fundamental ideas in Python: variables, data types, and functions.

Variables

Variables are basically containers that hold information. They store and label data in memory, data that is then used throughout the program.


current_course = "Data analytics"
print(current_course)       #Data analytics

Data types.

Every value in Python has a type. The most common you will encounter are the following;

city = "Nairobi"        # string(text)
population = 5_000_000  # integer(Whole number)
gdp_growth = 5.4        # float(Decimal)
is_capital = True       # boolean(True or False)

Understanding types helps you avoid errors and work with your data correctly.

Strings

A string is any piece of text wrapped in quotes. It is one of the most common data types you will encounter in real datasets. Examples;

name = "Martin Kamau"
city = "Nairobi"
course = "Data Analytics"

Dictionaries — Labelled Data

A dictionary stores data as key-value pairs. Instead of just a list of values, each value has a label (the key) that describes what it is.

student_info = {
    "name": "Martin",
    "city": "Nairobi",
    "age":25,
    "course": "Data Analytics"
}

To retrieve a value, use its key in square brackets:

print(student_info["name"])        # Output: Martin
print(student_info["age"])  # Output: 25

You can add new entries or update existing ones:

["lesson_mode"] = "Online"     # Add a new key
student_info["age"] = 26  # Update an existing key

To see all the keys or all the values:

print(student_info.keys())    # Output: dict_keys(['name', 'city', 'age', ...])
print(country_info.values())  # Output: dict_values(['Martin', 'Nairobi', 26, ...])

A dictionary maps directly onto a single row in a dataset — a row in a DataFrame is essentially a collection of labelled values, one per column.Take note.

Tuples — Fixed Collections

A tuple is similar to a list, but with one key difference: once you create it, you cannot change it. It is immutable.

months = ("Jan", "Feb", "Mar", "Apr")

You access items in a tuple the same way as a list, using their index (position), starting from 0:

print(months[2])         # Output: Mar

When would you use a tuple?
Tuples are ideal for data that should never change e.g, Months, days of the week, Latitudes and longitudes etc.

Loops — Repeating Actions

A loop lets you repeat a block of code multiple times without writing it out manually each time. There are two kinds.
For Loop
A for loop repeats once for each item in a collection,

fruits = ["Mango","Bananas","Oranges","Grapes"]
for fruit in fruits:
    print("I like", fruit)```
{% endraw %}

{% raw %}

``` output
I like Mango
I like Bananas
I like Oranges
I like Grapes

While Loop
A while loop repeats as long as a condition remains true.

count = 1
while count <= 5:
    print("Count:",count)
    count = count + 1```




```Count: 1
Count: 2
Count: 3
Count: 4
Count: 5```

Functions — Reusable Blocks of Code

A function is a named block of code that performs a specific task. You define it once, then call it whenever you need it.


def greet(name):
  message = "Hello " + name
  return message

def — tells Python you are defining a function
greet — the name you are giving the function
name — a parameter: a placeholder for the value you will pass in
_return _— sends a result back out of the function

To call the function;


print(greet("Job"))

Hello Job

You can also write functions that take numbers and do calculations. When analyzing code with python, you will often write functions to clean or transform data in a consistent way across many columns or files.

Importing Libraries

Python comes with a lot of built-in features, but for data analysis we need extra tools called libraries. A library is a collection of pre-written code that adds new capabilities to Python.
We will use two:

requests — for fetching data from the internet
_pandas _— for turning that data into a structured table and saving it as a CSV

Think of it this way: requests brings the data in, pandas makes it something you can actually work with.


import requests
import pandas as pd

Creating and Hosting Your Dataset

For this we are going to create a synthetic dataset using mockaroo.com(a free tool that allows you to create realistic fake datasets with your own columns, rows and datatypes ) we are going to download our dataset in j.son format.

Upload to GitHub and Get the Raw Link
A raw GitHub link gives you direct access to the file contents — which is exactly what requests needs to fetch the data.
To host your file:

1.Create a new public repository on github.com
2.Upload your .json file to the repository
3.Open the file on GitHub and click the Raw button
4.Copy the URL from your browser — it will look like this: w.githubusercontent.com/kamaumartin209-sketch/Introduction-to-Python-Practice-assignment-/refs/heads/main/MOCK_DATA.json

Fetching the JSON Data
Now we use requests to fetch the JSON file directly from that raw GitHub URL.


url = "https://raw.githubusercontent.com/kamaumartin209-sketch/Introduction-to-Python-Practice-assignment-/refs/heads/main/MOCK_DATA.json"
response = requests.get(url)
data = response.json()

Before doing anything with the response, always check whether the request actually succeeded:


print(response.status_code)
200

A status code tells you whether the request worked or not:
Code 200=Success — data came through
Code 404=Not found — check your URL
Code 403=Forbidden — the file is not publicly accessible

*Check the data type


type(data)
list


data[0]
{'student_id': 1,
 'first_name': 'Shepperd',
 'last_name': 'Hinsche',
 'passport_number': 'AB123456',
 'destination': 'China',
 'gender': 'Male',
 'flight_time': '5:11 AM',
 'departure_date': '4/29/2022',
 'arrival_date': '11/20/2022',
 'flight_duration_hours': 6}

Inspect the Dictionary Keys and Values
Each record in the list is a dictionary.

keys from the first chosen record;


data[0].keys()

dict_keys(['student_id', 'first_name', 'last_name', 'passport_number', 'destination', 'gender', 'flight_time', 'departure_date', 'arrival_date', 'flight_duration_hours'])

values from the first chosen record;


data[0].values()
dict_values([1, 'Shepperd', 'Hinsche', 'AB123456', 'China', 'Male', '5:11 AM', '4/29/2022', '11/20/2022', 6])

To access a specific value by its key


data[3]["destination"]
Brazil

Converting to a DataFrame


df = pd.DataFrame(data)

This one takes the list of dictionaries and turns it into a proper table

Saving as CSV
At this point, you have successfully created a dataframe, organized into tables you can now create a csv file;


df.to_csv("student_flights_csv")

Conclusion.

Python is an easy, structured and dynamic programming language hence why it can be used for analysis as we have seen with our workflow here i.e Generating a dataset and converting it into a structured and exportable table.
With the utilization of more libraries and tools python remains a great resource for analyzing large datasets for interpretation and help in faster and better decision making.

DEV Community