Data Analysis with Python part 1

We hear about data all the time. Data in the form of text, audios, videos, images, and in many forms is the foundation of the digital economy today. Today, humans are said to be creating 1.7 million megabytes of data in just a second. This data created by a huge number of devices offer endless possibilities to businesses, enterprises, and the economy. This data also forms the core of Artificial Intelligence models. The latest technologies rely on huge data to retrieve useful information from them. Data itself is nothing if properly analyzed for some purpose. In this blog, Data Analysis with Python part 1, we will understand what data is and how to start working with it.

Data Analysis:

We have discussed that how huge data is being produced manually by scientists and digitally by each and every one of us. Every time the webpage is searched and a mobile screen is clicked data bytes are generated. But this data in the raw form is of no use. Data is not information. Data Science and data analysis help us in bringing forward the useful insights and information from the heaps of data.

Data Analysis with Python
Source: Real Python

Let’s start our journey of understanding data by first stating a problem in Data Analysis with Python part 1. During the course of this series, we will try to solve this data problem.

Suppose we want to sell a car. We want to sell it at a good price. Not too low that we are at a loss. Also, not too high that no one buys it. To put a reasonable price on car think as a data analyst. Is there any data on car sales available? Does this data relate prices with the characteristics of the car? Which car feature mainly decides the price of the car? These are some of the questions on the lines of which a data analyst will start working towards a solution.

Understanding Data:

We have now established that we need data in order to solve our car sales problem. So, let’s introduce our data. We will be using Jeffrey C. Schlemmer’s open dataset of used car prices. This is in CSV format where values are separated by commas and each line is a record. On this link, you can find documentation on the columns of the dataset. There are 26 columns in the dataset. The first column, symboling, tells about the risk level of the car. -3 means not so risky and 3 indicates high risk. The second attribute of normalized losses tells about average loss payment per year. The value range is from 65 to 256. Our target value or label is column 26. That is the price. Price is what we want to predict from the dataset whereas all other attributes or columns are predictors.

Some Python Packages for Data Analyses and Data Science:

As we will perform our data analysis in Python programming language it is good to have a gentle introduction to important libraries of Python that make working with data easier and efficient. We know libraries contain functions and modules that help us in implementing solutions without writing extensive code. We directly implement functionalities already defined in the library.

For a gentle introduction to the Python libraries for data analysis, we will divide them into three groups.

Scientific Computing Libraries:

Pandas is a well-known library for data manipulation and analysis. For this purpose Pandas use a Dataframe data structure mostly which is a two-dimensional table of columns and rows.

Numpy library uses arrays for inputs and outputs. It is also used for processing data in the form of matrices.

SciPy library has functions to help with advanced Math problems like integrals, differential equations, and optimization. It also helps in data visualization.

Data Visualization Libraries:

Data visualization helps in a better understanding of data and also helps in communicating work with others in the form of maps, charts, and graphs.

Matplotlib is the most widely used library for data visualization.

Seaborn is another library that helps in making customizable graphs. It is based on Matplotlib.

ML Algorithm Libraries:

These libraries help in building machine learning models on datasets. These models are then trained to achieve solutions for a particular problem. Algorithmic libraries help in implementing basic to complex machine learning models.

Scikit-learn helps in implementing statistical models using classification, regression, clustering, and other ML techniques. Numpy, SciPy, and Matplotlib form the basis of this library.

Importing and Exporting Data in Python:

Data acquisition is a process of loading data, say into a notebook, from various sources. We will use Python’s Pandas library for this purpose. Once data is loaded we can perform all the data analysis procedures on it. Two points are considered when loading data with Pandas. File format and file path. The file extension gives an idea about the format of the data that is how data is encoded. Some common encodings include JSON, CSV, XLSX, etc. The path tells us where the data is stored. Either locally on a computer or online. Our dataset of used cars’ prices is located online. In this dataset, each line or row is a datapoint. Several values are associated with each datapoint and are separated with a comma, indicating that data is stored in a CSV format.

The following three lines of Python code will read data.

Data Analysis with Python

read_csv method assumes that the first row in the dataset contains header values. If this not the case, like in our dataset, we explicitly mention headers = none in the function call.

Once the data is loaded in the data frame, it is a good practice to view some portion of data to know that everything has loaded correctly.

df prints the whole dataset and obviously it is not recommended for huge datasets.

df.head(n) displays n number of top dataset rows. Similarly, df.tail(n) shows n bottom rows of the dataset.

Data Analysis with Python  - Understanding data

If the column names are not present Pandas sets integer values as column names. We know that it gets a bit annoying working with integers as names. We can set them to actual names defining the columns. The header values for our dataset our present online in a separate file. We can set a header row to these values as follows.

Put all the column names in a list called headers. Then set df.columns = headers

Data Analysis with Python  - Understanding data
Data Analysis with Python  - Understanding data

After performing data analysis on a data frame we can export this data to a file.

For exporting to a CSV file, method to_csv is used which takes the path of the file as a parameter.

Similarly, read_json() will import a JSON file and to_json wil export to a JSON file.

Analyzing Data with Python:

Next in Data Analysis with Python part 1, we will discuss some important methods that all aspiring data scientists and data analytics should know about. After importing the data, the next step is to explore it. Pandas offers a number of methods to explore the features of the data.

Such analysis of data gives an overview of the datatypes used in the dataset and data analysts may want to change them. The main datatypes used by Pandas to store data are int, float, object, and datetime.

Pandas automatically assigns datatypes to the columns depending on their encoding. Therefore, it is helpful if we quickly go through the datatypes after data import in case something needs to be changed. For example, Python may assign a price column datatype of an object. Whereas we would want it to be a float value. The method dtype() returns datatype of each column. Moreover, an experienced data scientist will guess by looking at the columns’ datatypes about function to be performed on particular columns.

After examining and handling the datatypes, next is the distribution of data in columns. These statistical metrics or summaries will help a data scientist locate an anomaly in the data. Such as deviations and extreme outliers.

The method df.describe() gives a quick summary of data stats. It returns total values in a column as count, mean of these values as mean, the standard deviation of the column as std, maximum-minimum values, and some other useful statistical data of columns. It works on columns with the numerical datatype.

Data Analysis with Python  - Understanding data

df.describe(include=”all”) would include all columns. For columns with object datatype, it returns values like unique, top, and frequency. Unique shows all the unique data values in the column. Top indicates the frequently occurring item in the column and frequency is the number of times a top value occurs.

If a particular statistical metric cannot be calculated for a particular column depending on its datatype, NaN is returned.

Data Analysis with Python  - Understanding data

Another method, df.info(), shows the top and bottom 30 rows of the dataset.

Accessing Databases with Python:

Now in Data Analysis with Python part 1 we will discuss how to work with databases. The Python code connects to a database using API calls. SQL APIs consists of libraries and methods to work with the DBMS. The application program first uses API functions to connect with the database and then uses other functions to pass SQL statements to the database such as for making data queries and status checks. After the communication with the database is finished connection is closed. An important step to save resources.

DB-API is a standard Python API for accessing relational databases. Using this standard we can write a single program to work with multiple databases. Connection objects and query objects are two main concepts in DB-API. Connection objects maintain a connection with the database and manage transactions. Query object or cursor objects are used for running queries on the database. A cursor is used to go through the results from a database.

The cursor () method returns a new cursor using the connection.

commit() method commits a transaction.

rollback() method rolls back to the start of an uncommitted or a pending transaction.

close() method closes a database connection.

The following piece of code opens a connection with the database, creates a cursor object on the connection object, makes queries and retrieves results using this cursor object, and finally closes the connection freeing up resources.

Data Analysis with Python

Wrap Up:

In Data Analysis with Python part 1 we discussed how to start with the understanding of data. Data today constitutes the backbone of technologies and thus research and innovations. It is important for implementing the latest technologies of today, such as Artificial Intelligence, that we know how to handle and manipulate data because this data forms the foundation of AI and machine learning models.

One thought on “Data Analysis with Python part 1

Leave a Reply

Your email address will not be published. Required fields are marked *