Data Analysis using Julia

Data Analysis using Julia

Recently I started reading Anshul Joshi’s latest book: “Learning Julia”. This is an amazing book for beginners who want to learn basic data analysis and visualization using Julia.
This blog is inspired from the book, here I have tried to put together all the important data analysis features provided by Julia. This blog will help you to get started with basic Julia data analysis and Visualisation using DataFrames.

What is Julia?

Julia: “Walks like python. Runs like C.”

Julia is a high level, high performance, dynamic programming language, for technical computing, with the syntax familiar to users of other technical computing environment.
It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.

Installation

Before we start, we need to install Julia and its specific packages.

Installing Julia

 

    1. Download Julia for your specific system from here:

https://julialang.org/downloads/

    1. Follow the platform-specific instructions to install Julia on your system from here

https://julialang.org/downloads/platform.html

Installing IJulia and Jupyter Notebook.

Data science and scientific computing are privileged to have an amazing interactive tool
called Jupyter Notebook. With Jupyter Notebook, you can write and run code in an
interactive web environment with many capabilities.
Go to the Julia prompt and type the following code:

In [ ]:
julia> Pkg.add("IJulia")

After ijulia is successfully installed you can type the following code to run it,

In [ ]:
julia> using IJulia
julia> notebook()

Now, Let’s start exploring Julia with Jupyter Notebook.

Exploratory Analysis with Julia.

A lot of time is spent by data scientists in transforming raw data into a useable format.
Julia provides an extensive library for dealing with raw data and transforming them into some meaningful format.

The first step in any kind of data analysis is exploring the dataset at hand. There are two ways to do that:

  • Exploring the data tables and applying statistical methods to find patterns in numbers.
  • Plotting the data to find patterns visually.

Julia provides DataFrames for exploring data and applying statistical methods.

Introduction to DataFrames.jl

A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with the use of row numbers. The essential difference is that column names and row numbers are known as column and row index, in case of dataframes . This is similar to pandas.DataFrame in Python or data.table in R.

Julia provides a package named DataFrames.jl, which provides the necessary data structures for loading table like data from different sources like: csv, txt or excel.

In [ ]:
julia> Pkg.add("DataFrames.jl")

Let’s work on a real-world scenario, for better a understanding.
The dataset is Loan Approval dataset which can be downloaded from here.
The description of the variables are:

Importing library and dataset

In [ ]:
using DataFrames

#read csv file into DataFrame
df= readtable("train.csv", separator=',')

Now, we have loaded the csv data into dataframe.
let’s explore the data using different dataframe’s function in Julia.

Dataframe functions

In [6]:
size(df)
Out[6]:
(614, 13)
In [7]:
names(df)
Out[7]:
13-element Array{Symbol,1}:
 :Loan_ID          
 :Gender           
 :Married          
 :Dependents       
 :Education        
 :Self_Employed    
 :ApplicantIncome  
 :CoapplicantIncome
 :LoanAmount       
 :Loan_Amount_Term 
 :Credit_History   
 :Property_Area    
 :Loan_Status

The dataset is not that big(614 rows).
Let’s look at the first 10 rows to get a better feel of how our data looks like?
The head(,n) function is used to read the first n rows of a dataset.

In [8]:
head(df,10)
Out[8]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
1 LP001002 Male No 0 Graduate No 5849 0.0 NA 360 1 Urban Y
2 LP001003 Male Yes 1 Graduate No 4583 1508.0 128 360 1 Rural N
3 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66 360 1 Urban Y
4 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120 360 1 Urban Y
5 LP001008 Male No 0 Graduate No 6000 0.0 141 360 1 Urban Y
6 LP001011 Male Yes 2 Graduate Yes 5417 4196.0 267 360 1 Urban Y
7 LP001013 Male Yes 0 Not Graduate No 2333 1516.0 95 360 1 Urban Y
8 LP001014 Male Yes 3+ Graduate No 3036 2504.0 158 360 0 Semiurban N
9 LP001018 Male Yes 2 Graduate No 4006 1526.0 168 360 1 Urban Y
10 LP001020 Male Yes 1 Graduate No 12841 10968.0 349 360 1 Semiurban N

By looking into the above data we can conclude that:

  • Gender, Married, Dependents, Education, Self_Employed, Credit_History, Loan_Status are all categorical variable, with two categories
  • Loan_ID is a unique column, it doesn’t provide any valuable information.
  • LoanAmount variable contains missing values.

Numerical Data

Let’s explore the LoanAmount column.
describe() function returns all the possible statistical inferences(Min, Max, Mean) that can be made
from the given data.

In [9]:
describe(df[:LoanAmount])
Summary Stats:
Mean:           146.412162
Minimum:        9.000000
1st Quartile:   100.000000
Median:         128.000000
3rd Quartile:   168.000000
Maximum:        700.000000
Length:         592
Type:           Int64
Number Missing: 22
% Missing:      3.583062

Categorical Data

For non-numerical variable like Education, Property_Area, we can check the frequency distribution of the variable,
to understand whether the data makes sense or not
countmap() function is a map function that maps the values to the number of
occurrences (or total weights) in the dataset:

In [10]:
countmap(df[:Property_Area])
Out[10]:
Dict{Union{DataArrays.NAtype, String},Int64} with 3 entries:
  "Urban"     => 202
  "Rural"     => 179
  "Semiurban" => 233

Dealing with Missing data

Using the showcols() function we can check for missing values in each of the variables:

In [11]:
showcols(df)
614×13 DataFrames.DataFrame
│ Col # │ Name              │ Eltype  │ Missing │
├───────┼───────────────────┼─────────┼─────────┤
│ 1     │ Loan_ID           │ String  │ 0       │
│ 2     │ Gender            │ String  │ 13      │
│ 3     │ Married           │ String  │ 3       │
│ 4     │ Dependents        │ String  │ 15      │
│ 5     │ Education         │ String  │ 0       │
│ 6     │ Self_Employed     │ String  │ 32      │
│ 7     │ ApplicantIncome   │ Int64   │ 0       │
│ 8     │ CoapplicantIncome │ Float64 │ 0       │
│ 9     │ LoanAmount        │ Int64   │ 22      │
│ 10    │ Loan_Amount_Term  │ Int64   │ 14      │
│ 11    │ Credit_History    │ Int64   │ 50      │
│ 12    │ Property_Area     │ String  │ 0       │
│ 13    │ Loan_Status       │ String  │ 0       │

As we can see LoanAmount and Credit_History have maximum number of missing values

There are multiple ways of fixing missing values in a dataset.
Take LoanAmount for example, there are numerous ways to fill the missing values – the simplest being replacement by the mean.
We will be replacing missing values in the LoanAmount variable with its mean:

In [12]:
#replace missing loan amount with mean of loan amount
df[isna.(df[:LoanAmount]),:LoanAmount] = floor(mean(dropna(df[:LoanAmount])))

#replace 0.0 of loan amount with the mean of loan amount 
df[df[:LoanAmount] .== 0, :LoanAmount] = floor(mean(dropna(df[:LoanAmount])))
Out[12]:
146.0

Now, if we cross-check the LoanAmount variable, we can see there is no more missing value.

In [13]:
describe(df[:LoanAmount])
Summary Stats:
Mean:           146.397394
Minimum:        9.000000
1st Quartile:   100.250000
Median:         129.000000
3rd Quartile:   164.750000
Maximum:        700.000000
Length:         614
Type:           Int64
Number Missing: 0
% Missing:      0.000000

For categorical data, we can replace  missing values with the mode of the entire column.

In [14]:
#replace missing gender with mode of gender values
 df[isna.(df[:Gender]), :Gender] = mode(dropna(df[:Gender]))
Out[14]:
"Male"
In [15]:
describe(df[:Gender])
Summary Stats:
Length:         614
Type:           String
Number Unique:  2
Number Missing: 0
% Missing:      0.000000

Similarly, we can remove the missing values from each of the variables.
For more detail on data munging and data exploration with Julia please refer the book: “Julia for Data Science”

Visualisation in Julia

“Another effective way of exploring the data is by doing it visually using various kind of plots”.
Here we will be focusing on creating some of the very simple and easily used
graphs and plots. The library of choice which we will be using to create such plots will
be PyPlot, which is based on Python’s matplotlib.pyplot module.

Basic Plot

In [12]:
using PyPlot
In [13]:
x = linspace(0,4pi,1000) 
y = cos.(pi + sin.(x))
plot(x, y, color="red", linewidth=2.0, linestyle="--")
title("Using Sin and Cos functions")
Out[13]:
PyObject Text(0.5,1,u'Using Sin and Cos functions')

Bar graphs

In [14]:
x= [10,20,30,40,50]; y= [2,4,6,8,10]
title("Vertical Bar Graph")
bar(x,y, color="red")
Out[14]:
(PyObject <matplotlib.patches.Rectangle object at 0x0000000027375278>, PyObject <matplotlib.patches.Rectangle object at 0x0000000027375048>, PyObject <matplotlib.patches.Rectangle object at 0x0000000027375550>, PyObject <matplotlib.patches.Rectangle object at 0x0000000027375898>, PyObject <matplotlib.patches.Rectangle object at 0x0000000027375B70>)

Histogram / HEAT MAP

In [15]:
x= rand(100)
y= rand(100)
title("Histogram")
hist2D(x,y, bins= 10)
Out[15]:
([0.0 2.0 … 0.0 0.0; 1.0 1.0 … 2.0 2.0; … ; 4.0 3.0 … 0.0 2.0; 2.0 0.0 … 0.0 2.0], [0.00109174, 0.100789, 0.200487, 0.300184, 0.399881, 0.499579, 0.599276, 0.698973, 0.798671, 0.898368, 0.998066], [0.014186, 0.112429, 0.210671, 0.308914, 0.407156, 0.505398, 0.603641, 0.701883, 0.800126, 0.898368, 0.996611], PyObject <matplotlib.image.AxesImage object at 0x00000000270326D8>)

Plotting DataFrames with Plots

Now that we are familiar with basic data plotting, let’s study the distribution of various variables in our Loan Approval dataset.
Let us start with numeric variables – namely ApplicantIncome and LoanAmount

In [25]:
using Plots, StatPlots
pyplot() ##Set the backend as matplotlib.pyplot
Out[25]:
Plots.PyPlotBackend()
In [20]:
Plots.histogram(dropna(df[:ApplicantIncome]),bins=50,xlabel="ApplicantIncome",labels="Frequency")
Out[20]:

Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:

`

In [21]:
Plots.boxplot(dropna(df[:ApplicantIncome]), xlabel="ApplicantIncome")
Out[21]:

By looking into above two graphs, we can depict that there are some extreme values for Applicant Income.

Now, Let’s look at the histogram and boxplot of LoanAmount:

In [22]:
Plots.histogram(dropna(df[:LoanAmount]),bins=50,xlabel="LoanAmount",labels="Frequency")
Out[22]:
In [24]:
Plots.boxplot(dropna(df[:LoanAmount]), ylabel="LoanAmount")
Out[24]:

Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data munging. LoanAmount has missing and well as extreme values, while ApplicantIncome has a few extreme values.
You can learn more about data visualization using Plots from Plots.jl Documentation

Calling Python Library in Julia

Julia provides functionality to call libraries of Python and R.
Using another language inside Julia may not be the best of ideas as far as speed and
efficiency is concerned. But, there may be situations when you would be required to do so.
We can call python library using PyCall.

In [ ]:
julia> Pkg.add("PyCall")
In [ ]:
using PyCall
@pyimport pandas as pd
df1= pd.read_csv("train.csv")

I hope, this blog will help you to get started with Julia.
For a deeper understanding of Julia, please refer to the book:  Learning Julia

Leave a Reply

Your email address will not be published. Required fields are marked *