Data Analysis using Julia¶
Recently I started reading Anshul Joshi’s latest book: “Learning Julia”. This is an amazing book for beginners who want to learn basic data analysis and visualization using Julia.
This blog is inspired from the book, here I have tried to put together all the important data analysis features provided by Julia. This blog will help you to get started with basic Julia data analysis and Visualisation using DataFrames.
What is Julia?¶
Julia: “Walks like python. Runs like C.”¶
Julia is a high level, high performance, dynamic programming language, for technical computing, with the syntax familiar to users of other technical computing environment.
It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.
Installation
Before we start, we need to install Julia and its specific packages.
Installing Julia
- Download Julia for your specific system from here:
https://julialang.org/downloads/
- Follow the platform-specific instructions to install Julia on your system from here
https://julialang.org/downloads/platform.html
Installing IJulia and Jupyter Notebook.
Data science and scientific computing are privileged to have an amazing interactive tool
called Jupyter Notebook. With Jupyter Notebook, you can write and run code in an
interactive web environment with many capabilities.
Go to the Julia prompt and type the following code:
julia> Pkg.add("IJulia")
After ijulia is successfully installed you can type the following code to run it,
julia> using IJulia
julia> notebook()
Now, Let’s start exploring Julia with Jupyter Notebook.
Exploratory Analysis with Julia.¶
A lot of time is spent by data scientists in transforming raw data into a useable format.
Julia provides an extensive library for dealing with raw data and transforming them into some meaningful format.
The first step in any kind of data analysis is exploring the dataset at hand. There are two ways to do that:
- Exploring the data tables and applying statistical methods to find patterns in numbers.
- Plotting the data to find patterns visually.
Julia provides DataFrames for exploring data and applying statistical methods.
Introduction to DataFrames.jl
A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with the use of row numbers. The essential difference is that column names and row numbers are known as column and row index, in case of dataframes . This is similar to pandas.DataFrame in Python or data.table in R.
Julia provides a package named DataFrames.jl, which provides the necessary data structures for loading table like data from different sources like: csv, txt or excel.
julia> Pkg.add("DataFrames.jl")
Let’s work on a real-world scenario, for better a understanding.
The dataset is Loan Approval dataset which can be downloaded from here.
The description of the variables are:
Importing library and dataset¶
using DataFrames
#read csv file into DataFrame
df= readtable("train.csv", separator=',')
Now, we have loaded the csv data into dataframe.
let’s explore the data using different dataframe’s function in Julia.
Dataframe functions
size(df)
names(df)
The dataset is not that big(614 rows).
Let’s look at the first 10 rows to get a better feel of how our data looks like?
The head(,n) function is used to read the first n rows of a dataset.
head(df,10)
By looking into the above data we can conclude that:
- Gender, Married, Dependents, Education, Self_Employed, Credit_History, Loan_Status are all categorical variable, with two categories
- Loan_ID is a unique column, it doesn’t provide any valuable information.
- LoanAmount variable contains missing values.
Numerical Data
Let’s explore the LoanAmount column.
describe() function returns all the possible statistical inferences(Min, Max, Mean) that can be made
from the given data.
describe(df[:LoanAmount])
Categorical Data
For non-numerical variable like Education, Property_Area, we can check the frequency distribution of the variable,
to understand whether the data makes sense or not
countmap() function is a map function that maps the values to the number of
occurrences (or total weights) in the dataset:
countmap(df[:Property_Area])
Dealing with Missing data
Using the showcols() function we can check for missing values in each of the variables:
showcols(df)
As we can see LoanAmount and Credit_History have maximum number of missing values
There are multiple ways of fixing missing values in a dataset.
Take LoanAmount for example, there are numerous ways to fill the missing values – the simplest being replacement by the mean.
We will be replacing missing values in the LoanAmount variable with its mean:
#replace missing loan amount with mean of loan amount
df[isna.(df[:LoanAmount]),:LoanAmount] = floor(mean(dropna(df[:LoanAmount])))
#replace 0.0 of loan amount with the mean of loan amount
df[df[:LoanAmount] .== 0, :LoanAmount] = floor(mean(dropna(df[:LoanAmount])))
Now, if we cross-check the LoanAmount variable, we can see there is no more missing value.
describe(df[:LoanAmount])
For categorical data, we can replace missing values with the mode of the entire column.
#replace missing gender with mode of gender values
df[isna.(df[:Gender]), :Gender] = mode(dropna(df[:Gender]))
describe(df[:Gender])
Similarly, we can remove the missing values from each of the variables.
For more detail on data munging and data exploration with Julia please refer the book: “Julia for Data Science”
Visualisation in Julia
“Another effective way of exploring the data is by doing it visually using various kind of plots”.
Here we will be focusing on creating some of the very simple and easily used
graphs and plots. The library of choice which we will be using to create such plots will
be PyPlot, which is based on Python’s matplotlib.pyplot module.
Basic Plot
using PyPlot
x = linspace(0,4pi,1000)
y = cos.(pi + sin.(x))
plot(x, y, color="red", linewidth=2.0, linestyle="--")
title("Using Sin and Cos functions")
Bar graphs
x= [10,20,30,40,50]; y= [2,4,6,8,10]
title("Vertical Bar Graph")
bar(x,y, color="red")
Histogram / HEAT MAP
x= rand(100)
y= rand(100)
title("Histogram")
hist2D(x,y, bins= 10)
Plotting DataFrames with Plots
Now that we are familiar with basic data plotting, let’s study the distribution of various variables in our Loan Approval dataset.
Let us start with numeric variables – namely ApplicantIncome and LoanAmount
using Plots, StatPlots
pyplot() ##Set the backend as matplotlib.pyplot
Plots.histogram(dropna(df[:ApplicantIncome]),bins=50,xlabel="ApplicantIncome",labels="Frequency")
Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:
`
Plots.boxplot(dropna(df[:ApplicantIncome]), xlabel="ApplicantIncome")
By looking into above two graphs, we can depict that there are some extreme values for Applicant Income.
Now, Let’s look at the histogram and boxplot of LoanAmount:
Plots.histogram(dropna(df[:LoanAmount]),bins=50,xlabel="LoanAmount",labels="Frequency")
Plots.boxplot(dropna(df[:LoanAmount]), ylabel="LoanAmount")

Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data munging. LoanAmount has missing and well as extreme values, while ApplicantIncome has a few extreme values.
You can learn more about data visualization using Plots from Plots.jl Documentation
Calling Python Library in Julia
Julia provides functionality to call libraries of Python and R.
Using another language inside Julia may not be the best of ideas as far as speed and
efficiency is concerned. But, there may be situations when you would be required to do so.
We can call python library using PyCall.
julia> Pkg.add("PyCall")
using PyCall
@pyimport pandas as pd
df1= pd.read_csv("train.csv")
I hope, this blog will help you to get started with Julia.
For a deeper understanding of Julia, please refer to the book: Learning Julia