# Data Science with Julia

## Introduction to Julia

Julia is:

- High-level dynamic programming language.
- Addresses the requirements of high-performance numerical and scientific computing.
- Also equally effective for general purpose programming.
- Sophisticated compiler. Just-in-time (JIT) compilation, implemented using LLVM.
- Distributed parallel execution.
- Numerical accuracy and an extensive mathematical function library.
- Provides ability to define function behavior across many combinations of argument types. This functionality is called Multi Dispatch.
- Performance approaching that of statically-compiled languages like C/C++.
- Pycall package – can call Python functions in Julia. Can also call Matlab packages using Matlab.jl.
- Can call C functions directly. No wrappers or special APIs needed.
- Powerful shell-like capabilities for managing other processes.

Julia tries to eliminate the performance trade-off and provide a single environment productive enough for prototyping and efficient enough for deploying performance-intensive applications. It is flexible dynamic language, appropriate for scientific and numerical computing, with performance comparable to traditional statically-typed languages. Julia was publicly released in 2012 and has amazed many in the data science world. The language is still a few years away from a 1.0 release candidate, with it's current version at 0.3.

Where Julia shouldn't be used?

- Julia is very young
- Breaking changes are frequently coming in core.
- Language features are still being added.
- Not all packages are available and currently available packages are not tested well enough.

## Julia by example

### User-Defined Types

```
# User-Defined Types
type UserDefinedType
i::Integer
s::String
end
user_t = UserDefinedType(24, "this is a UserDefinedType")
@show user_t.i
# user_t.i => 24
@show user_t.s
# user_t.s => "this is a UserDefinedType"
```

### Linear Algebra

```
# Vectors
v = [1, 1]
# Vector Operations
@show v + [2, 0] # vector addition
# v + [2,0] => [3,1]
@show v + 2 # same as v + [2,2]
# v + 2 => [3,3]
@show 10*v # scalar multiplication
# 10v => [10,10]
# Matrices
M = [1 1 ; 0 1]
# Matrix Addition
M + 1 ,
M + [0 0 ; 5 5]
# ( 2x2 Array{Int64,2}:
# 2 2
# 1 2,
# 2x2 Array{Int64,2}:
# 1 1
# 5 6 )
```

### Functions

```
# Named functions
f(x) = 20x
function g(x)
return x * 20
end
@show f(10)
# f(10) => 200
@show g(15)
# g(15) => 300
# Anonymous functions assigned to variables
h = x -> x * 10
i = function(x)
x * 20
end
@show h(5)
# h(5) => 50
@show i(5)
# i(5) => 100
# Operators are functions
+(10,5)
# 15
```

### Multiple Dispatch

```
muldis(x::String) = println("You entered the string: $x")
muldis(x::Integer) = x * 20
muldis(x::UserDefinedType) = println(x.s)
methods(muldis)
# 3 methods for generic function muldis:
# muldis(x::String) at In[20]:1
# muldis(x::Integer) at In[20]:2
# muldis(x::NewType) at In[20]:3
muldis("Hello World")
# You entered the string: Hello World
muldis(user_t)
# this is a UserDefinedType
muldis(10)
# 200
```

### Object-Oriented Programming

```
type UserObject
data::Union(Integer, String)
set::Function
function UserObject()
this = new()
this.data = ""
# Method Overloading
function setter(x::Integer)
println("Setting an integer")
this.data = x
end
function setter(x::String)
println("Setting a string")
this.data = x
end
this.set = setter
return this
end
end
obj = UserObject()
obj.set(2410)
obj.set("Hello World")
```

### Functional Programming

```
# Sum of odd integers between 1 and 10
values = 1:10
myMapper = x -> x
myFilter = x -> x % 2 == 1
myReducer = (x,y) -> x + y
mapped = map( myMapper, values )
filtered = filter( myFilter, mapped )
reduced = reduce( myReducer, filtered )
```

### Parallel Programming

Count the number of heads in a large number of coin tosses in parallel. Computation is automatically distributed across all available compute nodes. Result is reduced by summation (+). Result is returned at the calling node.

```
nheads = @parallel (+) for i=1:100000000
int(randbool())
end
println(nheads)
# 49996972
```

### Basic Statistics

Install StatsBase by doing:

Pkg.add("StatsBase")

```
using StatsBase
x = rand(100) # uniform distribution [0,1]
println( "mean: ", mean(x) )
# mean: 0.5035696936745758
println( "variance: ", var(x) )
# variance: 0.0791810085560676
println( "skewness: ", skewness(x) )
# skewness: 0.041519550098815805
```

### Tabular Data

Install DataFrames by doing:

Pkg.add("DataFrames")

```
using DataFrames
df = DataFrame(
A = [6, 3, 4],
B = ["a", "b", "c"],
C = [1//2, 3//4, 5//6],
D = [true, true, false]
)
```

Use Iris dataset. (easily available)

Pkg.add("RDatasets")

```
using RDatasets
iris = dataset("datasets", "iris")
head(iris)
```

Group by Species. Compute mean of PetalLength per group

```
by( iris, :Species, df -> mean(df[:PetalLength]) )
```

#### Data Visualization.

Install Gadfly by doing:

Pkg.add("Gadfly")

```
using Gadfly
set_default_plot_size(20cm, 12cm)
plot(iris, x = "PetalLength", y = "PetalWidth", color = "Species", Geom.point)
```

### K-means clustering

K-means clustering aims to partition n observations into k clusters. Each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

- For K-means, we randomly pick k centroids. (these will be center of the cluster).
- Assign each data point to the closest centroid.
- Find the average location of the data points and move the centroids.
- Keep repeating the preceding two steps until the assignments don’t change or change very little.

The problem is computationally difficult (NP-hard).

Install Clustering by doing:

Pkg.add("Clustering")

```
# K-means Clustering
using Clustering
features = array(iris[:, 1:4])'
# group the data onto 3 clusters
result = kmeans( features, 3 )
plot(iris, x = "PetalLength", y = "PetalWidth", color = result.assignments, Geom.point)
```

### Principal Component Analysis

PCA is used to discover or to reduce the dimensionality of the data set and to identify new meaningful underlying variables. It is necessary to reduce the number of variables to a few to interpret the data in a more meaningful form. We create interpretable linear combinations of the data where each linear combination will correspond to a principal component.

Install MultivariateStats by doing:

Pkg.add("MultivariateStats")

```
# Principal Component Analysis
using MultivariateStats
pc = fit(PCA, features; maxoutdim = 2)
reduced = transform(pc, features)
@show size(reduced)
plot(iris, x = reduced[1,:], y = reduced[2,:], color = "Species", Geom.point)
```

size(reduced) => (2,150)

## Conclusion

Julia is a language which has or is currently under development a wide range of statistical packages like R, it is easy to write and learn like Python and has execution speed similar to C/C++. Julia’s core language is still evolving and the basic language design has been stable for quite some time. Most data science tasks can now be done in Julia as easily as they can now be done in Python or R.