Data Science With Julia

Data Science with Julia

Introduction to Julia

Julia is:

  • High-level dynamic programming language.
  • Addresses the requirements of high-performance numerical and scientific computing.
  • Also equally effective for general purpose programming.
  • Sophisticated compiler. Just-in-time (JIT) compilation, implemented using LLVM.
  • Distributed parallel execution.
  • Numerical accuracy and an extensive mathematical function library.
  • Provides ability to define function behavior across many combinations of argument types. This functionality is called Multi Dispatch.
  • Performance approaching that of statically-compiled languages like C/C++.
  • Pycall package – can call Python functions in Julia. Can also call Matlab packages using Matlab.jl.
  • Can call C functions directly. No wrappers or special APIs needed.
  • Powerful shell-like capabilities for managing other processes.

Julia tries to eliminate the performance trade-off and provide a single environment productive enough for prototyping and efficient enough for deploying performance-intensive applications. It is flexible dynamic language, appropriate for scientific and numerical computing, with performance comparable to traditional statically-typed languages. Julia was publicly released in 2012 and has amazed many in the data science world. The language is still a few years away from a 1.0 release candidate, with it's current version at 0.3.

Where Julia shouldn't be used?

  • Julia is very young
  • Breaking changes are frequently coming in core.
  • Language features are still being added.
  • Not all packages are available and currently available packages are not tested well enough.

Julia by example

User-Defined Types

# User-Defined Types
type UserDefinedType
    i::Integer
    s::String
end

user_t = UserDefinedType(24, "this is a UserDefinedType")

@show user_t.i
# user_t.i => 24
@show user_t.s
# user_t.s => "this is a UserDefinedType"

Linear Algebra

# Vectors
v = [1, 1]

# Vector Operations

@show v + [2, 0] # vector addition
# v + [2,0] => [3,1]
@show v + 2      # same as v + [2,2]
# v + 2 => [3,3]
@show 10*v        # scalar multiplication
# 10v => [10,10]

# Matrices
M = [1 1 ; 0 1]

# Matrix Addition

M + 1 ,
M + [0 0 ; 5 5]

# ( 2x2 Array{Int64,2}:
#  2  2
#  1  2,
# 2x2 Array{Int64,2}:
#  1  1
#  5  6 )

Functions

# Named functions
f(x) = 20x
function g(x)
    return x * 20
end

@show f(10)
# f(10) => 200
@show g(15)
# g(15) => 300

# Anonymous functions assigned to variables
h = x -> x * 10
i = function(x)
    x * 20
end

@show h(5)
# h(5) => 50
@show i(5)
# i(5) => 100

# Operators are functions
+(10,5)
# 15

Multiple Dispatch

muldis(x::String)  = println("You entered the string: $x")
muldis(x::Integer) = x * 20
muldis(x::UserDefinedType) = println(x.s)

methods(muldis)

# 3 methods for generic function muldis:
#     muldis(x::String) at In[20]:1
#     muldis(x::Integer) at In[20]:2
#     muldis(x::NewType) at In[20]:3

muldis("Hello World")
# You entered the string: Hello World
muldis(user_t)
# this is a UserDefinedType
muldis(10)
# 200

Object-Oriented Programming

type UserObject
    data::Union(Integer, String)
    set::Function

    function UserObject()
        this = new()
        this.data = ""

        # Method Overloading
        function setter(x::Integer)
            println("Setting an integer")
            this.data = x
        end
        function setter(x::String)
            println("Setting a string")
            this.data = x
        end
        this.set = setter

        return this
    end
end

obj = UserObject()
obj.set(2410)
obj.set("Hello World")

Functional Programming

# Sum of odd integers between 1 and 10

values = 1:10
myMapper  = x -> x
myFilter  = x -> x % 2 == 1
myReducer = (x,y) -> x + y

mapped    = map( myMapper, values )
filtered  = filter( myFilter, mapped )
reduced   = reduce( myReducer, filtered )

Parallel Programming

Count the number of heads in a large number of coin tosses in parallel. Computation is automatically distributed across all available compute nodes. Result is reduced by summation (+). Result is returned at the calling node.

nheads = @parallel (+) for i=1:100000000
  int(randbool())
end
println(nheads)
# 49996972

Basic Statistics

Install StatsBase by doing:

Pkg.add("StatsBase")

using StatsBase

x = rand(100)    # uniform distribution [0,1]
println( "mean:     ", mean(x) )
# mean:     0.5035696936745758
println( "variance: ", var(x) )
# variance: 0.0791810085560676
println( "skewness: ", skewness(x) )
# skewness: 0.041519550098815805

Tabular Data

Install DataFrames by doing:

Pkg.add("DataFrames")

using DataFrames

df = DataFrame(
    A = [6, 3, 4],
    B = ["a", "b", "c"],
    C = [1//2, 3//4, 5//6],
    D = [true, true, false]
)

DataFrame

Use Iris dataset. (easily available)

Pkg.add("RDatasets")

using RDatasets

iris = dataset("datasets", "iris")
head(iris)

RDatasets

Group by Species. Compute mean of PetalLength per group

by( iris, :Species, df -> mean(df[:PetalLength]) )

Data Visualization.

Install Gadfly by doing:

Pkg.add("Gadfly")

using Gadfly
set_default_plot_size(20cm, 12cm)
plot(iris, x = "PetalLength", y = "PetalWidth", color = "Species", Geom.point)

Gadfly

K-means clustering

K-means clustering aims to partition n observations into k clusters. Each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

  1. For K-means, we randomly pick k centroids. (these will be center of the cluster).
  2. Assign each data point to the closest centroid.
  3. Find the average location of the data points and move the centroids.
  4. Keep repeating the preceding two steps until the assignments don’t change or change very little.

The problem is computationally difficult (NP-hard).

Install Clustering by doing:

Pkg.add("Clustering")

# K-means Clustering
using Clustering

features = array(iris[:, 1:4])'
# group the data onto 3 clusters
result = kmeans( features, 3 )
plot(iris, x = "PetalLength", y = "PetalWidth", color = result.assignments, Geom.point)

kmeans

Principal Component Analysis

PCA is used to discover or to reduce the dimensionality of the data set and to identify new meaningful underlying variables. It is necessary to reduce the number of variables to a few to interpret the data in a more meaningful form. We create interpretable linear combinations of the data where each linear combination will correspond to a principal component.

Install MultivariateStats by doing:

Pkg.add("MultivariateStats")

# Principal Component Analysis
using MultivariateStats

pc = fit(PCA, features; maxoutdim = 2)
reduced = transform(pc, features)
@show size(reduced)
plot(iris, x = reduced[1,:], y = reduced[2,:], color = "Species", Geom.point)

size(reduced) => (2,150)

PCA

Conclusion

Julia is a language which has or is currently under development a wide range of statistical packages like R, it is easy to write and learn like Python and has execution speed similar to C/C++. Julia’s core language is still evolving and the basic language design has been stable for quite some time. Most data science tasks can now be done in Julia as easily as they can now be done in Python or R.

Leave a Reply

Your email address will not be published. Required fields are marked *