GSoC 2016 Progress

Categorical Data Support [SciRuby]

Work Done This GSoC

Summary

It was a great summer. My project was to add categorical data support in Daru and Statsample.

This is my GSoC project page.

I’m happy to say that I implemented all of my goals and achived much more.

To conclude I implemented the following this summer:

  • Support for categorical data in Daru
  • Support to visualize categorical data using Nyaplot, GnuplotRB and Gruff
  • Support for categorical data with formula language in Statsample-GLM

The code

The following are the main pull requests regarding my project:

  • Daru#134

    It does the following:

    • CategoricalIndex class for handling categorical index
    • Category module to add categorical data support in Daru::Vector and Daru::DataFrame
    • Visualization support for categorical data
  • Statsample-GLM#31
    • Added Formula language support
    • Categorical data support in regression
  • Statsample#51 It implements formula language and categorical data support for regression in Statsample. This is unmerged, reason being that we are not sure whether we should remove the linear regression support from Stastsample or not. See here. We will either end up merging this pull request or moving the linear regression form here to Statsample-GLM.
  • Daru#208 It does the following:
    • Implements missing value support for categorical data
    • Improves the missing values API to make it simple and improve performance of Daru update operations

Here are other pull requests not necessary related to the project.

Now I will talk in detail about the work in these pull requests:

Daru#134

This was my major work during the weeks from 1 to 6th week.

You can find every detail of my work like what exactly I implemented, why I made certain decisions and how to use it in the following posts:

This PR has been merged.

Statsample-GLM#31

The following posts discusses in detail my work:

This PR is about to get merged. Just waiting for the new Daru to be released.

Statsample#51

This pull request is currently unmerged. It implements the same functionality as the above pull request does for Statsample-GLM.

Earlier our plan was to implement support for categorical data in both Statsample and Stastsample-GLM but because linear regression is also present in Statsample-GLM. And since linear regression in Statsample is better in terms of performance as compared to Statsample-GLM we are looking to remove the linear regression from Statsample and move it to Statsample-GLM. More information is here.

So, we will doing one of these two things:

  • Merge this pull request and do not remove linear regression from Statsample.
  • Or move linear regression from Statsample to Statsample-GLM.

Daru#208

This improves the current structure of missing values API in Daru and introduces missing values support for categorical data. More information can found here.

Improve Missing Values API in Daru [Week 11-12]

The end of GSoC is near. I ended up finishing up a bit early on the formula language implementation and decided to devote the time on some other important issues.

During these last two weeks I solved some issues in Daru and mainly worked on this issue regarding how missing values are handled in Daru.

The following were the shortcomings:

  • Update operations like #[]=, #set_at were slow.
  • Any value could be set as missing values, which made the checks for missing values somewhat hard.

Now, Daru follows a simple approach of only considering nil and Float::NAN as the missing values. Although one loses the flexibility of assigning an arbitrary value as missing but it has greatly simplified many things and also improvement in performance is significant. Further, one can simply uses #replace now to change the values which he/she wants to treat as missing to nil.

In addition to that, the updates have become blazingly fast without compromising the caching of missing values. I accomplished by the following strategy:

  • During the updates the cache which stores the positions of nil and Float::NAN gets outdated and doesn’t get updated until we require those positions.
  • When missing positions of either nil or Float::NAN are required by any of the missing value method, those are returned if cache isn’t outdated and if the cache is outdated then its rebuilt.

This way one has best of both worlds. The updates remain fast and also the caching of nil and Float::NAN is maintained.

I ran the following benchmarks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
require 'benchmark'

n = 100000
dv = Daru::Vector.new 1..n

Benchmark.bm do |x|
  x.report do
    100.times { dv[0] = nil }
  end

  x.report do
    10.times do
      dv[0] = nil
      # Need to be replaced with only_valid when running for before
      10.times { dv.reject_values nil }
    end
  end
end

And these are the results before and after:

1
2
3
4
5
6
7
8
9
10
# Before
       user     system      total        real
   1.840000   0.000000   1.840000 (  1.840055)
  15.080000   0.060000  15.140000 ( 15.462978)

# After
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.000308)
  11.120000   0.160000  11.280000 ( 11.385459)

Here’s the summary of the old and new API regarding handling of missing values:

Methods added in Daru::Vector (and category):

  • reject_values
  • include_values?
  • indexes
  • count_values
  • replace values

Methods added in Daru::DataFrame:

  • reject_values
  • include_values?
  • replace_values

Methods removed in Daru::Vector:

  • missing_values
  • missing_values=
  • update
  • exists?
  • set_missing_positions

and other methods #has_missing_data?, #n_valid have been deprecated.

Conclusion

  • As you can notice the performance of Daru updating methods have undergone a major improvement and the its effects will be far reaching from improving other things in Daru to imporoving the performance in Statsample and Statsample-GLM.
  • During the way I learned how to use tools like ruby-prof to benchmark the code and understand where’s the performance is lagging.
  • I noticed that methods like #[] are proving to be bottleneck and there lie chances of their improvement.
  • Thanks to Victor for suggestiong this change in Daru, providing with good API and helping me all the way to implement it.

Shortcut Symbols [Week 9-10]

With the work done in Week 9 and 10, Statsample-GLM now supports shortcut symbols in Formula Language.

With this addition, the regression has become more R/Patsy like and more convenient.

Symbols Added

There are two shortcut symbols now being supported:

  • *
  • /

a*b is shortcut for a+b+a:b. This is commonly used within regression models.

a/b is shortcut for a+a:b. Its quite useful while dealing with nested categorical variables. a/b makes sense when b is nested inside a.

Brackets

This week brackets support has been added so one can form expression involving use of brackets. For example (a+b):c would evaluate to a:c + b:c.

It supports any level of sophistication with symbols and brackets. For example (a+b)*(c+d) would give a+b+c+d+a:c+a:d+b:c+b:d.

Note

Although there are certain limitations to the current formula language:

  1. Since more than 2-way interactions are not supported yet, formula like a*b*c wouldn’t work.
  2. There’s not a mechanism to deal with cases such as a*a.

Formula Language in Statsample

Earlier, the plan was to implement the formula language also in Statsample but because Statsample which supports just linear regression is also supported by name Normal Regression in Statsample-GLM, we are planning to not implement formula language in Statsample but rather remove the linear regression support from Statsample if it doesn’t offer any advantage to Normal Regression in Statsample-GLM. For info, see here.

Example using shortcut symbols

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> df = Daru::DataFrame.from_csv 'spec/data/df.csv'
=> #<Daru::DataFrame(14x6)>
             y      a      b      c      d      e
      0      0      6   62.1     no female      A
      1      1     18   34.7    yes   male      B
      2      1      6   29.7     no female      C
      3      0      4     71     no   male      C
      4      1      5   36.9    yes   male      B
      5      0     11   58.7     no female      B
      6      0      8   63.3     no   male      B
      7      1     21   20.4    yes   male      A
      8      1      2   20.5    yes   male      C
      9      0     11   59.2     no   male      B
     10      0      1   76.4    yes female      A
     11      0      8   71.7     no female      B
     12      1      2   77.5     no   male      C
     13      1      3   31.1     no   male      B
> df.to_category 'c', 'd', 'e'
> train = df.first 10
> test = df.last 4
> reg = Statsample::GLM::Regression.new 'a~b*c', train, :normal, algorithm: :mle
> reg.model.coefficients :hash
{:c_yes=>5.678447231711081,
 :b=>0.0007560417597709064,
 :"c_yes:b"=>-0.06481888635745593,
 :constant=>7.6233202721217825}
# Now lets obtain predictions from this model
> reg.model test
<Daru::Vector(4)>
0 8.407366176569727
1 7.677528466297357
2 7.681913508504028
3 7.646833170850658

Internal Structure of Formula Language

There are three classes by which formula language works:

  • FormulaWrapper
  • Formula
  • Token

When creation of a new model is invoked by Statsample::GLM::Regression#new, FormulaWrapper is first called.

FormulaWrapper class does the necessary preprocessing. It does mainly two things:

  • Apply the shortcut symbols and reduce the expression to only containing : and +
  • After reducing to simple expression containing only : and +, it groups terms based on the numerical terms they are interacting with.

After FormulaWrapper has form groups, it processes each of these groups using the Formula class.

The Formula class takes each group and form tokens which do not overlap, that is if they are converted to dataframe they won’t contain redundancy in that dataframe.

The Token class stores the column names and can expand these columns when fed a dataframe.

Sounds confusing?

Lets try an example:

Lets say our expression is x*a + b*c, where x is numerical vector and a, b and c are categorical.

  1. First it will converted to simple expression by FormulaWrapper. It will be simplified as 1+x+a+x:a+b+c+b:c. Notice shortcut symbols have disappeared and only + and : are remaining.
  2. Now 1+x+a+x:a+b+c+b:c is grouped into two groups [1, a, b, c, b:c] and [1, a]. The first group has the common numerical interaction terms as 1, while the second group has common numerical interaction terms as x.
  3. Now both the groups will be processed by Formula to produce dataframe with full rank.
  4. First group will be parsed to 1+a(-)+b(-)+c(-)+b(-):c(-) by Formula class. a(-) implies that vector a is contrast coded to reduced rank, while a implies its coded to full rank.
  5. Second group will be parsed to x + x:a(-).
  6. In the end these terms are combined and resultant parsed expression is the sum of the above two expressions, i.e. 1+a(-)+b(-)+c(-)+b(-):c(-)+x+x:a(-).
  7. Then are expanded into dataframes by Token class and these dataframes are concatted to form the final dataframe for the given expression.

Conclusion

We saw the overview of how formula language works inside Statsample::GLM and shortcut symbols with brackets has made the usage much more convenient and powerful.

Formula Language Implementation [Week 7-8]

After the end of 6 weeks we have category data support in Daru. Now in the coming weeks we will be adding support for category data in Statsample and Statsample-GLM.

Currently, Statsample and Statsample-GLM do not support regression with category data.

With the introduction of formula language I am looking to accomplish the following:

  • To support regression with category data
  • To provide convenience of formula language to create regression models

In these two weeks I have implemented a formula language but it is limited in certain ways. The work of following weeks will fill this gap.

Lets talk about the formula language I have implemented in these two weeks.

Formula Language

The formula language which I aim to implement is similar to that used within R and Patsy

With the work of these two weeks, the formula language has the following features:

  • It supports 2-way interaction.
  • It supports : and +.
  • It supports inclusion/exclusion of contant or intercept term.

And since I have followed the Patsy way of implementing the formula langauge it has an edge over R. Since, Patsy has a more accurate algorithm for deciding whether to use a full or reduced-rank coding scheme for categorical factors, the same is inherited in Statsample and Statsample-GLM.

R sometimes can give under-specified model but this is not the case with our implementation. One example is expansion of 0 + a:x + a:b, where x is numeric. More information about this can be found here.

I am thankful to Patsy for it made my work very easy by providing all the details in their documentation. Without it I would have fallen into many pitfalls.

Now lets see formula language in action in Statsample and Statsample-GLM.

Regression in Statsample-GLM

Regression in Statsample-GLM has become an easy task and in addition it now supports category data as predictor variables.

Lets see this by an example.

Lets assume a dataframe df with numeric columns a, b, and having category column c, d, e.

Lets create a logistic model with predictors a, a*b, c and c:d.

If we were to do this earlier, we would have done the following.

Since we can’t code category variables, so lets leave c and c:d.

1
2
3
4
5
6
7
8
9
10
11
> train['a:b'] = train['a'] * train['b']
> train = train['a', 'a:b', 'y']
> mod = Statsample::GLM.compute train, 'y', :logistic, constant: 1
> # Now lets obtain predictions
> test['a:b'] = test['a'] * test['b']
> test = test['a', 'a:b']
> mod.predict test
=> #<Daru::Vector(3)>
      0 0.9999
      1 0.0123
      2 0.5925

Now with the introduction of formula langauge it has become a very easy task with no work required to preprocess the dataframe.

1
2
3
4
5
6
> reg = Statsample::GLM::Regression.new 'y~a+a:b+c+c:d', train, :logistic
> reg.predict test
=> #<Daru::Vector(3)>
      0 0.2999
      1 0.1523
      2 0.8925

The above code not only enables predictions with caetgory data but also reflects the powerful formula langauge.

Here’s a notebook that describes the use of formula language in Statsample-GLM using real life data.

Lets have a look at Statsample now.

Statsample

With Statsample, its the same. Now one can perform multiple regression with formula language and category variables as predictors.

1
2
> reg = Statsample::FitModel.new 'y~a+a:b+c+c:d', train
> mod = reg.model

This will give a multiple linear regression model.

Conclusion

The introduction of formula language and ability to handle category data has given a great boost to Data Analysis in Ruby and I really hope we keep improving it further and further.

In the coming weeks I will look forward to implement the following:

  • Add more than 2-way interaction support
  • Support for shortcut symbols ‘*’, ‘/’, etc.

Visualization [Week 5-6]

During these two weeks I added visualization of categorical data in addition to support of a new plotting library Gruff

Daru supports visualization via three libraries:

Lets discuss them one by one

Nyaplot

Nyaplot is the default plotting library for Daru. Nyaplot allows creation of a variety of plots with Daru easily. Its biggest strength lies in its ablity to draw interactive plots.

Now Daru also supports categorical data visualization using Nyaplot. It mainly has two aspects:

  • In case of a category vector it allows to view the frequencies of categories in a bar graph.
  • And in case of dataframe containing a category vector, it allows to have scatter and line plots categorized by a category vector visualized by different shape, size and color.

Here are some examples of visualization of category data using Nyaplot in Daru.

GnuplotRB

GnuplotRB is another great library which has inbuilt support for Daru datastructres: Daru::Vector and Daru::DataFrame. Though it doesn’t directly operate on vectors and dataframes but uses its own API, it provides out of box support to plot Daru::Vector and Data::DataFrame.

GnuplotRB strength lies in its offering of highly customized plots with yet a very simple to use API.

No work was done regarding supporting categorical data visualization in GnuplotRB because it supports it out of the box owing to its easy to use API that lets plot multiple plots with a variety of features.

Here’s a notebook demonstrating examples in the way category data can be visualizaed using GnuplotRB.

Gruff

Gruff is a new plotting library that has just been added to Daru. Gruff offers remarkably beautiful plots with very less effort. It also offers pie and sidebar plots which currently the other two libraries don’t offer.

These two notebooks show examples related to plotting of Daru::Vector and Daru::DataFrame:

Choose from different libraries

To easily move between these all these libraries, Daru has following functions:

  • Daru.plotting_library
  • Daru::Vector#plotting_library
  • Daru::DataFrame#plotting_library

Daru.plotting_library can be used to set the current plotting library. For example, using Daru.plotting_library = :gruff one can switch the plotting library to Gruff. This means all the plots created here after will be using Gruff for plotting.

Inorder to change plotting library for only a specific vector, one can use Daru::Vector#plotting_library. For example, dv.plotting_library = :gruff will only change plotting library for vector dv and all other vectors created will created using library as set by Daru.plotting_library.

The same goes for dataframes, one can use df.plotting_library = :gruff to set plotting library for data frame df to Gruff.

Summary

Along with the support of categorical data, Daru now also owns the ability to visualize catgory data. I realized and addressed a few shortcoming of some of these libraries and we at SciRuby are motivated to overcome those shortcoming and make visualization in Daru more complete.

Categorical Data [Week 3-4]

Daru has now the capability to store and process categorical data.

Daru has now three types of vector

  • :object
  • :numeric
  • :category (new)

With introduction of categorical data, Daru has now two benefits-

  1. Storage of categorical data is very effective.
  2. Tasks related to categorical data have become a lot easier

The reason for 1 is that in ordinary vector the data is stored as an array, it doesn’t consider the fact that most of the entries are same.

Lets discuss the various tasks which can now be done easily related to categorical vector.

(The purpose of this blog is to give an overview of what tasks can be accomplished with categorical data. To learn about what each method do and how to use it please look at this notebook)

As soon as one declares a categorical variable, one can look at frequency count of each category to get judgement of the data:

1
2
3
4
5
6
> rank = Daru::Vector.new ['III']*10 + ['II']*5 + ['I']*5, type: :category, categories: ['I', 'II', 'III']
> rank.frequencies
=> #<Daru::Vector(3)>
   I   5
  II   5
 III  10

One can look over the summary of the data to get to know common numbers about categorical data like how many categories are present, which is the most frequenct category, etc.

1
2
3
4
5
6
7
8
> rank.summary
=> #<Daru::Vector(6)>
         size           20
   categories            3
     max_freq           10
 max_category          III
     min_freq            5
 min_category            I

Its possible to convert a numerical variable into categorical variable. For example heights store measures of heights and we want to categorize them into categories low, medium and high:

1
2
3
4
5
6
7
> heights = Daru::Vector.new [30, 35, 32, 50, 42, 51]
> heights_cat = heights.cut [30, 40, 50, 60], labels: ['low', 'medium', 'high']
> height_cat.frequencies
=> #<Daru::Vector(3)>
    low      3
 medium      1
   high      2

Given a dataframe its possible to extract rows based on the categories. It uses the same Area-like query syntax like an ordinary vector. For example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> df = Daru::DataFrame.new({
  id: [0, 1, 2, 3, 4, 5, 6, 7],
  grade: %w[A C B A C C B B],
  name: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})

> df[:grade] = df[:grade].to_category ordered: true, categories: %w[A B C]

# Lets list entries with grade less than 'C'
> df.where df[:grade].lt('C')
=> #<Daru::DataFrame(5x3)>
       grade    id  name
     0     A     0     a
     2     B     2     c
     3     A     3     d
     6     B     6     g
     7     B     7     h

The benefit is that we used lt based on the order we set.

By defining the custom order of categories and setting ordered to true, one can sort the categories, find the min, max, etc. For example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Assuming df defined as above
# Lets rename the categories to show that lexical order is not followed
# while sorting with categorical data
> df[:grade].rename_categories 'A' => 'Good', 'B' => 'Average', 'C' => 'Bad'
> df
=> #<Daru::DataFrame(8x3)>
           grade      id    name
       0    Good       0       a
       1     Bad       1       b
       2 Average       2       c
       3    Good       3       d
       4     Bad       4       e
       5     Bad       5       f
       6 Average       6       g
       7 Average       7       h
> df.sort! [:grade]
=> #<Daru::DataFrame(8x3)>
           grade      id    name
       0    Good       0       a
       3    Good       3       d
       2 Average       2       c
       6 Average       6       g
       7 Average       7       h
       1     Bad       1       b
       4     Bad       4       e
       5     Bad       5       f

Example

Please have a look at this notebook which describes use of categorical data though an example.

Internal Structure

Its similar to the internal structure of categorical index.

To efficiently store the duplicates of catgories and make retrieval possible in constant time, categorical data in Daru uses two data structres-

  • Hash-table: To map each category to positional values. It is represented as @cat_hash.
  • Array: To map each position to a integer which represent a category.

For example:

1
dv = Daru::Vector.new [:a, :b, :a, :b, :C], type: :category

For dv, the hash table and array would be:

1
2
3
@cat_hash = {a: [0, 2], b: [1, 3], c: [4]}

@array = [0, 1, 0, 1, 2]

The hash table helps us in retriving all instances which belong to that category in real time.

Similary, the array helps us in retriving category of an instance in constant time.

And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.

Categorical Index [Week 1-2]

Daru just got a new capability => Categorical Index.

Now one can organize vector and dataframe using index that is categorical.

Daru has got now 4 types of indexes to index data:

  • Daru::Index
  • Daru::MultiIndex
  • Daru::DateTimeIndex
  • Daru::CategoricalIndex (new)

Daru::Index is for usual index where every value is unique.

Daru::MultiIndex is for indexing with more than one level.

Daru::DateTimeIndex is to have indexing with dates. Its powerful means to analyze time series data.

The new Daru::CategoricalIndex is helpful with data indexed with sparsely populated index with each unique index value as category.

Please visit this link before to get a basic understanding of how indexing works in Daru::Vector and this link for Daru::DataFrame.

Example

Let’s see an example.

(Alternatively you can also see this example in iRuby notebook here)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
require 'daru'

require 'open-uri'
content = open('https://d37djvu3ytnwxt.cloudfront.net/asset-v1:MITx+15.071x_3+1T2016+type@asset+block/WHO.csv')
df = Daru::DataFrame.from_csv content

df[0..5]

  => #<Daru::DataFrame(194x6)>
                 Country     Region Population    Under15     Over60 FertilityR
            0 Afghanista Eastern Me      29825      47.42       3.82        5.4
            1    Albania     Europe       3162      21.33      14.93       1.75
            2    Algeria     Africa      38482      27.42       7.17       2.83
            3    Andorra     Europe         78       15.2      22.86        nil
            4     Angola     Africa      20821      47.58       3.84        6.1
            5 Antigua an   Americas         89      25.96      12.35       2.12
            6  Argentina   Americas      41087      24.42      14.97        2.2
            7    Armenia     Europe       2969      20.34      14.06       1.74
            8  Australia Western Pa      23050      18.95      19.46       1.89
            9    Austria     Europe       8464      14.51      23.52       1.44
           10 Azerbaijan     Europe       9309      22.25       8.24       1.96
           11    Bahamas   Americas        372      21.62      11.24        1.9
           12    Bahrain Eastern Me       1318      20.16       3.38       2.12
           13 Bangladesh South-East     155000      30.57       6.89       2.24
           14   Barbados   Americas        283      18.99      15.78       1.84
          ...        ...        ...        ...        ...        ...        ...

The data is about countries. The region column describes the region that country belongs to. A region can have more than one country.

This a ideal place where we can use Categorical Index if we want to study about different regions.

1
2
3
> df.index = Daru::CategoricalIndex.new (df['Region']).to_a

  #<Daru::CategoricalIndex(194): {Eastern Mediterranean, Europe, Africa, Europe, Africa, Americas, Americas, Europe, Western Pacific, Europe, Europe, Americas, Eastern Mediterranean, South-East Asia, Americas, Europe, Europe, Americas, Africa, South-East Asia ... Africa}>

Let’s see all regions there are:

1
2
> df.index.categories
  ["Eastern Mediterranean", "Europe", "Africa", "Americas", "Western Pacific", "South-East Asia"]

Let’s find out how many countries lie in Africa region.

1
2
> df.row['Africa'].size
  46

Finding out the mean life expectancy of europe is as easy as

1
2
> df.row['Europe']['LifeExpectancy'].mean
  76.73584905660377

Let’s see the maximum life expectancy of South-East Asia

1
2
> df.row['South-East Asia']['LifeExpectancy'].min
  63

Internal architecture

To efficiently store the index and make retrieval possible in constant time, Daru::CategoricalIndex uses two data structres-

  • Hash-table: To map each category to positional values. It is represented as @cat_hash.
  • Array: To map each position to a integer which represent a category.

For example:

1
idx = Daru::CategoricalIndex.new [:a, :b, :a, :b, :c]

For idx, the hash table and array woul be:

1
2
3
@cat_hash = {a: [0, 2], b: [1, 3], c: [4]}

@array = [0, 1, 0, 1, 2]

The hash table helps us in retriving all instances which belong to that category in real time.

Similary, the array helps us in retriving category of an instance in constant time.

And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.

When and why to use Categorical Index

If you have a categorical variable or data where there are more than one instance of same object and you want to index the dataframe by that column.

It will save you a lot of space and make access to the same category fast.

Also if you want your dataframe to be indexed by a column in which not every entry is unique categorical index will come to the rescue.