It was a great summer. My project was to add categorical data support in Daru and Statsample.
This is my GSoC project page.
I’m happy to say that I implemented all of my goals and achived much more.
To conclude I implemented the following this summer:
The following are the main pull requests regarding my project:
It does the following:
CategoricalIndex
class for handling categorical indexCategory
module to add categorical data support in Daru::Vector
and Daru::DataFrame
Here are other pull requests not necessary related to the project.
Now I will talk in detail about the work in these pull requests:
This was my major work during the weeks from 1 to 6th week.
You can find every detail of my work like what exactly I implemented, why I made certain decisions and how to use it in the following posts:
This PR has been merged.
The following posts discusses in detail my work:
This PR is about to get merged. Just waiting for the new Daru to be released.
This pull request is currently unmerged. It implements the same functionality as the above pull request does for StatsampleGLM.
Earlier our plan was to implement support for categorical data in both Statsample and StastsampleGLM but because linear regression is also present in StatsampleGLM. And since linear regression in Statsample is better in terms of performance as compared to StatsampleGLM we are looking to remove the linear regression from Statsample and move it to StatsampleGLM. More information is here.
So, we will doing one of these two things:
This improves the current structure of missing values API in Daru and introduces missing values support for categorical data. More information can found here.
]]>During these last two weeks I solved some issues in Daru and mainly worked on this issue regarding how missing values are handled in Daru.
The following were the shortcomings:
#[]=
, #set_at
were slow.Now, Daru follows a simple approach of only considering nil
and Float::NAN
as the missing values. Although one loses the flexibility of assigning an arbitrary value as missing but it has greatly simplified many things and also improvement in performance is significant. Further, one can simply uses #replace
now to change the values which he/she wants to treat as missing to nil
.
In addition to that, the updates have become blazingly fast without compromising the caching of missing values. I accomplished by the following strategy:
nil
and Float::NAN
gets outdated and doesn’t get updated until we require those positions.nil
or Float::NAN
are required by any of the missing value method, those are returned if cache isn’t outdated and if the cache is outdated then its rebuilt.This way one has best of both worlds. The updates remain fast and also the caching of nil
and Float::NAN
is maintained.
I ran the following benchmarks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

And these are the results before and after:
1 2 3 4 5 6 7 8 9 10 

Here’s the summary of the old and new API regarding handling of missing values:
Methods added in Daru::Vector (and category):
Methods added in Daru::DataFrame:
Methods removed in Daru::Vector:
and other methods #has_missing_data?
, #n_valid
have been deprecated.
rubyprof
to benchmark the code and understand where’s the performance is lagging.#[]
are proving to be bottleneck and there lie chances of their improvement.With this addition, the regression has become more R/Patsy like and more convenient.
There are two shortcut symbols now being supported:
*
/
a*b
is shortcut for a+b+a:b
. This is commonly used within regression models.
a/b
is shortcut for a+a:b
. Its quite useful while dealing with nested categorical variables. a/b
makes sense when b
is nested inside a
.
This week brackets support has been added so one can form expression involving use of brackets. For example (a+b):c
would evaluate to a:c + b:c
.
It supports any level of sophistication with symbols and brackets. For example (a+b)*(c+d)
would give a+b+c+d+a:c+a:d+b:c+b:d
.
Although there are certain limitations to the current formula language:
a*b*c
wouldn’t work.a*a
.Earlier, the plan was to implement the formula language also in Statsample but because Statsample which supports just linear regression is also supported by name Normal Regression in StatsampleGLM, we are planning to not implement formula language in Statsample but rather remove the linear regression support from Statsample if it doesn’t offer any advantage to Normal Regression in StatsampleGLM. For info, see here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

There are three classes by which formula language works:
FormulaWrapper
Formula
Token
When creation of a new model is invoked by Statsample::GLM::Regression#new
, FormulaWrapper
is first called.
FormulaWrapper
class does the necessary preprocessing. It does mainly two things:
:
and +
:
and +
, it groups terms based on the numerical terms they are interacting with.After FormulaWrapper
has form groups, it processes each of these groups using the Formula
class.
The Formula
class takes each group and form tokens which do not overlap, that is if they are converted to dataframe they won’t contain redundancy in that dataframe.
The Token
class stores the column names and can expand these columns when fed a dataframe.
Sounds confusing?
Lets try an example:
Lets say our expression is x*a + b*c
, where x
is numerical vector and a
, b
and c
are categorical.
FormulaWrapper
. It will be simplified as 1+x+a+x:a+b+c+b:c
. Notice shortcut symbols have disappeared and only +
and :
are remaining.1+x+a+x:a+b+c+b:c
is grouped into two groups [1
, a
, b
, c
, b:c
] and [1
, a
]. The first group has the common numerical interaction terms as 1
, while the second group has common numerical interaction terms as x
.Formula
to produce dataframe with full rank.1+a()+b()+c()+b():c()
by Formula
class. a()
implies that vector a
is contrast coded to reduced rank, while a
implies its coded to full rank.x + x:a()
.1+a()+b()+c()+b():c()+x+x:a()
.Token
class and these dataframes are concatted to form the final dataframe for the given expression.We saw the overview of how formula language works inside Statsample::GLM
and shortcut symbols with brackets has made the usage much more convenient and powerful.
Currently, Statsample and StatsampleGLM do not support regression with category data.
With the introduction of formula language I am looking to accomplish the following:
In these two weeks I have implemented a formula language but it is limited in certain ways. The work of following weeks will fill this gap.
Lets talk about the formula language I have implemented in these two weeks.
The formula language which I aim to implement is similar to that used within R and Patsy
With the work of these two weeks, the formula language has the following features:
:
and +
.And since I have followed the Patsy way of implementing the formula langauge it has an edge over R. Since, Patsy has a more accurate algorithm for deciding whether to use a full or reducedrank coding scheme for categorical factors, the same is inherited in Statsample and StatsampleGLM.
R sometimes can give underspecified model but this is not the case with our implementation. One example is expansion of 0 + a:x + a:b
, where x
is numeric. More information about this can be found here.
I am thankful to Patsy for it made my work very easy by providing all the details in their documentation. Without it I would have fallen into many pitfalls.
Now lets see formula language in action in Statsample and StatsampleGLM.
Regression in StatsampleGLM has become an easy task and in addition it now supports category data as predictor variables.
Lets see this by an example.
Lets assume a dataframe df
with numeric columns a
, b
, and having category column c
, d
, e
.
Lets create a logistic model with predictors a
, a*b
, c
and c:d
.
If we were to do this earlier, we would have done the following.
Since we can’t code category variables, so lets leave c
and c:d
.
1 2 3 4 5 6 7 8 9 10 11 

Now with the introduction of formula langauge it has become a very easy task with no work required to preprocess the dataframe.
1 2 3 4 5 6 

The above code not only enables predictions with caetgory data but also reflects the powerful formula langauge.
Here’s a notebook that describes the use of formula language in StatsampleGLM using real life data.
Lets have a look at Statsample now.
With Statsample, its the same. Now one can perform multiple regression with formula language and category variables as predictors.
1 2 

This will give a multiple linear regression model.
The introduction of formula language and ability to handle category data has given a great boost to Data Analysis in Ruby and I really hope we keep improving it further and further.
In the coming weeks I will look forward to implement the following:
Daru supports visualization via three libraries:
Lets discuss them one by one
Nyaplot is the default plotting library for Daru. Nyaplot allows creation of a variety of plots with Daru easily. Its biggest strength lies in its ablity to draw interactive plots.
Now Daru also supports categorical data visualization using Nyaplot. It mainly has two aspects:
Here are some examples of visualization of category data using Nyaplot in Daru.
GnuplotRB is another great library which has inbuilt support for Daru datastructres: Daru::Vector
and Daru::DataFrame
. Though it doesn’t directly operate on vectors and dataframes but uses its own API, it provides out of box support to plot Daru::Vector
and Data::DataFrame
.
GnuplotRB strength lies in its offering of highly customized plots with yet a very simple to use API.
No work was done regarding supporting categorical data visualization in GnuplotRB because it supports it out of the box owing to its easy to use API that lets plot multiple plots with a variety of features.
Here’s a notebook demonstrating examples in the way category data can be visualizaed using GnuplotRB.
Gruff is a new plotting library that has just been added to Daru. Gruff offers remarkably beautiful plots with very less effort. It also offers pie and sidebar plots which currently the other two libraries don’t offer.
These two notebooks show examples related to plotting of Daru::Vector
and Daru::DataFrame
:
To easily move between these all these libraries, Daru has following functions:
Daru.plotting_library
Daru::Vector#plotting_library
Daru::DataFrame#plotting_library
Daru.plotting_library
can be used to set the current plotting library. For example, using Daru.plotting_library = :gruff
one can switch the plotting library to Gruff. This means all the plots created here after will be using Gruff for plotting.
Inorder to change plotting library for only a specific vector, one can use Daru::Vector#plotting_library
. For example, dv.plotting_library = :gruff
will only change plotting library for vector dv
and all other vectors created will created using library as set by Daru.plotting_library
.
The same goes for dataframes, one can use df.plotting_library = :gruff
to set plotting library for data frame df
to Gruff.
Along with the support of categorical data, Daru now also owns the ability to visualize catgory data. I realized and addressed a few shortcoming of some of these libraries and we at SciRuby are motivated to overcome those shortcoming and make visualization in Daru more complete.
]]>Daru has now three types of vector
:object
:numeric
:category (new)
With introduction of categorical data, Daru has now two benefits
The reason for 1
is that in ordinary vector the data is stored as an array, it doesn’t consider the fact that most of the entries are same.
Lets discuss the various tasks which can now be done easily related to categorical vector.
(The purpose of this blog is to give an overview of what tasks can be accomplished with categorical data. To learn about what each method do and how to use it please look at this notebook)
As soon as one declares a categorical variable, one can look at frequency count of each category to get judgement of the data:
1 2 3 4 5 6 

One can look over the summary of the data to get to know common numbers about categorical data like how many categories are present, which is the most frequenct category, etc.
1 2 3 4 5 6 7 8 

Its possible to convert a numerical variable into categorical variable. For example heights
store measures of heights and we want to categorize them into categories low
, medium
and high
:
1 2 3 4 5 6 7 

Given a dataframe its possible to extract rows based on the categories. It uses the same Arealike query syntax like an ordinary vector. For example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

The benefit is that we used lt
based on the order we set.
By defining the custom order of categories and setting ordered
to true
, one can sort the categories, find the min, max, etc. For example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

Please have a look at this notebook which describes use of categorical data though an example.
Its similar to the internal structure of categorical index.
To efficiently store the duplicates of catgories and make retrieval possible in constant time, categorical data in Daru uses two data structres
@cat_hash
.For example:
1


For dv
, the hash table and array would be:
1 2 3 

The hash table helps us in retriving all instances which belong to that category in real time.
Similary, the array helps us in retriving category of an instance in constant time.
And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.
]]>Now one can organize vector and dataframe using index that is categorical.
Daru has got now 4 types of indexes to index data:
Daru::Index
is for usual index where every value is unique.
Daru::MultiIndex
is for indexing with more than one level.
Daru::DateTimeIndex
is to have indexing with dates. Its powerful means to analyze time series data.
The new Daru::CategoricalIndex
is helpful with data indexed with sparsely populated index with each unique index value as category.
Please visit this link before to get a basic understanding of how indexing works in Daru::Vector
and this link for Daru::DataFrame
.
Let’s see an example.
(Alternatively you can also see this example in iRuby notebook here)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

The data is about countries. The region
column describes the region that country belongs to. A region can have more than one country.
This a ideal place where we can use Categorical Index if we want to study about different regions.
1 2 3 

Let’s see all regions there are:
1 2 

Let’s find out how many countries lie in Africa region.
1 2 

Finding out the mean life expectancy of europe is as easy as
1 2 

Let’s see the maximum life expectancy of SouthEast Asia
1 2 

To efficiently store the index and make retrieval possible in constant time, Daru::CategoricalIndex
uses two data structres
@cat_hash
.For example:
1


For idx
, the hash table and array woul be:
1 2 3 

The hash table helps us in retriving all instances which belong to that category in real time.
Similary, the array helps us in retriving category of an instance in constant time.
And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.
If you have a categorical variable or data where there are more than one instance of same object and you want to index the dataframe by that column.
It will save you a lot of space and make access to the same category fast.
Also if you want your dataframe to be indexed by a column in which not every entry is unique categorical index will come to the rescue.
]]>