Summary
It was a great summer. My project was to add categorical data support in Daru and Statsample. This is my GSoC project page. I’m happy to say that I implemented all of my goals and achived much more.
To conclude I implemented the following this summer:
 Support for categorical data in Daru
 Support to visualize categorical data using Nyaplot, GnuplotRB and Gruff
 Support for categorical data with formula language in StatsampleGLM
The code
The following are the main pull requests regarding my project:
 Daru#134
It does the following:
CategoricalIndex
class for handling categorical indexCategory
module to add categorical data support inDaru::Vector
andDaru::DataFrame
 Visualization support for categorical data
 StatsampleGLM#31
 Added Formula language support
 Categorical data support in regression
 Statsample#51 It implements formula language and categorical data support for regression in Statsample. This is unmerged, reason being that we are not sure whether we should remove the linear regression support from Stastsample or not. See here. We will either end up merging this pull request or moving the linear regression form here to StatsampleGLM.
 Daru#208 It does the following:
 Implements missing value support for categorical data
 Improves the missing values API to make it simple and improve performance of Daru update operations
Here are other pull requests not necessary related to the project.
Now I will talk in detail about the work in these pull requests:
Daru#134
This was my major work during the weeks from 1 to 6th week.
You can find every detail of my work like what exactly I implemented, why I made certain decisions and how to use it in the following posts:
This PR has been merged.
StatsampleGLM#31
The following posts discusses in detail my work:
This PR is merged.
Statsample#51
This pull request is currently unmerged. It implements the same functionality as the above pull request does for StatsampleGLM.
Earlier our plan was to implement support for categorical data in both Statsample and StastsampleGLM but because linear regression is also present in StatsampleGLM. And since linear regression in Statsample is better in terms of performance as compared to StatsampleGLM we are looking to remove the linear regression from Statsample and move it to StatsampleGLM. More information is here.
So, we will doing one of these two things:
 Merge this pull request and do not remove linear regression from Statsample.
 Or move linear regression from Statsample to StatsampleGLM.
Daru#208
This improves the current structure of missing values API in Daru and introduces missing values support for categorical data. More information can found here.
Installation
There are no special installation instructions to try the code. Installing the gems daru
and statsampleglm
is sufficient.
To install Daru, do:
1


One need to install the latest master of StatsampleGLM to try my code since StatsampleGLM after my code merged hasn’t been released yet.
One could try a number of notebooks I’ve mentioned in the links with each PR to try the respective code.