GSoC 2016 Progress

Categorical Data Support [SciRuby]

Work Done This GSoC

Summary

It was a great summer. My project was to add categorical data support in Daru and Statsample. This is my GSoC project page. I’m happy to say that I implemented all of my goals and achived much more.

To conclude I implemented the following this summer:

  • Support for categorical data in Daru
  • Support to visualize categorical data using Nyaplot, GnuplotRB and Gruff
  • Support for categorical data with formula language in Statsample-GLM

The code

The following are the main pull requests regarding my project:

  • Daru#134 It does the following:
    • CategoricalIndex class for handling categorical index
    • Category module to add categorical data support in Daru::Vector and Daru::DataFrame
    • Visualization support for categorical data
  • Statsample-GLM#31
    • Added Formula language support
    • Categorical data support in regression
  • Statsample#51 It implements formula language and categorical data support for regression in Statsample. This is unmerged, reason being that we are not sure whether we should remove the linear regression support from Stastsample or not. See here. We will either end up merging this pull request or moving the linear regression form here to Statsample-GLM.
  • Daru#208 It does the following:
    • Implements missing value support for categorical data
    • Improves the missing values API to make it simple and improve performance of Daru update operations

Here are other pull requests not necessary related to the project.

Now I will talk in detail about the work in these pull requests:

Daru#134

This was my major work during the weeks from 1 to 6th week.

You can find every detail of my work like what exactly I implemented, why I made certain decisions and how to use it in the following posts:

This PR has been merged.

Statsample-GLM#31

The following posts discusses in detail my work:

This PR is merged.

Statsample#51

This pull request is currently unmerged. It implements the same functionality as the above pull request does for Statsample-GLM.

Earlier our plan was to implement support for categorical data in both Statsample and Stastsample-GLM but because linear regression is also present in Statsample-GLM. And since linear regression in Statsample is better in terms of performance as compared to Statsample-GLM we are looking to remove the linear regression from Statsample and move it to Statsample-GLM. More information is here.

So, we will doing one of these two things:

  • Merge this pull request and do not remove linear regression from Statsample.
  • Or move linear regression from Statsample to Statsample-GLM.

Daru#208

This improves the current structure of missing values API in Daru and introduces missing values support for categorical data. More information can found here.

Installation

There are no special installation instructions to try the code. Installing the gems daru and statsample-glm is sufficient.

To install Daru, do:

1
gem install daru

One need to install the latest master of Statsample-GLM to try my code since Statsample-GLM after my code merged hasn’t been released yet.

One could try a number of notebooks I’ve mentioned in the links with each PR to try the respective code.

Comments