After the end of 6 weeks we have category data support in Daru. Now in the coming weeks we will be adding support for category data in Statsample and StatsampleGLM.
Currently, Statsample and StatsampleGLM do not support regression with category data.
With the introduction of formula language I am looking to accomplish the following:
 To support regression with category data
 To provide convenience of formula language to create regression models
In these two weeks I have implemented a formula language but it is limited in certain ways. The work of following weeks will fill this gap.
Lets talk about the formula language I have implemented in these two weeks.
Formula Language
The formula language which I aim to implement is similar to that used within R and Patsy
With the work of these two weeks, the formula language has the following features:
 It supports 2way interaction.
 It supports
:
and+
.  It supports inclusion/exclusion of contant or intercept term.
And since I have followed the Patsy way of implementing the formula langauge it has an edge over R. Since, Patsy has a more accurate algorithm for deciding whether to use a full or reducedrank coding scheme for categorical factors, the same is inherited in Statsample and StatsampleGLM.
R sometimes can give underspecified model but this is not the case with our implementation. One example is expansion of 0 + a:x + a:b
, where x
is numeric. More information about this can be found here.
I am thankful to Patsy for it made my work very easy by providing all the details in their documentation. Without it I would have fallen into many pitfalls.
Now lets see formula language in action in Statsample and StatsampleGLM.
Regression in StatsampleGLM
Regression in StatsampleGLM has become an easy task and in addition it now supports category data as predictor variables.
Lets see this by an example.
Lets assume a dataframe df
with numeric columns a
, b
, and having category column c
, d
, e
.
Lets create a logistic model with predictors a
, a*b
, c
and c:d
.
If we were to do this earlier, we would have done the following.
Since we can’t code category variables, so lets leave c
and c:d
.
1 2 3 4 5 6 7 8 9 10 11 

Now with the introduction of formula langauge it has become a very easy task with no work required to preprocess the dataframe.
1 2 3 4 5 6 

The above code not only enables predictions with caetgory data but also reflects the powerful formula langauge.
Here’s a notebook that describes the use of formula language in StatsampleGLM using real life data.
Lets have a look at Statsample now.
Statsample
With Statsample, its the same. Now one can perform multiple regression with formula language and category variables as predictors.
1 2 

This will give a multiple linear regression model.
Conclusion
The introduction of formula language and ability to handle category data has given a great boost to Data Analysis in Ruby and I really hope we keep improving it further and further.
In the coming weeks I will look forward to implement the following:
 Add more than 2way interaction support
 Support for shortcut symbols ‘*’, ‘/’, etc.