After the end of 6 weeks we have category data support in Daru. Now in the coming weeks we will be adding support for category data in Statsample and Statsample-GLM.
Currently, Statsample and Statsample-GLM do not support regression with category data.
With the introduction of formula language I am looking to accomplish the following:
- To support regression with category data
- To provide convenience of formula language to create regression models
In these two weeks I have implemented a formula language but it is limited in certain ways. The work of following weeks will fill this gap.
Lets talk about the formula language I have implemented in these two weeks.
The formula language which I aim to implement is similar to that used within R and Patsy
With the work of these two weeks, the formula language has the following features:
- It supports 2-way interaction.
- It supports
- It supports inclusion/exclusion of contant or intercept term.
And since I have followed the Patsy way of implementing the formula langauge it has an edge over R. Since, Patsy has a more accurate algorithm for deciding whether to use a full or reduced-rank coding scheme for categorical factors, the same is inherited in Statsample and Statsample-GLM.
R sometimes can give under-specified model but this is not the case with our implementation. One example is expansion of
0 + a:x + a:b, where
x is numeric. More information about this can be found here.
I am thankful to Patsy for it made my work very easy by providing all the details in their documentation. Without it I would have fallen into many pitfalls.
Now lets see formula language in action in Statsample and Statsample-GLM.
Regression in Statsample-GLM
Regression in Statsample-GLM has become an easy task and in addition it now supports category data as predictor variables.
Lets see this by an example.
Lets assume a dataframe
df with numeric columns
b, and having category column
Lets create a logistic model with predictors
If we were to do this earlier, we would have done the following.
Since we can’t code category variables, so lets leave
1 2 3 4 5 6 7 8 9 10 11
Now with the introduction of formula langauge it has become a very easy task with no work required to preprocess the dataframe.
1 2 3 4 5 6
The above code not only enables predictions with caetgory data but also reflects the powerful formula langauge.
Here’s a notebook that describes the use of formula language in Statsample-GLM using real life data.
Lets have a look at Statsample now.
With Statsample, its the same. Now one can perform multiple regression with formula language and category variables as predictors.
This will give a multiple linear regression model.
The introduction of formula language and ability to handle category data has given a great boost to Data Analysis in Ruby and I really hope we keep improving it further and further.
In the coming weeks I will look forward to implement the following:
- Add more than 2-way interaction support
- Support for shortcut symbols ‘*’, ‘/’, etc.