With the work done in Week 9 and 10, Statsample-GLM now supports shortcut symbols in Formula Language.
With this addition, the regression has become more R/Patsy like and more convenient.
Symbols Added
There are two shortcut symbols now being supported:
*
/
a*b
is shortcut for a+b+a:b
. This is commonly used within regression models.
a/b
is shortcut for a+a:b
. Its quite useful while dealing with nested categorical variables. a/b
makes sense when b
is nested inside a
.
Brackets
This week brackets support has been added so one can form expression involving use of brackets. For example (a+b):c
would evaluate to a:c + b:c
.
It supports any level of sophistication with symbols and brackets. For example (a+b)*(c+d)
would give a+b+c+d+a:c+a:d+b:c+b:d
.
Note
Although there are certain limitations to the current formula language:
- Since more than 2-way interactions are not supported yet, formula like
a*b*c
wouldn’t work. - There’s not a mechanism to deal with cases such as
a*a
.
Formula Language in Statsample
Earlier, the plan was to implement the formula language also in Statsample but because Statsample which supports just linear regression is also supported by name Normal Regression in Statsample-GLM, we are planning to not implement formula language in Statsample but rather remove the linear regression support from Statsample if it doesn’t offer any advantage to Normal Regression in Statsample-GLM. For info, see here.
Example using shortcut symbols
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Internal Structure of Formula Language
There are three classes by which formula language works:
FormulaWrapper
Formula
Token
When creation of a new model is invoked by Statsample::GLM::Regression#new
, FormulaWrapper
is first called.
FormulaWrapper
class does the necessary preprocessing. It does mainly two things:
- Apply the shortcut symbols and reduce the expression to only containing
:
and+
- After reducing to simple expression containing only
:
and+
, it groups terms based on the numerical terms they are interacting with.
After FormulaWrapper
has form groups, it processes each of these groups using the Formula
class.
The Formula
class takes each group and form tokens which do not overlap, that is if they are converted to dataframe they won’t contain redundancy in that dataframe.
The Token
class stores the column names and can expand these columns when fed a dataframe.
Sounds confusing?
Lets try an example:
Lets say our expression is x*a + b*c
, where x
is numerical vector and a
, b
and c
are categorical.
- First it will converted to simple expression by
FormulaWrapper
. It will be simplified as1+x+a+x:a+b+c+b:c
. Notice shortcut symbols have disappeared and only+
and:
are remaining. - Now
1+x+a+x:a+b+c+b:c
is grouped into two groups [1
,a
,b
,c
,b:c
] and [1
,a
]. The first group has the common numerical interaction terms as1
, while the second group has common numerical interaction terms asx
. - Now both the groups will be processed by
Formula
to produce dataframe with full rank. - First group will be parsed to
1+a(-)+b(-)+c(-)+b(-):c(-)
byFormula
class.a(-)
implies that vectora
is contrast coded to reduced rank, whilea
implies its coded to full rank. - Second group will be parsed to
x + x:a(-)
. - In the end these terms are combined and resultant parsed expression is the sum of the above two expressions, i.e.
1+a(-)+b(-)+c(-)+b(-):c(-)+x+x:a(-)
. - Then are expanded into dataframes by
Token
class and these dataframes are concatted to form the final dataframe for the given expression.
Conclusion
We saw the overview of how formula language works inside Statsample::GLM
and shortcut symbols with brackets has made the usage much more convenient and powerful.