With the work done in Week 9 and 10, Statsample-GLM now supports shortcut symbols in Formula Language.
With this addition, the regression has become more R/Patsy like and more convenient.
Symbols Added
There are two shortcut symbols now being supported:
*/
a*b is shortcut for a+b+a:b. This is commonly used within regression models.
a/b is shortcut for a+a:b. Its quite useful while dealing with nested categorical variables. a/b makes sense when b is nested inside a.
Brackets
This week brackets support has been added so one can form expression involving use of brackets. For example (a+b):c would evaluate to a:c + b:c.
It supports any level of sophistication with symbols and brackets. For example (a+b)*(c+d) would give a+b+c+d+a:c+a:d+b:c+b:d.
Note
Although there are certain limitations to the current formula language:
- Since more than 2-way interactions are not supported yet, formula like
a*b*cwouldn’t work. - There’s not a mechanism to deal with cases such as
a*a.
Formula Language in Statsample
Earlier, the plan was to implement the formula language also in Statsample but because Statsample which supports just linear regression is also supported by name Normal Regression in Statsample-GLM, we are planning to not implement formula language in Statsample but rather remove the linear regression support from Statsample if it doesn’t offer any advantage to Normal Regression in Statsample-GLM. For info, see here.
Example using shortcut symbols
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
Internal Structure of Formula Language
There are three classes by which formula language works:
FormulaWrapperFormulaToken
When creation of a new model is invoked by Statsample::GLM::Regression#new, FormulaWrapper is first called.
FormulaWrapper class does the necessary preprocessing. It does mainly two things:
- Apply the shortcut symbols and reduce the expression to only containing
:and+ - After reducing to simple expression containing only
:and+, it groups terms based on the numerical terms they are interacting with.
After FormulaWrapper has form groups, it processes each of these groups using the Formula class.
The Formula class takes each group and form tokens which do not overlap, that is if they are converted to dataframe they won’t contain redundancy in that dataframe.
The Token class stores the column names and can expand these columns when fed a dataframe.
Sounds confusing?
Lets try an example:
Lets say our expression is x*a + b*c, where x is numerical vector and a, b and c are categorical.
- First it will converted to simple expression by
FormulaWrapper. It will be simplified as1+x+a+x:a+b+c+b:c. Notice shortcut symbols have disappeared and only+and:are remaining. - Now
1+x+a+x:a+b+c+b:cis grouped into two groups [1,a,b,c,b:c] and [1,a]. The first group has the common numerical interaction terms as1, while the second group has common numerical interaction terms asx. - Now both the groups will be processed by
Formulato produce dataframe with full rank. - First group will be parsed to
1+a(-)+b(-)+c(-)+b(-):c(-)byFormulaclass.a(-)implies that vectorais contrast coded to reduced rank, whileaimplies its coded to full rank. - Second group will be parsed to
x + x:a(-). - In the end these terms are combined and resultant parsed expression is the sum of the above two expressions, i.e.
1+a(-)+b(-)+c(-)+b(-):c(-)+x+x:a(-). - Then are expanded into dataframes by
Tokenclass and these dataframes are concatted to form the final dataframe for the given expression.
Conclusion
We saw the overview of how formula language works inside Statsample::GLM and shortcut symbols with brackets has made the usage much more convenient and powerful.