With the work done in Week 9 and 10, Statsample-GLM now supports shortcut symbols in Formula Language.
With this addition, the regression has become more R/Patsy like and more convenient.
There are two shortcut symbols now being supported:
a*b is shortcut for
a+b+a:b. This is commonly used within regression models.
a/b is shortcut for
a+a:b. Its quite useful while dealing with nested categorical variables.
a/b makes sense when
b is nested inside
This week brackets support has been added so one can form expression involving use of brackets. For example
(a+b):c would evaluate to
a:c + b:c.
It supports any level of sophistication with symbols and brackets. For example
(a+b)*(c+d) would give
Although there are certain limitations to the current formula language:
- Since more than 2-way interactions are not supported yet, formula like
- There’s not a mechanism to deal with cases such as
Formula Language in Statsample
Earlier, the plan was to implement the formula language also in Statsample but because Statsample which supports just linear regression is also supported by name Normal Regression in Statsample-GLM, we are planning to not implement formula language in Statsample but rather remove the linear regression support from Statsample if it doesn’t offer any advantage to Normal Regression in Statsample-GLM. For info, see here.
Example using shortcut symbols
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Internal Structure of Formula Language
There are three classes by which formula language works:
When creation of a new model is invoked by
FormulaWrapper is first called.
FormulaWrapper class does the necessary preprocessing. It does mainly two things:
- Apply the shortcut symbols and reduce the expression to only containing
- After reducing to simple expression containing only
+, it groups terms based on the numerical terms they are interacting with.
FormulaWrapper has form groups, it processes each of these groups using the
Formula class takes each group and form tokens which do not overlap, that is if they are converted to dataframe they won’t contain redundancy in that dataframe.
Token class stores the column names and can expand these columns when fed a dataframe.
Lets try an example:
Lets say our expression is
x*a + b*c, where
x is numerical vector and
c are categorical.
- First it will converted to simple expression by
FormulaWrapper. It will be simplified as
1+x+a+x:a+b+c+b:c. Notice shortcut symbols have disappeared and only
1+x+a+x:a+b+c+b:cis grouped into two groups [
b:c] and [
a]. The first group has the common numerical interaction terms as
1, while the second group has common numerical interaction terms as
- Now both the groups will be processed by
Formulato produce dataframe with full rank.
- First group will be parsed to
a(-)implies that vector
ais contrast coded to reduced rank, while
aimplies its coded to full rank.
- Second group will be parsed to
x + x:a(-).
- In the end these terms are combined and resultant parsed expression is the sum of the above two expressions, i.e.
- Then are expanded into dataframes by
Tokenclass and these dataframes are concatted to form the final dataframe for the given expression.
We saw the overview of how formula language works inside
Statsample::GLM and shortcut symbols with brackets has made the usage much more convenient and powerful.