Daru has now the capability to store and process categorical data.
Daru has now three types of vector
:object
:numeric
:category (new)
With introduction of categorical data, Daru has now two benefits-
- Storage of categorical data is very effective.
- Tasks related to categorical data have become a lot easier
The reason for 1
is that in ordinary vector the data is stored as an array, it doesn’t consider the fact that most of the entries are same.
Lets discuss the various tasks which can now be done easily related to categorical vector.
(The purpose of this blog is to give an overview of what tasks can be accomplished with categorical data. To learn about what each method do and how to use it please look at this notebook)
As soon as one declares a categorical variable, one can look at frequency count of each category to get judgement of the data:
1 2 3 4 5 6 |
|
One can look over the summary of the data to get to know common numbers about categorical data like how many categories are present, which is the most frequenct category, etc.
1 2 3 4 5 6 7 8 |
|
Its possible to convert a numerical variable into categorical variable. For example heights
store measures of heights and we want to categorize them into categories low
, medium
and high
:
1 2 3 4 5 6 7 |
|
Given a dataframe its possible to extract rows based on the categories. It uses the same Area-like query syntax like an ordinary vector. For example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The benefit is that we used lt
based on the order we set.
By defining the custom order of categories and setting ordered
to true
, one can sort the categories, find the min, max, etc. For example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Example
Please have a look at this notebook which describes use of categorical data though an example.
Internal Structure
Its similar to the internal structure of categorical index.
To efficiently store the duplicates of catgories and make retrieval possible in constant time, categorical data in Daru uses two data structres-
- Hash-table: To map each category to positional values. It is represented as
@cat_hash
. - Array: To map each position to a integer which represent a category.
For example:
1
|
|
For dv
, the hash table and array would be:
1 2 3 |
|
The hash table helps us in retriving all instances which belong to that category in real time.
Similary, the array helps us in retriving category of an instance in constant time.
And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.