Daru just got a new capability => Categorical Index.
Now one can organize vector and dataframe using index that is categorical.
Daru has got now 4 types of indexes to index data:
- Daru::Index
- Daru::MultiIndex
- Daru::DateTimeIndex
- Daru::CategoricalIndex (new)
Daru::Index
is for usual index where every value is unique.
Daru::MultiIndex
is for indexing with more than one level.
Daru::DateTimeIndex
is to have indexing with dates. Its powerful means to analyze time series data.
The new Daru::CategoricalIndex
is helpful with data indexed with sparsely populated index with each unique index value as category.
Please visit this link before to get a basic understanding of how indexing works in Daru::Vector
and this link for Daru::DataFrame
.
Example
Let’s see an example.
(Alternatively you can also see this example in iRuby notebook here)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
The data is about countries. The region
column describes the region that country belongs to. A region can have more than one country.
This a ideal place where we can use Categorical Index if we want to study about different regions.
1 2 3 |
|
Let’s see all regions there are:
1 2 |
|
Let’s find out how many countries lie in Africa region.
1 2 |
|
Finding out the mean life expectancy of europe is as easy as
1 2 |
|
Let’s see the maximum life expectancy of South-East Asia
1 2 |
|
Internal architecture
To efficiently store the index and make retrieval possible in constant time, Daru::CategoricalIndex
uses two data structres-
- Hash-table: To map each category to positional values. It is represented as
@cat_hash
. - Array: To map each position to a integer which represent a category.
For example:
1
|
|
For idx
, the hash table and array woul be:
1 2 3 |
|
The hash table helps us in retriving all instances which belong to that category in real time.
Similary, the array helps us in retriving category of an instance in constant time.
And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.
When and why to use Categorical Index
If you have a categorical variable or data where there are more than one instance of same object and you want to index the dataframe by that column.
It will save you a lot of space and make access to the same category fast.
Also if you want your dataframe to be indexed by a column in which not every entry is unique categorical index will come to the rescue.