Daru just got a new capability => Categorical Index.
Now one can organize vector and dataframe using index that is categorical.
Daru has got now 4 types of indexes to index data:
- Daru::CategoricalIndex (new)
Daru::Index is for usual index where every value is unique.
Daru::MultiIndex is for indexing with more than one level.
Daru::DateTimeIndex is to have indexing with dates. Its powerful means to analyze time series data.
Daru::CategoricalIndex is helpful with data indexed with sparsely populated index with each unique index value as category.
Let’s see an example.
(Alternatively you can also see this example in iRuby notebook here)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
The data is about countries. The
region column describes the region that country belongs to. A region can have more than one country.
This a ideal place where we can use Categorical Index if we want to study about different regions.
1 2 3
Let’s see all regions there are:
Let’s find out how many countries lie in Africa region.
Finding out the mean life expectancy of europe is as easy as
Let’s see the maximum life expectancy of South-East Asia
To efficiently store the index and make retrieval possible in constant time,
Daru::CategoricalIndex uses two data structres-
- Hash-table: To map each category to positional values. It is represented as
- Array: To map each position to a integer which represent a category.
idx, the hash table and array woul be:
1 2 3
The hash table helps us in retriving all instances which belong to that category in real time.
Similary, the array helps us in retriving category of an instance in constant time.
And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.
When and why to use Categorical Index
If you have a categorical variable or data where there are more than one instance of same object and you want to index the dataframe by that column.
It will save you a lot of space and make access to the same category fast.
Also if you want your dataframe to be indexed by a column in which not every entry is unique categorical index will come to the rescue.