GSoC 2016 Progress

Categorical Data Support [SciRuby]

Categorical Data [Week 3-4]

Daru has now the capability to store and process categorical data.

Daru has now three types of vector

  • :object
  • :numeric
  • :category (new)

With introduction of categorical data, Daru has now two benefits-

  1. Storage of categorical data is very effective.
  2. Tasks related to categorical data have become a lot easier

The reason for 1 is that in ordinary vector the data is stored as an array, it doesn’t consider the fact that most of the entries are same.

Lets discuss the various tasks which can now be done easily related to categorical vector.

(The purpose of this blog is to give an overview of what tasks can be accomplished with categorical data. To learn about what each method do and how to use it please look at this notebook)

As soon as one declares a categorical variable, one can look at frequency count of each category to get judgement of the data:

1
2
3
4
5
6
> rank = Daru::Vector.new ['III']*10 + ['II']*5 + ['I']*5, type: :category, categories: ['I', 'II', 'III']
> rank.frequencies
=> #<Daru::Vector(3)>
   I   5
  II   5
 III  10

One can look over the summary of the data to get to know common numbers about categorical data like how many categories are present, which is the most frequenct category, etc.

1
2
3
4
5
6
7
8
> rank.summary
=> #<Daru::Vector(6)>
         size           20
   categories            3
     max_freq           10
 max_category          III
     min_freq            5
 min_category            I

Its possible to convert a numerical variable into categorical variable. For example heights store measures of heights and we want to categorize them into categories low, medium and high:

1
2
3
4
5
6
7
> heights = Daru::Vector.new [30, 35, 32, 50, 42, 51]
> heights_cat = heights.cut [30, 40, 50, 60], labels: ['low', 'medium', 'high']
> height_cat.frequencies
=> #<Daru::Vector(3)>
    low      3
 medium      1
   high      2

Given a dataframe its possible to extract rows based on the categories. It uses the same Area-like query syntax like an ordinary vector. For example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> df = Daru::DataFrame.new({
  id: [0, 1, 2, 3, 4, 5, 6, 7],
  grade: %w[A C B A C C B B],
  name: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})

> df[:grade] = df[:grade].to_category ordered: true, categories: %w[A B C]

# Lets list entries with grade less than 'C'
> df.where df[:grade].lt('C')
=> #<Daru::DataFrame(5x3)>
       grade    id  name
     0     A     0     a
     2     B     2     c
     3     A     3     d
     6     B     6     g
     7     B     7     h

The benefit is that we used lt based on the order we set.

By defining the custom order of categories and setting ordered to true, one can sort the categories, find the min, max, etc. For example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Assuming df defined as above
# Lets rename the categories to show that lexical order is not followed
# while sorting with categorical data
> df[:grade].rename_categories 'A' => 'Good', 'B' => 'Average', 'C' => 'Bad'
> df
=> #<Daru::DataFrame(8x3)>
           grade      id    name
       0    Good       0       a
       1     Bad       1       b
       2 Average       2       c
       3    Good       3       d
       4     Bad       4       e
       5     Bad       5       f
       6 Average       6       g
       7 Average       7       h
> df.sort! [:grade]
=> #<Daru::DataFrame(8x3)>
           grade      id    name
       0    Good       0       a
       3    Good       3       d
       2 Average       2       c
       6 Average       6       g
       7 Average       7       h
       1     Bad       1       b
       4     Bad       4       e
       5     Bad       5       f

Example

Please have a look at this notebook which describes use of categorical data though an example.

Internal Structure

Its similar to the internal structure of categorical index.

To efficiently store the duplicates of catgories and make retrieval possible in constant time, categorical data in Daru uses two data structres-

  • Hash-table: To map each category to positional values. It is represented as @cat_hash.
  • Array: To map each position to a integer which represent a category.

For example:

1
dv = Daru::Vector.new [:a, :b, :a, :b, :C], type: :category

For dv, the hash table and array would be:

1
2
3
@cat_hash = {a: [0, 2], b: [1, 3], c: [4]}

@array = [0, 1, 0, 1, 2]

The hash table helps us in retriving all instances which belong to that category in real time.

Similary, the array helps us in retriving category of an instance in constant time.

And the reason to store integers in the array instead of name of categories itself is to avoid unnecessary usage of space.

Comments