The end of GSoC is near. I ended up finishing up a bit early on the formula language implementation and decided to devote the time on some other important issues.
During these last two weeks I solved some issues in Daru and mainly worked on this issue regarding how missing values are handled in Daru.
The following were the shortcomings:
- Update operations like
- Any value could be set as missing values, which made the checks for missing values somewhat hard.
Now, Daru follows a simple approach of only considering
Float::NAN as the missing values. Although one loses the flexibility of assigning an arbitrary value as missing but it has greatly simplified many things and also improvement in performance is significant. Further, one can simply uses
#replace now to change the values which he/she wants to treat as missing to
In addition to that, the updates have become blazingly fast without compromising the caching of missing values. I accomplished by the following strategy:
- During the updates the cache which stores the positions of
Float::NANgets outdated and doesn’t get updated until we require those positions.
- When missing positions of either
Float::NANare required by any of the missing value method, those are returned if cache isn’t outdated and if the cache is outdated then its rebuilt.
This way one has best of both worlds. The updates remain fast and also the caching of
Float::NAN is maintained.
I ran the following benchmarks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
And these are the results before and after:
1 2 3 4 5 6 7 8 9 10
Here’s the summary of the old and new API regarding handling of missing values:
Methods added in Daru::Vector (and category):
- replace values
Methods added in Daru::DataFrame:
Methods removed in Daru::Vector:
and other methods
#n_valid have been deprecated.
- As you can notice the performance of Daru updating methods have undergone a major improvement and the its effects will be far reaching from improving other things in Daru to imporoving the performance in Statsample and Statsample-GLM.
- During the way I learned how to use tools like
ruby-profto benchmark the code and understand where’s the performance is lagging.
- I noticed that methods like
#are proving to be bottleneck and there lie chances of their improvement.
- Thanks to Victor for suggestiong this change in Daru, providing with good API and helping me all the way to implement it.