Version 2.0 of my data set validation package assertr
hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr
, the text below will point out the improvements.
The Bayesian approach to ridge regression
In a TODO previous post, we demonstrated that ridge regression (a form of regularized linear regression that attempts to shrink the beta coefficients toward zero) can be super-effective at combating overfitting and lead to a greatly more generalizable model. This approach to regularization used penalized maximum likelihood estimation (for which we used the amazing glmnet
package). There is, however, another approach… an equivalent approach… but one that allows us greater flexibility in model construction and lends itself more easily to an intuitive interpretation of the uncertainty of our beta coefficient estimates. I’m speaking, of course, of the bayesian approach.
Using Python decorators to be a lazy programmer: a case study
Decorators are considered one of the more advanced features of python and it will often be the last topic in a python class or introductory book. It will, unfortunately, also be one that trips up many beginning or even intermediate python programmers. Those who stick it out and work through it, though, will be handsomely rewarded for their hard work.
Computational foreign language learning: a study in Spanish verbs usage
Abstract: I did some computer-y stuff to construct a personal Spanish text corpus and create a Spanish verb study guide specifically tailored to the linguistic variety of Spanish I intend to consume and produce. It worked fairly well. It also revealed a (in some small way) generalizable depiction of the relative frequencies of Spanish verb tenses and moods. This technique may prove to be extremely beneficial to Spanish-language pedagogy. If you’re uninterested in my motivations or procedure, you can skip to the section labeled “results”.
Genre-based Music Recommendations Using Open Data (and the problem with recommender systems)
After a long 12 months of pouring my soul into it, my book, Data Analysis with R, was finally published. After the requisite 2-4 day breather, I started thinking about how I was going to get back into the swing of regular blog posts and decided that the easier and softer way is to cannibalize and expand on an example in the book.
Kickin' it with elastic net regression
With the kind of data that I usually work with, overfitting regression models can be a huge problem if I’m not careful. Ridge regression is a really effective technique for thwarting overfitting. It does this by penalizing the L2 norm (euclidean distance) of the coefficient vector which results in “shrinking” the beta coefficients. The aggressiveness of the penalty is controlled by a parameter
Lessons learned in high-performance R
On this blog, I’ve had a long running investigation/demonstration of how to make a “embarrassingly-parallel” but computationally intractable (on commodity hardware, at least) R problem more performant by using parallel computation and Rcpp.