Decision Trees & Random Forests
Our September 2020 meetup was on the topic of Decision Trees & Random Forests. It featured three presentations on the topic of tree-based models in R:
Hierarchical clustering in R (Kavana Rudresh)
treeheatr
- an R package to create interpretable decision tree visualizations (Trang Le)Decision trees vs. random forests (Karla Fettich)
#RLadies Join us today at 6pm for our September event! We are excited to have a line-up of presenters @trang1618 @kfettich @kavana_rudresh to share how to engage with tree-based methods in R.https://t.co/iCTY8n6Hub
— R-Ladies Philly (@RLadiesPhilly) September 15, 2020
Hierarchical clustering in R
Our first presenter was Kavana Rudresh, Enterprise Business Intelligence Manager for Strategic Analytics at Comcast Corporation. Materials from Kavana’s presentation are available here.
Kavana explained some key concepts within clustering, before focusing on hierarchical algorithms. Within hierarchical clustering, she discussed the distinction between bottom-up (agglomerative/additive) and top-down (divisive) types, before moving on to questions of how to measure similarity and how to choose the number of clusters.
Kavana also walked us through a script which employs some of these techniques with mall customer data and discussed tips for using clustering techniques with messy real-world data.
In the discussion period we talked a bit about how dendrograms can be misinterpreted; it is important to look at where leaves join rather than how they are arranged relative to each other. One suggestion was to visualize a baby’s crib mobile, where the branches can rotate without changing the structural relationship between the leaves.
treeheatr
- an R package to create interpretable decision tree visualizations
Trang Le is a postdoctoral fellow with Jason Moore at the Computational Genetics Lab, University of Pennsylvania. She’s the author and maintainer of 5 R packages and active contributor of the automated machine learning tool TPOT.
Trang’s presentation was about the package treeheatr
, which she authors and maintains. treeheatr
creates interpretable decision tree visualizations which incorporate a heatmap of the data at the tree’s leaf nodes. The presentation slides are available here and a recording of the presentation is available here. Trang started by reviewing some other options for visualizing decision tree models, before introducing treeheatr
and how to use it. A vignette is available here.
You can learn more about treeheatr
on the github website, in the github repository and on CRAN!
Decision trees vs. random forests
Karla works as Head of Algorithm Development at Orchestrall, where she leads behavioral data analytics efforts and predictive model development for healthcare IT innovation. She is also an organizer of R-Ladies Philly!
Karla provided an introduction to random forests. Karla’s slides are available here and recording of her presentation is available here. Her presentation included some background on decision trees versus random forests, an explanation of how random forest algorithms work at a high level, as well as some discussion of advantages and disadvantages of the approach.
Karla also walked through an implementation with some fictitious data - a case of the Mondays - and highlighted some “gotchas” to watch out for.
Thank you
Many thanks to our fantastic presenters, Kavana, Trang, and Karla, and to R-Ladies Global for making the virtual event possible via Zoom.