December 2019 Recap: Machine Learning Workshop
On December 2, 2019, almost 50 data enthusiasts joined R-Ladies Philly for a workshop on machine learning in R.
The workshop was led by Trang Le. Trang is a researcher at the University of Pennsylvania who has authored several R packages and applies machine learning to biomedical data. Follow her on twitter and at https://trang.page.
At a high-level, this workshop covered:
- Intro to the
caret
package and why it exists - Live demo and exercises using a dataset of beer reviews
- Trang’s insights into good practices
Off to a great start with our machine learning workshop! pic.twitter.com/lNUCAoiQye
— R-Ladies Philly (@RLadiesPhilly) December 2, 2019
The materials for this workshop are available online:
- Slides: https://slides.com/trang1618/caret-rladies
- RStudio Cloud: bit.ly/33MFHLy
- Code: https://github.com/trang1618/rladies-caret
Do you even caret all?
The caret
package was created to solve the problem of lots of modeling packages that didn’t play well together. caret
currently unifies over 200 models!
Trang suggested that you get started with the caret
website. She also reminded us that the package is not perfect. When you find issues or errors, contribute to the codebase or submit issues!
Machine learning for beer lovers
This workshop used the beer ratings dataset available on Kaggle (link). This data is freely available and provides a tasty example to practice on. Each review includes ratings on the appearance, aroma, palate, taste, and overall impression of a beer. Reviews include product and user information, followed by each of these five ratings, and a plaintext review.
Remember - it is important to clean your data! Trang recommended the skimr
package and skim
function to quickly get a look at your dataset.
For this workshop, we used 1,000 reviews to predict the ABV (alcohol content) of beers from the reviews.
Machine learning 101: before building your models… MAKE SURE YOUR DATASET IS CLEAN. (Or, can't do the fun stuff until you've completed your data cleaning chores) #rstats @trang1618
— R-Ladies Philly (@RLadiesPhilly) December 3, 2019
Making some predictions…
Using 1,000 reviews from the beer review dataset, attendees practiced…
- Dimensionality reduction with principal component analysis
- Fitting a support vector machine model, then tuning the parameters
- Testing a random forest model
- Using the unstructured text reviews to predict and then evaluating which words were the most predictive
Closing thoughts
Trang wrapped up with a Q&A session. During this time, she discussed some comparisons between machine learning frameworks in R versus python and what “counts” as machine learning.
Thank you!
Thank you to all our attendees, our sponsors (Elsevier), and especially Trang!!
Do you even caret all? With Trang Le, PhD! pic.twitter.com/ZeN8DkTFQT
— R-Ladies Philly (@RLadiesPhilly) December 2, 2019
About our sponsor: Elsevier fuses evidence-based trusted content, cutting-edge technology and analytics in a range of innovative digital applications for end users in the scientific, academic and medical worlds. Our leading-edge applications, platforms and products are used globally to advance science, aid discovery, improve patient outcomes and to positively impact people’s lives.
This post was authored by Alice Walsh. For more information contact philly@rladies.org