Putting those bike pedals to work with a comprehensive exploratory data analysis, navigating through a near-inferno of namespace and dependency issues in package development, and how you can ensure bragging rights during your next play of Guess My Name using decision trees.
Episode Links
- This week's curator: Tony Elhabr - @TonyElHabr (Twitter) & @[email protected] (Mastodon)
- My Year of Riding Danishly
- Tame your namespace with a dash of suggests
- Guess My Name with Decision Trees
- Entire issue available at rweekly.org/2024-W08
Supplement Resources
- {fusen} - Inflate your package from a simple flat Rmd https://thinkr-open.github.io/fusen/
- R Packages Second Edition https://r-pkgs.org/
- {usethis} - Automate package and project setup https://usethis.r-lib.org/
Supporting the show
- Use the contact page at https://rweekly.fireside.fm/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @theRcast (Twitter) and @[email protected] (Mastodon)
- Mike Thomas: @mike_ketchbrook (Twitter) and @[email protected] (Mastodon)
Music credits powered by OCRemix
- Swing Indigo - The Legend of Zelda: Majora's Mask - sschafi1 - https://ocremix.org/remix/OCR04560
- What Lurks Behind the Door - Final Fantasy V - Lucas Guimaraes, Andrew Steffen - https://ocremix.org/remix/OCR04542
[00:00:03]
Eric Nantz:
Hello, friends. We are back with episode 153 of the R Weekly Highlights podcast. This is the weekly show where we talk about the latest happenings and the tremendous resources that you can find every single week at the rweekly.org website. My name is Eric Nantz, and I'm delighted that you joined us from wherever you are around the world. We're about in the past the halfway point in February, so spring is coming soon, I hope. But I can't do the show alone. Of course, I am joined by my awesome cohost, Mike Thomas. Mike, how are you doing this morning? I'm doing well, Eric. Yeah. I think probably the audience can can hear in our voices that spring isn't quite here yet, but, we're getting there. Yeah, combo. The cold weather, kids bringing you know what home from their respective day cares or schools. It just it it never ends. It never ends. But, nonetheless, we're gonna power on through here. We got a lot of exciting content to share with you all today, and this content for this particular issue was curated by Tony Elhaubar, who had tremendous help, as always, from our fellow rweekly team members and members like you, contributors like all of you around the world with your awesome poll requests and suggestions for more excellent resources.
As I said, Mike, we do sense spring is coming, and that's where I know my kids like to start getting their bikes out to take their bikes rides around the neighborhood and whatnot. Well, it's appropriate that as the weather's warming up, our first highlight today is taking a very data driven approach to just how far you can take your bikes and analyze that for some real fun, exploratory data analysis and and and the like. And this post comes to us from Greg Dubrow, who is a data analyst who is now based in Denmark, and he has a, you know, very passionate, hobby, so to speak, of riding his bike basically everywhere he can go. He's been doing this wherever he's been in the world. His first part of the blog post gives a nice background on the various bikes he's had growing up, even a little mishap he had, last year during that, which any bike enthusiast can probably relate to.
But, nonetheless, he talks about, you know, taking advantage of recording his bike riding data using an app called Strava. I've not heard of this before, but, apparently, it gives you a boatload of metrics having to do with your bike riding. And the first question comes, okay. Well, you got this data on the app. Right? How do you get it out of that? Well, you could download a bundle from your profile on the Strava site as one way to get a CSV text dump of that, which he does end up doing. But like anything else, there is an API for that. Right? And not only that, there is an R package to help you grab this data from R itself called RStrava, which he utilizes as well as, like I said, the aforementioned CSV data dump, if you will, and merges that together to give them a nice tidy dataset after some, you know, very usual cleaning and reshaping and and manipulation of dates and whatnot so that it's actually ready for analysis. So, again, an awesome data driven approach, to take advantage of modern tech to put this data into R itself.
You've got the data. Now what? Like a lot of the posts we cover in our weekly, we start with some fun exploratory data analysis. And to get things going quickly, he makes use of a very cool package that I've actually seen utilized with some of my colleagues at the day job as well as others in the community called data explorer. This is a really nice package that gives you a very quick way to explore, say, the missingness in your data as well as doing some very nice correlation heat maps right off the bat with your numeric variables. And he senses that, yeah, most of these variables have a positive correlation to the key metrics, such as distance and the actual moving time of the bike, where this app is apparently smart enough to detect when the bike is actually moving versus stationary.
So really novel use of tech here, but the post has both the variables and their percent of missing values as well as this aforementioned, heat map with the correlations to help begin informing just what kind of relationships he will explore later on in the post. And you start to sense some of the things you might think intuitively, such as the average speed of his bike being positively correlated with distance, albeit not as a huge relationship right off the bat, and also some correlations with the power output of the ride and measured an average wattage usage may have some negative correlations with other metrics and the like. And then augmenting these visuals from the correlation perspective is the tried and true scatterplot, which uses a very novel functional approach that he got inspiration from one of Cedric Shurer's posts that he does all things ggplot2 and with a little bit of permac magic with patchwork, able to get these nice correlation plots for the key response variables of distance, moving time, and average speed, and you, again, start to see positive correlations amongst many of the key metrics, such as calories, average watts, moving time, and etcetera, to give him a better idea of what he might expect out of a more rigorous analysis.
And just where does this regular analysis take place? Well, we all like some tables. Right? So he starts off with creating some fun GT tables of just the metrics in terms of the total time, elevation, total calories output throughout the year. And you can see out of 446 rides, yeah, he's burned a lot of calories and generated a lot of energy. Lots of cool, cool summaries there. And then we got some more visuals, Mike, where he's looks at the seasonal pattern of his ride shares per month. So why don't you take us through some of the visuals you're seeing here? Yeah. It's a really nice, visual blog post. I I think Greg notes that he leveraged
[00:06:31] Mike Thomas:
some of the tutorials that Cedric Shearer has put together. And if you are in the DataViz space, especially in the our DataViz space, that is a name that you are certainly familiar with. One of the visuals that I thought was was pretty cool, that that he used was, most rides during, you know, it's showing, the number of rides that he has and and the type, sort of in this polar area coordinate diagram. And the reason that he did that is to correlate them to the hour of the day that it took place. So it's a this this chart is sort of representing a clock which is a really cool, I think, use case of these polar area coordinate, diagram type charts. I struggle to find a lot of use cases for those those charts in a lot of my EDA analysis and DataViz work, but I think this is a perfect use case for it.
So I really appreciated that. You know, going back to talking about some of the dependent variables that he used. One of the the dependent variables, that Greg used was called kilojoules, I believe. Hopefully, I'm pronouncing that somewhat correctly. And, he used that variable instead of calories because, according to Garmin, frequently asked questions, calories expended are the total energy, in the time that it took to do the workout that you expended, while kilojoules is the energy burned, actually burned by the workout and that formula is watts times seconds times a1000. So that was that was very interesting to me, and, you know, I know a lot of workout apps there sort of focus on calories, and and maybe they should be focusing on kilojoules instead. So I I thought that that was pretty interesting, you know, some of these GT tables, and the way that he was able to format them in the blog post to have a lot of these GT tables actually side by side, instead of one on top of the other using some HTML was a pretty cool, nifty trick to make this, this blog post sort of nice and neatly put together as well.
So a lot of really interesting, work here, and then he actually fit some models at the end of this here. There's a time model, a kilojoules model, and then a Watts model as well. And the the model fits all look look pretty good. I'm pretty impressed and it's sort of a function, I think, of just the amount of data that, this Strava app allows you to to have, control over and to take a look at. And you can do some pretty cool, as as Greg shows us here, some pretty cool analysis with your your workout and exercise data, particularly with your your cycling data in the Strava app. So a really cool use case, I think walking through a lot of different types of data visualization, some predictive modeling, sort of an an end to end data science project here using a pretty nifty data set. So, hats off to to Greg for a great start to our weekly highlights this week.
[00:09:25] Eric Nantz:
Yeah. Really awesome approaches here. And looking at even the source of this quartile document that this blog post is based on, you're gonna see, like you Mikey said, those nifty tricks of putting the the tables and the some of the plot side by side. Really, really nice, easily viewable post here. We got the table of contents on the right margin. Yeah. Lots of ways to hop back and forth amongst us. So, again, Quartle gives you a lot of these niceties out of the box. And really, hats off to owning your data as best you can, albeit, yeah, a third party app is collecting it, but fair play to Strava for giving an API for users to expose this because there are some others out there in the fitness tracking space that aren't quite as friendly about you getting your exercise and workout metrics out of it. So really encouraging to see For all you cyclists listening out there and I'm a part time cyclist when I can.
Yeah, that's really cool to see you be able to take advantage of this amount of data. And, you know, he's got to be, Greg's got to be a pretty fit person to be able to have this amount of rides in in 2023 even with the injury that he underwent earlier in the year. Like, that is impressive stuff, impressive dedication, and, yeah, may maybe I need to not be so lazy this year. We'll see.
[00:10:48] Mike Thomas:
And hats off to Greg as well and Quarto for the nice, collapsed code chunks that allow you to see exactly how he did what he did in this blog post.
[00:10:58] Eric Nantz:
Yeah. Isn't that a great, you know, UX, so to speak, of digesting the parts you like? And then maybe you're more interested in, say, the modeling part or more interested in the visualization part or, of course, interested in everything, but you can opt in to looking at all those details and still get the full cohesive story here. So, yeah, really enjoyed the post. Looks like he spent a lot of time drafting this together. But, again, it's all all available for us to see in the open. And and, yeah, I'm gonna have to maybe get a new bike this year so I can start tracking some metrics.
[00:11:28] Mike Thomas:
Me too.
[00:11:38] Eric Nantz:
Now maybe, Mike, you you're on a long bike ride. It may seem, Mike, every time you think you've got to your destination, there's some little hurdle along the way. Right? And maybe it's a traffic light. Maybe it's, you know, who knows what else is happening out there. Sometimes package development can feel like you're so close. You get that glimmer of hope, and then something really crazy happens. That's where our next highlight comes in, coming us from our our fine threads at Think R. This has been authored by Swan Fuller Clay. I probably didn't pronounce that at all correct, but apologies in advance.
But they have an excellent blog post here about how you can tame the namespace of your R package and making use of the suggest call of it. So like any good story, this starts with what seems to be a smooth road. There, the example package in this post is a simple wrapper on top of ggplot2 to help export the plot with a simple function they call saveplot. It looks innocent enough. Right? We are simply letting the user specify, after they specify the ggplot object, the extension, which can be 1 or more of, like, PNG, JPEG, or PDF, where to put it, what's the file name.
Pretty straightforward stuff is just wrapping a call to gg save with a little bit of per on top of that. And the usage looks very straightforward. The example looks very logical. You're gonna plot your dataset, use the save underscore plot, clean up after yourself. Well constructed example. So like anything in package development, you're gonna start checking this on your local system using dev tools colon colon check most of the time. It comes through flying colors. No issues at all. No errors. No warnings. No notes. You are feeling good about it. And like in good practice, this package code is on version control with GitHub.
So why not just rely on your local system to do the checking? We like to use GitHub Actions now to do a lot of this automated checking as well. And that's where things start to go a little off the rails because we are now embodying the infamous slogan in CS and development. It works fine on my machine. But on the CI check, we see a cryptic error. And where does this rely? Well, you go down the rabbit hole of checking the logs. In the details, there's an error saying that there was an error running the example code, which, again, the example looks straightforward.
Works fine locally. Right? Go down that back trace a bit further a bit further, and then you see this error about load namespace. There is no package called SVG lite. Oh, boy. What on earth happened here? Now we gotta put our detective hat on. Where in the world is SVG light being used, Mike? What gives?
[00:14:52] Mike Thomas:
Well, it's going to be in, and you'd only see this in the source code potentially, or the, the the package description file, but it's it's going to be in the, it's a dependency of ggplot specific to the ggsave function. When you are saving a plot using the ggsave function from ggplot2, as an SVG, it's going to employ the SVG lite package. And one of the reasons why you don't have this, in your CICD check is because SVG Lite is not a hard dependency of ggplot2. Svglight is in the suggest portion of, the dependencies of gg plot 2. So it is not going to get installed automatically when you specify that ggplot 2 is a dependency of your package.
So, you know, this is a very familiar, probably, territory for those of us who have done a lot of our package development and ran into similar situations like this before either, you know, wrestling with, whether dependencies dependency should be, a hard import or should be in the suggests, section of your description file. How to manage dependencies that are only specific to, maybe, vignettes, or things like that. So this is very familiar territory to me. I have I have, maybe, a story that that I might tell if we have a little bit of time here, but I think it it sort of just goes to the the overall narrative here that sometimes, you know, things may work well on your machine, but when you start to employ CICD that, you know, is going to leverage a different machine to build and test that package, you may see a failure there. And and honestly, that's that's a good thing. Because what that means is that when someone else on a completely different machine than yours wants to use your package, and that's that's the whole idea, to build software that is useful for other folks as well, they might run into the same issue. Even though you saw no issues, warnings, errors, or notes in your your own DevTools check run, you know, it's a good thing, in my opinion, that when you you send it off to GitHub actions in this case, that, this error presented itself and pops up and, you know, let you know sort of exactly what the issue was. I think the error message is is pretty descriptive, and, obviously, you do have to do a little bit of detective work to to figure out, hey, where the heck is is f SVG light, even impacting us in this particular case. But, if you if you've been around the block a little bit with ggplot2 and you're you sort of understand what you're trying to do here in terms of saving that plot, hopefully, it won't take you too long to figure that out, as it it did in the case of of Swan.
So, you know, I think this blog post, also nicely calls out the use of the the Fusen package, not Fusen, it's the Fusen package, which, for those who are unfamiliar, is a package that streamlines, the development of our packages. And I believe it sort of uses an R Markdown approach, a chunks approach to sort of execute different commands, through a a nice documentation framework, and sort of build out all of the different components that you need, for that R package. So, you know, I don't know. Eric, if you wanna take over sort of, more of the the final solution here to ensure that this,
[00:18:23] Eric Nantz:
that this package passed all of its CICD checks when it went to GitHub actions. Oh, yeah. I'm chomping at the bit for this because, boy, do I feel seen on some of the ways this can be implemented. So now that we realize that SVG light is definitely required for this safe plot function, there are a couple options or how this could be tackled here. One is that in the r oxygen preamble for this function, we declare an import from SVG light, the SVG light function. That is certainly a valid approach. Right? Well, when you run a check again, even locally, you're gonna see something appear that may seem really scary, and frankly, it can be, where you will be warned about, hey, you know what? Now, the imports of your package may seem small, may only seem like it needs YuJaPaw 2 and SVG lite.
But that has now ballooned to 21 non default packages that are now gonna be required at install time to get your package installed. Now the CRAN maintainers put this check-in our command. Check this note, I'm about to say, is that importing from so many packages make the package vulnerable to any of them becoming unavailable. Yes. We have covered in years of this podcast when one dependency suddenly got, quote, unquote, archived on CRAN. In fact, they even infected ggplot2, I believe, and amongst others. So how how we appropriate here. So but you but you still need this package, or do you need it?
Now there this is where suggest comes in in terms of now taking this preamble out of the r oxygen, putting SVG in the suggest field, and then in your function having a check via the require namespace function to check if that user has installed it or not on their local system and to prompt them to install it in order to get the full support of all those file types that are being exported, but it would still import, in the case of the solution here, it will still import the file types or export, I should say, the file types that don't need SVG light. So there's a happy medium in here from the UX experience to what they politely say avoid the backlash of what can happen from both you as a developer, but also as the end user to know what to do next.
So it may seem like a little more upfront work, but the good news is is that now the dependency footprint of your package going to CRAN has become much less, such that now the user or in your CICD for running the examples can opt into installing this package without having the hard dependency on it and, hence, minimizing the potential for the dreaded archival status on CRAN if any of these dependencies end up going away. So there is another nugget here, though, is that when you install a package from, say, GitHub, like, they have a nice snippet on here using the remotes colon colon install GitHub, name of the package repository, by default, it will not install dependencies that are marked as suggests.
That's where you have to supply the dependencies flag of true in order for your local system as an end user to grab, in this case, that SVG like dependency. That's a nuance that has tripped me up so much in my day to day work when I thought I'd install a package from GitHub. I'm ready to throw it in my shiny app or throw it in my over pipeline, and then I realized, oh, what's that error? Oh, nope. Didn't get that suggest package when I installed it. So that's at the end of the post, but it's bitten me up many times. And that's contrary to the base r install dot packages function, right, which will install by default your the packages that are listed in the suggest portion of the description file. That is correct.
There's a dichotomy there that unless you really stumble into it, you don't really know exists. So that was that was some shared learning for me. So, well, I'm curious, Mike, to hear your your your tail, if you will, of this of this issue. For me, I have two minds on this. If I'm developing a package that's just for internal use at my day job, admittedly, I'm probably just gonna put it in imports anyway because at that point, I know I'm not going to CRAN. It's more about does this pipeline I'm making is the benefit of throwing this on to, say, a Shiny app hosted on Pawsit Connect or other areas. What's the easiest way for me to control what's happening there? And that's typically if I have a description file, just throw it all in imports.
But if in the situation of either a, you know, CICD or CRAN itself, yeah, I definitely would take this advice to heart because it will make your life as a maintainer easier not to worry about s v in this case, SVG lite's tangled web of dependencies when all you needed is for one additional type of file to export in this case. It is not gonna impact the baseline functionality of said package. It's just an enhancement on top of it. I admit sometimes it's hard to find that good, like, threshold of when you go to import only or when you go to suggest only. I think that comes through experience. But for me, it also depends on what context you're gonna be releasing this package in.
[00:24:04] Mike Thomas:
Yeah. I would agree. You know, I think I'll default here and punt a little bit and say that the our packages book, which is authored by Hadley Wickham and and maybe Jenny Brian and and a few others, don't quote me on that. I think has some really good discussion about when to list a package as a hard dependency versus a soft dependency, and and also how to handle, you know, packages that are just being used. This is something that we run into a lot, packages that are just being used in a vignette, you know. And ggplot2 is one for me that that happens quite a bit with because, you know, maybe I I don't have any functions within my package, itself that, you know, leverage ggplot2. We're just returning data. But in my vignette, I wanna show how this package can be useful to users. So I I wanna build a beautiful chart, you know, and have that be on our package down site and folks come and see that and look at that and be like, wow, this is what I can do with this package. But not necessarily, you know, auto plot anything for them because I have plenty of opinions on why I don't necessarily like like functions that do that. We just try to return the data and let you, you know, use ggplot2, use Echarts for r, do do whatever you want, GT, to make it beautiful.
But, I guess one other thing that I will add as an anecdote here that that Swan mentioned, which I think is really good advice is when you are managing the dependencies, you know, within your particular package, highly recommend using the the use this, use package function, which will handle, not only, you know, where things should be listed within that description file, making sure you don't have anything duplicated in there, but also it will handle the relationship between that description file and your namespace file, and any updates that need to be, taken on that name space file, which should not be done by hand, which is highly recommended. So I I would recommend leveraging that package. Just another example where use this can be awesome.
I had had a I'll try to keep this brief, but we developed an open source package recently that actually downloads some publicly available data that's stored as a zip file on a website. And when it downloads, you know, onto my machine or or onto any of our our, company resources, local machines, any of my teammates machines, the the data comes in, and it's it's strange data set in the zip file of these text files, where, say, there's say, there's 4 datasets, for example. There would be 8 text files total. The first four would be the column headers, and then the next 4, would be the actual data itself without column headers. So you just have to stitch them together and and, you know, file 1 is the headers for the data in file 5. You know, file 2 goes with file 6. File 3 with 7 and 4 with 8. You can match them up. They're they're ordered that way when you download them online.
So we never had any issues locally with essentially, you know, returning a data frame that matches the headers to the data itself, until we try to use the the with our package, in our unit testing to do this programmatically deployed to GitHub Action CICD. And when, we, you know, ran these tests locally, everything works totally fine because we just said, you know, you match file 1 with with file 5 and and 2 with 6, and so on, and so forth. When, the with our package ran, you know, on this this Linux box, probably, on GitHub actions, the files that got unzipped from the zip folder got unzipped in strange orders. Not the order that they were stored in. So just got totally totally totally reordered. And it was very difficult to figure out, you know, what the issue was that was was going on here. And, so we just had to write in a little extra logic that uses the naming conventions of the filenames to match them together, which wasn't a big deal at all. But we couldn't rely on the order that those files came in because for whatever reason, when this this ran on GitHub actions, and unzipped these files, they unzipped in a totally different order than what took place locally for us. So, you know, just an example of how you can pass all of your checks locally, but not necessarily, when you, you know, are running it in a separate environment, which is a good thing. I was glad that we ran into that issue, so that if for some reason, you know, somebody, leveraging our package experienced the same issue where the the files unzipped in a different order, you know, our our functions would still work.
[00:28:41] Eric Nantz:
Yeah. Boy, is that is that just par for the course. When you see the expect the unexpected, so to speak, when you put in things in CICD, I I had a similar thing, albeit it was more of me shooting myself a bit in the foot on this. But I was doing a pipeline in GitHub Action. It's not an RPAC. It's about a set of functions that grabs a SQLite database from online, does some transposing, does some massaging, does some fuzzy duplicate finding, and then sends it back out as S3 objects. Well, in my development, I was starting to do, like, date cleaning with the luberday package. I had installed it locally. I forgot to add it to my manifest for the GitHub action to install. And I actually was leveraging a friend of the show, Peter Solomis' DEPS package to make a JSON file of the dependencies. I just forgot to rerun that darn thing and then commit it. So I'm like, wait. That worked fine on my machine. I was like, well, dependencies again. So it happens even in non package context. So you just gotta gotta keep keep the vote of keeping that stuff up to date, whether it's an rmblock file, a dep.json, or whatever else. Just, yeah, keep that stuff up to date, man.
[00:29:51] Mike Thomas:
It's hard. It's hard. There's a lot to it. But in my opinion, it's it's super important because we want the user experience of others, you know, using the software that we create to be as as high quality as possible.
[00:30:04] Eric Nantz:
That's, it's all the name of the game, isn't it? Yep. But, really entertaining read here and very informative too. So credit to thank Arver sharing her knowledge of the developments in the trenches, so to speak. And, Mike, yeah, those are the 22 pretty heavy content highlights here. And we're going to have some fun with this next one because I like me a little game now and then. So our next highlight is going to do a little fun classification magic to hopefully help you win this game even even faster than you might expect. And this is coming to us from Michael Hoe, professor in statistics and data science at the University of Gresfel, Germany.
Again, pronunciation is not my strong suit here. But he starts introducing this blog post about a game that he likes to play called guess my name where each player will have a card with 16 kinda avatars on them. They each have different, like, you know, hair color, maybe slightly shirt, you know, different genders, all that. And the object of the game is the each player will pick who they wanna represent on this game, but then the opponent has to ask questions to help narrow down who that opponent would actually be on on the board itself. And they can only ask questions about the picture of the person, and they must and they the response to that must be either yes or no. So you can kinda start crossing off who is not and then figuring out eventually who that person actually is. So, of course, naturally, the winner of the game is gonna be one that finds the answer in the least amount of questions.
And so Michael, what he does in the first part of his post is actually compiles a spreadsheet of all the about 12 questions that we will logically ask in this game amongst then the players that this these questions would represent. So you can download that spreadsheet right off the the blog post if you want to look at that for reference. He's got a snippet of it in the post itself such as, like, do they have headgear, do they have glasses, blonde hair, etcetera, etcetera. Now that doesn't that's a great starting point. Right? But what where do you help determine what should you ask first in order to maximize your chance of winning?
It's decisions, decisions. Right? Well, literally here because we're gonna look at decision trees as a way to help take a data driven approach to find the solution. This is not something I would have expected, but it's a pretty clever use of the classical classification tree method where he feeds in the data. In this case, like I said, that spreadsheet of questions and then the membership of each person responding yes or no to those questions, throws it in in our part, call after it does some massaging of the data. And then the rest of the blog post now gives you a decision tree that you can use as a strategy going forward for how you might identify these players.
So if before seeing this post, Mike, would you have guessed that the first question that someone should ask is whether the player has blonde hair?
[00:33:31] Mike Thomas:
I wouldn't have guessed that, but it has been a little bit, of time since I last played, a game like this. I think there's a game very, very similar. Maybe it's it's more US based. And I think it's called called Guess Who. Played it a lot when I was growing up and it you know, you you flip the flip the people up and down after you ask questions, and it's kinda like a 20 questions game to see who who can figure it out first. So this sounds like exactly the same thing. And if I remember correctly, I think that was one of my one of my top questions. Does the player have blonde hair? You know? And trying to narrow down, who the person is that you're you're trying to get at. But I guess it all depends on, it all depends on your on your strategy and the the data or or the people that you have on your on your card.
[00:34:17] Eric Nantz:
That's right. And so when you look at this decision tree further, once you get through that initial question, then we start branching off into questions like, is the is the picture have something green in it or something red? So you're looking at color next across the entire picture. Now the type of gear they're wearing, is there some headgear? Is there a short sleeve or a long sleeve shirt visible? Alright. Can we see the eyebrows or the glasses? Like, it is a very interesting dichotomy here of where you should go next. But, yeah, it just shows you another clever use of classification methods, which, again, are typically the backbone of almost all types of machine learning, especially in the dichotomous approach where you're trying to optimize that response or predicting a response.
These are the fundamental building blocks. Obviously, if you're in a more rigorous analysis, you might look at other methods and classifications as a random forest, GBM, and the like. But if you ever want a gentle introduction to what classification trees are actually doing, with a little, perhaps, ways you can use this in your in your activity time later on on the weekends, this, you could do a lot worse than seeing what Michael's post has here. No. This is a super super cool, super fun application. I guess not something that I've I've seen too often in the highlights before. We've seen some exercise data,
[00:35:42] Mike Thomas:
you know, before, but in terms of actually, like, playing a game, I think this is is super cool, and a great use case, and I think a great learning case for leveraging decision trees and the the r part package itself. You know, sometimes I think algorithms can be explained best on, like, small data sets where you can see sort of exactly the decisions that are being made by the algorithm or or how the math really plays out. Just as a another final story for for the week, I do have somebody in my family. I'm not gonna name names. But when we play Clue, they essentially need a laptop with Excel open next to them to be able to sort of track everybody's responses in a spreadsheet, which is way over the top, analytics approach.
And, to be honest, they they actually are the person that that usually wins, which makes me feel like I need to to do something during that game. But I'm off the clock, you know, so to speak, when I'm playing board games. So I try not try not to go try not to pull out all the stops. But if this person does keep winning, I may have to spin up r in something like this and come back to Michael's Michael's post because I think a decision tree type of approach is is not only applicable to this particular game that we're looking at here, but, Clue is another one that that comes up for me that would be a fantastic use case and application for leveraging something like a decision tree, because it's a question based game as well, where you're trying to sort of narrow things down, into the the smallest, you know, number of pieces of information that sort of give you the the the best idea of of what the true answer is at the end of the day. So,
[00:37:24] Eric Nantz:
that that's the end of story time for me for this week. It does make me think, albeit it probably have to use even more rigorous methods for this. But as a kid, I loved playing that game Battleship and trying to figure out which space should I target first. And then based on that, knowing that it didn't hit, how far away should I go from my next target? I could sense there'd be a lot of fun data driven approaches to that. But for those of you listening, if you're interested in more the technical math behind this idea, well, the blog post has a terrific appendix here where you can really, really get get to school, so to speak, on how classification methodology works here. Michael does a terrific job with kind of the mathematical optimization formulas that are under the hood. There's also some nice visuals along the way. So this is if you're in in a situation and trying to learn about classification in general, like I said, not only do we get the the fun intro of this post, but also the the appendix is really giving you a lot of great details for how this all really works under the hood.
[00:38:29] Mike Thomas:
Absolutely. No. That break that battleship. Oh, that's that's another good one. Where to where to guess next? That kinda, like, brings me to the Monty Hall problem a little bit, like, which door should you pick based on the first door that you you selected? I once I don't know. Spent spent too much time with an actual deck of cards trying to prove out whether the Monty Hall problem and solution was actually the right way to go and and that was probably my first introduction into into Bays without really knowing it.
[00:38:55] Eric Nantz:
Isn't it interesting? It's almost everywhere in your life, but you just may not realize it until it's all in Chelsea. Yeah. Well, you make me wanna play games the rest of the day, but, unfortunately, I won't be able to do that. But what we can tell you, though, is that the rest of the hour week, we issue as a terrific section of additional resources, tutorials, blog posts, new packages, updated packages, tons more to choose from. So we'll take a couple minutes here to talk about our additional highlights here. And sticking with the earlier part of the show, looking at great uses of ggplot2 for EDA and whatnot, there is a terrific package called g g magnify.
This has been authored by David Hugh Jones, who I believe we have featured on the highlights before, where, in essence, you can take a ggplot object, and then within that same plot, draw a boundary of, in essence, a panel that you can then use as zooming in on those particular data points and put that in a section on the same plot. This is pretty interesting to me because now, obviously, I'm a big fan of interactive graphics and interactive HTML where if you had this in, say, PlotViewer or not, you could just do the, you know, the zooming of a particular plot in real time, get those coordinates, and then zoom back out. But if you're in the realm where maybe you're confined to static representations, then gg magnify might be a really interesting way for you to call out a particular section of that scatterplot or that distribution plot and then be able to really emphasize just kinda what's happening in that subregion of the plot while maintaining, you know, a good clever use of space and whatnot. So gg magnify, I never heard of this one before, so I'm gonna put that in my visualization bookmarks to follow-up with later on.
[00:40:48] Mike Thomas:
No. That's a great one, Eric. I'm gonna shout out, Romain Francois for his blog post on the request perform stream, function from the HTTR 2 package, and how he found a bug and submitted a pull request as well. So he actually asked it. And I don't know for folks who are connected to him on social media, you may have seen that he has authored a package recently, that will write a Valentine's or not necessarily Valentine's Day poem, but I think will write a poem for you, I believe, leveraging chat GPT about anything that you want. I think it started out around Valentine's Day, which is is where I saw it first. And he actually ran into an issue asking the, Chatter package, which is from, I believe, the the ML verse, which leverages chat GPT, which I know a lot of tangential packages have been built on top of at this point. He was asking it to write a poem about Gollum, and he asked it to use many many emojis, in in that poem.
And, unfortunately, he received an error. And one of the reasons for the error, or I think the big reason in particular, is that, the this package, Chatter, leverages the h t t r two package, and and obviously streams back the response from the ChatGPT API. And it streams it back in in chunks of bytes. And, unfortunately, it was cutting off an emoji in the middle of the number of bytes within that emoji because it's encoded as as a few bytes. And it couldn't essentially stitch together the emoji from 2 separate chunks of bytes, if you will. So, this was, I guess, you know, an unexpected bug, and, Romaine went as far as submitting a pull request to the h t t r two package about a way to go about this.
Potentially, I think using the the read lines function instead of the read bin function. And there's, actually, I mean, you can go into the pull request. It's it's linked in the blog post, and you can see, you know, the fantastic conversation, the way that he frames the problem to the Rlib, team that manages the h t t r two package. You can see his back and forth conversation with with Hadley Wickham on how they eventually went about and resolved, and closed this this pull request, and resolved this bug, and merged it back into main. So now, if you are installing the htrt2 package from from GitHub, and hopefully soon from CRAN, you won't run into the same issue that Romaine ran into when trying to, ask ChatCPT for a response that includes emojis.
[00:43:39] Eric Nantz:
My goodness. Yeah. You saw my rabbit holes. Right? I mean, that credit to Romaine and Hadley for finding a a now we can fix this because this is this is an area where I take for granted that all these symbols are just gonna work no matter if we're passing data in or taking data out. Oh, there's a lot behind those emojis, folks. It ain't just the fun graphics. There are a lot of these raw bytes and bits that if you're not you're not taking it correctly because we're involving, like you said, the APIs of chat GPT and what this chatter package, knowing what's being handed off, that can be quite important. So, well, credit to credit to Romaine and Hadley for putting this together. And, yes, at the end of the post is a very lovely poem about one of our paper packages called Gollum that will never cease to amaze us.
[00:44:26] Mike Thomas:
Absolutely. And I think, you know, it's also if you if you do sort of follow that pull request, it's just a great example of how to contribute to a package and to make it easy on the on the maintainer or as easy as possible on the maintainers to get that bug fixed.
[00:44:42] Eric Nantz:
Excellent. Excellent. And you know what else makes it easier for you to learn about what's happening in the data science and the art communities? Well, that's where you just bookmark our weekly .org. Have you have checked that out every Monday morning. We have a new issue released like this one we talked about here. But, of course, the entire back catalog is also available. You can see right on the home page. And, of course, this is a project by the community, for the community. The lifeblood is you and the community. So we love your poll request. We love your suggestions. You can get in touch with sharing that resource via poll request. We just thought about poll requests. Right? Our weekly is very embedded into that workflow. The link is directly on each each issue's front page where you get a link to the upcoming issue draft for next week. You'll be able to quickly send your poll request there. It's all marked down all the time. Very easy to get up and running quickly.
And, also, yeah, we're always happy if you wanna join the team as a curation role. We definitely have spots, and we have links so you can get information on that at r wicked.org. And, of course, we love hearing from you and the audience as well. You have a few ways to get in touch with your humble host here. One is the contact page. We have the link link directly in the episode show notes. We also have a fun little if you have a modern podcast app like Paverse, Fountaincast O Matic, and whatnot, you can send those fun little boosts along the way, which is directly from you to us in your favorite podcast app. And, of course, we have, some presence on the social medias.
I am mostly on Mastodon these days with at our podcast, at podcast index.social. Sporaglia on the weapon x thing with at the r cast and then also on LinkedIn sharing some posts and other fun announcements from time to time. And, Mike, where can the listeners get ahold of you? Sure. You can find me on Mastodon as well at mike [email protected],
[00:46:39] Mike Thomas:
or the other place that I am present on social media a lot is on LinkedIn. If you search Ketchbrook Analytics, k e t c h b r o o k, you can probably find out what I'm up to.
[00:46:50] Eric Nantz:
Awesome stuff. Like I like we heard about before. Congrats on that recent package you open sourced. I'm sure there are lots of fun stories behind that as well that you'll see on Mike's, LinkedIn post from time to time. So Yes. Thank you. Absolutely. So we're gonna close-up shop here for episode 153, and we hope to see you back for episode 154 of the Our Weekly Highlights podcast next week.
Hello, friends. We are back with episode 153 of the R Weekly Highlights podcast. This is the weekly show where we talk about the latest happenings and the tremendous resources that you can find every single week at the rweekly.org website. My name is Eric Nantz, and I'm delighted that you joined us from wherever you are around the world. We're about in the past the halfway point in February, so spring is coming soon, I hope. But I can't do the show alone. Of course, I am joined by my awesome cohost, Mike Thomas. Mike, how are you doing this morning? I'm doing well, Eric. Yeah. I think probably the audience can can hear in our voices that spring isn't quite here yet, but, we're getting there. Yeah, combo. The cold weather, kids bringing you know what home from their respective day cares or schools. It just it it never ends. It never ends. But, nonetheless, we're gonna power on through here. We got a lot of exciting content to share with you all today, and this content for this particular issue was curated by Tony Elhaubar, who had tremendous help, as always, from our fellow rweekly team members and members like you, contributors like all of you around the world with your awesome poll requests and suggestions for more excellent resources.
As I said, Mike, we do sense spring is coming, and that's where I know my kids like to start getting their bikes out to take their bikes rides around the neighborhood and whatnot. Well, it's appropriate that as the weather's warming up, our first highlight today is taking a very data driven approach to just how far you can take your bikes and analyze that for some real fun, exploratory data analysis and and and the like. And this post comes to us from Greg Dubrow, who is a data analyst who is now based in Denmark, and he has a, you know, very passionate, hobby, so to speak, of riding his bike basically everywhere he can go. He's been doing this wherever he's been in the world. His first part of the blog post gives a nice background on the various bikes he's had growing up, even a little mishap he had, last year during that, which any bike enthusiast can probably relate to.
But, nonetheless, he talks about, you know, taking advantage of recording his bike riding data using an app called Strava. I've not heard of this before, but, apparently, it gives you a boatload of metrics having to do with your bike riding. And the first question comes, okay. Well, you got this data on the app. Right? How do you get it out of that? Well, you could download a bundle from your profile on the Strava site as one way to get a CSV text dump of that, which he does end up doing. But like anything else, there is an API for that. Right? And not only that, there is an R package to help you grab this data from R itself called RStrava, which he utilizes as well as, like I said, the aforementioned CSV data dump, if you will, and merges that together to give them a nice tidy dataset after some, you know, very usual cleaning and reshaping and and manipulation of dates and whatnot so that it's actually ready for analysis. So, again, an awesome data driven approach, to take advantage of modern tech to put this data into R itself.
You've got the data. Now what? Like a lot of the posts we cover in our weekly, we start with some fun exploratory data analysis. And to get things going quickly, he makes use of a very cool package that I've actually seen utilized with some of my colleagues at the day job as well as others in the community called data explorer. This is a really nice package that gives you a very quick way to explore, say, the missingness in your data as well as doing some very nice correlation heat maps right off the bat with your numeric variables. And he senses that, yeah, most of these variables have a positive correlation to the key metrics, such as distance and the actual moving time of the bike, where this app is apparently smart enough to detect when the bike is actually moving versus stationary.
So really novel use of tech here, but the post has both the variables and their percent of missing values as well as this aforementioned, heat map with the correlations to help begin informing just what kind of relationships he will explore later on in the post. And you start to sense some of the things you might think intuitively, such as the average speed of his bike being positively correlated with distance, albeit not as a huge relationship right off the bat, and also some correlations with the power output of the ride and measured an average wattage usage may have some negative correlations with other metrics and the like. And then augmenting these visuals from the correlation perspective is the tried and true scatterplot, which uses a very novel functional approach that he got inspiration from one of Cedric Shurer's posts that he does all things ggplot2 and with a little bit of permac magic with patchwork, able to get these nice correlation plots for the key response variables of distance, moving time, and average speed, and you, again, start to see positive correlations amongst many of the key metrics, such as calories, average watts, moving time, and etcetera, to give him a better idea of what he might expect out of a more rigorous analysis.
And just where does this regular analysis take place? Well, we all like some tables. Right? So he starts off with creating some fun GT tables of just the metrics in terms of the total time, elevation, total calories output throughout the year. And you can see out of 446 rides, yeah, he's burned a lot of calories and generated a lot of energy. Lots of cool, cool summaries there. And then we got some more visuals, Mike, where he's looks at the seasonal pattern of his ride shares per month. So why don't you take us through some of the visuals you're seeing here? Yeah. It's a really nice, visual blog post. I I think Greg notes that he leveraged
[00:06:31] Mike Thomas:
some of the tutorials that Cedric Shearer has put together. And if you are in the DataViz space, especially in the our DataViz space, that is a name that you are certainly familiar with. One of the visuals that I thought was was pretty cool, that that he used was, most rides during, you know, it's showing, the number of rides that he has and and the type, sort of in this polar area coordinate diagram. And the reason that he did that is to correlate them to the hour of the day that it took place. So it's a this this chart is sort of representing a clock which is a really cool, I think, use case of these polar area coordinate, diagram type charts. I struggle to find a lot of use cases for those those charts in a lot of my EDA analysis and DataViz work, but I think this is a perfect use case for it.
So I really appreciated that. You know, going back to talking about some of the dependent variables that he used. One of the the dependent variables, that Greg used was called kilojoules, I believe. Hopefully, I'm pronouncing that somewhat correctly. And, he used that variable instead of calories because, according to Garmin, frequently asked questions, calories expended are the total energy, in the time that it took to do the workout that you expended, while kilojoules is the energy burned, actually burned by the workout and that formula is watts times seconds times a1000. So that was that was very interesting to me, and, you know, I know a lot of workout apps there sort of focus on calories, and and maybe they should be focusing on kilojoules instead. So I I thought that that was pretty interesting, you know, some of these GT tables, and the way that he was able to format them in the blog post to have a lot of these GT tables actually side by side, instead of one on top of the other using some HTML was a pretty cool, nifty trick to make this, this blog post sort of nice and neatly put together as well.
So a lot of really interesting, work here, and then he actually fit some models at the end of this here. There's a time model, a kilojoules model, and then a Watts model as well. And the the model fits all look look pretty good. I'm pretty impressed and it's sort of a function, I think, of just the amount of data that, this Strava app allows you to to have, control over and to take a look at. And you can do some pretty cool, as as Greg shows us here, some pretty cool analysis with your your workout and exercise data, particularly with your your cycling data in the Strava app. So a really cool use case, I think walking through a lot of different types of data visualization, some predictive modeling, sort of an an end to end data science project here using a pretty nifty data set. So, hats off to to Greg for a great start to our weekly highlights this week.
[00:09:25] Eric Nantz:
Yeah. Really awesome approaches here. And looking at even the source of this quartile document that this blog post is based on, you're gonna see, like you Mikey said, those nifty tricks of putting the the tables and the some of the plot side by side. Really, really nice, easily viewable post here. We got the table of contents on the right margin. Yeah. Lots of ways to hop back and forth amongst us. So, again, Quartle gives you a lot of these niceties out of the box. And really, hats off to owning your data as best you can, albeit, yeah, a third party app is collecting it, but fair play to Strava for giving an API for users to expose this because there are some others out there in the fitness tracking space that aren't quite as friendly about you getting your exercise and workout metrics out of it. So really encouraging to see For all you cyclists listening out there and I'm a part time cyclist when I can.
Yeah, that's really cool to see you be able to take advantage of this amount of data. And, you know, he's got to be, Greg's got to be a pretty fit person to be able to have this amount of rides in in 2023 even with the injury that he underwent earlier in the year. Like, that is impressive stuff, impressive dedication, and, yeah, may maybe I need to not be so lazy this year. We'll see.
[00:10:48] Mike Thomas:
And hats off to Greg as well and Quarto for the nice, collapsed code chunks that allow you to see exactly how he did what he did in this blog post.
[00:10:58] Eric Nantz:
Yeah. Isn't that a great, you know, UX, so to speak, of digesting the parts you like? And then maybe you're more interested in, say, the modeling part or more interested in the visualization part or, of course, interested in everything, but you can opt in to looking at all those details and still get the full cohesive story here. So, yeah, really enjoyed the post. Looks like he spent a lot of time drafting this together. But, again, it's all all available for us to see in the open. And and, yeah, I'm gonna have to maybe get a new bike this year so I can start tracking some metrics.
[00:11:28] Mike Thomas:
Me too.
[00:11:38] Eric Nantz:
Now maybe, Mike, you you're on a long bike ride. It may seem, Mike, every time you think you've got to your destination, there's some little hurdle along the way. Right? And maybe it's a traffic light. Maybe it's, you know, who knows what else is happening out there. Sometimes package development can feel like you're so close. You get that glimmer of hope, and then something really crazy happens. That's where our next highlight comes in, coming us from our our fine threads at Think R. This has been authored by Swan Fuller Clay. I probably didn't pronounce that at all correct, but apologies in advance.
But they have an excellent blog post here about how you can tame the namespace of your R package and making use of the suggest call of it. So like any good story, this starts with what seems to be a smooth road. There, the example package in this post is a simple wrapper on top of ggplot2 to help export the plot with a simple function they call saveplot. It looks innocent enough. Right? We are simply letting the user specify, after they specify the ggplot object, the extension, which can be 1 or more of, like, PNG, JPEG, or PDF, where to put it, what's the file name.
Pretty straightforward stuff is just wrapping a call to gg save with a little bit of per on top of that. And the usage looks very straightforward. The example looks very logical. You're gonna plot your dataset, use the save underscore plot, clean up after yourself. Well constructed example. So like anything in package development, you're gonna start checking this on your local system using dev tools colon colon check most of the time. It comes through flying colors. No issues at all. No errors. No warnings. No notes. You are feeling good about it. And like in good practice, this package code is on version control with GitHub.
So why not just rely on your local system to do the checking? We like to use GitHub Actions now to do a lot of this automated checking as well. And that's where things start to go a little off the rails because we are now embodying the infamous slogan in CS and development. It works fine on my machine. But on the CI check, we see a cryptic error. And where does this rely? Well, you go down the rabbit hole of checking the logs. In the details, there's an error saying that there was an error running the example code, which, again, the example looks straightforward.
Works fine locally. Right? Go down that back trace a bit further a bit further, and then you see this error about load namespace. There is no package called SVG lite. Oh, boy. What on earth happened here? Now we gotta put our detective hat on. Where in the world is SVG light being used, Mike? What gives?
[00:14:52] Mike Thomas:
Well, it's going to be in, and you'd only see this in the source code potentially, or the, the the package description file, but it's it's going to be in the, it's a dependency of ggplot specific to the ggsave function. When you are saving a plot using the ggsave function from ggplot2, as an SVG, it's going to employ the SVG lite package. And one of the reasons why you don't have this, in your CICD check is because SVG Lite is not a hard dependency of ggplot2. Svglight is in the suggest portion of, the dependencies of gg plot 2. So it is not going to get installed automatically when you specify that ggplot 2 is a dependency of your package.
So, you know, this is a very familiar, probably, territory for those of us who have done a lot of our package development and ran into similar situations like this before either, you know, wrestling with, whether dependencies dependency should be, a hard import or should be in the suggests, section of your description file. How to manage dependencies that are only specific to, maybe, vignettes, or things like that. So this is very familiar territory to me. I have I have, maybe, a story that that I might tell if we have a little bit of time here, but I think it it sort of just goes to the the overall narrative here that sometimes, you know, things may work well on your machine, but when you start to employ CICD that, you know, is going to leverage a different machine to build and test that package, you may see a failure there. And and honestly, that's that's a good thing. Because what that means is that when someone else on a completely different machine than yours wants to use your package, and that's that's the whole idea, to build software that is useful for other folks as well, they might run into the same issue. Even though you saw no issues, warnings, errors, or notes in your your own DevTools check run, you know, it's a good thing, in my opinion, that when you you send it off to GitHub actions in this case, that, this error presented itself and pops up and, you know, let you know sort of exactly what the issue was. I think the error message is is pretty descriptive, and, obviously, you do have to do a little bit of detective work to to figure out, hey, where the heck is is f SVG light, even impacting us in this particular case. But, if you if you've been around the block a little bit with ggplot2 and you're you sort of understand what you're trying to do here in terms of saving that plot, hopefully, it won't take you too long to figure that out, as it it did in the case of of Swan.
So, you know, I think this blog post, also nicely calls out the use of the the Fusen package, not Fusen, it's the Fusen package, which, for those who are unfamiliar, is a package that streamlines, the development of our packages. And I believe it sort of uses an R Markdown approach, a chunks approach to sort of execute different commands, through a a nice documentation framework, and sort of build out all of the different components that you need, for that R package. So, you know, I don't know. Eric, if you wanna take over sort of, more of the the final solution here to ensure that this,
[00:18:23] Eric Nantz:
that this package passed all of its CICD checks when it went to GitHub actions. Oh, yeah. I'm chomping at the bit for this because, boy, do I feel seen on some of the ways this can be implemented. So now that we realize that SVG light is definitely required for this safe plot function, there are a couple options or how this could be tackled here. One is that in the r oxygen preamble for this function, we declare an import from SVG light, the SVG light function. That is certainly a valid approach. Right? Well, when you run a check again, even locally, you're gonna see something appear that may seem really scary, and frankly, it can be, where you will be warned about, hey, you know what? Now, the imports of your package may seem small, may only seem like it needs YuJaPaw 2 and SVG lite.
But that has now ballooned to 21 non default packages that are now gonna be required at install time to get your package installed. Now the CRAN maintainers put this check-in our command. Check this note, I'm about to say, is that importing from so many packages make the package vulnerable to any of them becoming unavailable. Yes. We have covered in years of this podcast when one dependency suddenly got, quote, unquote, archived on CRAN. In fact, they even infected ggplot2, I believe, and amongst others. So how how we appropriate here. So but you but you still need this package, or do you need it?
Now there this is where suggest comes in in terms of now taking this preamble out of the r oxygen, putting SVG in the suggest field, and then in your function having a check via the require namespace function to check if that user has installed it or not on their local system and to prompt them to install it in order to get the full support of all those file types that are being exported, but it would still import, in the case of the solution here, it will still import the file types or export, I should say, the file types that don't need SVG light. So there's a happy medium in here from the UX experience to what they politely say avoid the backlash of what can happen from both you as a developer, but also as the end user to know what to do next.
So it may seem like a little more upfront work, but the good news is is that now the dependency footprint of your package going to CRAN has become much less, such that now the user or in your CICD for running the examples can opt into installing this package without having the hard dependency on it and, hence, minimizing the potential for the dreaded archival status on CRAN if any of these dependencies end up going away. So there is another nugget here, though, is that when you install a package from, say, GitHub, like, they have a nice snippet on here using the remotes colon colon install GitHub, name of the package repository, by default, it will not install dependencies that are marked as suggests.
That's where you have to supply the dependencies flag of true in order for your local system as an end user to grab, in this case, that SVG like dependency. That's a nuance that has tripped me up so much in my day to day work when I thought I'd install a package from GitHub. I'm ready to throw it in my shiny app or throw it in my over pipeline, and then I realized, oh, what's that error? Oh, nope. Didn't get that suggest package when I installed it. So that's at the end of the post, but it's bitten me up many times. And that's contrary to the base r install dot packages function, right, which will install by default your the packages that are listed in the suggest portion of the description file. That is correct.
There's a dichotomy there that unless you really stumble into it, you don't really know exists. So that was that was some shared learning for me. So, well, I'm curious, Mike, to hear your your your tail, if you will, of this of this issue. For me, I have two minds on this. If I'm developing a package that's just for internal use at my day job, admittedly, I'm probably just gonna put it in imports anyway because at that point, I know I'm not going to CRAN. It's more about does this pipeline I'm making is the benefit of throwing this on to, say, a Shiny app hosted on Pawsit Connect or other areas. What's the easiest way for me to control what's happening there? And that's typically if I have a description file, just throw it all in imports.
But if in the situation of either a, you know, CICD or CRAN itself, yeah, I definitely would take this advice to heart because it will make your life as a maintainer easier not to worry about s v in this case, SVG lite's tangled web of dependencies when all you needed is for one additional type of file to export in this case. It is not gonna impact the baseline functionality of said package. It's just an enhancement on top of it. I admit sometimes it's hard to find that good, like, threshold of when you go to import only or when you go to suggest only. I think that comes through experience. But for me, it also depends on what context you're gonna be releasing this package in.
[00:24:04] Mike Thomas:
Yeah. I would agree. You know, I think I'll default here and punt a little bit and say that the our packages book, which is authored by Hadley Wickham and and maybe Jenny Brian and and a few others, don't quote me on that. I think has some really good discussion about when to list a package as a hard dependency versus a soft dependency, and and also how to handle, you know, packages that are just being used. This is something that we run into a lot, packages that are just being used in a vignette, you know. And ggplot2 is one for me that that happens quite a bit with because, you know, maybe I I don't have any functions within my package, itself that, you know, leverage ggplot2. We're just returning data. But in my vignette, I wanna show how this package can be useful to users. So I I wanna build a beautiful chart, you know, and have that be on our package down site and folks come and see that and look at that and be like, wow, this is what I can do with this package. But not necessarily, you know, auto plot anything for them because I have plenty of opinions on why I don't necessarily like like functions that do that. We just try to return the data and let you, you know, use ggplot2, use Echarts for r, do do whatever you want, GT, to make it beautiful.
But, I guess one other thing that I will add as an anecdote here that that Swan mentioned, which I think is really good advice is when you are managing the dependencies, you know, within your particular package, highly recommend using the the use this, use package function, which will handle, not only, you know, where things should be listed within that description file, making sure you don't have anything duplicated in there, but also it will handle the relationship between that description file and your namespace file, and any updates that need to be, taken on that name space file, which should not be done by hand, which is highly recommended. So I I would recommend leveraging that package. Just another example where use this can be awesome.
I had had a I'll try to keep this brief, but we developed an open source package recently that actually downloads some publicly available data that's stored as a zip file on a website. And when it downloads, you know, onto my machine or or onto any of our our, company resources, local machines, any of my teammates machines, the the data comes in, and it's it's strange data set in the zip file of these text files, where, say, there's say, there's 4 datasets, for example. There would be 8 text files total. The first four would be the column headers, and then the next 4, would be the actual data itself without column headers. So you just have to stitch them together and and, you know, file 1 is the headers for the data in file 5. You know, file 2 goes with file 6. File 3 with 7 and 4 with 8. You can match them up. They're they're ordered that way when you download them online.
So we never had any issues locally with essentially, you know, returning a data frame that matches the headers to the data itself, until we try to use the the with our package, in our unit testing to do this programmatically deployed to GitHub Action CICD. And when, we, you know, ran these tests locally, everything works totally fine because we just said, you know, you match file 1 with with file 5 and and 2 with 6, and so on, and so forth. When, the with our package ran, you know, on this this Linux box, probably, on GitHub actions, the files that got unzipped from the zip folder got unzipped in strange orders. Not the order that they were stored in. So just got totally totally totally reordered. And it was very difficult to figure out, you know, what the issue was that was was going on here. And, so we just had to write in a little extra logic that uses the naming conventions of the filenames to match them together, which wasn't a big deal at all. But we couldn't rely on the order that those files came in because for whatever reason, when this this ran on GitHub actions, and unzipped these files, they unzipped in a totally different order than what took place locally for us. So, you know, just an example of how you can pass all of your checks locally, but not necessarily, when you, you know, are running it in a separate environment, which is a good thing. I was glad that we ran into that issue, so that if for some reason, you know, somebody, leveraging our package experienced the same issue where the the files unzipped in a different order, you know, our our functions would still work.
[00:28:41] Eric Nantz:
Yeah. Boy, is that is that just par for the course. When you see the expect the unexpected, so to speak, when you put in things in CICD, I I had a similar thing, albeit it was more of me shooting myself a bit in the foot on this. But I was doing a pipeline in GitHub Action. It's not an RPAC. It's about a set of functions that grabs a SQLite database from online, does some transposing, does some massaging, does some fuzzy duplicate finding, and then sends it back out as S3 objects. Well, in my development, I was starting to do, like, date cleaning with the luberday package. I had installed it locally. I forgot to add it to my manifest for the GitHub action to install. And I actually was leveraging a friend of the show, Peter Solomis' DEPS package to make a JSON file of the dependencies. I just forgot to rerun that darn thing and then commit it. So I'm like, wait. That worked fine on my machine. I was like, well, dependencies again. So it happens even in non package context. So you just gotta gotta keep keep the vote of keeping that stuff up to date, whether it's an rmblock file, a dep.json, or whatever else. Just, yeah, keep that stuff up to date, man.
[00:29:51] Mike Thomas:
It's hard. It's hard. There's a lot to it. But in my opinion, it's it's super important because we want the user experience of others, you know, using the software that we create to be as as high quality as possible.
[00:30:04] Eric Nantz:
That's, it's all the name of the game, isn't it? Yep. But, really entertaining read here and very informative too. So credit to thank Arver sharing her knowledge of the developments in the trenches, so to speak. And, Mike, yeah, those are the 22 pretty heavy content highlights here. And we're going to have some fun with this next one because I like me a little game now and then. So our next highlight is going to do a little fun classification magic to hopefully help you win this game even even faster than you might expect. And this is coming to us from Michael Hoe, professor in statistics and data science at the University of Gresfel, Germany.
Again, pronunciation is not my strong suit here. But he starts introducing this blog post about a game that he likes to play called guess my name where each player will have a card with 16 kinda avatars on them. They each have different, like, you know, hair color, maybe slightly shirt, you know, different genders, all that. And the object of the game is the each player will pick who they wanna represent on this game, but then the opponent has to ask questions to help narrow down who that opponent would actually be on on the board itself. And they can only ask questions about the picture of the person, and they must and they the response to that must be either yes or no. So you can kinda start crossing off who is not and then figuring out eventually who that person actually is. So, of course, naturally, the winner of the game is gonna be one that finds the answer in the least amount of questions.
And so Michael, what he does in the first part of his post is actually compiles a spreadsheet of all the about 12 questions that we will logically ask in this game amongst then the players that this these questions would represent. So you can download that spreadsheet right off the the blog post if you want to look at that for reference. He's got a snippet of it in the post itself such as, like, do they have headgear, do they have glasses, blonde hair, etcetera, etcetera. Now that doesn't that's a great starting point. Right? But what where do you help determine what should you ask first in order to maximize your chance of winning?
It's decisions, decisions. Right? Well, literally here because we're gonna look at decision trees as a way to help take a data driven approach to find the solution. This is not something I would have expected, but it's a pretty clever use of the classical classification tree method where he feeds in the data. In this case, like I said, that spreadsheet of questions and then the membership of each person responding yes or no to those questions, throws it in in our part, call after it does some massaging of the data. And then the rest of the blog post now gives you a decision tree that you can use as a strategy going forward for how you might identify these players.
So if before seeing this post, Mike, would you have guessed that the first question that someone should ask is whether the player has blonde hair?
[00:33:31] Mike Thomas:
I wouldn't have guessed that, but it has been a little bit, of time since I last played, a game like this. I think there's a game very, very similar. Maybe it's it's more US based. And I think it's called called Guess Who. Played it a lot when I was growing up and it you know, you you flip the flip the people up and down after you ask questions, and it's kinda like a 20 questions game to see who who can figure it out first. So this sounds like exactly the same thing. And if I remember correctly, I think that was one of my one of my top questions. Does the player have blonde hair? You know? And trying to narrow down, who the person is that you're you're trying to get at. But I guess it all depends on, it all depends on your on your strategy and the the data or or the people that you have on your on your card.
[00:34:17] Eric Nantz:
That's right. And so when you look at this decision tree further, once you get through that initial question, then we start branching off into questions like, is the is the picture have something green in it or something red? So you're looking at color next across the entire picture. Now the type of gear they're wearing, is there some headgear? Is there a short sleeve or a long sleeve shirt visible? Alright. Can we see the eyebrows or the glasses? Like, it is a very interesting dichotomy here of where you should go next. But, yeah, it just shows you another clever use of classification methods, which, again, are typically the backbone of almost all types of machine learning, especially in the dichotomous approach where you're trying to optimize that response or predicting a response.
These are the fundamental building blocks. Obviously, if you're in a more rigorous analysis, you might look at other methods and classifications as a random forest, GBM, and the like. But if you ever want a gentle introduction to what classification trees are actually doing, with a little, perhaps, ways you can use this in your in your activity time later on on the weekends, this, you could do a lot worse than seeing what Michael's post has here. No. This is a super super cool, super fun application. I guess not something that I've I've seen too often in the highlights before. We've seen some exercise data,
[00:35:42] Mike Thomas:
you know, before, but in terms of actually, like, playing a game, I think this is is super cool, and a great use case, and I think a great learning case for leveraging decision trees and the the r part package itself. You know, sometimes I think algorithms can be explained best on, like, small data sets where you can see sort of exactly the decisions that are being made by the algorithm or or how the math really plays out. Just as a another final story for for the week, I do have somebody in my family. I'm not gonna name names. But when we play Clue, they essentially need a laptop with Excel open next to them to be able to sort of track everybody's responses in a spreadsheet, which is way over the top, analytics approach.
And, to be honest, they they actually are the person that that usually wins, which makes me feel like I need to to do something during that game. But I'm off the clock, you know, so to speak, when I'm playing board games. So I try not try not to go try not to pull out all the stops. But if this person does keep winning, I may have to spin up r in something like this and come back to Michael's Michael's post because I think a decision tree type of approach is is not only applicable to this particular game that we're looking at here, but, Clue is another one that that comes up for me that would be a fantastic use case and application for leveraging something like a decision tree, because it's a question based game as well, where you're trying to sort of narrow things down, into the the smallest, you know, number of pieces of information that sort of give you the the the best idea of of what the true answer is at the end of the day. So,
[00:37:24] Eric Nantz:
that that's the end of story time for me for this week. It does make me think, albeit it probably have to use even more rigorous methods for this. But as a kid, I loved playing that game Battleship and trying to figure out which space should I target first. And then based on that, knowing that it didn't hit, how far away should I go from my next target? I could sense there'd be a lot of fun data driven approaches to that. But for those of you listening, if you're interested in more the technical math behind this idea, well, the blog post has a terrific appendix here where you can really, really get get to school, so to speak, on how classification methodology works here. Michael does a terrific job with kind of the mathematical optimization formulas that are under the hood. There's also some nice visuals along the way. So this is if you're in in a situation and trying to learn about classification in general, like I said, not only do we get the the fun intro of this post, but also the the appendix is really giving you a lot of great details for how this all really works under the hood.
[00:38:29] Mike Thomas:
Absolutely. No. That break that battleship. Oh, that's that's another good one. Where to where to guess next? That kinda, like, brings me to the Monty Hall problem a little bit, like, which door should you pick based on the first door that you you selected? I once I don't know. Spent spent too much time with an actual deck of cards trying to prove out whether the Monty Hall problem and solution was actually the right way to go and and that was probably my first introduction into into Bays without really knowing it.
[00:38:55] Eric Nantz:
Isn't it interesting? It's almost everywhere in your life, but you just may not realize it until it's all in Chelsea. Yeah. Well, you make me wanna play games the rest of the day, but, unfortunately, I won't be able to do that. But what we can tell you, though, is that the rest of the hour week, we issue as a terrific section of additional resources, tutorials, blog posts, new packages, updated packages, tons more to choose from. So we'll take a couple minutes here to talk about our additional highlights here. And sticking with the earlier part of the show, looking at great uses of ggplot2 for EDA and whatnot, there is a terrific package called g g magnify.
This has been authored by David Hugh Jones, who I believe we have featured on the highlights before, where, in essence, you can take a ggplot object, and then within that same plot, draw a boundary of, in essence, a panel that you can then use as zooming in on those particular data points and put that in a section on the same plot. This is pretty interesting to me because now, obviously, I'm a big fan of interactive graphics and interactive HTML where if you had this in, say, PlotViewer or not, you could just do the, you know, the zooming of a particular plot in real time, get those coordinates, and then zoom back out. But if you're in the realm where maybe you're confined to static representations, then gg magnify might be a really interesting way for you to call out a particular section of that scatterplot or that distribution plot and then be able to really emphasize just kinda what's happening in that subregion of the plot while maintaining, you know, a good clever use of space and whatnot. So gg magnify, I never heard of this one before, so I'm gonna put that in my visualization bookmarks to follow-up with later on.
[00:40:48] Mike Thomas:
No. That's a great one, Eric. I'm gonna shout out, Romain Francois for his blog post on the request perform stream, function from the HTTR 2 package, and how he found a bug and submitted a pull request as well. So he actually asked it. And I don't know for folks who are connected to him on social media, you may have seen that he has authored a package recently, that will write a Valentine's or not necessarily Valentine's Day poem, but I think will write a poem for you, I believe, leveraging chat GPT about anything that you want. I think it started out around Valentine's Day, which is is where I saw it first. And he actually ran into an issue asking the, Chatter package, which is from, I believe, the the ML verse, which leverages chat GPT, which I know a lot of tangential packages have been built on top of at this point. He was asking it to write a poem about Gollum, and he asked it to use many many emojis, in in that poem.
And, unfortunately, he received an error. And one of the reasons for the error, or I think the big reason in particular, is that, the this package, Chatter, leverages the h t t r two package, and and obviously streams back the response from the ChatGPT API. And it streams it back in in chunks of bytes. And, unfortunately, it was cutting off an emoji in the middle of the number of bytes within that emoji because it's encoded as as a few bytes. And it couldn't essentially stitch together the emoji from 2 separate chunks of bytes, if you will. So, this was, I guess, you know, an unexpected bug, and, Romaine went as far as submitting a pull request to the h t t r two package about a way to go about this.
Potentially, I think using the the read lines function instead of the read bin function. And there's, actually, I mean, you can go into the pull request. It's it's linked in the blog post, and you can see, you know, the fantastic conversation, the way that he frames the problem to the Rlib, team that manages the h t t r two package. You can see his back and forth conversation with with Hadley Wickham on how they eventually went about and resolved, and closed this this pull request, and resolved this bug, and merged it back into main. So now, if you are installing the htrt2 package from from GitHub, and hopefully soon from CRAN, you won't run into the same issue that Romaine ran into when trying to, ask ChatCPT for a response that includes emojis.
[00:43:39] Eric Nantz:
My goodness. Yeah. You saw my rabbit holes. Right? I mean, that credit to Romaine and Hadley for finding a a now we can fix this because this is this is an area where I take for granted that all these symbols are just gonna work no matter if we're passing data in or taking data out. Oh, there's a lot behind those emojis, folks. It ain't just the fun graphics. There are a lot of these raw bytes and bits that if you're not you're not taking it correctly because we're involving, like you said, the APIs of chat GPT and what this chatter package, knowing what's being handed off, that can be quite important. So, well, credit to credit to Romaine and Hadley for putting this together. And, yes, at the end of the post is a very lovely poem about one of our paper packages called Gollum that will never cease to amaze us.
[00:44:26] Mike Thomas:
Absolutely. And I think, you know, it's also if you if you do sort of follow that pull request, it's just a great example of how to contribute to a package and to make it easy on the on the maintainer or as easy as possible on the maintainers to get that bug fixed.
[00:44:42] Eric Nantz:
Excellent. Excellent. And you know what else makes it easier for you to learn about what's happening in the data science and the art communities? Well, that's where you just bookmark our weekly .org. Have you have checked that out every Monday morning. We have a new issue released like this one we talked about here. But, of course, the entire back catalog is also available. You can see right on the home page. And, of course, this is a project by the community, for the community. The lifeblood is you and the community. So we love your poll request. We love your suggestions. You can get in touch with sharing that resource via poll request. We just thought about poll requests. Right? Our weekly is very embedded into that workflow. The link is directly on each each issue's front page where you get a link to the upcoming issue draft for next week. You'll be able to quickly send your poll request there. It's all marked down all the time. Very easy to get up and running quickly.
And, also, yeah, we're always happy if you wanna join the team as a curation role. We definitely have spots, and we have links so you can get information on that at r wicked.org. And, of course, we love hearing from you and the audience as well. You have a few ways to get in touch with your humble host here. One is the contact page. We have the link link directly in the episode show notes. We also have a fun little if you have a modern podcast app like Paverse, Fountaincast O Matic, and whatnot, you can send those fun little boosts along the way, which is directly from you to us in your favorite podcast app. And, of course, we have, some presence on the social medias.
I am mostly on Mastodon these days with at our podcast, at podcast index.social. Sporaglia on the weapon x thing with at the r cast and then also on LinkedIn sharing some posts and other fun announcements from time to time. And, Mike, where can the listeners get ahold of you? Sure. You can find me on Mastodon as well at mike [email protected],
[00:46:39] Mike Thomas:
or the other place that I am present on social media a lot is on LinkedIn. If you search Ketchbrook Analytics, k e t c h b r o o k, you can probably find out what I'm up to.
[00:46:50] Eric Nantz:
Awesome stuff. Like I like we heard about before. Congrats on that recent package you open sourced. I'm sure there are lots of fun stories behind that as well that you'll see on Mike's, LinkedIn post from time to time. So Yes. Thank you. Absolutely. So we're gonna close-up shop here for episode 153, and we hope to see you back for episode 154 of the Our Weekly Highlights podcast next week.