A realistic take on converting the NY Forest Carbon Assessment modeling pipeline to the tidymodels suite, and a review of R package development workflows in the Positron IDE.
Episode Links
Episode Links
- This week's curator: Jon Calder - @[email protected] (Mastodon) & @jonmcalder (X/Twitter)
- Converting New York’s Forest Carbon Assessment to Tidymodels
- R package development in Positron
- Entire issue available at rweekly.org/2024-W32
- Tidy Modeling with R e-book: https://www.tmwr.org
- maestro: Orchestration of data pipelines https://whipson.github.io/maestro/
- Pharma RUG: The Rise of R in China’s Pharmaceutical Industry https://www.r-consortium.org/blog/2024/08/01/pharma-rug-the-rise-of-r-in-chinas-pharmaceutical-industry
- R/Pharma APAC track call for talks: https://rinpharma.com/post/2024-07-17-apac-track/
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon) and @theRcast (X/Twitter)
- Mike Thomas: @mike[email protected] (Mastodon) and @mikeketchbrook (X/Twitter)
- Person, Place, or Groove? - Pictionary - The Orichalcon - http://ocremix.org/remix/OCR01548
[00:00:03]
Eric Nantz:
Hello, friends. We're back with episode 174 of the Our Weekly Highlights podcast. This is the weekly podcast where we talk about the terrific resources and the highlight sections that are being shared along with much more content in this week's our weekly issue. My name is Eric Nantz, and I'm delighted you joined us from wherever you are around the world. We are in the month of August already, and time has flown by quick. So, of course, I gotta, you know, buckle up my virtual seat belt here as we get along the ride to an eventual conference next week. But, of course, I need to bring in my awesome cohost who I never do this alone, of course, because he's joining me here, Mike Thomas. Mike, how are you doing today?
[00:00:42] Mike Thomas:
Doing well, Eric. I am 6 days until my flight leaves, and I I can't, be more excited
[00:00:49] Eric Nantz:
to get to Seattle. That's right. In fact, yep. As as you've heard in the previous episodes, Mike and I will both be at Posit Conference Seattle, which means that, you know, usually during our, quote, unquote, recording time next week, I'll actually be giving a presentation around that time. So, you won't be having an episode next week, but, nonetheless, we'll make it up to you later in the month. But, nonetheless, we've mentioned this before, if you are gonna be in the area for POSITONV, please come say hi to us. We're gonna be out and about. I'm actually arriving pretty early because I'll be part of the Our Pharma summit that's happening on Sunday before the conference, and I'll be in one of the workshops or databases.
So I'll be around. Mike, you're getting in on Monday, it sounds like. So, yeah, we'll definitely looking forward to connecting with you listeners out there.
[00:01:36] Mike Thomas:
Please say hi. Yes.
[00:01:38] Eric Nantz:
Awesome stuff. Yeah. And And, again, I can't confirm or deny now. I have some sticker swag with me. I'm still trying to get stuff together. I still have to pack, so lots of things to to bring with me. But good thing we don't have to bring, you know, writing an article show ourselves. We got a handy curator team that handles that for the project. And speaking of going above and beyond, for the 2nd week in a row, our curator this week is John Calder. He really stepped in to help out with our scheduling for the rest of our curator team. So, again, many of you hopefully know this by now. Our weekly is a complete volunteer effort. So anytime we can pitch in and help each other out, it it it it's it's just so valuable to us. So our curator team always goes above and beyond. So, certainly, my thanks to John for stepping in. And as always, he has always had tremendous help from our fellow RM Wiki team members and contributors like all of you around the world with your poll requests and suggestions.
And, yes, it's been great to see the momentum behind, you know, the adoption of the tidy models ecosystem for many machine learning, prediction, and other pipelines in the modeling space. We've covered numerous, segments here in the podcast about some of the recent advancements on the internal tooling that's been happening over the years. And it's always great to see, members of the community start to adopt this to their existing workflows, especially in cases where they've had maybe some internal custom solutions. And now they wanna see where does tidy models fit in terms of giving them advantages, reconstructing their modeling pipelines, and what that experience is like. So our first highlight today is doing just that. It comes to us from Mike Mahoney, who is now at the USGS as one of their scientists and biologists, which is the US geological survey for those who are outside of our US circle here.
They've been doing some great work in the art community while they're tooling. But in particular, Mike is involved with a very important effort for the New York Forest Carbon Assessment, which in a nutshell is trying to objectively measure the amount of forest coverage within the state of New York and helping predict the changes in this coverage as part of the recent climate, Protection Act that was trying to minimize their use of carbon emissions from, obviously, fossil fuels and other things like that, by the year 2050, I believe. They want, like, an 85% reduction in that, which means that, you know, they're trying to take advantage of what, you know, forest and other plants can provide in helping offset some of the some of those, decreases.
So this summer, they've been working on version 2 of their automation pipeline and modeling pipeline of this assessment where, again, they've had some internal functions to help with the tuning and creation of these stacked ensembles. But now they wanna use now the broader ecosystem of tidy models to kind of bring all that together in one cohesive structure. Now, unfortunately, he can't share the actual data they're using for this ensemble project at this time, but the blog post does a terrific job of taking advantage of publicly available data from actually not far from your neck of the woods, Mike. They're taking advantage of tree canopy data, from online sources in the in the city of Boston from 2019.
And the goal of this post is to illustrate very similarly to how they're adapting their New York forest carbon assessment by fitting 2 types of models, the Mars, which is a multivariate adaptive regression splines, as well as gradient boosting modeling, which, again, these are highly popular in the machine learning and prediction space. So the first part of the post talks about how he assembles the actual prediction data, the outcome data, which, again, I think is very comprehensive. If you're into learning how to obtain these data, Mike definitely has you covered with all the data preprocessing steps, the various packages you need. And lo and behold, once you do some, you know, data preprocessing and visualization, you now have a tidy dataset with the outcomes of interest that he wants to predict.
So with that, now it's time for the model fitting. And so there they got the a cell values object, which is again, hosting the outcome data and the prediction data. One little nugget here right off the bat is that just like of anything in r, when you're doing a prediction model, you need to have some kind of formula object to help denote what are the relationship between the outcome and the predictor variables. And there's a handy little function in base r called df 2 formula, which I wasn't familiar with, which basically will be intelligent enough to determine if your data set has the outcome variable as the first column, and then the rest of the columns are prediction variables, you don't actually assume that structure, and then you don't have to write like the typical formula syntax of outcome, tilde, and then all the different combinations of predictors. It literally just takes that data as is and builds your form of your object right off the bat. So that's another one of those hidden base r nuggets that you see from time to time. It's always great great to see those.
Now in the tiny models ecosystem, there is kind of a stepwise fashion to how to produce these workflows. The first step is to register the recipe, which is kind of gonna be the building blocks for the actual fit itself where you simply feed in the formula and the dataset that contains your observations. Simple enough for the recipe, recipes package to do that. And then now it's time to fine tune the model specifications, which, again, one of the great things about the tiny models ecosystem is that it has dedicated packages to help with some of these things. Now that may not always work well, well, which we'll get to later on, but he outlines 2 specifications, one for the GBM model and one for the Mars model, which, again, depending on the model type, you may have different sets of parameters to specify.
And then throughout that, for many of the parameters where he doesn't know right off the bat, what are the best values of such as, like, the number of trees or the tree depth and the GBM side of it, Tiny models, of course, lets you tune for those parameters, optimize for those leveraging the hard hat package and its tune function. But you'll see in the code, it's actually littered throughout this model specification with no arguments. So it's trying to do a lot for you from an abstraction perspective, which, again, may or may not be always perfect, but we'll get to that in a little bit.
You've got those specification in the model now. Now it's time to set up the workflow, which is again coming from the workflow sets package where you can put in your recipe as well as your 2 model specifications and making sure you've got all that integrated together gives you this nested data frame back where you can see your 2 recipes in this case, the information inside. And then once you actually do the prediction, it'll capture the results as well. But what Mica's saying high praises of is that a lot of this takes a lot of code back in the days before tidy models, a lot of things you had to build on the fly. Certainly, Max Schoon had offered the carrot package well before tidy models. Many people use that to orchestrate their flows. Still, you had to learn the nuances on how to stitch all this together.
Tidy models is trying to abstract away a lot of that manual effort to stitch all this together so that you can have fit for purpose functions to define all this. So, again, we'll we'll come back to the benefits and trade offs of that in in later on. And then once it's ready to go, you've got to now start composing your samples, your re samples of the data, as well as tuning your workflow sets. In other words, finding what are those optimal parameters. So with the workflow map function, which is gonna actually take that set of combinations of the recipe and the model fits.
You feed in a certain met set of metric or parameters such as your space for the grid search. And then also some two arguments that we'll come back to later on, the metrics and the control argument, which is gonna help with specifying some defaults that can be surprising if you're not ready for it. We'll come back to that in a little bit. And then once you have that done, it's time to actually run the tuning and see what your best fit is based on your metric of interest. So in this case, in this example, he's looking at the root mean square error to see which model or which set of parameters of the model fit best.
And he's able to locate for each of the model types which particular model. It's got a somewhat ambiguous name of, like, preprocessing preprocessor 1, model 09 because it's actually fitting different models for each of these combinations, so it gives it a unique ID on it. But he's not interested in just cherry picking the best fits of these. He wants to use the ensemble technique, which basically is able to take all these model fits, figure out then with some analysis which are the, quote, good enough parameter sets that then we can he can use later on in the actual prediction side of it. So taking advantage of the information from all these different model types.
And this is a great advancement in tiny models ecosystem. They have a package called stacks, which is gonna be able to add all or literally stack together all these different model fits and then be able to see what is the highest weight in terms of the model giving the best performances and really get objective measures around that. So then he's got a nice tidy output here of the top 4 members contributing the best prediction outcome or prediction, I should say, power, you might say. And it is a mix of the light GBM and the Mars model for these different, you know, combinations of the preprocessing and the model fit itself.
And then he can take that and feed it just directly into the predict function. That's pretty neat. Just predict this ensemble object that he's created with these stacks, the dataset that has the amp predictor values, and you'll get a tidy data frame with the predictive values back. That, again, another, you know, less code solution to get your predictions out of that. And this is a pretty comprehensive start to finish flow on this. And I will say there were some little gotchas along the way that he want he is able to talk about in the next section of the post. And Mike has a unique perspective on this, our author of the post, because he was an intern to the tidy models team years ago. So he's had some inside look on this, and that makes his, critiques here even more fascinating. So, Mike, why don't you take us through that?
[00:13:31] Mike Thomas:
Yeah. Mike was an intern back in 2022 on the tidy models team, and I have to imagine that while that must have given him a a huge leg up in converting their their legacy code into tidy models, 2 years in tidy models time must be like a decade, in terms of the amount of new functionality and packages and design choices that have been added to the ecosystem since then. I mean, we have survival analysis and tidy models now and that that was never on the radar back in in 2022, at least as far as I could tell. And, you know, as we know that ecosystem in that that universe of of packages within tidy models has grown. And it reminds me a little bit of that recent blog post that we had about creating package universes. And, Eric, you and I discussed some of the trade offs that you have to consider when doing that. Right. And it reminds me a little bit about sort of the end of of Mike's blog post here. Because a lot of these tidy models packages work together.
As Mike notes, you know, there's sort of 3 packages that work together just for hyperparameter tuning itself. The tune package takes care of grid searching. This hard hat developer oriented package owns sort of the infrastructure around hyperparameter tuning, and then the DIALS package owns the actual grid construction. And when you think about hyperparameter tuning itself and trying to, you know, create these different methods, you know, within each of these these packages and having them work together, Well, not only do you have to do that for 1 modeling one type of model but I'm assuming that you have to create different object oriented methods for all of these functions across all of these packages for all of the models that tidymodels supports. And I know on the the tidymodels, you know, homepage, they have sort of a list of all the different modeling algorithms that they support tree based, you know, regression based algorithms, all all sorts of different stuff. And they there's a ton within that ecosystem that they have supported, but I imagine that once you want to add an additional model type, it's a pretty extensive process to be able to ensure that you can support that model type not just in one package but across all of these different packages that work together to create these modeling workflows.
So I am not a user of of scikit learn. I'll be honest. I've reviewed some scikit learn code in the past. I know it's it's very highly regarded in the Python ecosystem. I don't know this for a fact, but I have to imagine that the tidymodels team must have had the benefit of taking a look at, you know, what works well in in scikit learn and what doesn't work well in scikit learn when they went to, you know, sort of move from caret to tidy models and create this new framework. So I I'd be curious to see if Python sort of suffers from these same sort of of issues or if not, you know, how they're handled in scikit learn. Because I have to agree with Mike that this is somewhat of a pain point if you are doing some pretty hardcore machine learning and predictive modeling as Mike and his team are clearly doing. Right? Creating a lot of different types of models, trying to ensemble them together, trying to tune hyperparameters, and and do it in a way such that the code is as efficient as possible. Right? There's a function in here that I didn't even know existed from the workflow sets package called workflow map which I have to imagine is like a purr like approach to, developing workflows across a bunch of different models, hyperparameter tuning, and evaluating, and comparing those models, you know, sort of really in this this programmatic approach as opposed to hard coding things for each one of these models and then trying to compare and evaluate, the the outcome of these models separately. So I would advise you to to try to take a look, you know, I don't know how far down in the weeds that we want to go into to some of Mike's sort of specific, gripes, if you will. I think complaint is sort of a strong word that that's what he uses in the header here. But I think he's really just pointing out, you know, some of the things that you, as a Tidymodels user, the the deeper you get into it, will face as well. And calling out some of the things that worked for him in terms of workarounds, some of the things that that he learned is he admits that some of this stuff is is straight in the documentation, and some of it is not. And you have to sort of, you know, take a lot of time to figure out yourself. He he and this is super relatable. I think there is something that that was documented in a package here that Mike spent 26 hours, trying to trying to figure out before he was able to actually, you know, figure out what was going on here.
And and it had to do I think with the defaults in hyperparameter tuning and some of these workflows that we're failing extremely slowly, unfortunately, and weren't sort of, you know, bringing to light the errors, quickly enough to the end user. This is all to say at the end of the blog post that that they're still using tidy models because I think net net at the end of the day, Mike and his team believe that, you know, the the pros and the benefits that they've received from switching over to tidy models outweigh the cons. And I think like anything, you know, with an open source software, hopefully, some of the the complaints and the issues that they faced are things that will be, you know, resolved. And over the years, we'll move within the Tidymodels ecosystem to enhancing these things and making that user experience a little more easy. I've used Tidymodels, you know, many times before, really enjoyed. It's definitely a little bit of a learning curve from from Karat. I think you have to, have more of a sort of per like thought process, a higher level design sort of thought process in mind when you're leveraging, tidy models and understanding how all of these different packages like Parsnip, r sample, yardstick, you know, work together, at workflows to be able to, accomplish what you're trying to accomplish. But it is it is super powerful, and and I'm glad to see that the team is is still sticking with it. And, I'm this this is like a wealth of information around tidy models and is a great crash course, on if you are trying to get into the weeds of machine learning and R on some of the design choices that that Mike and his team made to to create these models and evaluate them programmatically.
It's fantastic. I'm not sure, Eric, if I've seen a blog post recently that goes into this level of detail within tidymodels for us. So, very welcome blog post. I think it's a great not only technical discussion but also sort of practical discussion, from a team perspective on on what's worked well for them and and what hasn't and where they're planning to go in the future. I was wracking my brain as you were walking through this, and I
[00:20:38] Eric Nantz:
don't recall one in in the recent, you know, months or probably even year of anything this comprehensive because there is a great mix here, again, of one of the points he he mentions towards the end is that this you you you may think you can kind of, you know, hone in on one particular aspect of tidy models, but he's saying that it really took him having this holistic view of how these different pieces fit together. That may be an issue for those that are kinda new to these, you know, suite of packages that have a cohesive API or opinionated way of integrating together.
Now as I say that, you may be thinking to yourself, well, that sure sounds an awful lot like the tidy verse itself. Right? I mean, certainly, they got inspiration from the tidy verse on a few things, But what I when I do data processing pipelines with the tidyverse, most of my time is spent with, I'll call, a core set of maybe 2 packages, maybe 3 at the most, like dplyr, tidr, and per to help with some, you know, mapping processing. Oftentimes, I don't quite have to get into the weeds so much as some of the other packages, but sometimes I do. And so it's always helps as you're building these cases or these use cases for yourself or maybe for your team to document these intangible kind of learnings that as comprehensive documentation might be for these given packages, it's how they're integrating together.
And, certainly, the tiny models team has done great work to put these freely available, you know, online books online about, you know, all the the different ways that tidy models can be used. Some, you know, Max Kuhn, Julia Silge, and others have been very front and center with that, and we highly recommend you check out the tidy model site to get links to those particular cases. I do think though that having a post like what Mike has done here touches on things that, again, are kind of on the more practical side. And I have been victim as someone who uses HPC systems on a weekly basis.
Oftentimes with jobs that won't complete in a day or sometimes 2 days, it can be costly when you thought you had a default set right, and then you find out after the fact you forgot to save that prediction result. You forgot to do that one little adjustment to p values in my case, and whoops, gotta go back to it. So there are some things that you can do to help minimize the impact of that, which I don't know works as well for the models. But when we tell people on our team is if you have a simulation pipeline and you wanna do, like, 10,000 simulations, you know it's gonna take a while.
You really only wanna do a few of them first to make sure you've ironed out all your connections, all your outputs you're saving so that you're not surprised after writing them over that amount of time. So I I I had the feels when I read that part of of my exposes. I've been there many, many times, but these are all things that, again, you kind of have to learn by doing, but then documenting your learning process is so helpful. And I do think there's going to be tremendous value for those that are adopting tidy models to their pipelines right now to kind of see, again, our literal real world usage of this and the and the, lessons along the way.
I'm confident that the tiny models team will take this hopefully constructive feedback here, and maybe we'll see some enhancements to, like, these more use case approaches, the documentation, and not just the, developer facing that you might see in the weeds of a package manual or when you go help for itune or whatnot or whatnot. You're not really getting the full picture at that point. So maybe we'll see improvements on that, but, again, this is the kind of the unofficial contract you sign when you leverage a suite of packages that, again, are meant to be coupled together tightly. There may be cases where the abstraction doesn't give you the full store. You need to kinda get in the weeds a little bit like Mike has done here.
[00:24:52] Mike Thomas:
I agree, Eric. And that's a great call out to, the tidy modeling with our book I think if you are interested in tidy models or if you're stuck on something that can be a great resource. We'll put that in the show notes, but it's tmwr.org.
[00:25:06] Eric Nantz:
Couldn't be easier to get to. That's right. Yeah. And it should be on your virtual bookshelf or even your printed bookshelf there. It's a valuable resource. And, again, the the the ecosystem is always evolving too. There's always, like I think they do quarterly updates from time to time on the tiny models blog. I've seen Max and Emil and others, you know, do a great job of writing that up. But, yeah, what Mike has done here in this post is is tremendous value to the community. Alright. We talked about, you know, looking at things from a developer perspective and a user perspective. We're gonna put our dev hats on, Mike, a little bit because we're gonna talk about what has been a very, you know, well spoken or hot topic these days and how that applies to package development workflows.
And if you recall, it was a couple a month ago or so. It was kind of quietly put out there, but there is a new IDE authored by Posit called positron. And, again, the geek in me cannot ignore the fact that the name of my favorite movie ever is in the name Positron. So to take that for what it's worth, I have visions of the MCP talking to me right now. End of line. But in any event, this is really interesting to see the uptake on this, and I dare say we'll be hearing a lot more about this at Pasa Comp in a week from now. But what we're seeing here is a post from, well, Stephen Turner who, coincidentally, when I started my our journey many, and I do mean many years ago, there were 2 blogs I discovered that helped me in my journey, especially as coming from another language like SAS trying to make heads or tails or what r was doing under the hood.
Stephen Turner, the author of this post, was one of them because he wrote this terrific blog called getting genetics done, just was so instrumental in my learning journey with R, and it is just terrific to see him resurrecting this effort in a new in a new blog. But this post is one of the latest that he's put on this, I believe launched earlier in July. So, it's a huge thank you for Steven. And, you know, I don't think I've met you personally or or if I have, it must have been years ago, but you have been very instrumental to my journey of ours. So it's terrific to be able to cover one of your posts here in the highlights. Nonetheless, what he talks about in his post here is his early adoption and user experience of positron to mimic what he's done in the Rstudio IDE and before that, e max of ESS over the years, and that is building in our package. What is the experience like in that space?
So we're gonna dive into some of his, findings here in in the highlights now. And first of which is for if you're not familiar with how positron operates, positron is actually a fork of the open source version of visual studio code. They call it code o s s. So when you look at positron, it's gonna look different than your studio ID, but that's by design because it literally is the Versus code shell with posits, you might call design choices on top to help bring some of that RStudio IDE functionality. Not quite all of it because this is a beta product, which we'll get to in a little bit. But he puts links to great resources about positron if you're new to it, such as the Wiki on their GitHub repository. Absalon's done a nice intro to Positron as well as Andrew Heiss, who's been frequently featured on the highlights, his experience of positron too. So definitely have a look at those after you listen to this, but let's get to the actual package development workflow.
So in our studio, what do we typically do? We like to create a new project in our studio to help house our package code, and positron brings their own spin on this as well. This is, again, one of their additions to the top of the visual studio code like experience where they let you choose 3 different project types, either a Python project, an R project, or a Jupyter notebook. So right off the bat, you've got a little wizard to guide you along the way. So, of course, he chooses the R project, and you've got, again, that familiar looking R project file created for you, which is just like what you would have in our studio. So there's already a familiar, part of your experience.
And then how do you actually create a package? Many of us are now using the use this package to create a package from scratch in that same directory. He does that. No gotchas there. He's got the scaffolding right off the bat in a package he's calling hello, and then he puts a simple function called isay. It's kinda like your hello world type, but just a little sampling of different strings in there. He writes that function. Now we get to some of the differences because there is a very convenient feature in in our studio ID that I use every time I make a new function for a package.
There is a either a keyboard shortcut or a menu entry to dynamically insert in our oxygen skeleton of the parameter documentation right there above your function with just a click or a keyboard shortcut away. Unfortunately, that's not in positron yet, so you're gonna have to write out the docs yourself. Of course, it's not too difficult. You'll get code completion, but it is just one of those conveniences that hasn't quite been replicated to the positron experience yet. But nonetheless, he's able to document his function and now comes the iteration. Right? When you're writing a package, you wanna develop your function, you wanna test that things are working, update the documentation, manual pages dynamically.
And what's nice about positron is you can import a keyboard mapping of shortcuts that will mirror very closely what you might have done in Rstudio's keyboard shortcuts. So you can you can import that in optionally, and then you can use that familiar, say, command shift or control shift d to populate the documentation manual pages on the spot, that works just right here as well. So that's gonna take care of, again, the the manual pages. They're gonna take care of putting the function name in the pages. They're gonna take care of putting the function name in the namespace for exporting so you don't have to do any manual effort on that front just knowing the shortcut or the command palette. You'll be able to do either one of those.
Another interesting thing is from time to time, we like to install the packages in our local environment as we're iterating on it. Before in RStudio, it would call the r command install function verbatim. In positron, it's actually leveraging the pack package. Pack colon colon local install, which again, I didn't know existed. So a little learning there. It's not a surprise that posit would use some of their tooling of the rlib suite of packages that they have. Excuse me. The rlib suite of packages to help automate some of these processes behind the scenes. So fair play to them. Of course, as you develop a package, you're gonna see your fair share of warnings if you need to update, like, documentation names or other things like that or the license entry. You get all that in the console just like you would with RStudio.
So no surprises there. You can run dev check, get the results there. You can build in your tests. We'll test that very quickly or use this to launch the test. It's gonna open that up right in your ID. Again, we're seeing a lot of similarities there. There is one other thing that's kinda missing from a development perspective that you might use from time to time is that for things such as the cover package to help you look at all your test coverage percentage with the functions you develop. There is always a handy add in, r studio add in in the r studio ID that will let you kind of run that report quickly of a menu click. As of this recording, positron's not supporting r studio add ins yet. I think that's something that is being worked on because even in my visual studio code experience with developing our projects, I can get add ins on that, thanks to the efforts of Miles McBain and others from the community. So I think that's on the road map, but that's something to be aware of if you wanna adopt positron right away.
So from most of the, you know, actual workflow, package development looks pretty seamless with Positron. Now there are some additional things that are missing in the positron experience that I leverage heavily when I put on my Versus Code hat back for a second. That is remote container and the remote connection, functionality, meaning that I could have Visual Studio Code on a one system, but then have another server either on my local LAN or my local HPC environment or in the cloud that actually has the R process built into that, and then I can just farm out my computations on that, but look as if it's local. That's not quite there yet in positron. They are working on that. That would be a game changer for me personally when they adopt that. But what was interesting, what Steven was able to do here is that little did I know, on top of writing great content about r, he's actually authored a package to help with getting a Dockerfile built from your package project called pracpack.
Say that three times fast. But, nonetheless, this is new to me. I've only been familiar with the Docker filer package by our our friends, think r. But this looks pretty nifty, and he even has a link to the paper about the package as well. We'll put that in the show notes, but he was able to actually, even in positron, uses pracpac package to create the Dockerfile with rmbakedin without any fuss. So at least he can get the Dockerfile going. He just can't do remote container development with that Dockerfile in positron. But, hey, it's great to see that he's able to get a Dockerfile on there. So if you wanna throw this into another environment and have those same dependencies from both the system level and the R package level ready to go. But he says the experience of that was very smooth because positron does take advantage of most of Versus Code's features such as syntax highlighting for batch scripting and other file types. So if you're in the multilingual environment, yeah, Positron is definitely gonna be a big help to that.
So the nutshell to me is that things are looking promising. But, again, this is beta. Be warned about that. But I do think that the seeds are planted. And if you're want to put on your speculation hat with me, Mike, Steven kind of has this in a comment in the post. It looks like a lot of the development energy is going to positron now, so we may have to, put a little toast out there to our friendly art studio ID because because I don't know if it's days or numbered. I guess we'll find out next week. What do you think?
[00:36:42] Mike Thomas:
Yeah. Maybe we'll find out next week. I know that some folks have raised that question and the messaging thus far, I think, is that our studio is gonna continue to be supported. You know, Positron's sort of for you know, there there's a lot of different levels of skill of our developers. Right? There's our beginners. There's a lot of folks that are probably are intermediate where they are consumers of our packages, right, and do a lot of their work and and ETL stuff and analysis and R, but maybe aren't developing our packages or or doing anything, you know, much more hardcore than than analysis. I I think that's a probably a large portion of the R community out there and, you know, those first two classes, they are developers and they are, sort of intermediate folks or our consumer or our beginners, I should say, and, our consumers mostly.
I'm not sure how much incentive they have to to move from our studio. I don't know if positron feels like a more welcoming environment than RStudio, but I'm I'm super biased because RStudio was probably one of my first ever IDEs. And then I I looked at Versus Code for a long time. It looked so scary to me when I originally, you know, booted it up for the first time and and tried to do some work in Versus Code. I had no idea where anything was but it's it's probably my bias from, you know, just being so used to our studio and and the layout there and and where the features are. So so I don't really know the answer to if, you know, someone starts out in RStudio versus starts out in positron, sort of who is able to make the best progress the quickest.
But it would be interesting to to see how that plays out, you know, for new users whether they're they're migrating directly to positron or in the future or or whether folks are still starting out with our studio. But, you know, as you said, Eric, you know, for for those of us that need some of these maybe niche features that we get in in Versus Code, you know, around remote SSH as you talked about so that you can work locally, or feel like you're working locally while actually, you know, SSH'd into some remote environment, that that's pretty powerful. I think that that's probably going to come fairly soon, to the Positron ecosystem. When I was looking through the issues and the discussions on GitHub, it looks like that one, is is fairly promising in terms of, you know, Posit's ability to actually get that incorporated.
The other one sort of that hurts me, you know, that we've talked about before is the idea of these Versus Code dev containers that allow you to develop as if you're in a Docker container, you know, while you're in a containerized environment. That's huge for us in collaboration on our team, but I fully understand that that is proprietary Microsoft code, that extension, that Versus Code extension. So there's not an easy way for Positron or any other IDE for that matter to be able to recreate that unless Microsoft someday decides to actually open source how they go about doing that. So I don't know which one's going to come first, whether Microsoft will open source it or Positron will incorporate it, but that's that's a big one for me that I would love to be able to see, incorporated. But, you know, again, these are these are niche things that for probably the majority of the R community may not necessarily care about. And, you know, some of the the trade offs here and the things that we're getting from Positron, you know, I I think are are huge benefits for potentially a lot of that intermediate class. So it it'll be interesting to see, you know, the adoption of Positron, sort of what takes place as it moves from beta to alpha, and what significant changes get incorporated for for quality of life for folks.
But, you know, one of the issues I I imagine that Pauseit is up against is accommodating so many different levels of our users. Yes. Right? Yes. It's tricky. It's tricky. So I don't envy, you know, what they're trying to do. It's a it's a large scale problem to solve. But I do think that being able, you know, the fact that Versus Code is open source and allows, you know, posit to sort of tailor it to whatever they want it to be is is powerful in and of itself and I guess another testament to open source. So we'll we'll see how things progress here. I love the fact that folks are starting to dive into it and and pick up on the strengths and the weaknesses of it for the rest of us to be able to get up to speed early as we adopt it.
So, you know, kudos to Steven as well for doing that.
[00:41:35] Eric Nantz:
Absolutely. And one thing I think about is, first, I do want to mention that I've used my fair share of attempts at IDs before our studio. They were hit or miss at best. I remember one in particular on the Linux system called rkward that tried to do a lot of things, and then others tried to make the Eclipse IDE try to fit with our projects. Oh, boy. That was gnarly, buddy. If you if anybody listening remembers that, give me a shout because I I can share stories of you about that. But I knew at that time, I was trying to learn it. But as, you know, R was starting to take more adoption in industry, it was gonna be a tough sell to get people to develop and those kind of IDs much less than a command line. I mean, hey. I love command line as much as anybody, but, it's not for everybody. Right? So when our studio came out, I mean, you can't underscore the influence it had on not just those getting new to r, new to data science, having this cohesive experience, everything in one place, so to speak, in your development journey.
But, boy oh, boy, did it help with adoption. It certainly did in my industry. So that's why I'm I'm cautiously optimistic that Positron will eventually, you know, hopefully be tailored in in more usability fashion to those use cases for those new to the language and who are not interested in, say, package development. They just wanna get their data science done, get their statistics model fits done, get that quarter report out there, and get on your way. Like, I think the bones are there. It's just gonna be a little while. But with this soft beta that they or that they rolled out, I think now we're they're getting their issue tracker is quite extensive now if I last checked. There's a lot on there, so it's a lot for the the team to prioritize. But, yeah, we shall see. There is one little nugget more on the, speaking of open source, there's always the issue of licensing. Right?
Little hidden nugget here in what Steven mentions at the end here, which I think we should call out here is that Positron is not using things like GPO, not using Apache. It's using what's called the elastic license 2.0, which you may have not heard about unless you're really familiar of all the nuts and bolts of software licenses. But let me read the the blurb that he calls out here that I think I may know the backstory about, but let me read it first. And he says, you may not provide the software to third parties as a hosted or managed service where the service provider provides users with access to any substantial set of the features or functionality of the software.
There have been vendors, I won't name names, that have bundled the open source version of RStudio into their, what I call, platform as a service offerings. And I've heard, unofficially, I won't put names on this, that Pazit was not too thrilled about this business practice. So it does not surprise me that they would go this next step with this fresh start to put this in there, but that may change where positron can actually be integrated. So I guess we'll see this space, but that's something to watch out for if you are in that in that kind of software, you know, provider space. So that was a little nugget I didn't expect to see, but we'll we'll stay tuned on that.
[00:45:09] Mike Thomas:
Yeah. It's interesting to Eric, that's why we have you on the podcast because you you know the backstory around all of this. You have the you know, environments like the the one that Steven works in, you know, environments like the the one that Steven works in where you may need to to host something like our studio server. Right? Mostly internally, maybe you're doing some sort of a consortium collaboration, right? And you you wanna stand that up in a way that, is easily accessible you know cloud hosted something like that to, a group of individuals maybe internally or externally. So I would love to see this license, like, say something like you're not allowed to resell, Positron in that fashion but if you wanna stand it up for for free and not make any money off of it, you know, in a similar fashion to how you would, you know, stand up our studio server, then then go for it.
But this definitely, you know, as as Steven mentions in his case, he's he's not a lawyer, but it looks like, you know, Positron's license may preclude the ability to do something like that. Yeah. That's,
[00:46:23] Eric Nantz:
that's looking more likely. Maybe, again, these will be things we hear about more in between sessions at at next week's Posikoff. But, if you're in tune to this space of those in the data science, you know, platform as a service area, Let me just say that if you're a fan of pizza in the US, you can probably guess the name of the company I'm thinking about. I'll leave it at that. I can't say that. But, nonetheless, this this was a very comprehensive post by Steven here. And, again, it's a pleasure to have him on the highlights, and I'm really looking forward to see how he continues to use positron and other efforts like this as I'll be testing the waters a little bit in my continued shiny development. There is one thing that I want to have that hopefully I can talk to deposit folks next week about.
Please, please make a NICS package for positron, please, then I'm I'm happy. Hey. It's open source. Pull requests are welcome. Yeah. I know. I I don't know if, Bruno, you're listening. Maybe we need to to talk about it. Maybe we need to strengthen numbers on this one.
[00:47:29] Mike Thomas:
There you go.
[00:47:30] Eric Nantz:
Well, now strength in numbers is also a way you can say, you know, what's the benefit of our weekly itself. It is the strength of the community that populates every single issue. So we'll take a couple of minutes to talk about our additional fines here. Now I'm gonna talk about something that I totally did not expect to talk about, especially in the the circle I operate in. We have a new package that's been released in the our ecosystem called Maestro positioned to create and orchestrate data pipelines in R.
I know a little bit about pipelines because I happen to work closely with the author of targets himself, Will Landau. So, of course, the first thing I'm thinking of is what is Maestro all about? So we'll put a link to this in the show notes of Khos, but at a high level, Maestro is definitely positioning itself in the realm of data processing. I don't envision it's trying to approach on the comprehensive ability of targets to be flexible in many types of analytical workflows that we often see in data science and operational automation efforts. But the the nuts and bolts of Maestro are kind of fascinating where they're taking advantage of our oxygen tags in a function that you create for your data processing or your ETL processing to define characteristics such as the schedule of when the or pipeline will run such as, like, the frequency, what's the actual start time of that job, and there there are other decorators as well. And then once you have your functions decorated, you use some of the functions built into Maestro to build that schedule and then to execute it. So they do leverage other packages that help with, you might say, more of the high demand, you know, computation such as the future package and the FIR package. You wanna have that map like functionality for your running schedule. It looks like it can interface with those quite a bit, but I don't think this is an either or. I definitely when I saw Maestro get mentioned, I went immediately to their package site, and they have an article on their motivation for creating, Maestro.
And, of course, there is a section on comparison of Robert Paggins. I was pleased that they talked about targets on this because I was gonna be some I would have reached out to the office right away about if I hadn't seen this. But it looks like they are interested in seeing where Maestro and Targets can complement each other because, at least in some of the target critiques I've heard over the years, it's a little more difficult for these more ETL, especially the extract portion of data processing pipelines. Again, especially if you're interfacing with databases or APIs that you're ingesting data from, There are ways you can use targets with it. You just have to get a little more into the nuts and bolts of how you orchestrate your return objects and whatnot. So I'll be very interested to see if this ends up being a cohesive relationship or not. But with that said, if you're in this space of producing comprehensive data pipelines, Maestro may be worth a look.
[00:50:42] Mike Thomas:
Absolutely, Eric. I'm in very interested to see how that evolves. That sort of came out of left field. For me, that Maestro package wasn't something that I had had on my radar, but it's it's really, really interesting, you know, especially as we've been getting into a lot more targets projects lately, a lot more of these sort of pipeline, types of types of projects, so it'll be, you know, a potential new tool in our tool belt as well. I have sort of a different highlight that I wanted to call out. It is an interview with the R Consortium published and held with, Pharma Rug China organizer, Zhou Xu, who who spoke with the R Consortium about how he has grown the R community in China, you know, specifically in the pharma space.
I think it's something that we don't necessarily talk about enough is how international r is. One cool a couple cool facts about Joe is that he studied or he has his PhD in statistics where he studied in New Zealand, which we know is the birthplace of r. He worked at at Roche for, the past 4 years where he helped open source 30 software packages. There there's 4 that are called out here. I'm not super familiar with them, Eric, but you might be, you know, being in the pharma ecosystem, formatters, our tables, our listings in turn. Oh, yes. Sounds like there's there's many more beyond that.
This this our community in China has done a lot of collaboration and I think is involved pretty heavily in the pharma rug, pharma r user group sort of in general. And there's just a lot of great talks about how Joe has helped organize a lot of events, a lot of hybrid events, and how he's been able to pull those off using Microsoft Teams as well as, you know, hosting, the these hybrid events in person as well. So I I thought it was a really nice interview, for anyone who's looking to learn about, you know, organizing, our groups, you know, within their their neck of the woods that you might get something out of this interview.
[00:52:46] Eric Nantz:
Yeah. I'm really glad to see this because we are seeing a lot more momentum of our our colleagues in Asia Pacific regions adopting our in life sciences. This is hugely instrumental to making that journey hopefully easier for those that are that are on this. And, you know, very closely related to this, some of you know I'm one of the organizers of the annual Our Pharma Virtual Conference. One thing that we've always struggled with is how do we best accommodate our our fine friends over in the Asia Pacific region because we're mostly a US based conference in terms of time zones and the way we stream our our talks. Well, I'm happy to say that, you know, alongside reading this post, if you are in the Asia Pacific region, there is going to be an r pharma specific event tailored to Asia Pacific.
So we're gonna have a link to that. It's actually in our this week's our weekly issue that call for talks is open for that particular effort. We'll put that in the show notes just in case here too. Jonathan Caroll has actually been quite nice to to advertise that on Mastodon and whatnot. So, yeah, this is a wonderful time if you're in the life sciences space in in Asia Pacific region to to dive into this momentum. I liken it to what we, you know, when our farmers first introduced in 2018. None of that had been built before. Now we're trying to expand the scope of this because we wanna take advantage and get everybody across the world a unique opportunity to share their learning and learn from each other along the way. So, yeah, gratifying to see the Arkansas team keep, keep on spotlighting these these great use cases. And, yeah, I'm very excited to see where the future goes here.
And, of course, you can't just be excited about that. You gotta be excited about the rest of the issue. There is a boatload of additional content here following the full gamut of new packages, many of which we wish we could talk about here today, but there's only so much time in a day. But there's lots of great tutorials as well. I see a full gamut of spatial visualizations, tidyverse pipelines, even a great post by Nicola about creating typewriter styled images. I think that's great for my retro feels as well. Love to see that. So much more content there. Where do you go for it? If you don't know by now, you know now it's rweekly.org. You bookmark that every single week. We got a new issue for you. And if you wanna contribute to the project, we rely on your contributions in the community.
So, please, if you find that great new blog post, maybe you wrote that post, you find a great package, a great tutorial, send that to us via a poll request directly linked at the top right corner of the home page. It's all marked down all the time. Right? Markdown is how I live in writing my content, and as for some internal presentations having to go back to PowerPoint, it just doesn't feel right. I feel right writing a markdown in quarto or our markdown, and I'm gonna put my foot down on that. Hopefully, that becomes more of a trend in my industry, but I digress. Either way, I'll markdown all the time so you can send your poll request right there. We'll be glad to merge that in for the week. And, also, we love to hear from you as well. We got a contact page in this, podcast episode show notes directly linked there. Did a low HTML hacking to get that together after we moved our podcast hosting. So hopefully it works out for you. But, also, you can get in touch with us on social media.
I am on Mastodon these days at our podcast at podcast NSR social, also on LinkedIn. Send me a shout there. And, again, if you're gonna be at Pazikov, I will be there at this time next week for sure. So we hopefully hope to hear from you. And, Mike, where can listeners get a hold of you? You can find me on mastodon@[email protected].
[00:56:30] Mike Thomas:
You can find me on LinkedIn just by searching Catchbrook Analytics, ketchb r o o k, to see what I'm up to lately. Or, Eric, as you said, find me in person next week if you're gonna be in Seattle. We'd love to chat with you. Yep. I'm gonna do my best to wear some kind of r related swag shirt every single day, so I'm easy to spot.
[00:56:49] Eric Nantz:
But who knows? Maybe everybody wearing r swag so it may not be easy to spot.
[00:56:53] Mike Thomas:
I'm gonna print up some rweekly highlights podcast, t shirts for us. Oh, yeah. Because it this is an audio podcast. People might not have a single clue what I look
[00:57:05] Eric Nantz:
like. Yeah. I still remember one time at a at a the first shiny dev conference, I was spending a question. Somebody looked over and said, hey. I know that voice. Yeah. So we're gonna get, I'm sure, our fair share of that. Nonetheless, we could blab around all day. We're always excited to talk about this stuff, but we got a close-up shop here. We got our day jobs to get back to, but we're very happy you joined us today for listening to this latest episode of our weekly highlights. And, again, we will not be back next week because we'll be at Pazacom, but we look forward to connecting with you all again with a new episode in 2 weeks from now. So until then, goodbye, everybody.
Hello, friends. We're back with episode 174 of the Our Weekly Highlights podcast. This is the weekly podcast where we talk about the terrific resources and the highlight sections that are being shared along with much more content in this week's our weekly issue. My name is Eric Nantz, and I'm delighted you joined us from wherever you are around the world. We are in the month of August already, and time has flown by quick. So, of course, I gotta, you know, buckle up my virtual seat belt here as we get along the ride to an eventual conference next week. But, of course, I need to bring in my awesome cohost who I never do this alone, of course, because he's joining me here, Mike Thomas. Mike, how are you doing today?
[00:00:42] Mike Thomas:
Doing well, Eric. I am 6 days until my flight leaves, and I I can't, be more excited
[00:00:49] Eric Nantz:
to get to Seattle. That's right. In fact, yep. As as you've heard in the previous episodes, Mike and I will both be at Posit Conference Seattle, which means that, you know, usually during our, quote, unquote, recording time next week, I'll actually be giving a presentation around that time. So, you won't be having an episode next week, but, nonetheless, we'll make it up to you later in the month. But, nonetheless, we've mentioned this before, if you are gonna be in the area for POSITONV, please come say hi to us. We're gonna be out and about. I'm actually arriving pretty early because I'll be part of the Our Pharma summit that's happening on Sunday before the conference, and I'll be in one of the workshops or databases.
So I'll be around. Mike, you're getting in on Monday, it sounds like. So, yeah, we'll definitely looking forward to connecting with you listeners out there.
[00:01:36] Mike Thomas:
Please say hi. Yes.
[00:01:38] Eric Nantz:
Awesome stuff. Yeah. And And, again, I can't confirm or deny now. I have some sticker swag with me. I'm still trying to get stuff together. I still have to pack, so lots of things to to bring with me. But good thing we don't have to bring, you know, writing an article show ourselves. We got a handy curator team that handles that for the project. And speaking of going above and beyond, for the 2nd week in a row, our curator this week is John Calder. He really stepped in to help out with our scheduling for the rest of our curator team. So, again, many of you hopefully know this by now. Our weekly is a complete volunteer effort. So anytime we can pitch in and help each other out, it it it it's it's just so valuable to us. So our curator team always goes above and beyond. So, certainly, my thanks to John for stepping in. And as always, he has always had tremendous help from our fellow RM Wiki team members and contributors like all of you around the world with your poll requests and suggestions.
And, yes, it's been great to see the momentum behind, you know, the adoption of the tidy models ecosystem for many machine learning, prediction, and other pipelines in the modeling space. We've covered numerous, segments here in the podcast about some of the recent advancements on the internal tooling that's been happening over the years. And it's always great to see, members of the community start to adopt this to their existing workflows, especially in cases where they've had maybe some internal custom solutions. And now they wanna see where does tidy models fit in terms of giving them advantages, reconstructing their modeling pipelines, and what that experience is like. So our first highlight today is doing just that. It comes to us from Mike Mahoney, who is now at the USGS as one of their scientists and biologists, which is the US geological survey for those who are outside of our US circle here.
They've been doing some great work in the art community while they're tooling. But in particular, Mike is involved with a very important effort for the New York Forest Carbon Assessment, which in a nutshell is trying to objectively measure the amount of forest coverage within the state of New York and helping predict the changes in this coverage as part of the recent climate, Protection Act that was trying to minimize their use of carbon emissions from, obviously, fossil fuels and other things like that, by the year 2050, I believe. They want, like, an 85% reduction in that, which means that, you know, they're trying to take advantage of what, you know, forest and other plants can provide in helping offset some of the some of those, decreases.
So this summer, they've been working on version 2 of their automation pipeline and modeling pipeline of this assessment where, again, they've had some internal functions to help with the tuning and creation of these stacked ensembles. But now they wanna use now the broader ecosystem of tidy models to kind of bring all that together in one cohesive structure. Now, unfortunately, he can't share the actual data they're using for this ensemble project at this time, but the blog post does a terrific job of taking advantage of publicly available data from actually not far from your neck of the woods, Mike. They're taking advantage of tree canopy data, from online sources in the in the city of Boston from 2019.
And the goal of this post is to illustrate very similarly to how they're adapting their New York forest carbon assessment by fitting 2 types of models, the Mars, which is a multivariate adaptive regression splines, as well as gradient boosting modeling, which, again, these are highly popular in the machine learning and prediction space. So the first part of the post talks about how he assembles the actual prediction data, the outcome data, which, again, I think is very comprehensive. If you're into learning how to obtain these data, Mike definitely has you covered with all the data preprocessing steps, the various packages you need. And lo and behold, once you do some, you know, data preprocessing and visualization, you now have a tidy dataset with the outcomes of interest that he wants to predict.
So with that, now it's time for the model fitting. And so there they got the a cell values object, which is again, hosting the outcome data and the prediction data. One little nugget here right off the bat is that just like of anything in r, when you're doing a prediction model, you need to have some kind of formula object to help denote what are the relationship between the outcome and the predictor variables. And there's a handy little function in base r called df 2 formula, which I wasn't familiar with, which basically will be intelligent enough to determine if your data set has the outcome variable as the first column, and then the rest of the columns are prediction variables, you don't actually assume that structure, and then you don't have to write like the typical formula syntax of outcome, tilde, and then all the different combinations of predictors. It literally just takes that data as is and builds your form of your object right off the bat. So that's another one of those hidden base r nuggets that you see from time to time. It's always great great to see those.
Now in the tiny models ecosystem, there is kind of a stepwise fashion to how to produce these workflows. The first step is to register the recipe, which is kind of gonna be the building blocks for the actual fit itself where you simply feed in the formula and the dataset that contains your observations. Simple enough for the recipe, recipes package to do that. And then now it's time to fine tune the model specifications, which, again, one of the great things about the tiny models ecosystem is that it has dedicated packages to help with some of these things. Now that may not always work well, well, which we'll get to later on, but he outlines 2 specifications, one for the GBM model and one for the Mars model, which, again, depending on the model type, you may have different sets of parameters to specify.
And then throughout that, for many of the parameters where he doesn't know right off the bat, what are the best values of such as, like, the number of trees or the tree depth and the GBM side of it, Tiny models, of course, lets you tune for those parameters, optimize for those leveraging the hard hat package and its tune function. But you'll see in the code, it's actually littered throughout this model specification with no arguments. So it's trying to do a lot for you from an abstraction perspective, which, again, may or may not be always perfect, but we'll get to that in a little bit.
You've got those specification in the model now. Now it's time to set up the workflow, which is again coming from the workflow sets package where you can put in your recipe as well as your 2 model specifications and making sure you've got all that integrated together gives you this nested data frame back where you can see your 2 recipes in this case, the information inside. And then once you actually do the prediction, it'll capture the results as well. But what Mica's saying high praises of is that a lot of this takes a lot of code back in the days before tidy models, a lot of things you had to build on the fly. Certainly, Max Schoon had offered the carrot package well before tidy models. Many people use that to orchestrate their flows. Still, you had to learn the nuances on how to stitch all this together.
Tidy models is trying to abstract away a lot of that manual effort to stitch all this together so that you can have fit for purpose functions to define all this. So, again, we'll we'll come back to the benefits and trade offs of that in in later on. And then once it's ready to go, you've got to now start composing your samples, your re samples of the data, as well as tuning your workflow sets. In other words, finding what are those optimal parameters. So with the workflow map function, which is gonna actually take that set of combinations of the recipe and the model fits.
You feed in a certain met set of metric or parameters such as your space for the grid search. And then also some two arguments that we'll come back to later on, the metrics and the control argument, which is gonna help with specifying some defaults that can be surprising if you're not ready for it. We'll come back to that in a little bit. And then once you have that done, it's time to actually run the tuning and see what your best fit is based on your metric of interest. So in this case, in this example, he's looking at the root mean square error to see which model or which set of parameters of the model fit best.
And he's able to locate for each of the model types which particular model. It's got a somewhat ambiguous name of, like, preprocessing preprocessor 1, model 09 because it's actually fitting different models for each of these combinations, so it gives it a unique ID on it. But he's not interested in just cherry picking the best fits of these. He wants to use the ensemble technique, which basically is able to take all these model fits, figure out then with some analysis which are the, quote, good enough parameter sets that then we can he can use later on in the actual prediction side of it. So taking advantage of the information from all these different model types.
And this is a great advancement in tiny models ecosystem. They have a package called stacks, which is gonna be able to add all or literally stack together all these different model fits and then be able to see what is the highest weight in terms of the model giving the best performances and really get objective measures around that. So then he's got a nice tidy output here of the top 4 members contributing the best prediction outcome or prediction, I should say, power, you might say. And it is a mix of the light GBM and the Mars model for these different, you know, combinations of the preprocessing and the model fit itself.
And then he can take that and feed it just directly into the predict function. That's pretty neat. Just predict this ensemble object that he's created with these stacks, the dataset that has the amp predictor values, and you'll get a tidy data frame with the predictive values back. That, again, another, you know, less code solution to get your predictions out of that. And this is a pretty comprehensive start to finish flow on this. And I will say there were some little gotchas along the way that he want he is able to talk about in the next section of the post. And Mike has a unique perspective on this, our author of the post, because he was an intern to the tidy models team years ago. So he's had some inside look on this, and that makes his, critiques here even more fascinating. So, Mike, why don't you take us through that?
[00:13:31] Mike Thomas:
Yeah. Mike was an intern back in 2022 on the tidy models team, and I have to imagine that while that must have given him a a huge leg up in converting their their legacy code into tidy models, 2 years in tidy models time must be like a decade, in terms of the amount of new functionality and packages and design choices that have been added to the ecosystem since then. I mean, we have survival analysis and tidy models now and that that was never on the radar back in in 2022, at least as far as I could tell. And, you know, as we know that ecosystem in that that universe of of packages within tidy models has grown. And it reminds me a little bit of that recent blog post that we had about creating package universes. And, Eric, you and I discussed some of the trade offs that you have to consider when doing that. Right. And it reminds me a little bit about sort of the end of of Mike's blog post here. Because a lot of these tidy models packages work together.
As Mike notes, you know, there's sort of 3 packages that work together just for hyperparameter tuning itself. The tune package takes care of grid searching. This hard hat developer oriented package owns sort of the infrastructure around hyperparameter tuning, and then the DIALS package owns the actual grid construction. And when you think about hyperparameter tuning itself and trying to, you know, create these different methods, you know, within each of these these packages and having them work together, Well, not only do you have to do that for 1 modeling one type of model but I'm assuming that you have to create different object oriented methods for all of these functions across all of these packages for all of the models that tidymodels supports. And I know on the the tidymodels, you know, homepage, they have sort of a list of all the different modeling algorithms that they support tree based, you know, regression based algorithms, all all sorts of different stuff. And they there's a ton within that ecosystem that they have supported, but I imagine that once you want to add an additional model type, it's a pretty extensive process to be able to ensure that you can support that model type not just in one package but across all of these different packages that work together to create these modeling workflows.
So I am not a user of of scikit learn. I'll be honest. I've reviewed some scikit learn code in the past. I know it's it's very highly regarded in the Python ecosystem. I don't know this for a fact, but I have to imagine that the tidymodels team must have had the benefit of taking a look at, you know, what works well in in scikit learn and what doesn't work well in scikit learn when they went to, you know, sort of move from caret to tidy models and create this new framework. So I I'd be curious to see if Python sort of suffers from these same sort of of issues or if not, you know, how they're handled in scikit learn. Because I have to agree with Mike that this is somewhat of a pain point if you are doing some pretty hardcore machine learning and predictive modeling as Mike and his team are clearly doing. Right? Creating a lot of different types of models, trying to ensemble them together, trying to tune hyperparameters, and and do it in a way such that the code is as efficient as possible. Right? There's a function in here that I didn't even know existed from the workflow sets package called workflow map which I have to imagine is like a purr like approach to, developing workflows across a bunch of different models, hyperparameter tuning, and evaluating, and comparing those models, you know, sort of really in this this programmatic approach as opposed to hard coding things for each one of these models and then trying to compare and evaluate, the the outcome of these models separately. So I would advise you to to try to take a look, you know, I don't know how far down in the weeds that we want to go into to some of Mike's sort of specific, gripes, if you will. I think complaint is sort of a strong word that that's what he uses in the header here. But I think he's really just pointing out, you know, some of the things that you, as a Tidymodels user, the the deeper you get into it, will face as well. And calling out some of the things that worked for him in terms of workarounds, some of the things that that he learned is he admits that some of this stuff is is straight in the documentation, and some of it is not. And you have to sort of, you know, take a lot of time to figure out yourself. He he and this is super relatable. I think there is something that that was documented in a package here that Mike spent 26 hours, trying to trying to figure out before he was able to actually, you know, figure out what was going on here.
And and it had to do I think with the defaults in hyperparameter tuning and some of these workflows that we're failing extremely slowly, unfortunately, and weren't sort of, you know, bringing to light the errors, quickly enough to the end user. This is all to say at the end of the blog post that that they're still using tidy models because I think net net at the end of the day, Mike and his team believe that, you know, the the pros and the benefits that they've received from switching over to tidy models outweigh the cons. And I think like anything, you know, with an open source software, hopefully, some of the the complaints and the issues that they faced are things that will be, you know, resolved. And over the years, we'll move within the Tidymodels ecosystem to enhancing these things and making that user experience a little more easy. I've used Tidymodels, you know, many times before, really enjoyed. It's definitely a little bit of a learning curve from from Karat. I think you have to, have more of a sort of per like thought process, a higher level design sort of thought process in mind when you're leveraging, tidy models and understanding how all of these different packages like Parsnip, r sample, yardstick, you know, work together, at workflows to be able to, accomplish what you're trying to accomplish. But it is it is super powerful, and and I'm glad to see that the team is is still sticking with it. And, I'm this this is like a wealth of information around tidy models and is a great crash course, on if you are trying to get into the weeds of machine learning and R on some of the design choices that that Mike and his team made to to create these models and evaluate them programmatically.
It's fantastic. I'm not sure, Eric, if I've seen a blog post recently that goes into this level of detail within tidymodels for us. So, very welcome blog post. I think it's a great not only technical discussion but also sort of practical discussion, from a team perspective on on what's worked well for them and and what hasn't and where they're planning to go in the future. I was wracking my brain as you were walking through this, and I
[00:20:38] Eric Nantz:
don't recall one in in the recent, you know, months or probably even year of anything this comprehensive because there is a great mix here, again, of one of the points he he mentions towards the end is that this you you you may think you can kind of, you know, hone in on one particular aspect of tidy models, but he's saying that it really took him having this holistic view of how these different pieces fit together. That may be an issue for those that are kinda new to these, you know, suite of packages that have a cohesive API or opinionated way of integrating together.
Now as I say that, you may be thinking to yourself, well, that sure sounds an awful lot like the tidy verse itself. Right? I mean, certainly, they got inspiration from the tidy verse on a few things, But what I when I do data processing pipelines with the tidyverse, most of my time is spent with, I'll call, a core set of maybe 2 packages, maybe 3 at the most, like dplyr, tidr, and per to help with some, you know, mapping processing. Oftentimes, I don't quite have to get into the weeds so much as some of the other packages, but sometimes I do. And so it's always helps as you're building these cases or these use cases for yourself or maybe for your team to document these intangible kind of learnings that as comprehensive documentation might be for these given packages, it's how they're integrating together.
And, certainly, the tiny models team has done great work to put these freely available, you know, online books online about, you know, all the the different ways that tidy models can be used. Some, you know, Max Kuhn, Julia Silge, and others have been very front and center with that, and we highly recommend you check out the tidy model site to get links to those particular cases. I do think though that having a post like what Mike has done here touches on things that, again, are kind of on the more practical side. And I have been victim as someone who uses HPC systems on a weekly basis.
Oftentimes with jobs that won't complete in a day or sometimes 2 days, it can be costly when you thought you had a default set right, and then you find out after the fact you forgot to save that prediction result. You forgot to do that one little adjustment to p values in my case, and whoops, gotta go back to it. So there are some things that you can do to help minimize the impact of that, which I don't know works as well for the models. But when we tell people on our team is if you have a simulation pipeline and you wanna do, like, 10,000 simulations, you know it's gonna take a while.
You really only wanna do a few of them first to make sure you've ironed out all your connections, all your outputs you're saving so that you're not surprised after writing them over that amount of time. So I I I had the feels when I read that part of of my exposes. I've been there many, many times, but these are all things that, again, you kind of have to learn by doing, but then documenting your learning process is so helpful. And I do think there's going to be tremendous value for those that are adopting tidy models to their pipelines right now to kind of see, again, our literal real world usage of this and the and the, lessons along the way.
I'm confident that the tiny models team will take this hopefully constructive feedback here, and maybe we'll see some enhancements to, like, these more use case approaches, the documentation, and not just the, developer facing that you might see in the weeds of a package manual or when you go help for itune or whatnot or whatnot. You're not really getting the full picture at that point. So maybe we'll see improvements on that, but, again, this is the kind of the unofficial contract you sign when you leverage a suite of packages that, again, are meant to be coupled together tightly. There may be cases where the abstraction doesn't give you the full store. You need to kinda get in the weeds a little bit like Mike has done here.
[00:24:52] Mike Thomas:
I agree, Eric. And that's a great call out to, the tidy modeling with our book I think if you are interested in tidy models or if you're stuck on something that can be a great resource. We'll put that in the show notes, but it's tmwr.org.
[00:25:06] Eric Nantz:
Couldn't be easier to get to. That's right. Yeah. And it should be on your virtual bookshelf or even your printed bookshelf there. It's a valuable resource. And, again, the the the ecosystem is always evolving too. There's always, like I think they do quarterly updates from time to time on the tiny models blog. I've seen Max and Emil and others, you know, do a great job of writing that up. But, yeah, what Mike has done here in this post is is tremendous value to the community. Alright. We talked about, you know, looking at things from a developer perspective and a user perspective. We're gonna put our dev hats on, Mike, a little bit because we're gonna talk about what has been a very, you know, well spoken or hot topic these days and how that applies to package development workflows.
And if you recall, it was a couple a month ago or so. It was kind of quietly put out there, but there is a new IDE authored by Posit called positron. And, again, the geek in me cannot ignore the fact that the name of my favorite movie ever is in the name Positron. So to take that for what it's worth, I have visions of the MCP talking to me right now. End of line. But in any event, this is really interesting to see the uptake on this, and I dare say we'll be hearing a lot more about this at Pasa Comp in a week from now. But what we're seeing here is a post from, well, Stephen Turner who, coincidentally, when I started my our journey many, and I do mean many years ago, there were 2 blogs I discovered that helped me in my journey, especially as coming from another language like SAS trying to make heads or tails or what r was doing under the hood.
Stephen Turner, the author of this post, was one of them because he wrote this terrific blog called getting genetics done, just was so instrumental in my learning journey with R, and it is just terrific to see him resurrecting this effort in a new in a new blog. But this post is one of the latest that he's put on this, I believe launched earlier in July. So, it's a huge thank you for Steven. And, you know, I don't think I've met you personally or or if I have, it must have been years ago, but you have been very instrumental to my journey of ours. So it's terrific to be able to cover one of your posts here in the highlights. Nonetheless, what he talks about in his post here is his early adoption and user experience of positron to mimic what he's done in the Rstudio IDE and before that, e max of ESS over the years, and that is building in our package. What is the experience like in that space?
So we're gonna dive into some of his, findings here in in the highlights now. And first of which is for if you're not familiar with how positron operates, positron is actually a fork of the open source version of visual studio code. They call it code o s s. So when you look at positron, it's gonna look different than your studio ID, but that's by design because it literally is the Versus code shell with posits, you might call design choices on top to help bring some of that RStudio IDE functionality. Not quite all of it because this is a beta product, which we'll get to in a little bit. But he puts links to great resources about positron if you're new to it, such as the Wiki on their GitHub repository. Absalon's done a nice intro to Positron as well as Andrew Heiss, who's been frequently featured on the highlights, his experience of positron too. So definitely have a look at those after you listen to this, but let's get to the actual package development workflow.
So in our studio, what do we typically do? We like to create a new project in our studio to help house our package code, and positron brings their own spin on this as well. This is, again, one of their additions to the top of the visual studio code like experience where they let you choose 3 different project types, either a Python project, an R project, or a Jupyter notebook. So right off the bat, you've got a little wizard to guide you along the way. So, of course, he chooses the R project, and you've got, again, that familiar looking R project file created for you, which is just like what you would have in our studio. So there's already a familiar, part of your experience.
And then how do you actually create a package? Many of us are now using the use this package to create a package from scratch in that same directory. He does that. No gotchas there. He's got the scaffolding right off the bat in a package he's calling hello, and then he puts a simple function called isay. It's kinda like your hello world type, but just a little sampling of different strings in there. He writes that function. Now we get to some of the differences because there is a very convenient feature in in our studio ID that I use every time I make a new function for a package.
There is a either a keyboard shortcut or a menu entry to dynamically insert in our oxygen skeleton of the parameter documentation right there above your function with just a click or a keyboard shortcut away. Unfortunately, that's not in positron yet, so you're gonna have to write out the docs yourself. Of course, it's not too difficult. You'll get code completion, but it is just one of those conveniences that hasn't quite been replicated to the positron experience yet. But nonetheless, he's able to document his function and now comes the iteration. Right? When you're writing a package, you wanna develop your function, you wanna test that things are working, update the documentation, manual pages dynamically.
And what's nice about positron is you can import a keyboard mapping of shortcuts that will mirror very closely what you might have done in Rstudio's keyboard shortcuts. So you can you can import that in optionally, and then you can use that familiar, say, command shift or control shift d to populate the documentation manual pages on the spot, that works just right here as well. So that's gonna take care of, again, the the manual pages. They're gonna take care of putting the function name in the pages. They're gonna take care of putting the function name in the namespace for exporting so you don't have to do any manual effort on that front just knowing the shortcut or the command palette. You'll be able to do either one of those.
Another interesting thing is from time to time, we like to install the packages in our local environment as we're iterating on it. Before in RStudio, it would call the r command install function verbatim. In positron, it's actually leveraging the pack package. Pack colon colon local install, which again, I didn't know existed. So a little learning there. It's not a surprise that posit would use some of their tooling of the rlib suite of packages that they have. Excuse me. The rlib suite of packages to help automate some of these processes behind the scenes. So fair play to them. Of course, as you develop a package, you're gonna see your fair share of warnings if you need to update, like, documentation names or other things like that or the license entry. You get all that in the console just like you would with RStudio.
So no surprises there. You can run dev check, get the results there. You can build in your tests. We'll test that very quickly or use this to launch the test. It's gonna open that up right in your ID. Again, we're seeing a lot of similarities there. There is one other thing that's kinda missing from a development perspective that you might use from time to time is that for things such as the cover package to help you look at all your test coverage percentage with the functions you develop. There is always a handy add in, r studio add in in the r studio ID that will let you kind of run that report quickly of a menu click. As of this recording, positron's not supporting r studio add ins yet. I think that's something that is being worked on because even in my visual studio code experience with developing our projects, I can get add ins on that, thanks to the efforts of Miles McBain and others from the community. So I think that's on the road map, but that's something to be aware of if you wanna adopt positron right away.
So from most of the, you know, actual workflow, package development looks pretty seamless with Positron. Now there are some additional things that are missing in the positron experience that I leverage heavily when I put on my Versus Code hat back for a second. That is remote container and the remote connection, functionality, meaning that I could have Visual Studio Code on a one system, but then have another server either on my local LAN or my local HPC environment or in the cloud that actually has the R process built into that, and then I can just farm out my computations on that, but look as if it's local. That's not quite there yet in positron. They are working on that. That would be a game changer for me personally when they adopt that. But what was interesting, what Steven was able to do here is that little did I know, on top of writing great content about r, he's actually authored a package to help with getting a Dockerfile built from your package project called pracpack.
Say that three times fast. But, nonetheless, this is new to me. I've only been familiar with the Docker filer package by our our friends, think r. But this looks pretty nifty, and he even has a link to the paper about the package as well. We'll put that in the show notes, but he was able to actually, even in positron, uses pracpac package to create the Dockerfile with rmbakedin without any fuss. So at least he can get the Dockerfile going. He just can't do remote container development with that Dockerfile in positron. But, hey, it's great to see that he's able to get a Dockerfile on there. So if you wanna throw this into another environment and have those same dependencies from both the system level and the R package level ready to go. But he says the experience of that was very smooth because positron does take advantage of most of Versus Code's features such as syntax highlighting for batch scripting and other file types. So if you're in the multilingual environment, yeah, Positron is definitely gonna be a big help to that.
So the nutshell to me is that things are looking promising. But, again, this is beta. Be warned about that. But I do think that the seeds are planted. And if you're want to put on your speculation hat with me, Mike, Steven kind of has this in a comment in the post. It looks like a lot of the development energy is going to positron now, so we may have to, put a little toast out there to our friendly art studio ID because because I don't know if it's days or numbered. I guess we'll find out next week. What do you think?
[00:36:42] Mike Thomas:
Yeah. Maybe we'll find out next week. I know that some folks have raised that question and the messaging thus far, I think, is that our studio is gonna continue to be supported. You know, Positron's sort of for you know, there there's a lot of different levels of skill of our developers. Right? There's our beginners. There's a lot of folks that are probably are intermediate where they are consumers of our packages, right, and do a lot of their work and and ETL stuff and analysis and R, but maybe aren't developing our packages or or doing anything, you know, much more hardcore than than analysis. I I think that's a probably a large portion of the R community out there and, you know, those first two classes, they are developers and they are, sort of intermediate folks or our consumer or our beginners, I should say, and, our consumers mostly.
I'm not sure how much incentive they have to to move from our studio. I don't know if positron feels like a more welcoming environment than RStudio, but I'm I'm super biased because RStudio was probably one of my first ever IDEs. And then I I looked at Versus Code for a long time. It looked so scary to me when I originally, you know, booted it up for the first time and and tried to do some work in Versus Code. I had no idea where anything was but it's it's probably my bias from, you know, just being so used to our studio and and the layout there and and where the features are. So so I don't really know the answer to if, you know, someone starts out in RStudio versus starts out in positron, sort of who is able to make the best progress the quickest.
But it would be interesting to to see how that plays out, you know, for new users whether they're they're migrating directly to positron or in the future or or whether folks are still starting out with our studio. But, you know, as you said, Eric, you know, for for those of us that need some of these maybe niche features that we get in in Versus Code, you know, around remote SSH as you talked about so that you can work locally, or feel like you're working locally while actually, you know, SSH'd into some remote environment, that that's pretty powerful. I think that that's probably going to come fairly soon, to the Positron ecosystem. When I was looking through the issues and the discussions on GitHub, it looks like that one, is is fairly promising in terms of, you know, Posit's ability to actually get that incorporated.
The other one sort of that hurts me, you know, that we've talked about before is the idea of these Versus Code dev containers that allow you to develop as if you're in a Docker container, you know, while you're in a containerized environment. That's huge for us in collaboration on our team, but I fully understand that that is proprietary Microsoft code, that extension, that Versus Code extension. So there's not an easy way for Positron or any other IDE for that matter to be able to recreate that unless Microsoft someday decides to actually open source how they go about doing that. So I don't know which one's going to come first, whether Microsoft will open source it or Positron will incorporate it, but that's that's a big one for me that I would love to be able to see, incorporated. But, you know, again, these are these are niche things that for probably the majority of the R community may not necessarily care about. And, you know, some of the the trade offs here and the things that we're getting from Positron, you know, I I think are are huge benefits for potentially a lot of that intermediate class. So it it'll be interesting to see, you know, the adoption of Positron, sort of what takes place as it moves from beta to alpha, and what significant changes get incorporated for for quality of life for folks.
But, you know, one of the issues I I imagine that Pauseit is up against is accommodating so many different levels of our users. Yes. Right? Yes. It's tricky. It's tricky. So I don't envy, you know, what they're trying to do. It's a it's a large scale problem to solve. But I do think that being able, you know, the fact that Versus Code is open source and allows, you know, posit to sort of tailor it to whatever they want it to be is is powerful in and of itself and I guess another testament to open source. So we'll we'll see how things progress here. I love the fact that folks are starting to dive into it and and pick up on the strengths and the weaknesses of it for the rest of us to be able to get up to speed early as we adopt it.
So, you know, kudos to Steven as well for doing that.
[00:41:35] Eric Nantz:
Absolutely. And one thing I think about is, first, I do want to mention that I've used my fair share of attempts at IDs before our studio. They were hit or miss at best. I remember one in particular on the Linux system called rkward that tried to do a lot of things, and then others tried to make the Eclipse IDE try to fit with our projects. Oh, boy. That was gnarly, buddy. If you if anybody listening remembers that, give me a shout because I I can share stories of you about that. But I knew at that time, I was trying to learn it. But as, you know, R was starting to take more adoption in industry, it was gonna be a tough sell to get people to develop and those kind of IDs much less than a command line. I mean, hey. I love command line as much as anybody, but, it's not for everybody. Right? So when our studio came out, I mean, you can't underscore the influence it had on not just those getting new to r, new to data science, having this cohesive experience, everything in one place, so to speak, in your development journey.
But, boy oh, boy, did it help with adoption. It certainly did in my industry. So that's why I'm I'm cautiously optimistic that Positron will eventually, you know, hopefully be tailored in in more usability fashion to those use cases for those new to the language and who are not interested in, say, package development. They just wanna get their data science done, get their statistics model fits done, get that quarter report out there, and get on your way. Like, I think the bones are there. It's just gonna be a little while. But with this soft beta that they or that they rolled out, I think now we're they're getting their issue tracker is quite extensive now if I last checked. There's a lot on there, so it's a lot for the the team to prioritize. But, yeah, we shall see. There is one little nugget more on the, speaking of open source, there's always the issue of licensing. Right?
Little hidden nugget here in what Steven mentions at the end here, which I think we should call out here is that Positron is not using things like GPO, not using Apache. It's using what's called the elastic license 2.0, which you may have not heard about unless you're really familiar of all the nuts and bolts of software licenses. But let me read the the blurb that he calls out here that I think I may know the backstory about, but let me read it first. And he says, you may not provide the software to third parties as a hosted or managed service where the service provider provides users with access to any substantial set of the features or functionality of the software.
There have been vendors, I won't name names, that have bundled the open source version of RStudio into their, what I call, platform as a service offerings. And I've heard, unofficially, I won't put names on this, that Pazit was not too thrilled about this business practice. So it does not surprise me that they would go this next step with this fresh start to put this in there, but that may change where positron can actually be integrated. So I guess we'll see this space, but that's something to watch out for if you are in that in that kind of software, you know, provider space. So that was a little nugget I didn't expect to see, but we'll we'll stay tuned on that.
[00:45:09] Mike Thomas:
Yeah. It's interesting to Eric, that's why we have you on the podcast because you you know the backstory around all of this. You have the you know, environments like the the one that Steven works in, you know, environments like the the one that Steven works in where you may need to to host something like our studio server. Right? Mostly internally, maybe you're doing some sort of a consortium collaboration, right? And you you wanna stand that up in a way that, is easily accessible you know cloud hosted something like that to, a group of individuals maybe internally or externally. So I would love to see this license, like, say something like you're not allowed to resell, Positron in that fashion but if you wanna stand it up for for free and not make any money off of it, you know, in a similar fashion to how you would, you know, stand up our studio server, then then go for it.
But this definitely, you know, as as Steven mentions in his case, he's he's not a lawyer, but it looks like, you know, Positron's license may preclude the ability to do something like that. Yeah. That's,
[00:46:23] Eric Nantz:
that's looking more likely. Maybe, again, these will be things we hear about more in between sessions at at next week's Posikoff. But, if you're in tune to this space of those in the data science, you know, platform as a service area, Let me just say that if you're a fan of pizza in the US, you can probably guess the name of the company I'm thinking about. I'll leave it at that. I can't say that. But, nonetheless, this this was a very comprehensive post by Steven here. And, again, it's a pleasure to have him on the highlights, and I'm really looking forward to see how he continues to use positron and other efforts like this as I'll be testing the waters a little bit in my continued shiny development. There is one thing that I want to have that hopefully I can talk to deposit folks next week about.
Please, please make a NICS package for positron, please, then I'm I'm happy. Hey. It's open source. Pull requests are welcome. Yeah. I know. I I don't know if, Bruno, you're listening. Maybe we need to to talk about it. Maybe we need to strengthen numbers on this one.
[00:47:29] Mike Thomas:
There you go.
[00:47:30] Eric Nantz:
Well, now strength in numbers is also a way you can say, you know, what's the benefit of our weekly itself. It is the strength of the community that populates every single issue. So we'll take a couple of minutes to talk about our additional fines here. Now I'm gonna talk about something that I totally did not expect to talk about, especially in the the circle I operate in. We have a new package that's been released in the our ecosystem called Maestro positioned to create and orchestrate data pipelines in R.
I know a little bit about pipelines because I happen to work closely with the author of targets himself, Will Landau. So, of course, the first thing I'm thinking of is what is Maestro all about? So we'll put a link to this in the show notes of Khos, but at a high level, Maestro is definitely positioning itself in the realm of data processing. I don't envision it's trying to approach on the comprehensive ability of targets to be flexible in many types of analytical workflows that we often see in data science and operational automation efforts. But the the nuts and bolts of Maestro are kind of fascinating where they're taking advantage of our oxygen tags in a function that you create for your data processing or your ETL processing to define characteristics such as the schedule of when the or pipeline will run such as, like, the frequency, what's the actual start time of that job, and there there are other decorators as well. And then once you have your functions decorated, you use some of the functions built into Maestro to build that schedule and then to execute it. So they do leverage other packages that help with, you might say, more of the high demand, you know, computation such as the future package and the FIR package. You wanna have that map like functionality for your running schedule. It looks like it can interface with those quite a bit, but I don't think this is an either or. I definitely when I saw Maestro get mentioned, I went immediately to their package site, and they have an article on their motivation for creating, Maestro.
And, of course, there is a section on comparison of Robert Paggins. I was pleased that they talked about targets on this because I was gonna be some I would have reached out to the office right away about if I hadn't seen this. But it looks like they are interested in seeing where Maestro and Targets can complement each other because, at least in some of the target critiques I've heard over the years, it's a little more difficult for these more ETL, especially the extract portion of data processing pipelines. Again, especially if you're interfacing with databases or APIs that you're ingesting data from, There are ways you can use targets with it. You just have to get a little more into the nuts and bolts of how you orchestrate your return objects and whatnot. So I'll be very interested to see if this ends up being a cohesive relationship or not. But with that said, if you're in this space of producing comprehensive data pipelines, Maestro may be worth a look.
[00:50:42] Mike Thomas:
Absolutely, Eric. I'm in very interested to see how that evolves. That sort of came out of left field. For me, that Maestro package wasn't something that I had had on my radar, but it's it's really, really interesting, you know, especially as we've been getting into a lot more targets projects lately, a lot more of these sort of pipeline, types of types of projects, so it'll be, you know, a potential new tool in our tool belt as well. I have sort of a different highlight that I wanted to call out. It is an interview with the R Consortium published and held with, Pharma Rug China organizer, Zhou Xu, who who spoke with the R Consortium about how he has grown the R community in China, you know, specifically in the pharma space.
I think it's something that we don't necessarily talk about enough is how international r is. One cool a couple cool facts about Joe is that he studied or he has his PhD in statistics where he studied in New Zealand, which we know is the birthplace of r. He worked at at Roche for, the past 4 years where he helped open source 30 software packages. There there's 4 that are called out here. I'm not super familiar with them, Eric, but you might be, you know, being in the pharma ecosystem, formatters, our tables, our listings in turn. Oh, yes. Sounds like there's there's many more beyond that.
This this our community in China has done a lot of collaboration and I think is involved pretty heavily in the pharma rug, pharma r user group sort of in general. And there's just a lot of great talks about how Joe has helped organize a lot of events, a lot of hybrid events, and how he's been able to pull those off using Microsoft Teams as well as, you know, hosting, the these hybrid events in person as well. So I I thought it was a really nice interview, for anyone who's looking to learn about, you know, organizing, our groups, you know, within their their neck of the woods that you might get something out of this interview.
[00:52:46] Eric Nantz:
Yeah. I'm really glad to see this because we are seeing a lot more momentum of our our colleagues in Asia Pacific regions adopting our in life sciences. This is hugely instrumental to making that journey hopefully easier for those that are that are on this. And, you know, very closely related to this, some of you know I'm one of the organizers of the annual Our Pharma Virtual Conference. One thing that we've always struggled with is how do we best accommodate our our fine friends over in the Asia Pacific region because we're mostly a US based conference in terms of time zones and the way we stream our our talks. Well, I'm happy to say that, you know, alongside reading this post, if you are in the Asia Pacific region, there is going to be an r pharma specific event tailored to Asia Pacific.
So we're gonna have a link to that. It's actually in our this week's our weekly issue that call for talks is open for that particular effort. We'll put that in the show notes just in case here too. Jonathan Caroll has actually been quite nice to to advertise that on Mastodon and whatnot. So, yeah, this is a wonderful time if you're in the life sciences space in in Asia Pacific region to to dive into this momentum. I liken it to what we, you know, when our farmers first introduced in 2018. None of that had been built before. Now we're trying to expand the scope of this because we wanna take advantage and get everybody across the world a unique opportunity to share their learning and learn from each other along the way. So, yeah, gratifying to see the Arkansas team keep, keep on spotlighting these these great use cases. And, yeah, I'm very excited to see where the future goes here.
And, of course, you can't just be excited about that. You gotta be excited about the rest of the issue. There is a boatload of additional content here following the full gamut of new packages, many of which we wish we could talk about here today, but there's only so much time in a day. But there's lots of great tutorials as well. I see a full gamut of spatial visualizations, tidyverse pipelines, even a great post by Nicola about creating typewriter styled images. I think that's great for my retro feels as well. Love to see that. So much more content there. Where do you go for it? If you don't know by now, you know now it's rweekly.org. You bookmark that every single week. We got a new issue for you. And if you wanna contribute to the project, we rely on your contributions in the community.
So, please, if you find that great new blog post, maybe you wrote that post, you find a great package, a great tutorial, send that to us via a poll request directly linked at the top right corner of the home page. It's all marked down all the time. Right? Markdown is how I live in writing my content, and as for some internal presentations having to go back to PowerPoint, it just doesn't feel right. I feel right writing a markdown in quarto or our markdown, and I'm gonna put my foot down on that. Hopefully, that becomes more of a trend in my industry, but I digress. Either way, I'll markdown all the time so you can send your poll request right there. We'll be glad to merge that in for the week. And, also, we love to hear from you as well. We got a contact page in this, podcast episode show notes directly linked there. Did a low HTML hacking to get that together after we moved our podcast hosting. So hopefully it works out for you. But, also, you can get in touch with us on social media.
I am on Mastodon these days at our podcast at podcast NSR social, also on LinkedIn. Send me a shout there. And, again, if you're gonna be at Pazikov, I will be there at this time next week for sure. So we hopefully hope to hear from you. And, Mike, where can listeners get a hold of you? You can find me on mastodon@[email protected].
[00:56:30] Mike Thomas:
You can find me on LinkedIn just by searching Catchbrook Analytics, ketchb r o o k, to see what I'm up to lately. Or, Eric, as you said, find me in person next week if you're gonna be in Seattle. We'd love to chat with you. Yep. I'm gonna do my best to wear some kind of r related swag shirt every single day, so I'm easy to spot.
[00:56:49] Eric Nantz:
But who knows? Maybe everybody wearing r swag so it may not be easy to spot.
[00:56:53] Mike Thomas:
I'm gonna print up some rweekly highlights podcast, t shirts for us. Oh, yeah. Because it this is an audio podcast. People might not have a single clue what I look
[00:57:05] Eric Nantz:
like. Yeah. I still remember one time at a at a the first shiny dev conference, I was spending a question. Somebody looked over and said, hey. I know that voice. Yeah. So we're gonna get, I'm sure, our fair share of that. Nonetheless, we could blab around all day. We're always excited to talk about this stuff, but we got a close-up shop here. We got our day jobs to get back to, but we're very happy you joined us today for listening to this latest episode of our weekly highlights. And, again, we will not be back next week because we'll be at Pazacom, but we look forward to connecting with you all again with a new episode in 2 weeks from now. So until then, goodbye, everybody.