Our candid takes on the state of CRAN's role in light of recent package archival events, how creative use of LLMs could greatly streamline your next literature review, and a few great illustrations of lazy being a good thing in your R session.
Episode Links
Episode Links
- This week's curator: Eric Nantz - @[email protected] (Mastodon) & @rpodcast.bsky.social (BlueSky) & @theRcast (X/Twitter)
- Is CRAN Holding R Back?
- How to use large language models to assist in systematic literature reviews
- Lazy introduction to laziness in R
- Entire issue available at rweekly.org/2025-W08
- A guide to contributing to open-source Python packages https://arilamstein.com/blog/2025/01/02/a-guide-to-contributing-to-open-source-python-packages/
- Continue https://www.continue.dev/
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)
- Mike Thomas: @[email protected] (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter)
- Gurgling Desert Pond - Final Fantasy Random Encounter - blackguitar - https://ocremix.org/remix/OCR02555
- Hotel Rhumba - Earthbound - The Pancake Chef - https://ocremix.org/remix/OCR00526
[00:00:03]
Eric Nantz:
Hello, friends. We're back of episode a 96 of the Our Weekly Highlights podcast. Oh my goodness. That's four away from the big 200. It's sneaking up on us folks, but in any event, this is the weekly show where we talk about the awesome highlights and additional resources that are shared every single week at ourweekly dot o r g. My name is Eric Nance, and, I'm I'm feeling a little decent now, but I will admit I've been under the weather lately. So we'll do my best to get through it, but I'm feeling good enough right now. But this is definitely one of those times where more than ever, I am so glad I don't do this alone because I got my awesome cohost,
[00:00:40] Mike Thomas:
Mike Thomas, to pick up the pieces when I fall apart. Mike, how are you doing? I'm doing pretty good, Eric. I feel about 50% healthy. I'm fighting it too. So, hopefully, you're 50% and my 50% can can add up to a hundred if that math checks out. I do want a shameless plug, and we didn't even talk about this pre show, but were you on another podcast recently that anybody can check out?
[00:01:03] Eric Nantz:
I will be. Unfortunately, that post got sick too. So we had the postcode now. So talked about a preshow. No. No. It's all good. It's all good. I will definitely link that out when it's published, but we're that'll be a a fun episode that will be Dakota radio program. So stay tuned on my social feeds for when that gets out. But, yeah. Apparently, it's going everywhere even across the country too. So, nonetheless, we are we're we're good enough for today. And, let me you know? Gosh. It's been such a whirlwind week. I gotta figure out who curated this issue. Oh, wait. Oh, wait. Look in the camera, Eric. Yeah. It is me. It was me that curated this week's issue. As always, great to give back to the project, and this was a very tidy selection.
And as always, I had tremendous help from the fellow, our rookie team members, and all of you out there with your great poll requests and suggestions. We got those merged in, and some of those ended up becoming a highlight this week. So without further ado, we're gonna lead off today with arguably one of the more spicy takes we've had on the show this year at least, and maybe even maybe even the past year as well. And this is authored by, I'll just say, a good friend of mine from the community, Ari Lamstein, who I've met at previous positive conferences. And in fact, I have fond memories. He was actually one of the students at one of my previous shiny production workshops. So he's been always, you know, trying to be on the cutting edge of what he does, and he does consulting. He's done all sorts of things.
But this came into my feed, around the time of the curation this week, and he has this provocative title, is CRAN, CRAN being the comprehensive r archive network, holding our back. So let's give a little context here, and then Mike and I are gonna have a little banter here about the the pros and cons of all this. So Ari has authored a very successful package in the spatial, you know, mapping space called coral plethora. I hope I'm saying that right. But he's maintained that for years and years. He would say that it is in a stable state. He did rapidly iterate on it over the years in the initial development, especially when it was literally part of his job at the time to work on this package.
But, yeah, it's been in a stable state, you know, along those things where if it ain't broke, don't fix it. Well, it was about a month or so ago. He was informed that it was his package was going to be archived. Meaning, and for those aren't aware, when CRAN archives a package, they will not support the binary installation of a package with the typical install dot packages function out of the box because of some issue that has been raised by an r command check or maybe another issue that the maintainers thing should have been resolved and end up not being resolved.
Now you may ask, well, what was the issue with the package? It wasn't with Ari's package. It was of a dependency package called ACS, which I've not heard about until now. Apparently, that got a warning. And what was the warning about? Well, if you go in the digging of the ACS packages, post on CRAN where it gives a link to the command check warnings, Try this one for for size, folks. The note, it was not a warning. It was a note.
[00:04:38] Mike Thomas:
Right.
[00:04:39] Eric Nantz:
And it says, configure slash bin slash bash is not portable. Mike, when did r include a bash shell? Do you know? Did I miss something here?
[00:04:52] Mike Thomas:
I must have missed it too. I mean, I have to imagine that this is when the right. This is when the the checks run, so it probably has something to do with the runner itself and the the Linux box where the this is being executed on. This may be above my pay grade, but that is not an informative message.
[00:05:10] Eric Nantz:
And it literally has nothing to do with r itself or even, another language that a package could be basing off of such as c plus plus or Fortran or the like. So Ari, you know, he did not like this, and, frankly, I can understand where he's coming from here. And he decided that, well, chloropleather is not at fault here. And you know what? I'll let the chips follow where they may, and it is now archived. So his package is now archived because of the ACS package being archived. We I personally don't know who authored it, not that I really care to for the purpose of this discussion. It has been archived, and now Choroplethor is now archived as a result.
This, so Ari and the rest of the post has some pretty candid viewpoints on the state of CRAN at this time. I will admit he's not alone in some of the recent discourse I've seen online. I've heard, community members like Josiah Perry, who I have great respect. He's had some qualms about some recent notes he's received from crayon as he's been trying to, I mean, either add a new package or update an existing package. And Ari, you know, really is being provocative here about what impact is Kran having now on the future growth of r itself.
And then Ari has lately in the last, I would say, a few years, been working on additional projects in Python. So he does compare and contrast how CRAN compares to Python's arguably default package repository called PyPy. Now let's, let's have a little fun with this, Mike. I'm going to play, hopefully, the role of a good cop, supposedly. YCran, even with these issues, is still a valued piece of the our ecosystem and that maybe this is just a very small blip in the overall
[00:07:20] Mike Thomas:
success of it. And then your job is gonna be able to talk me down on this. So let me you ready? Buckle up. Yeah. We we can do that. And I think as a disclaimer, I would say that I don't wanna speak for you, but both of us have mixed feelings on both sides of the fence. Yes. But I think we'll we'll go through the exercise of, good cop, bad cop here. I think that's a great idea. Yep. So
[00:07:43] Eric Nantz:
I've been a long time R user since 02/2006. So I've I've I've I've known the ecosystem for a while. I dare say that as somebody new to the language, as I was getting familiar with Baysar and then suddenly I would have courses in grad school that talked about some novel statistical methods and also my dissertation on top of that. Without the CRAN ecosystem of a way to easily once I did my, say, WIP review or other research about the pack the methodology I needed, and then finding that package and then be able to install that right away in my r session.
But that package being authored by leading researchers in those methodologies and the fact that this was a curated set that I could have full confidence that the package I'm installing is indeed going to work on my system, and it has been has been approved from a curated, you know, curator type of group that the CRAN team is and to be able to use that in my research. Other programming languages do have a more automated system where, yeah, you can throw any package on there, but you have no idea about the quality of it. You have no idea if it's going to destroy your system sometimes. You have no idea if it's even doing the right thing, statistically speaking.
In my opinion, one of the great advantages of r is indeed that CRAN is this targeted stable set that's very akin to, say, the the Debian type philosophy in the Linux world. It's stable. You can rely on it, will not break on you, and they are making a concerted effort to make sure that this is a reliable network for you to install packages on. And I don't think ours is where it is today without CRAN having that that curated group behind the scenes to make all this happen. What do you think?
[00:09:51] Mike Thomas:
Yeah. So I think that maybe the times have changed a little bit. And I think that maybe five plus years ago, people were developing packages not via GitHub, this GitHub package development workflow that really exists now, I think, across the space. You know, sometimes you'll go to an old package and you'll search for it on Google or your your favorite LLM, I guess, these days. And you'll try to find, you know, the the GitHub repository that that sits behind that package. So you can take a look through the code and it doesn't exist. You're just taken to, like, the the PDF, right, of the the package itself. And that's that's very frustrating nowadays, but I guess that's probably reflective of maybe the workflow that used to exist, you know, five or ten years ago where we didn't have really this GitHub package, development driven workflow. And this is something that was raised on on Blue Sky by Yoni Sidi. And they said, you know, back in 2016 that we they thought the R community was sort of setting itself up for problems by not building CRAN not building sort of this infrastructure and software development life cycle, geared towards really a focused GitHub package development where sort of most of, packages are are developed right now, and CICD can be set up and all sorts of things like that. So I I think, unfortunately, in a lot of ways, CRAN has not caught up with the times.
And then just some of the rigor and inflexibility that they have, I think can be construed as over the top, for lack of a better word. I think archiving a package because of a note, seems absurd. And I think that the time windows for folks to fix these things are unnecessarily short. I remember g g plot two, it was a year or two ago, was almost archived due to a dependency, you know, that was that had been archived. And ggplot two has, like, a whole entire company behind it that can work on trying to rescue that. They have relationships with the CRAN maintainers, you know, the volunteers that that work on Krayaan that the rest of us do not have. Right? We're limited to emailing back and forth, and I'm sure anybody that's done that before, you know, has has struggled with that, for lack of a better word, in in some sense.
So I think we're we have this dichotomy where if you are an R user, yes, CRAN can be very beneficial. But if you are an R developer of packages, it can potentially cause, you know, more headaches than it solves.
[00:12:37] Eric Nantz:
I definitely resonate with that. So let's let's put away our cap and and whatever uniforms here. Let's let's, let's be real here. I definitely think that this was a heavy handed instance. I think that it's one thing to ensure broad compatibility across different architectures that are supports. But for something like a bin bash note that admittedly was not affecting users up to this point anyway, Ari has always been very responsive to user feedback on his packages. He has not heard one iota in both his testing and others about colorectal being being affected by this.
I think this is an artifact of a a system that has gotten the our, you know, community at large to where it is. I mean, like like, without CRAN in the very beginning, I still stand by our was not would not be as successful as it is now. But as you said, Mike, this is a different time now. This is a different time where there are multiple personas, so to speak, leveraging our going from more of an academic type of, you know, statistical environment for research. Now it's being used across industries. My industry in particular is really doubling down on it, and it is being relied on in production way more than maybe someone might have thought five or six years ago. So I do think that there needs to be, a healthy assessment on where things can be improved upon.
I think transparency is one of them. I think leveraging newer technology for automation is in another because let's face it. Another key project that we covered on this show many, many times now is the growth of our universe where, no, it's not a human curated effort per se, but it is also leveraging a lot of automation to give, you know, confident, reliable package binary installations across different operating systems as well as new technology, hint hint, web assembly, to make this even more future state proof. I think there is somewhere in the middle that maybe eventually either cran or another effort that is slowly coming up into, into the discussion, the multiverse project, which we'll be hearing, I think, more about this year, where maybe there is a best of both for us. They still have a human in the loop, but yet take advantage of modern technology that say our universe has pioneered to make this hopefully a better experience for everybody involved, for the consumer, I e the user of the package, but also the maintainer and developers of the package.
And and Ari's post, he does throw some statistics about how, you know, Python itself has a lot more, you know, metrics of their packages being downloaded and whatnot. Well, admittedly, Ari, that can be a loaded statistics, so to speak, because I think and this is actually captured in a in a LinkedIn post that Ari put up his blog post with some great comments as well. I agree with people like Joe Chang who commented on this. The PyPI, yeah, maybe it's an automated thing, but it is a wild west And dependency management in Python, I'll stand on this soapbox.
Still leaves a lot to be desired, and hence, there are people that are very are are are destroying and upping their environments left and right for a single project. Joe in particular had a comment about he must have installed pandas, like, over 50 or 60 times when he was testing various things with with his efforts in Python. So it yeah. It's great that PyPI has this less friction approach to get something online for a Python package, but you can go extreme in another direction, and it can cause havoc in that regard. So I still think there's a middle ground to be had here.
But in any event, do I I still think that this was a very heavy handed, heavy handed approach here taken with with with archiving chloroprether and and ACS as a result of this kind of note. Because in the end, did it affect the users? No. Maybe it affected one esoteric build target that Cranston uses. I won't say Solaris even though I do wanna say it because that's usually the butt of many jokes for antiquated architecture. But I I I think it it's good to at least have these discussions and hopefully with the efforts that are happening with with our universe and in the future, not so distant future, the multiverse project that we'll we'll get to a middle ground somewhere.
[00:17:35] Mike Thomas:
Yeah. Yeah. I have to say, you know, the experience of installing in our package is, I think, one of it. The big benefits of our over Python, you can install packages from within R Right. Without a second piece of software like Pip, which is, you know, incredibly frustrating for Python newbies, especially. And I guess my other my only last other point would be and I'm not saying that this is the case with ACS. I don't know much about the package. But if you are an R package developer, one thing that you can do to try to mitigate risk, in my opinion, is to take a look and see how actively maintained the dependencies of your package are. And if you're going to add a new package, make sure you you take a look and see if there is somebody, who's, you know, contributing to that package recently. It looks like they're maintaining it, such that they would be responsive if an issue like this did happen. There's a lot of R packages out there that haven't been touched in, you know, four plus years. If you you go back and take a look at the code base, and, it's probably only a matter of time until something happens. And if nobody's there to respond to it, that's going to be an issue for all of the packages that depend on it. So in my opinion, that's that's an additional way that you can try to mitigate some risk.
[00:18:54] Eric Nantz:
Yeah. Very valid point, Mike. And and my other key, you know, thought I had is that a maintainer should not be penalized for having a stable package where maybe a check does arise, but has nothing to do with, like, how r itself works or anything like that. If it's having to do with this kind of arbitrary Linux shell prompt that the package has nothing to do with, that should be treated differently than say, oh, wait a minute. Your use of, like, a a an object class system just completely broke or or a test completely broke, you know, whatever. That that's a different story. I don't think these notes are all created equal here. And that that that nuance is not lost on me. That was a note, not a warning, not an error, a note, which yes. I mean, you you obviously strive to minimize those, but, again, those are not all created equal.
[00:19:52] Mike Thomas:
Likewise. And I guess maybe the last thing I would say is if you are somebody who's developed our packages and you're looking to dabble into potentially developing Python packages, Ari's also, drafted a a blog post on his blog called a guide to contributing to open source Python packages, which,
[00:20:08] Eric Nantz:
you might find very interesting. Very good, Mike. We will link to that in the show notes. So up next in the highlights, well, it is 2025. You're usually gonna hear something about large language models and how they're, you know, helping productivity or helping a data science type of pipeline. And so our next high, we do have an interesting use case that admittedly I even mentioned when I was doing using r for my dissertation probably would have been really helpful in my lit review because we are literally gonna talk about how large language models could help you in a more systematic approach of literature review.
And this post is coming to us from the seascape models group at the University of Tasmania. Shout out to Tasmania. That's a first on the highlights in the duration of this box. That is awesome. Yes. If you ever had any any doubt about how international r is in this footprint, you know, this is this is a proof right there. Nonetheless, this is the first time I heard about their research group, but none but they I don't know who exactly authored it, but I'm gonna say from their research group, they talk about, you know, a very practical approach that in their research or how they've looked at, you know, it's one thing to assemble all the resources or manuscripts or papers that comprise a lit review, but actually, So in particular So in particular, what they walk through in this post is how you can leverage an off the shelf, you know, large language model, you know, chat g b t like service.
And then to be able to take a set of PDFs and extract the information from them. And to take that text, try to clean that up as well. And then once you get that going for, like, a single type of manuscript, how you batch all this together. So first, like I said, this is using the approach here is using one of the off the shelf providers for generative, you know, AI, large language models. They are using anthropic in this post, and I've heard good things about that. So, of course, there's no such thing as a free lunch here. You're gonna need an API key because that will be leveraged as you interact with the service to grab the information back from their chat like bot interface.
And so they got some nice call outs of packages that help you with managing environment variables. You know, typically, I'm a I'm an old school guy. I've always done the dot r environment file in my working directory. There's this great package that I admittedly need to use more often called dot env that will help you do this in a slightly more agnostic way. But you just set up a dot e m v file in your home directory, put in your anthrop anthropic key, and you're off to the races because dot m has a little low dot e n v function to import that in. And then you can use that as your environment variable, and you're you're good good to go there. So great great little package right off the bat for that side of it.
And then the packages that they're using to interact with the anthropic models is called the tidy chat models package, which I am not familiar with as well. To do some research on this, to where this this package comes from, but looks pretty straightforward here. You create a chat object defining the name of your service, your API key, and the version of the API. But you could use other APIs as well. So I have to I have to look at this, this package in more detail later on. Looks pretty nifty here. Once you get all that set up, now just like with any of these services, you gotta figure out what you want to use for your prompt and how to perform that. So they have little, you know, basic example for adding a message based on the role that you supply and the prompt text.
Role is a key concept here because there's typically two roles here. There's a user role and a system role where you may get more precise control over, say, the system role, but they give you some links to, to determine which is best suited for you. In this case, they're gonna look at a more system role here to give a little more granular control over the type of model that they're gonna use in the l o m, you know, interrogation. And you can add certain parameters such as temperature or max tokens, which they have lots of links of documentation on where you can find more information here. But this kinda checks out with my explorations of the Elmer package where I learned very quickly the prompt is the key here along with some other interesting things you can augment with that chat or with that interface to that chat in, that chat service.
So once you have all that, they've got them so they got it ready to go to look at, you know, getting getting the text, you know, summarized, but you gotta get the text in there itself. So in this case, they've leveraged the PDF tools package for a convenient way to grab the text from a PDF that you've downloaded on your computer with the PDF underscore text function. But you also have to make sure that you are able to authenticate directly to your API service because you may get, on top of the authentication, you may have to format the text effectively. So he he or she notes that the first time they ran this, they got a 400 error from the API service because the formatting wasn't correct. Because when you extract text from PDF, there could be some strange symbols in there, some strange artifacts, and you gotta you gotta clean that up a little bit. So that's a good walk through on the practical ways of leveraging this workflow. But yeah. And the rest of the post, Mike, why don't you talk us through some of the challenges I'm learning that the authors had of this workflow here?
[00:26:35] Mike Thomas:
Yeah. You know, it's really nice that we have these APIs that allow us to do things programmatically. I think, like you mentioned, it wasn't quite as straightforward as, extracting the the text from the PDF using that PDF tools package. There there was some cleanup for those characters. But once that took place, it was pretty straightforward to be able to interact with this this LLM, and I believe that's much thanks to this tidy chat models package that exists, that allows you to, it has functions in it like add params, where you can set what's called, like, the temperature, and the the max number of tokens that, you expect to interact with, with respect to the LLM. You can add a particular message, in this case, a system prompt. And, the the system prompt here was, you are a research assistant who has been asked to summarize the methods section of a paper on turtle mortality.
You will extract key statistics on sample size and and year of study and do not extract any more information beyond this point. So those system prompts are intended to sort of fix the response or the way that the LLM will respond to you, prior to even providing your prompt, your user prompt. And then finally, there's a function called addMessage, which allows you to submit that user prompt. And in this case, it'll be the the text, that was extracted from the PDF. In in order to, send this to the LLM, there's a final function called perform underscore chat, that will send that. You can save that as at least save the output of that to an object. In this case, the author used an object named new chat.
And then to take a look at what the LLM gave you back, you can use the extract underscore chat function, from the same package against that object that you saved. And it's it's pretty cool here to take a look at the results. I will give one hot tip. So the results come back, as text where the LLM says, you know, based on the method section, here are the key statistics. Your sample size of 357 sets, for large scale long line fishery, and the year of the study was 2018. So it it is text that you would then have to parse, and there's a little bit of code that leverages a lot of functions from stringer and I think dplyr as well in order to actually extract, you know, just the numeric values for the sample size and the year.
I will give a hot tip here just to based upon our experience in the past. We have done things like tell this system prompt to or use the system prompt to tell the LLM to only return, data in JSON format with the following elements, like sample size and year of study. And that will actually spit back out JSON, that you can, you know, then consume much more easily without having to do any string data wrangling, if you will. So that that may be, helpful to to some of those folks out there. But if you can ensure that the output that you're getting is is fairly standardized, like it seems to be the case here, then, you know, hopefully, we can program the whole solution. Right? And we don't have to, do different parsing logic based upon, you know, different, prompts that we are essentially querying.
And and so the code here is a fantastic walk through, pretty lightweight in terms of the number of packages that are being used, which is is awesome. There's some considerations here and some some great discussion at the end around cost uncertainty. If you're doing these things programmatically, right, you need to make sure that you're calculating, estimating your cost. You know, most of these models out there have the ability to set or most of these third party providers have the ability to set budgets, I think so that, you don't go over, you know, $10 or a hundred dollars, whatever it is that you, you know, have linked in your account. I know at least the Claude models have that, which is really nice.
And maybe the maybe the last thing that I will say just on a related topic while we're talking about LLMs is a totally, separate shout out, but there's this project out there called Continue. I don't know if you've heard of it, Eric. No, I'm not. It's a v s v s code extension, and I heard about it on Hugo Bowne Anderson's recent podcast. I think the, the Vanishing Gradients podcast, which, interviewed the one of the, I believe, chief developers of this continue project. And it's an open source coding assistant. And you know me, an open source, you know, go away, Microsoft.
It it it does have the ability so you can select sort of your back end model that you want to use, and that could be OpenAI, it could be Claude, but it could also be a local model, like one of those from the Ollama project, which is what I've been using, one of the smaller ones. And it has the ability to just chat with your code. It has the ability to bring in particular pieces of context, Like, you can use a a GitHub repository just specifying the URL, and it will crawl that GitHub repository, you know, using your your GitHub p a t, if it's a private repository that you can supply it with. And, it will essentially do all the text embedding work for you to, you know, put that in a vector database or whatever it's called, to be able to utilize that as context that you can chat with. You can put multiple sources of documentation like that if you want.
And, you can also highlight pieces of of code and, you know, chat with the context being that particular highlighted piece of code. Does auto complete, all those really good things. And it's a really, really cool user experience. And I would encourage anyone out there who is looking for, you know, the the Copilot experience without necessarily wanting to, interact with a third party and continue to leverage open source software to take a look at this continue project. I think it's maybe continue.dev,
[00:32:58] Eric Nantz:
but we we can link to it in the show notes. Yeah. We're gonna link to it. And while you were talking, I wanted to see, hey. Wait a minute. I wonder if I could plug this into Positron in my my bleeding edge evaluation of positron. The good news is this extension is on the open VSX registry, meaning it's not locked into Versus code. You could use it on the variant. So I think I'm doing that later today, Mike. I'm gonna give this a shot because I have been hesitant to get on the copilot train for even my open source stuff that I've been leveraging, some other alternatives, and this may be one of them. So just goes to show you there's a lot advancement here. And by the way, yeah, plus 100 to your tip about giving the prompt some detailed information on how to get results back. Getting results back in a structured way like Jason just opens up so many possibilities, and I leverage that technique extensively when I was making this, fun little haunted places shiny app that leveraged an LOM for our pharma last year. I made sure that when I made it randomly generate the quiz questions, give it back to me in JSON so that I could present it in shiny very easily with a with dynamic input input. So there's a lot a lot at your fingertips here. I think this post does highlight.
Do some quick tests first in a specific use case, and you will have a lot a lot that you'll learn along the way. But we're only just scratching the tip of the iceberg with this, so to speak. There's a lot more to come, come in this space. And last, but certainly not least, Mike, I will admit on a day like this when I'm not feeling like myself, I do, want to feel, and I do feel a little lazy about certain tasks. But we're not gonna talk about lazy and a negative connotation here because our neck our last highlight today is literally giving us a very practical take with the many different ways that lazy and laziness is actually available to you as an R user depending on your context, depending on your workflow and the packages that you're utilizing.
And so this last highlight comes to us from the, our hub blog and from a very brilliant group of authors. If I do say so myself, we got Mel Salman, Athanasia, my Winkle, and Hannah Frick. So that's a, that's a a trio of trios right there to start off with. And so I'm gonna introduce certain pieces of this, and I'll turn over the mic for the rest. But if you're an r user the last few years, there's probably one interpretation that of lazy that you've heard throughout your journey with our, and that is a concept of lazy evaluation. That is arguably one of our biggest selling points is the idea of lazy evaluation.
And what this really means is that if you have a function with arguments, that those arguments are only going to be evaluated in the runtime, so to speak when they are accessed. So you may be able to pass like a huge value for an argument. Maybe that's like a data frame or some other large vector. If it's not used, it's not gonna really matter. It's just gonna be there just in the in the definition. Only until you actually call something that leverages it will actually be used. And, there is a counter concept to that called eager evaluation. But in typical default behavior behavior for r, It is a lazy evaluation, and they have a link to a great chapter from the advance r book that's authored by Hadley Wickham with another more, you know, thorough introduction to lazy evaluation.
And in fact, in base r itself, another concept that you may be familiar with is the idea of a promise where it may be something that is on tap to be evaluated, but it's in essence more of a recipe to get to that value. They call that an expression most of the time or it could be from an environment as well. And again, only when you need it will be evaluated in your memory. So that can be very important depending on your workflow. I mentioned promise. Right? Well, there is a very important part of the our ecosystem that leverages a different take on promises with respect to, high performance computing and async processing.
And that comes from the future package. The future package is authored by Henrik Benson, another really brilliant researcher in the art community that I had the pleasure to meet at Paz at Confuc a few years ago. And in the future package, this promise is more thought of as a placeholder of a value And that and, again, there are different ways to configure this with the future package. You can have what's called a lazy future or an eager future. And in essence, when you define this future, the default behavior is actually be eager, meaning that when you define the function that has that future encapsulated in the moment you define that and it runs something right off the spot, it's gonna wait until that task is done.
However, you can also feed in an argument of lazy equal true, meaning that that future will just be set to run-in the background. It will not hog your our session, and you could do other things in the council and and do whatever you want, but only until you want that future value will it actually do it. So that could be important if you're new to the future package to figure out, well, wait a minute. I thought the whole point was that it could run-in the background. You gotta be careful in how you define how that future, is is spelled out or initialized.
So, again, their version of lazy is not quite the same as the definition we heard in base are and whatnot, but that is important if you're into into that space. And now we're gonna shift context to data itself because a very key concept in database operations within the R language is the idea of lazy operations. What does this mean? Well, in a nutshell, with these database back ends, such as, say, MySQL or SQLite or others, these queries that you might define with the with the help of, say, the DB plier package that accompanies a lot of d plier like syntax. But for databases, you may have a pipeline of an analytical pipeline where you're gonna take data, maybe mutate a few things, summarize with group processing, but only until you quote, unquote collect that result will it actually be evaluated. So it's a way to efficiently run SQL queries instead of in a typical data frame. It's gonna run all those steps one by one in memory right away.
So that is a really important concept that the DB plier package, that DB plier package surfaces. And also more recently, the DT plier package does a similar thing with data dot table as the back end for managing that data. And again, much like the database, you know, paradigm we mentioned earlier, the lazy way of doing it is gonna capture the intent of those data processing steps, but not actually do anything until that result is requested, I e collected with a collect function. Again, really important selling point. But there is in this in the realm of databases, another newer contender that has even more of a Nuance take on this, and that is the duck plier package because Mike and I are big cheerleaders for the duck DB back end. I absolutely love it.
So with duck plier, again, this is using duck DB on the back end, so to speak. Now there can be a little bit of a problem here with respect to traditional d plier type usage. Usually, things are eager by default in dplyr, like I mentioned, with a typical data frame. But when the concept of DuckDV, one of the reasons we wanna use it as a whole is that we can optimize the queries, optimize the computations before they're actually run. So Duckplier does need this same concept of laziness as those traditional packages like DBplier actually need.
Now this is what's interesting here. The way Duckplier is pulling this off, we're getting a little in the weeds here, is that it is leveraging alt rep, which is one of the more fantastic contributions of ASR of over the last two years where it's more power power behind vectorized operations, but it supports what's called deferred evaluation. The more specifically, and I quote from the post here, alt rep allows our objects to have different in memory representations and for custom code to be executed whenever those objects are accessed.
So that means for ductplier that they can have a special version of these callbacks and other functions to interrogate whatever is the root of that operation, say the query, an analytical summarize, or whatever have you. So then duct plier by proxy is actually lazy in terms of how it itself runs as operations, but it seems eager to you as the end user when you're running like a duct plier based pipeline. So they got they got examples here where there could be cases where this is very important to utilize this functionality and cases where it might be might be, more more, applicable to add a little more control to it or add a safeguard to it.
I've never played with this before, but there's a concept called prudence to control just how automatic this evaluation of this laziness vault rep is is done here. There's stingy, and then there's thrifty. I love these names, by the way. Those are really creative, but they got examples in the post with the NT cars set of the differences between how these are these are, approached here. So these this is something that you probably wanna look at with the recent version of duct plier. It it had an upgrade, I think, within the last few weeks or the last year. There's a lot of rapid development on it, and I think it's got tons of potential for leveraging high performer workflows with database at the back end. And, again, a clever use of laziness with respect to alt rep. So I am I'm eager to to try that out.
But, of course, there's way more ways of laziness and, you know, lazy evaluation play a role in the rest of the our kinda typical workflows that you might have. So, Mike, why don't you take us through those?
[00:44:33] Mike Thomas:
Yes. A few more quick hitters for us in this blog post. When we talk about lazy loading of data in packages, I think a lot of us have experienced this before. When you you're in R. Right? You can quickly access, like, the iris and the mtcars datasets which are built into your installation of R. I'd I'm not sure if they're loaded Eric, you probably have to help me with this a little bit. If they are loaded into memory prior to calling them, prior to actually evaluating them. But that's sort of this concept here where if you have an R package that does have a package dataset in it and and sets the lazy data field in the description file to true, then the exported datasets are are lazily loaded and they're they're available without having to call the data function, right, for for those particular datasets.
But they're not actually taking up memory until they are accessed. So that's something interesting there. It's something that we've run into a few times actually. We have some some functions in some of our packages that programmatically, sort of, you know, with the use of, like, regular expressions and stringer, try to decide which internal package dataset you want to leverage in that function and unfortunately you have to call library on the package first in order for that function to work you can't just name space it or else it will fail. And I'm not sure if we've solved that yet. It's it's a bit of a workaround.
[00:46:07] Eric Nantz:
Is that something you've run into before, Eric? Yeah. The hard way quite a bit even with my goal in power, Shiny apps are on include an internal dataset as, like, a way to have, like, me my colleague test or an example set that the app would use. I I I've I've had to I've had to do some, you know, very weird hacks of, like, just doing an arbitrary command on that data frame to trick it to load in the memory before the function completes. I don't really have a great solution for that. So, hey, Colin, if you're listening, maybe you could help me out with that, by the way. But, nonetheless, that's where I've encountered that bugaboo the most.
[00:46:46] Mike Thomas:
Yes. Yes. No. That's that's a great point. And there's a couple of links here, I think, that may help discuss this concept further of lazy data. There's the r packages book by Hadley Wickham and Jenny Bryan, then there's the also also the Writing R Extensions book, which I think is is more sort of authored by some of the core R developers or, you know, so I think from that perspective. So those might be two good resources if you're interested in learning a little bit more about lazily loading data in packages. I love lazy logic that checks to see if something ever needs to be rerun, and that's sort of the concept of caching, right, in a in a broad sense.
And the authors here give the example of the lazy argument in the package down build site function, which if that argument is set to true, it will only rebuild articles and reference pages if the source is newer than the destination, which makes a whole lot of sense and can save a whole lot of time depending on how big your project is. And that's something that I have to talk to a client about today because we have a GitHub action that is taking way too much, way more time than it needs to take.
[00:47:58] Eric Nantz:
I feel seen about that. Absolutely. Yep.
[00:48:01] Mike Thomas:
I digress. Similar concept with the lazy test package that helps you only rerun tests that failed during the last run. And the last example here is regarding regular expressions. I had never heard of the terminology lazy being applied to regular expressions, but if your regular expression is finding all matches, of of whatever pattern you're looking for, that's considered eager. And if it's only finding the first match or the fewest number of repetitions as the authors define it here as possible, then it's considered to be lazy. And in the example that they provide, the question mark character in the regular expression is what adds this laziness.
So ton of examples here, really, really interesting blog posts. I think it's it's always interesting, you know, whatever these authors put out. It's some neat perspectives that maybe we don't think about or or have on a day to day basis. And, I would say that if you you didn't get it already, there are a lot of different definitions around laziness when it comes to programming and and our programming, especially. They did omit one definition of laziness, which is the one that that takes place when people just copy and paste code from ChatGPT and don't even look at it before incorporating it into their project or repository or even worse, pushing it to production.
That's bad laziness as opposed to a lot of good laziness that we were talking about today. But Context is king as I say. And, yes,
[00:49:32] Eric Nantz:
we've we we both have had experiences where that's happened, and we're like, oh, boy. Is this what we're in for now? Just my 2¢. Yeah. Yeah. Yeah. But, I I think it's a it's a viewpoint that's shared with a lot of people. But, yeah, lots of lots of great additional, you know, links in this post to dive into each of these in greater detail. As I said, I'm really intrigued by the duck pliers approach to this because I've never seen it kinda try to total lines. See both eagerness and laziness depending on the on the way you're interrogating that. So I'm gonna do some homework after the the show about that because I'm trying to up my duck DB, power here, so to speak, after that great workshop I took back at Pazitconf last year. I'm all in on that train. And and, yeah, in this case, lazy is definitely not a bad thing in many of the many of the approaches here.
And what else is not bad is our weekly itself. I would dare say we're not lazy in terms of how we curate the issue. That is very much, an eager evaluation in a good way. Normally, we do our additional fines. We are running a bit low on time, so we're gonna close-up shop here and, again, invite you if you wanna help contribute to the project. The best way to do that is with a poll request through our weekly itself and the upcoming issue. If you found that great blog post that maybe spurs up a lot of discussion in the community like we had in Ari's post or a great technical deep dive or a great way to use a new r package out there. We're just a poll request away. All marked down all the time. The template's already there. Head to rweekly.0rg for complete details on that. And we love hearing from you on the social medias. Great shout out to those that have gotten in touch and send us some good things on on social media.
But you can find me. I'm now on Blue Sky, where I'm at [email protected]. I'm also on Mastodon where I'm at [email protected]. And I'm on LinkedIn. You can search my name, and you'll find me there. And, Mike, where can the listeners find you?
[00:51:35] Mike Thomas:
Sure. You can find me on blue sky at mike dash thomas dot b s k y dot social or on LinkedIn, if you search Ketchbrook Analytics, k e t c h b r o o k, you can see what I'm up to lately. Very good stuff. And, thank you again. We made it. In our 50%
[00:51:54] Eric Nantz:
workflow, we somehow made it. So that's why having a co host is a really good idea in these times. So nonetheless, we will close-up shop here for our weekly highlights rep. So hun nine 96. Yeah. We're far away from 200 folks. It's coming up soon. And we'll be back with episode a 97 of our weekly highlights next week.
Hello, friends. We're back of episode a 96 of the Our Weekly Highlights podcast. Oh my goodness. That's four away from the big 200. It's sneaking up on us folks, but in any event, this is the weekly show where we talk about the awesome highlights and additional resources that are shared every single week at ourweekly dot o r g. My name is Eric Nance, and, I'm I'm feeling a little decent now, but I will admit I've been under the weather lately. So we'll do my best to get through it, but I'm feeling good enough right now. But this is definitely one of those times where more than ever, I am so glad I don't do this alone because I got my awesome cohost,
[00:00:40] Mike Thomas:
Mike Thomas, to pick up the pieces when I fall apart. Mike, how are you doing? I'm doing pretty good, Eric. I feel about 50% healthy. I'm fighting it too. So, hopefully, you're 50% and my 50% can can add up to a hundred if that math checks out. I do want a shameless plug, and we didn't even talk about this pre show, but were you on another podcast recently that anybody can check out?
[00:01:03] Eric Nantz:
I will be. Unfortunately, that post got sick too. So we had the postcode now. So talked about a preshow. No. No. It's all good. It's all good. I will definitely link that out when it's published, but we're that'll be a a fun episode that will be Dakota radio program. So stay tuned on my social feeds for when that gets out. But, yeah. Apparently, it's going everywhere even across the country too. So, nonetheless, we are we're we're good enough for today. And, let me you know? Gosh. It's been such a whirlwind week. I gotta figure out who curated this issue. Oh, wait. Oh, wait. Look in the camera, Eric. Yeah. It is me. It was me that curated this week's issue. As always, great to give back to the project, and this was a very tidy selection.
And as always, I had tremendous help from the fellow, our rookie team members, and all of you out there with your great poll requests and suggestions. We got those merged in, and some of those ended up becoming a highlight this week. So without further ado, we're gonna lead off today with arguably one of the more spicy takes we've had on the show this year at least, and maybe even maybe even the past year as well. And this is authored by, I'll just say, a good friend of mine from the community, Ari Lamstein, who I've met at previous positive conferences. And in fact, I have fond memories. He was actually one of the students at one of my previous shiny production workshops. So he's been always, you know, trying to be on the cutting edge of what he does, and he does consulting. He's done all sorts of things.
But this came into my feed, around the time of the curation this week, and he has this provocative title, is CRAN, CRAN being the comprehensive r archive network, holding our back. So let's give a little context here, and then Mike and I are gonna have a little banter here about the the pros and cons of all this. So Ari has authored a very successful package in the spatial, you know, mapping space called coral plethora. I hope I'm saying that right. But he's maintained that for years and years. He would say that it is in a stable state. He did rapidly iterate on it over the years in the initial development, especially when it was literally part of his job at the time to work on this package.
But, yeah, it's been in a stable state, you know, along those things where if it ain't broke, don't fix it. Well, it was about a month or so ago. He was informed that it was his package was going to be archived. Meaning, and for those aren't aware, when CRAN archives a package, they will not support the binary installation of a package with the typical install dot packages function out of the box because of some issue that has been raised by an r command check or maybe another issue that the maintainers thing should have been resolved and end up not being resolved.
Now you may ask, well, what was the issue with the package? It wasn't with Ari's package. It was of a dependency package called ACS, which I've not heard about until now. Apparently, that got a warning. And what was the warning about? Well, if you go in the digging of the ACS packages, post on CRAN where it gives a link to the command check warnings, Try this one for for size, folks. The note, it was not a warning. It was a note.
[00:04:38] Mike Thomas:
Right.
[00:04:39] Eric Nantz:
And it says, configure slash bin slash bash is not portable. Mike, when did r include a bash shell? Do you know? Did I miss something here?
[00:04:52] Mike Thomas:
I must have missed it too. I mean, I have to imagine that this is when the right. This is when the the checks run, so it probably has something to do with the runner itself and the the Linux box where the this is being executed on. This may be above my pay grade, but that is not an informative message.
[00:05:10] Eric Nantz:
And it literally has nothing to do with r itself or even, another language that a package could be basing off of such as c plus plus or Fortran or the like. So Ari, you know, he did not like this, and, frankly, I can understand where he's coming from here. And he decided that, well, chloropleather is not at fault here. And you know what? I'll let the chips follow where they may, and it is now archived. So his package is now archived because of the ACS package being archived. We I personally don't know who authored it, not that I really care to for the purpose of this discussion. It has been archived, and now Choroplethor is now archived as a result.
This, so Ari and the rest of the post has some pretty candid viewpoints on the state of CRAN at this time. I will admit he's not alone in some of the recent discourse I've seen online. I've heard, community members like Josiah Perry, who I have great respect. He's had some qualms about some recent notes he's received from crayon as he's been trying to, I mean, either add a new package or update an existing package. And Ari, you know, really is being provocative here about what impact is Kran having now on the future growth of r itself.
And then Ari has lately in the last, I would say, a few years, been working on additional projects in Python. So he does compare and contrast how CRAN compares to Python's arguably default package repository called PyPy. Now let's, let's have a little fun with this, Mike. I'm going to play, hopefully, the role of a good cop, supposedly. YCran, even with these issues, is still a valued piece of the our ecosystem and that maybe this is just a very small blip in the overall
[00:07:20] Mike Thomas:
success of it. And then your job is gonna be able to talk me down on this. So let me you ready? Buckle up. Yeah. We we can do that. And I think as a disclaimer, I would say that I don't wanna speak for you, but both of us have mixed feelings on both sides of the fence. Yes. But I think we'll we'll go through the exercise of, good cop, bad cop here. I think that's a great idea. Yep. So
[00:07:43] Eric Nantz:
I've been a long time R user since 02/2006. So I've I've I've I've known the ecosystem for a while. I dare say that as somebody new to the language, as I was getting familiar with Baysar and then suddenly I would have courses in grad school that talked about some novel statistical methods and also my dissertation on top of that. Without the CRAN ecosystem of a way to easily once I did my, say, WIP review or other research about the pack the methodology I needed, and then finding that package and then be able to install that right away in my r session.
But that package being authored by leading researchers in those methodologies and the fact that this was a curated set that I could have full confidence that the package I'm installing is indeed going to work on my system, and it has been has been approved from a curated, you know, curator type of group that the CRAN team is and to be able to use that in my research. Other programming languages do have a more automated system where, yeah, you can throw any package on there, but you have no idea about the quality of it. You have no idea if it's going to destroy your system sometimes. You have no idea if it's even doing the right thing, statistically speaking.
In my opinion, one of the great advantages of r is indeed that CRAN is this targeted stable set that's very akin to, say, the the Debian type philosophy in the Linux world. It's stable. You can rely on it, will not break on you, and they are making a concerted effort to make sure that this is a reliable network for you to install packages on. And I don't think ours is where it is today without CRAN having that that curated group behind the scenes to make all this happen. What do you think?
[00:09:51] Mike Thomas:
Yeah. So I think that maybe the times have changed a little bit. And I think that maybe five plus years ago, people were developing packages not via GitHub, this GitHub package development workflow that really exists now, I think, across the space. You know, sometimes you'll go to an old package and you'll search for it on Google or your your favorite LLM, I guess, these days. And you'll try to find, you know, the the GitHub repository that that sits behind that package. So you can take a look through the code and it doesn't exist. You're just taken to, like, the the PDF, right, of the the package itself. And that's that's very frustrating nowadays, but I guess that's probably reflective of maybe the workflow that used to exist, you know, five or ten years ago where we didn't have really this GitHub package, development driven workflow. And this is something that was raised on on Blue Sky by Yoni Sidi. And they said, you know, back in 2016 that we they thought the R community was sort of setting itself up for problems by not building CRAN not building sort of this infrastructure and software development life cycle, geared towards really a focused GitHub package development where sort of most of, packages are are developed right now, and CICD can be set up and all sorts of things like that. So I I think, unfortunately, in a lot of ways, CRAN has not caught up with the times.
And then just some of the rigor and inflexibility that they have, I think can be construed as over the top, for lack of a better word. I think archiving a package because of a note, seems absurd. And I think that the time windows for folks to fix these things are unnecessarily short. I remember g g plot two, it was a year or two ago, was almost archived due to a dependency, you know, that was that had been archived. And ggplot two has, like, a whole entire company behind it that can work on trying to rescue that. They have relationships with the CRAN maintainers, you know, the volunteers that that work on Krayaan that the rest of us do not have. Right? We're limited to emailing back and forth, and I'm sure anybody that's done that before, you know, has has struggled with that, for lack of a better word, in in some sense.
So I think we're we have this dichotomy where if you are an R user, yes, CRAN can be very beneficial. But if you are an R developer of packages, it can potentially cause, you know, more headaches than it solves.
[00:12:37] Eric Nantz:
I definitely resonate with that. So let's let's put away our cap and and whatever uniforms here. Let's let's, let's be real here. I definitely think that this was a heavy handed instance. I think that it's one thing to ensure broad compatibility across different architectures that are supports. But for something like a bin bash note that admittedly was not affecting users up to this point anyway, Ari has always been very responsive to user feedback on his packages. He has not heard one iota in both his testing and others about colorectal being being affected by this.
I think this is an artifact of a a system that has gotten the our, you know, community at large to where it is. I mean, like like, without CRAN in the very beginning, I still stand by our was not would not be as successful as it is now. But as you said, Mike, this is a different time now. This is a different time where there are multiple personas, so to speak, leveraging our going from more of an academic type of, you know, statistical environment for research. Now it's being used across industries. My industry in particular is really doubling down on it, and it is being relied on in production way more than maybe someone might have thought five or six years ago. So I do think that there needs to be, a healthy assessment on where things can be improved upon.
I think transparency is one of them. I think leveraging newer technology for automation is in another because let's face it. Another key project that we covered on this show many, many times now is the growth of our universe where, no, it's not a human curated effort per se, but it is also leveraging a lot of automation to give, you know, confident, reliable package binary installations across different operating systems as well as new technology, hint hint, web assembly, to make this even more future state proof. I think there is somewhere in the middle that maybe eventually either cran or another effort that is slowly coming up into, into the discussion, the multiverse project, which we'll be hearing, I think, more about this year, where maybe there is a best of both for us. They still have a human in the loop, but yet take advantage of modern technology that say our universe has pioneered to make this hopefully a better experience for everybody involved, for the consumer, I e the user of the package, but also the maintainer and developers of the package.
And and Ari's post, he does throw some statistics about how, you know, Python itself has a lot more, you know, metrics of their packages being downloaded and whatnot. Well, admittedly, Ari, that can be a loaded statistics, so to speak, because I think and this is actually captured in a in a LinkedIn post that Ari put up his blog post with some great comments as well. I agree with people like Joe Chang who commented on this. The PyPI, yeah, maybe it's an automated thing, but it is a wild west And dependency management in Python, I'll stand on this soapbox.
Still leaves a lot to be desired, and hence, there are people that are very are are are destroying and upping their environments left and right for a single project. Joe in particular had a comment about he must have installed pandas, like, over 50 or 60 times when he was testing various things with with his efforts in Python. So it yeah. It's great that PyPI has this less friction approach to get something online for a Python package, but you can go extreme in another direction, and it can cause havoc in that regard. So I still think there's a middle ground to be had here.
But in any event, do I I still think that this was a very heavy handed, heavy handed approach here taken with with with archiving chloroprether and and ACS as a result of this kind of note. Because in the end, did it affect the users? No. Maybe it affected one esoteric build target that Cranston uses. I won't say Solaris even though I do wanna say it because that's usually the butt of many jokes for antiquated architecture. But I I I think it it's good to at least have these discussions and hopefully with the efforts that are happening with with our universe and in the future, not so distant future, the multiverse project that we'll we'll get to a middle ground somewhere.
[00:17:35] Mike Thomas:
Yeah. Yeah. I have to say, you know, the experience of installing in our package is, I think, one of it. The big benefits of our over Python, you can install packages from within R Right. Without a second piece of software like Pip, which is, you know, incredibly frustrating for Python newbies, especially. And I guess my other my only last other point would be and I'm not saying that this is the case with ACS. I don't know much about the package. But if you are an R package developer, one thing that you can do to try to mitigate risk, in my opinion, is to take a look and see how actively maintained the dependencies of your package are. And if you're going to add a new package, make sure you you take a look and see if there is somebody, who's, you know, contributing to that package recently. It looks like they're maintaining it, such that they would be responsive if an issue like this did happen. There's a lot of R packages out there that haven't been touched in, you know, four plus years. If you you go back and take a look at the code base, and, it's probably only a matter of time until something happens. And if nobody's there to respond to it, that's going to be an issue for all of the packages that depend on it. So in my opinion, that's that's an additional way that you can try to mitigate some risk.
[00:18:54] Eric Nantz:
Yeah. Very valid point, Mike. And and my other key, you know, thought I had is that a maintainer should not be penalized for having a stable package where maybe a check does arise, but has nothing to do with, like, how r itself works or anything like that. If it's having to do with this kind of arbitrary Linux shell prompt that the package has nothing to do with, that should be treated differently than say, oh, wait a minute. Your use of, like, a a an object class system just completely broke or or a test completely broke, you know, whatever. That that's a different story. I don't think these notes are all created equal here. And that that that nuance is not lost on me. That was a note, not a warning, not an error, a note, which yes. I mean, you you obviously strive to minimize those, but, again, those are not all created equal.
[00:19:52] Mike Thomas:
Likewise. And I guess maybe the last thing I would say is if you are somebody who's developed our packages and you're looking to dabble into potentially developing Python packages, Ari's also, drafted a a blog post on his blog called a guide to contributing to open source Python packages, which,
[00:20:08] Eric Nantz:
you might find very interesting. Very good, Mike. We will link to that in the show notes. So up next in the highlights, well, it is 2025. You're usually gonna hear something about large language models and how they're, you know, helping productivity or helping a data science type of pipeline. And so our next high, we do have an interesting use case that admittedly I even mentioned when I was doing using r for my dissertation probably would have been really helpful in my lit review because we are literally gonna talk about how large language models could help you in a more systematic approach of literature review.
And this post is coming to us from the seascape models group at the University of Tasmania. Shout out to Tasmania. That's a first on the highlights in the duration of this box. That is awesome. Yes. If you ever had any any doubt about how international r is in this footprint, you know, this is this is a proof right there. Nonetheless, this is the first time I heard about their research group, but none but they I don't know who exactly authored it, but I'm gonna say from their research group, they talk about, you know, a very practical approach that in their research or how they've looked at, you know, it's one thing to assemble all the resources or manuscripts or papers that comprise a lit review, but actually, So in particular So in particular, what they walk through in this post is how you can leverage an off the shelf, you know, large language model, you know, chat g b t like service.
And then to be able to take a set of PDFs and extract the information from them. And to take that text, try to clean that up as well. And then once you get that going for, like, a single type of manuscript, how you batch all this together. So first, like I said, this is using the approach here is using one of the off the shelf providers for generative, you know, AI, large language models. They are using anthropic in this post, and I've heard good things about that. So, of course, there's no such thing as a free lunch here. You're gonna need an API key because that will be leveraged as you interact with the service to grab the information back from their chat like bot interface.
And so they got some nice call outs of packages that help you with managing environment variables. You know, typically, I'm a I'm an old school guy. I've always done the dot r environment file in my working directory. There's this great package that I admittedly need to use more often called dot env that will help you do this in a slightly more agnostic way. But you just set up a dot e m v file in your home directory, put in your anthrop anthropic key, and you're off to the races because dot m has a little low dot e n v function to import that in. And then you can use that as your environment variable, and you're you're good good to go there. So great great little package right off the bat for that side of it.
And then the packages that they're using to interact with the anthropic models is called the tidy chat models package, which I am not familiar with as well. To do some research on this, to where this this package comes from, but looks pretty straightforward here. You create a chat object defining the name of your service, your API key, and the version of the API. But you could use other APIs as well. So I have to I have to look at this, this package in more detail later on. Looks pretty nifty here. Once you get all that set up, now just like with any of these services, you gotta figure out what you want to use for your prompt and how to perform that. So they have little, you know, basic example for adding a message based on the role that you supply and the prompt text.
Role is a key concept here because there's typically two roles here. There's a user role and a system role where you may get more precise control over, say, the system role, but they give you some links to, to determine which is best suited for you. In this case, they're gonna look at a more system role here to give a little more granular control over the type of model that they're gonna use in the l o m, you know, interrogation. And you can add certain parameters such as temperature or max tokens, which they have lots of links of documentation on where you can find more information here. But this kinda checks out with my explorations of the Elmer package where I learned very quickly the prompt is the key here along with some other interesting things you can augment with that chat or with that interface to that chat in, that chat service.
So once you have all that, they've got them so they got it ready to go to look at, you know, getting getting the text, you know, summarized, but you gotta get the text in there itself. So in this case, they've leveraged the PDF tools package for a convenient way to grab the text from a PDF that you've downloaded on your computer with the PDF underscore text function. But you also have to make sure that you are able to authenticate directly to your API service because you may get, on top of the authentication, you may have to format the text effectively. So he he or she notes that the first time they ran this, they got a 400 error from the API service because the formatting wasn't correct. Because when you extract text from PDF, there could be some strange symbols in there, some strange artifacts, and you gotta you gotta clean that up a little bit. So that's a good walk through on the practical ways of leveraging this workflow. But yeah. And the rest of the post, Mike, why don't you talk us through some of the challenges I'm learning that the authors had of this workflow here?
[00:26:35] Mike Thomas:
Yeah. You know, it's really nice that we have these APIs that allow us to do things programmatically. I think, like you mentioned, it wasn't quite as straightforward as, extracting the the text from the PDF using that PDF tools package. There there was some cleanup for those characters. But once that took place, it was pretty straightforward to be able to interact with this this LLM, and I believe that's much thanks to this tidy chat models package that exists, that allows you to, it has functions in it like add params, where you can set what's called, like, the temperature, and the the max number of tokens that, you expect to interact with, with respect to the LLM. You can add a particular message, in this case, a system prompt. And, the the system prompt here was, you are a research assistant who has been asked to summarize the methods section of a paper on turtle mortality.
You will extract key statistics on sample size and and year of study and do not extract any more information beyond this point. So those system prompts are intended to sort of fix the response or the way that the LLM will respond to you, prior to even providing your prompt, your user prompt. And then finally, there's a function called addMessage, which allows you to submit that user prompt. And in this case, it'll be the the text, that was extracted from the PDF. In in order to, send this to the LLM, there's a final function called perform underscore chat, that will send that. You can save that as at least save the output of that to an object. In this case, the author used an object named new chat.
And then to take a look at what the LLM gave you back, you can use the extract underscore chat function, from the same package against that object that you saved. And it's it's pretty cool here to take a look at the results. I will give one hot tip. So the results come back, as text where the LLM says, you know, based on the method section, here are the key statistics. Your sample size of 357 sets, for large scale long line fishery, and the year of the study was 2018. So it it is text that you would then have to parse, and there's a little bit of code that leverages a lot of functions from stringer and I think dplyr as well in order to actually extract, you know, just the numeric values for the sample size and the year.
I will give a hot tip here just to based upon our experience in the past. We have done things like tell this system prompt to or use the system prompt to tell the LLM to only return, data in JSON format with the following elements, like sample size and year of study. And that will actually spit back out JSON, that you can, you know, then consume much more easily without having to do any string data wrangling, if you will. So that that may be, helpful to to some of those folks out there. But if you can ensure that the output that you're getting is is fairly standardized, like it seems to be the case here, then, you know, hopefully, we can program the whole solution. Right? And we don't have to, do different parsing logic based upon, you know, different, prompts that we are essentially querying.
And and so the code here is a fantastic walk through, pretty lightweight in terms of the number of packages that are being used, which is is awesome. There's some considerations here and some some great discussion at the end around cost uncertainty. If you're doing these things programmatically, right, you need to make sure that you're calculating, estimating your cost. You know, most of these models out there have the ability to set or most of these third party providers have the ability to set budgets, I think so that, you don't go over, you know, $10 or a hundred dollars, whatever it is that you, you know, have linked in your account. I know at least the Claude models have that, which is really nice.
And maybe the maybe the last thing that I will say just on a related topic while we're talking about LLMs is a totally, separate shout out, but there's this project out there called Continue. I don't know if you've heard of it, Eric. No, I'm not. It's a v s v s code extension, and I heard about it on Hugo Bowne Anderson's recent podcast. I think the, the Vanishing Gradients podcast, which, interviewed the one of the, I believe, chief developers of this continue project. And it's an open source coding assistant. And you know me, an open source, you know, go away, Microsoft.
It it it does have the ability so you can select sort of your back end model that you want to use, and that could be OpenAI, it could be Claude, but it could also be a local model, like one of those from the Ollama project, which is what I've been using, one of the smaller ones. And it has the ability to just chat with your code. It has the ability to bring in particular pieces of context, Like, you can use a a GitHub repository just specifying the URL, and it will crawl that GitHub repository, you know, using your your GitHub p a t, if it's a private repository that you can supply it with. And, it will essentially do all the text embedding work for you to, you know, put that in a vector database or whatever it's called, to be able to utilize that as context that you can chat with. You can put multiple sources of documentation like that if you want.
And, you can also highlight pieces of of code and, you know, chat with the context being that particular highlighted piece of code. Does auto complete, all those really good things. And it's a really, really cool user experience. And I would encourage anyone out there who is looking for, you know, the the Copilot experience without necessarily wanting to, interact with a third party and continue to leverage open source software to take a look at this continue project. I think it's maybe continue.dev,
[00:32:58] Eric Nantz:
but we we can link to it in the show notes. Yeah. We're gonna link to it. And while you were talking, I wanted to see, hey. Wait a minute. I wonder if I could plug this into Positron in my my bleeding edge evaluation of positron. The good news is this extension is on the open VSX registry, meaning it's not locked into Versus code. You could use it on the variant. So I think I'm doing that later today, Mike. I'm gonna give this a shot because I have been hesitant to get on the copilot train for even my open source stuff that I've been leveraging, some other alternatives, and this may be one of them. So just goes to show you there's a lot advancement here. And by the way, yeah, plus 100 to your tip about giving the prompt some detailed information on how to get results back. Getting results back in a structured way like Jason just opens up so many possibilities, and I leverage that technique extensively when I was making this, fun little haunted places shiny app that leveraged an LOM for our pharma last year. I made sure that when I made it randomly generate the quiz questions, give it back to me in JSON so that I could present it in shiny very easily with a with dynamic input input. So there's a lot a lot at your fingertips here. I think this post does highlight.
Do some quick tests first in a specific use case, and you will have a lot a lot that you'll learn along the way. But we're only just scratching the tip of the iceberg with this, so to speak. There's a lot more to come, come in this space. And last, but certainly not least, Mike, I will admit on a day like this when I'm not feeling like myself, I do, want to feel, and I do feel a little lazy about certain tasks. But we're not gonna talk about lazy and a negative connotation here because our neck our last highlight today is literally giving us a very practical take with the many different ways that lazy and laziness is actually available to you as an R user depending on your context, depending on your workflow and the packages that you're utilizing.
And so this last highlight comes to us from the, our hub blog and from a very brilliant group of authors. If I do say so myself, we got Mel Salman, Athanasia, my Winkle, and Hannah Frick. So that's a, that's a a trio of trios right there to start off with. And so I'm gonna introduce certain pieces of this, and I'll turn over the mic for the rest. But if you're an r user the last few years, there's probably one interpretation that of lazy that you've heard throughout your journey with our, and that is a concept of lazy evaluation. That is arguably one of our biggest selling points is the idea of lazy evaluation.
And what this really means is that if you have a function with arguments, that those arguments are only going to be evaluated in the runtime, so to speak when they are accessed. So you may be able to pass like a huge value for an argument. Maybe that's like a data frame or some other large vector. If it's not used, it's not gonna really matter. It's just gonna be there just in the in the definition. Only until you actually call something that leverages it will actually be used. And, there is a counter concept to that called eager evaluation. But in typical default behavior behavior for r, It is a lazy evaluation, and they have a link to a great chapter from the advance r book that's authored by Hadley Wickham with another more, you know, thorough introduction to lazy evaluation.
And in fact, in base r itself, another concept that you may be familiar with is the idea of a promise where it may be something that is on tap to be evaluated, but it's in essence more of a recipe to get to that value. They call that an expression most of the time or it could be from an environment as well. And again, only when you need it will be evaluated in your memory. So that can be very important depending on your workflow. I mentioned promise. Right? Well, there is a very important part of the our ecosystem that leverages a different take on promises with respect to, high performance computing and async processing.
And that comes from the future package. The future package is authored by Henrik Benson, another really brilliant researcher in the art community that I had the pleasure to meet at Paz at Confuc a few years ago. And in the future package, this promise is more thought of as a placeholder of a value And that and, again, there are different ways to configure this with the future package. You can have what's called a lazy future or an eager future. And in essence, when you define this future, the default behavior is actually be eager, meaning that when you define the function that has that future encapsulated in the moment you define that and it runs something right off the spot, it's gonna wait until that task is done.
However, you can also feed in an argument of lazy equal true, meaning that that future will just be set to run-in the background. It will not hog your our session, and you could do other things in the council and and do whatever you want, but only until you want that future value will it actually do it. So that could be important if you're new to the future package to figure out, well, wait a minute. I thought the whole point was that it could run-in the background. You gotta be careful in how you define how that future, is is spelled out or initialized.
So, again, their version of lazy is not quite the same as the definition we heard in base are and whatnot, but that is important if you're into into that space. And now we're gonna shift context to data itself because a very key concept in database operations within the R language is the idea of lazy operations. What does this mean? Well, in a nutshell, with these database back ends, such as, say, MySQL or SQLite or others, these queries that you might define with the with the help of, say, the DB plier package that accompanies a lot of d plier like syntax. But for databases, you may have a pipeline of an analytical pipeline where you're gonna take data, maybe mutate a few things, summarize with group processing, but only until you quote, unquote collect that result will it actually be evaluated. So it's a way to efficiently run SQL queries instead of in a typical data frame. It's gonna run all those steps one by one in memory right away.
So that is a really important concept that the DB plier package, that DB plier package surfaces. And also more recently, the DT plier package does a similar thing with data dot table as the back end for managing that data. And again, much like the database, you know, paradigm we mentioned earlier, the lazy way of doing it is gonna capture the intent of those data processing steps, but not actually do anything until that result is requested, I e collected with a collect function. Again, really important selling point. But there is in this in the realm of databases, another newer contender that has even more of a Nuance take on this, and that is the duck plier package because Mike and I are big cheerleaders for the duck DB back end. I absolutely love it.
So with duck plier, again, this is using duck DB on the back end, so to speak. Now there can be a little bit of a problem here with respect to traditional d plier type usage. Usually, things are eager by default in dplyr, like I mentioned, with a typical data frame. But when the concept of DuckDV, one of the reasons we wanna use it as a whole is that we can optimize the queries, optimize the computations before they're actually run. So Duckplier does need this same concept of laziness as those traditional packages like DBplier actually need.
Now this is what's interesting here. The way Duckplier is pulling this off, we're getting a little in the weeds here, is that it is leveraging alt rep, which is one of the more fantastic contributions of ASR of over the last two years where it's more power power behind vectorized operations, but it supports what's called deferred evaluation. The more specifically, and I quote from the post here, alt rep allows our objects to have different in memory representations and for custom code to be executed whenever those objects are accessed.
So that means for ductplier that they can have a special version of these callbacks and other functions to interrogate whatever is the root of that operation, say the query, an analytical summarize, or whatever have you. So then duct plier by proxy is actually lazy in terms of how it itself runs as operations, but it seems eager to you as the end user when you're running like a duct plier based pipeline. So they got they got examples here where there could be cases where this is very important to utilize this functionality and cases where it might be might be, more more, applicable to add a little more control to it or add a safeguard to it.
I've never played with this before, but there's a concept called prudence to control just how automatic this evaluation of this laziness vault rep is is done here. There's stingy, and then there's thrifty. I love these names, by the way. Those are really creative, but they got examples in the post with the NT cars set of the differences between how these are these are, approached here. So these this is something that you probably wanna look at with the recent version of duct plier. It it had an upgrade, I think, within the last few weeks or the last year. There's a lot of rapid development on it, and I think it's got tons of potential for leveraging high performer workflows with database at the back end. And, again, a clever use of laziness with respect to alt rep. So I am I'm eager to to try that out.
But, of course, there's way more ways of laziness and, you know, lazy evaluation play a role in the rest of the our kinda typical workflows that you might have. So, Mike, why don't you take us through those?
[00:44:33] Mike Thomas:
Yes. A few more quick hitters for us in this blog post. When we talk about lazy loading of data in packages, I think a lot of us have experienced this before. When you you're in R. Right? You can quickly access, like, the iris and the mtcars datasets which are built into your installation of R. I'd I'm not sure if they're loaded Eric, you probably have to help me with this a little bit. If they are loaded into memory prior to calling them, prior to actually evaluating them. But that's sort of this concept here where if you have an R package that does have a package dataset in it and and sets the lazy data field in the description file to true, then the exported datasets are are lazily loaded and they're they're available without having to call the data function, right, for for those particular datasets.
But they're not actually taking up memory until they are accessed. So that's something interesting there. It's something that we've run into a few times actually. We have some some functions in some of our packages that programmatically, sort of, you know, with the use of, like, regular expressions and stringer, try to decide which internal package dataset you want to leverage in that function and unfortunately you have to call library on the package first in order for that function to work you can't just name space it or else it will fail. And I'm not sure if we've solved that yet. It's it's a bit of a workaround.
[00:46:07] Eric Nantz:
Is that something you've run into before, Eric? Yeah. The hard way quite a bit even with my goal in power, Shiny apps are on include an internal dataset as, like, a way to have, like, me my colleague test or an example set that the app would use. I I I've I've had to I've had to do some, you know, very weird hacks of, like, just doing an arbitrary command on that data frame to trick it to load in the memory before the function completes. I don't really have a great solution for that. So, hey, Colin, if you're listening, maybe you could help me out with that, by the way. But, nonetheless, that's where I've encountered that bugaboo the most.
[00:46:46] Mike Thomas:
Yes. Yes. No. That's that's a great point. And there's a couple of links here, I think, that may help discuss this concept further of lazy data. There's the r packages book by Hadley Wickham and Jenny Bryan, then there's the also also the Writing R Extensions book, which I think is is more sort of authored by some of the core R developers or, you know, so I think from that perspective. So those might be two good resources if you're interested in learning a little bit more about lazily loading data in packages. I love lazy logic that checks to see if something ever needs to be rerun, and that's sort of the concept of caching, right, in a in a broad sense.
And the authors here give the example of the lazy argument in the package down build site function, which if that argument is set to true, it will only rebuild articles and reference pages if the source is newer than the destination, which makes a whole lot of sense and can save a whole lot of time depending on how big your project is. And that's something that I have to talk to a client about today because we have a GitHub action that is taking way too much, way more time than it needs to take.
[00:47:58] Eric Nantz:
I feel seen about that. Absolutely. Yep.
[00:48:01] Mike Thomas:
I digress. Similar concept with the lazy test package that helps you only rerun tests that failed during the last run. And the last example here is regarding regular expressions. I had never heard of the terminology lazy being applied to regular expressions, but if your regular expression is finding all matches, of of whatever pattern you're looking for, that's considered eager. And if it's only finding the first match or the fewest number of repetitions as the authors define it here as possible, then it's considered to be lazy. And in the example that they provide, the question mark character in the regular expression is what adds this laziness.
So ton of examples here, really, really interesting blog posts. I think it's it's always interesting, you know, whatever these authors put out. It's some neat perspectives that maybe we don't think about or or have on a day to day basis. And, I would say that if you you didn't get it already, there are a lot of different definitions around laziness when it comes to programming and and our programming, especially. They did omit one definition of laziness, which is the one that that takes place when people just copy and paste code from ChatGPT and don't even look at it before incorporating it into their project or repository or even worse, pushing it to production.
That's bad laziness as opposed to a lot of good laziness that we were talking about today. But Context is king as I say. And, yes,
[00:49:32] Eric Nantz:
we've we we both have had experiences where that's happened, and we're like, oh, boy. Is this what we're in for now? Just my 2¢. Yeah. Yeah. Yeah. But, I I think it's a it's a viewpoint that's shared with a lot of people. But, yeah, lots of lots of great additional, you know, links in this post to dive into each of these in greater detail. As I said, I'm really intrigued by the duck pliers approach to this because I've never seen it kinda try to total lines. See both eagerness and laziness depending on the on the way you're interrogating that. So I'm gonna do some homework after the the show about that because I'm trying to up my duck DB, power here, so to speak, after that great workshop I took back at Pazitconf last year. I'm all in on that train. And and, yeah, in this case, lazy is definitely not a bad thing in many of the many of the approaches here.
And what else is not bad is our weekly itself. I would dare say we're not lazy in terms of how we curate the issue. That is very much, an eager evaluation in a good way. Normally, we do our additional fines. We are running a bit low on time, so we're gonna close-up shop here and, again, invite you if you wanna help contribute to the project. The best way to do that is with a poll request through our weekly itself and the upcoming issue. If you found that great blog post that maybe spurs up a lot of discussion in the community like we had in Ari's post or a great technical deep dive or a great way to use a new r package out there. We're just a poll request away. All marked down all the time. The template's already there. Head to rweekly.0rg for complete details on that. And we love hearing from you on the social medias. Great shout out to those that have gotten in touch and send us some good things on on social media.
But you can find me. I'm now on Blue Sky, where I'm at [email protected]. I'm also on Mastodon where I'm at [email protected]. And I'm on LinkedIn. You can search my name, and you'll find me there. And, Mike, where can the listeners find you?
[00:51:35] Mike Thomas:
Sure. You can find me on blue sky at mike dash thomas dot b s k y dot social or on LinkedIn, if you search Ketchbrook Analytics, k e t c h b r o o k, you can see what I'm up to lately. Very good stuff. And, thank you again. We made it. In our 50%
[00:51:54] Eric Nantz:
workflow, we somehow made it. So that's why having a co host is a really good idea in these times. So nonetheless, we will close-up shop here for our weekly highlights rep. So hun nine 96. Yeah. We're far away from 200 folks. It's coming up soon. And we'll be back with episode a 97 of our weekly highlights next week.