How the recent frontier LLM model releases compare for successfully generating R code, our take on the new Test Set data science podcast, and a surprising entry in the world of languages equipped for data science.
Episode Links
Supplement Resources
Supporting the show
Music credits powered by OCRemix
Episode Links
- This week's curator: Sam Parmar - @parmsam@fosstodon.org (Mastodon) & @parmsam_ (X/Twitter)
- 2025-12-05 AI Newsletter
- The Test Set: Now on YouTube + a look at what’s next
- Haskell IS a Great Language for Data Science
- Entire issue available at rweekly.org/2025-W50
Supplement Resources
- Add links discussed in the episode (in place of this sentence)
- How well do LLMs generate R code (Shiny app) https://skaltman-model-eval-app.share.connect.posit.cloud/
- Python is not a great language for data science (Claus Wilke) Part 1 https://blog.genesmindsmachines.com/p/python-is-not-a-great-language-for
- DataHaskell https://www.datahaskell.org/
Supporting the show
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @rpodcast@podcastindex.social (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)
- Mike Thomas: @mike_thomas@fosstodon.org (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter)
Music credits powered by OCRemix
- Marble Dash - Sonic the Hedgehog - Joshua Morris - https://ocremix.org/remix/OCR01365
- The Belmont Chill - Super Castlevania IV - Blak_Omen - https://ocremix.org/remix/OCR01195
[00:00:03]
Eric Nantz:
Hello, friends. We are back with episode 215 of the Our Weekly Highlights podcast. Sorry for missing last week, but, sometimes the the day job end of year stuff got me wrapped up in all sorts of fun stuff last week, but I'm back once again and really excited to close out the year strong here, so to speak. We got more awesome r content as was shared on this week's r weekly issue to talk to you all about. So my name is Eric Nance, and I'm delighted that you joined us from wherever you are around the world and wherever you're listening. And keeping me on, this is always is my awesome cohost, Mike Thomas, who just had to endure yet more rants in our preshow here. And, Mike, how are you doing today? Doing great, Eric. Hoping to wrap up the year strong here with the our weekly highlights heading into the holidays. It is frigid out here on the East Coast in The US, but, we'll try to keep everybody warm. That's right. I was wearing my other jacket here downstairs until we got started with this because it is a bit chilly here in the good old basement here.
We don't exactly have a huge budget here for recording here. We'll get in more than that later. But nonetheless, we can talk about some awesome stuff that doesn't need a budget, so to speak, because it comes for free for all of you on this week's our weekly issue and has been curated by Sam Palmer this week, another one of our somewhat newer curator, but he's actually been on the team for over a year, and he's been tremendously helpful as always with this issue. And he had tremendous help from our fellow rweekly team members and contributors like all of you around the world of your poll request right at our rweekly.org site. So let's get right to it because the world of artificial intelligence development of large language models moves at a breakneck pace throughout the years.
And one of the nice things that we've seen come out in the data science community to try and cut through all the fluff of this, so to speak, because believe me, there's a lot of fluff out there. Hate to bring up LinkedIn from time to time, but there's a lot of slop on there with AI so you gotta watch it there. I I kid. I kid. Not really. But one of the things that has been helpful is the recent effort launched by Pazit called the AI newsletter, which is authored by Sarah Altman and Simon Couch, both have been really involved in developing LOM technology and packages over at posit, and they give it you the straight scoop so to speak, on a lot of the newer developments and a lot of the practical issues that you may hear about but maybe want, more, you know, robust take on especially from the data science perspective.
So their most recent issue that came out, earlier in the month at the December 5, and in this issue there are a couple cool things that caught my eye. First of which, as I mentioned, these models are rapidly evolving, and there are some new releases on the Frontier providers. And I guess somewhat not surprising, but yet it's still interesting to see that from Anthropic's side of things, the new Claude Opus four dot five model has been deemed by by Sarah and and Simon here to be the best coding model available according to their benchmarks, which does kind of track from some of the things we've heard in the various community members here that are in the world of LOMs.
Seems like Claude has been one that most software developers developers in general turn to, to generate their programming code of choice. And it's getting really nice with our code as I've been playing with it a bit here and there in positron assistant. So I'll certainly be eager to try cloud Opus four dot five. And it does sound like they are working on the pricing to be a little more competitive compared to maybe some initial releases. So that's something to keep in mind too, which does remind me there's another part of the post that kinda clears up, which can be a confusion for those new to the world of Frontier l o m providers.
There's often two ways to pay for these services. One of which is that you literally sign up for an account on these services and pay a monthly fee. You may pay it yearly, but it's a static fee. But then there's also the API token method where you pay for what you use in terms of API tokens. In general, especially if you're new to this, a lot of folks, myself included, recommend you go the API token route as you're just getting your feet wet with things. And it maybe turns to the more rigorous, account, you know, account subscription for times that you really need it. But there's lots of great information in the post on where you can learn more about those those, choices.
The other model that got talked about here is Google's recent update to Gemini, which I know a lot of people have been using because I believe Gemini is actually free to start with, to get your feet wet, but they have a new release called nano banana pro. Who knows what they use to name these things, but I'm sure that generates some fun images. But what was interesting in this post here is that they took a little bit of a challenge here and saw that it might actually be helpful in terms of being an image model, that's what this one is, that can generate interesting and at least semi clear technical diagrams of software.
So in the blog post, you'll see when you when you click on this after listening that they asked it to create an infographic explaining how the Elmer R package works, and it looks like a pretty solid graph. Certainly, there may be some questionable choices on where the arrows are going in terms of their layout, but I could see this being kinda neat. If you don't wanna code up something in those, open source kind of diagramming software or the other diagramming software offer that or software that's proprietary, maybe an LOM can ingest kinda your overall code base and help you do that. So I'll I'll take that note of that for future developments and the like.
But, yeah, there's definitely more highlights here in terms of other, interesting developments. The other one I'll highlight before I turn it over to Mike here is that they have been, deposit teams and regularly benchmarking how well the LMs are generating our code because there are some nice packages that they've been developing, Simon especially, such as the vitals package, and then using kind of a built in, you might call test case called r, a r e, to run on all these models and see how well it performs. And now instead of just relying on the blog post to come every so often to summarize performance, they have a fun little shiny app that we'll link to in the show notes that compares the model performance, and you can choose as many as are offered here and there's there's quite a bit offered here. But when I looked at it earlier this morning preparing the show notes here, it definitely shows that Claude Opus is the leader in terms of getting the percent correct in cogeneration based on the truth, so to speak, and this are, you know, scenario followed by GPT five and then SONNET four dot five and the like. But you can you can add in, like, the older models, and you can see the evolution of how how well or maybe not so well some of these models are performing. So great little tool to bookmark, not just for performance, but also the ratio of cost to performance and getting the nitty gritty on pricing details, which if you're in an organization, might be really important or just important to you, so to speak, if you're paying out of pocket for this.
So really nice app here. Straight to the point. Nicely done by b s lib and the like. So definitely bookmark that if you're new to this space. Yeah. I always enjoy reading this when it comes out every few weeks or so, but, Mike, what did you think about the roundup here in this newsletter?
[00:08:31] Mike Thomas:
Yeah. The the end of last month, we saw a ton of activity in sort of the closed source, you know, frontier model updates that Gemini three pro update, I think, kinda rock the world. I think it sort of threw anthropic and and claw and, ChatGPT, maybe OpenAI for a minute, but I think, that their responses between GPT 5.1 pro and now CloudOpus 4.5 have been pretty strong as well. And I've also been experimenting with both a little bit and kinda concur that CloudOpus 4.5 is is the best for what we're doing day in and day out with our coding, and quarto type of stuff.
The some of the stuff that I was experimenting with Gemini three Pro was good. I think maybe, like, a little too verbose, a little too research y, and a little less targeted, but I think that there's probably some, fantastic applications as well. And the pricing model changes is is pretty interesting. It sounds like, Anthropic actually increased the price of, like, their smaller haiku model and decreased the price of of Opus, kind of causing the the cost to converge towards the the balanced SONNET, other sort of pricing model that they have. So it's it's interesting that the the the highest, performing, supposedly, model with Opus is now sort of one of the the cheaper sides, and it seems like that's gonna be a bait and switch, like, just to get you in. Right? And then then ramp up the cost eventually once you get hooked. But we're we're pretty hooked on on Opus, so they'll probably have us for a while.
And the one thing I I haven't explored, but glad the Paza team did in terms of Sam and and Sarah in this this article, or Simon and and Sarah, excuse me, is explore sort of these image models as well and get into multimodality here with Gemini's Nano Banana Pro. And one of the things I think that they found interesting compared to every other image model that I've seen thus far is it's not necessarily throwing out images that have, like, junk text in them and Wingdings and stuff that looks like it's from another language. I saw something going around on LinkedIn where or Blue Sky maybe, where there was an image in a research paper that was published in the prolific science magazine Nature.
I think. That was like a total AI junk slop, has since been retracted. I think it was a research paper on autism, pretty pretty serious stuff. And, been retracted because it was, I think, found to be just, you know, half AI generated or at least the the diagrams were. And if they use Nano Banana Pro, they're maybe their junk science wouldn't have been discovered or at least not as
[00:11:24] Eric Nantz:
quickly. So I don't know if it's It makes me both happy and sad at the same time. I don't know.
[00:11:29] Mike Thomas:
Yes. Exactly. But, Sarah and Simon did find, you know, an interesting use case for these nano for the nano Banana Pro, image generation models and their data science workflows in terms of generating, co what they're calling coherent, sometimes, technical diagrams, which in a lot of our repositories and a lot of our work and documentation that that we do at Catchbook, you know, we're almost always creating workflow diagrams, technical diagrams that are accompanied with the code and the software that we're creating. And, you know, there's great tools out there like, Mermaid JS and things like that, Excalidraw, that allow us to do those things, but it it does take some time. So if there's the potential for, you know, Nano Banana Pro to be able to speed that process up or maybe even just help us iterate, that's pretty interesting.
And the other part of, I think the article here that they shared in the the newsletter is some additional updates to our packages, in terms of the vitals package, now has a o dot 2.o release, so still very early in its life, but that's a really cool package for doing evals and evaluating LLM tools using Elmer. So o o dot two dot o is now on CRAN as well. And, yeah, as you mentioned to Eric, that Shiny app that they have developed to track performance, in terms of R code generation and the accuracy of R code generation across these new Frontier models is really, really cool.
I think it draws data, from using the the vitals package as well, and they I know they try to update it regularly. That's something that I'd seen Sharon Machlis do over on Blue Sky and not a lot of other folks trying to do, at least in terms of our benchmarking. So for the, those of us that are still developing in R and haven't switched everything over to Python, this is fantastic because most of the benchmarking that I've seen thus far in terms of code generation is just around Python. And we've been using or or I've been using the API model, pricing model for all of these services for a while, but I think it's it's probably time that I consider maybe the all in monthly cost because that may be beneficial because I have a we have caps on most of our APIs where, I get a notification every time we use $5.
And that that $5 notification is starting to to ding, like, every other day or every every three days. So probably time to consider that all in monthly cost, but it was a great way to or it has been a great way to dip our feet into trying to leverage these in our workflows.
[00:14:10] Eric Nantz:
Yeah. I think, especially when you get into more of the agentic side of things and you're not just doing simple code completions or simple answers to cryptic JavaScript or XML based parsing that I've done in the past, I know people are trying to build I heard a story from another colleague at a different company. Just as a challenge, try to build like a like an internal version of a Stack Overflow. And it worked pretty well, but I don't think just the API usage was kinda got it. It was very much an agentic thing, but that just shows you just how far people are pushing these things. And like I mentioned before, generating full blown Linux distribution configuration files for it in a couple hours versus, like, five weeks it might take somebody new to roll the nicks or whatnot. So that's it's this agentic flow is happening, and I'm hearing code and cursor being used quite a bit in those situations.
But I will say back on the API front, if you want something where you can kind of pick and choose the models, but in, like, one overall platform, OpenRouter has been really helpful to me. It helps you, you know, pick and choose between, say, you know, OpenAI's models, Anthropics, Google's, and others. And then you just basically give your service account credentials for these API keys into OpenRouter, and they'll take care of the rest. So some tools can benefit from that where you just wanna be you don't wanna have to hop back and forward to, like, five or six different services. You wanna put it all in one place. There's not this is not meant to be free advertising. I'm just saying it actually works pretty well. So you might wanna check that out. If you're using more than a handful of these at a time, it can be a lot to manage, especially someone new like me. So, yeah, lots of great developments here. And, again, really great to see this Elmer ecosystem just grow exponentially.
We learned a lot more about this in recent conferences I've been a part of, both the, Gen AI day for our pharma that we have the recordings on YouTube and some great content that will be on YouTube soon from our recent r pharma conference in respect to how we're leveraging Elmer and the like. So good times to be had. And speaking of good times, jeez, Mike. There's a little competition out there in the world of podcasting, apparently, but, no, it's all in good fun, of course, because our friends at Posit have been hard at work in developing some novel podcast content called the test set. Sounds like you've had a chance to listen to a few of these, and maybe you could take us through what your impressions are of of the test set.
[00:17:09] Mike Thomas:
I have. I think they have a higher budget than we have, Eric. They've got some really nice backgrounds, and it's very well produced. It's great. I think originally, it sort of started out with, Michael Chow as the host, and now they've brought Wes McKinney in as a co host on most of the most recent episodes, I should say. And they've been fantastic. You know, really great deep dives into the backgrounds of many sort of prominent figures in data science across both both R and Python. Folks like Julia Silgi, you would know, James Blair who works at Posit as as an engineer there, and, Kelly Baldwin most recently, who I believe is a a professor of statistics, in data science at Cal Poly, which is really interesting to get her perspective on how she teaches and, sort of different strategies that she has as well as her background coming up through data science. So I have really enjoyed it thus far. It sounds like they have some pretty exciting guests on the roadmap for 2026 as well, folks from DBT, Shopify, Astral, Mode, Meta, and and a bunch more. So I can't get enough of, which maybe is unhealthy listening to data science podcasts after I'm done working for the day and walking the dog and mowing the lawn on the weekends.
But I have long been bit by the data science bug and started out probably listening to the, original R podcast by my cohost here as well as Shiny developer series and and just ingesting as much as I possibly can. So I really enjoy these conversations. It's as a way, maybe for somebody who works kind of fully remote in my team. It's it's a way to connect with the greater, you know, data science ecosystem and and folks that are out there, Greater data science community, hear what everybody's thinking about, interested in, and and working on. So I appreciate posit making the investment to put this together and and produce it and share it with the world.
[00:19:16] Eric Nantz:
Yep. It is on my catalog, on my little phone here to catch up on the back catalog. I did listen to a little bit of, one of the episodes earlier this week with, Minh, Chitchennai, Rundell, on her experiences with teaching in the world of data science, especially in the advent of LOMs. And she has some interesting takes on how she's leveraging the novel technology to make running a course easier, but also kind of the real fundamental issues with students now growing up in this new age, which is almost like, you know, when old timers like me were growing up with the Internet just starting in high school and college and how how that revolutionized my my my workflow there. But it's a but, yeah, great perspectives, great, you know, lineup of guests.
So like I said, it must be nice to have a positive budget instead of a pseudo negative one like what we have here, but, hey, fair play, and then they got the right minds behind it. So I'll definitely have that in my podcaster of choice, feeder of choice because it is both on YouTube and in your favorite podcast provider. Certainly, I would say definitely audio friendly. I think like you said, Mike, I'm my consumption of content in the world of data science and open source software development, I am often doing something else while I listen. And then the challenge is when I hear this great insight of, like, being able to time it, hit pause, write it down, or at least jot down a link to follow-up on. But there are some creative things I'm sure in that space. But, yeah, test sets, going on strong and look forward to seeing what else they have in their in their, episode pipeline.
[00:20:57] Mike Thomas:
Good podcast name too.
[00:21:00] Eric Nantz:
Yes. Naming things is hard. I can but I it was easy back then because there was no R podcast before, so I got the easy one. Better than that. Yes. I'm gonna get better than that. I someday, I tell you, I'm gonna get that one back up and running again. I've got I've got plans, buddy. Got plans. So I know, Mike, you've been talking, especially in your recent conference talk about, you know, bilingual type of data science, especially from the R and Python side of things. And there was a post a few weeks ago by, a prominent member of the r community, Klaus Wilk.
Definitely caused a bit of a stir about his take on why Python was not such a great language for data science. Certainly a mix of, I would say, things I nodded my head on and some things I'm like, yeah. That sounds like just a bad experience and maybe with a better environment management or things like that, it might have gotten better. But it definitely got a lot of people talking, a lot of people thinking. And our last highlight here is definitely from one of the leading thinkers in the space of alternative ways to do data science, John Carroll.
He goes, he was inspired by Klaus' post to look at a language that you may, on the surface, not expect to be great with data science workflows. But in John's mind, it's actually coming a pretty long way. What language are we talking about here? Haskell. Haskell is, definitely kind of towing the line between a lower level language and and maybe some interpretable type of components to it. If my memory serves me correct, I believe Haskell is the basis for the Pandoc utility, which Pandoc is what we use quite a bit as the back end converter system going from, like, markdown over to HTML or to, you know, open office formats and whatnot.
I could be wrong on that. Someone have to fact may check on that. I'm sure they will. But that's where I first heard of Haskell. But I do know that there are a lot of different use cases for it. And up until recently, John has been using this Haskell in a lot of interesting kind of advent of code challenges, but he has seen some interesting traction in the world of Haskell with respect to data science, such as the data Haskell project, which is meant to be kind of a curated organization of various packages in the Haskell ecosystem and tutorials and in the future kind of learning environments to get your journey off the right foot on using Haskell for data science.
Now there are some key differences between Haskell and the language we're familiar with such as r or Python, that are definitely worth taking a note of, especially if you're new to the syntax that definitely would take some learning. One interesting one is that we're so used to using parentheses to feed in, like, the name of the function, and then put within the parentheses and name of arguments. No parentheses in Haskell. It's just spaces. So that takes a little getting used to when you're reading that the first time even for a simple sum summarization function or whatnot.
But there are some things that are kind of similar, such as it has the concept of list, but list and Haskell are kinda more like vectors and r where they kinda need to be the same type of each element inside. You can't mix and match like an integer and a character and a Haskell list. They must be a single type. But one interesting thing is, you know, ever since, you know, a couple years ago, our now has the native pipe operator, obviously inspired by Magruder and other packages. Haskell has the same pipe in its system too. So if you're used to pipe workflows, more power to you. You can use it.
There are some nuances, though. As you go from left to right, it'll pass the left side of the pipe, you know, side of the pipe chain over at the end of whatever's on the right side as opposed to the first argument or whatever you're doing on the right side. Again, you may not run into that unless you are really taking this first spin, but that is something that you might wanna look into. But going back to what Klaus had mentioned about, you know, why he felt Python wasn't a great language for data science and what attributes of r do make it a great language for data science, there are four key pillars that Klaus talked about. First was non mutability or basically the idea of keeping things kind of static and not modifying without, you know, guardrails behind it with respect to objects and results of functions and whatnot, having built in concept for missing values as well as vectorization and also hooks for nonstandard evaluation.
So in the remainder of John's post here, he talks about how Haskell actually also addresses these four pillars. First of which is the non mutability, and this is where Haskell is what we call a strongly typed language. That may be a term not familiar to you if you've only done r or Python work, but, basically, what strong type kinda means in this context is that it is really difficult to change, if not impossible, the types of objects that you work with unless you use the kind of guardrails or functions that the language provides. This is where you can't really change, like, a character to a number easily or things like that.
And Haskell does share those traits of being a strongly typed language with some recent advancements kind of at this intersection of r and strong type languages to enhance r itself, such as the the vapor framework from John Coon, type r, and our Lang checks, and even some other, you know, efforts like that. But Haskell has it all built in, so that's one nice feature. And it also has a concept of missing values. You do have to, I think, tap into another package for it, but you can use built in constructs that have an interesting label such as just or nothing.
So this one kind of made my head scratch a little bit, but there's an example here about a list of numbers one to four and then a list with, quote, just one, just two, nothing, and just four. And then there are checks to see which one has missing values and which one doesn't, and it does detect which one has a missing value. And that you've got to do if you want to do a summarization of that where it just had one, two, and four, you have to exclude the nothing from that on that list before you can do it. There's no n a dot r m type of argument in Haskell that we're seeing here, but that's pretty interesting.
But what the other interesting thing is is that while it doesn't have built in vectorization, it does have great support for iteration via the map kind of constructs that we use from per and the like, but they come built in to Haskell. So some nice examples of doing doing those kind of mapping operations, but then the part that John thinks really shows some great promise is a nonstandard evaluation where you can do a lot of interesting dplyr like constructs with Haskell. Again, combining that pipe operator to do things like filtering and doing things like changing variables or adding deriving variables as they call it. And you can print out the data frame in a nice text kind of display, not too dissimilar.
So we print out a tibble or something in the world of r and, of course, in Python with pandas and the like. Some pretty interesting things you can do there. Now, again, the hat the syntax is quite different. It's not like I'm going to Haskell anytime soon, but it is interesting to see that some languages that you may not expect do come of here some built in tricks or can tap into its own package ecosystem to do some data science operations. And it does sound like they're trying to beef up the resources on the use of Haskell for data science. I think John himself is contributing some small pull requests in different repositories.
And maybe the maybe the Haskell world of data science kinda takes off in 2026. Who knows? But as usual, open source, great to have choice and great to learn something new along the way. So if you got some downtime in December, maybe give Haskell a spin. Who knows? Yeah. I think Jonathan
[00:30:33] Mike Thomas:
has been one throughout the years of our weekly to explore different options in data science programming. I think that it may have been him that explored APL,
[00:30:46] Eric Nantz:
a programming language. That's right. We covered that. That was a fun one. I remember that correctly,
[00:30:52] Mike Thomas:
which, just kind of used, like, arbitrary random characters that didn't necessarily make a whole lot of logical sense, but but worked just fine. And I think maybe that was the point of APL. Haskell is definitely a little easier to read and consume for me. And it sounds like a lot of ground has been broken through this data Haskell project as well as this data frame package within Haskell that allows it to be quite a bit friendlier to data scientists as well. You know, I think that familiar pipe operator is is really interesting when you take a look at the documentation in terms of how it lines up with maybe your familiarity in r. I think one of the big differences is that in, instead of using parentheses for function calls, you actually use a space. So in r, you would do some parenthesis x close paren, and then in Haskell, that would just be some space x as well. And, yeah, as you mentioned, Eric, the the whole immutability, concept as well, which Jonathan gave in blog post this example of if I have an an r vector with with three elements that I define using, you know, the lowercase c function, I can sort of overwrite the second element if I just assign some value to the vector with, you know, square brackets after it that have maybe number the number two in it if I'm trying to overwrite the the second element. And that's really difficult and kind of impossible to do. Not necessarily impossible, but, they make it difficult to do in Haskell. It's kind of off limits. You have to go through quite a few hoops to be able to to update, you know, a single element within a vector.
So that's, you know, one interesting thing, and I think that leads to the one interesting difference, and I think that leads to the concept of immutability and and strongly typed languages. And Haskell also has, you know, some really strong sort of support for handling, missing values as well, which can always sort of be this tricky thing in R where we have just NA and then we also have NA underscore for all these different types, character, you know, number, things like that that make it make it difficult. And if you have to deal with missingness, Haskell may potentially do a better job.
And as you mentioned in in terms of performance, it lacks, you know, this built in vectorization. It's it's not an array language, but since it's compiled, it it really compensates for this using some pretty nifty compiler tricks, it seems like, which is is pretty cool. And it also aligns with, you know, nonstandard evaluation type of features that we see in things like dplyr. And there is actually a great, article, I believe, in the data frame docs for this Haskell data frame library that does a comparison, between dplyr on how you would filter rows using the the dplyr filter function, and then compares that to what you do in Haskell, Haskell, which looks to me just kind of on the outset of sort of this combination between, combination between r and Python a little bit. You know, there's an at symbol that sort of specifies data types in the middle of this filter statement, but not that far off. So maybe an interesting thing to look at if you're one potentially doing Advent of Code or something like that where you're trying to explore, you know, different edges of the data science ecosystem and maybe trying different languages that you haven't picked up before, Haskell might be a good one to take a crack at.
Particularly now that this data Haskell ecosystem, is available and this data frame package, has been, you know, has had the tires kicked on it. It seems like for quite a while now and seems like a great option. So great blog post from Jonathan. Always always very interested to see how he's pushing the fringes of data science.
[00:34:49] Eric Nantz:
Yep. And speaking of pushing, there was a a great, write up in in the vectorization portion of his post where you can you can see kind of the benefits of Haskell, kinda have some good compilation tricks, and it can be actually quite fast for things that would be challenging r itself to compile or or to run such as doing of a huge length of of of integers. There's a great example of, you know, just reversing the sorting of that and taking literally almost zero seconds on on Haskell's side versus over four and a half seconds on the r side. So while, again, the vectorization isn't quite built in, it does have hooks and compilation, that does sound like it's good for performance too. So for those of the need for speed, sounds like Haskell can get you there as well. But speaking of speed, we better, speed right through to the end of this because we got everything to attend to, but we won't have time for additional fines. But, again, fantastic issue, put together by Sam here, and where you can find that is at rwahoo.org, of course. That's always where you find the latest issue as well as links to all the previous issues that, again, wonderful content over the year. And, yeah, 2025 has been a fantastic year of content, I would say.
And we love hearing from you as well. You can always fact check me on what I get wrong in my summaries here, but, you can find us or get in contact with us a few different ways. You can send us a little, contact submission form, which is in the episode show notes wherever you're listening to this humble little podcast. You can also get in touch with us on these social media outlets out there. I am on Blue Sky with @rpodcastatbsky.social. I'm also, Mastodon. We have @rpodcastatpodcastindex on social, and I'm on LinkedIn. I'm not causing a a I fluff. You can search for my name, and you'll find me there. And, Mike, where can the listeners find you? You can find me on blue sky at mike dash thomas dot b s k y dot social,
[00:36:55] Mike Thomas:
or you can find me on LinkedIn if you search Ketchbrook Analytics, k e t c h b r o o k. You can find out what we're up to. Awesome stuff. Yeah. I know you've recently completed some really fun projects. We hope to hear about those somewhat soon. But time's our enemy, of course, to write all that stuff up. And some new R packages on Crayon in the last month that we've published, which we're pretty excited about. Although one is, one is having some issues that we need to update. The joys of crayons We got the crayon email. Yep. We were good. And then the website,
[00:37:27] Eric Nantz:
that the package interacts with changed. Put up a firewall. Oh. Verify that you're a human.
[00:37:31] Mike Thomas:
Verify that you're a human type of page and it broke our download dot file function in our
[00:37:38] Eric Nantz:
just in case you've been there. If you know, you know. You know, you know. Yep. That's bit me and behind another places many, many times. But nonetheless, very exciting stuff to see contributing to open source. And with that, we will close-up episode 215 of our wiki highlights, and we hopefully will be back with another fresh episode to help wrap up the year next week.
Hello, friends. We are back with episode 215 of the Our Weekly Highlights podcast. Sorry for missing last week, but, sometimes the the day job end of year stuff got me wrapped up in all sorts of fun stuff last week, but I'm back once again and really excited to close out the year strong here, so to speak. We got more awesome r content as was shared on this week's r weekly issue to talk to you all about. So my name is Eric Nance, and I'm delighted that you joined us from wherever you are around the world and wherever you're listening. And keeping me on, this is always is my awesome cohost, Mike Thomas, who just had to endure yet more rants in our preshow here. And, Mike, how are you doing today? Doing great, Eric. Hoping to wrap up the year strong here with the our weekly highlights heading into the holidays. It is frigid out here on the East Coast in The US, but, we'll try to keep everybody warm. That's right. I was wearing my other jacket here downstairs until we got started with this because it is a bit chilly here in the good old basement here.
We don't exactly have a huge budget here for recording here. We'll get in more than that later. But nonetheless, we can talk about some awesome stuff that doesn't need a budget, so to speak, because it comes for free for all of you on this week's our weekly issue and has been curated by Sam Palmer this week, another one of our somewhat newer curator, but he's actually been on the team for over a year, and he's been tremendously helpful as always with this issue. And he had tremendous help from our fellow rweekly team members and contributors like all of you around the world of your poll request right at our rweekly.org site. So let's get right to it because the world of artificial intelligence development of large language models moves at a breakneck pace throughout the years.
And one of the nice things that we've seen come out in the data science community to try and cut through all the fluff of this, so to speak, because believe me, there's a lot of fluff out there. Hate to bring up LinkedIn from time to time, but there's a lot of slop on there with AI so you gotta watch it there. I I kid. I kid. Not really. But one of the things that has been helpful is the recent effort launched by Pazit called the AI newsletter, which is authored by Sarah Altman and Simon Couch, both have been really involved in developing LOM technology and packages over at posit, and they give it you the straight scoop so to speak, on a lot of the newer developments and a lot of the practical issues that you may hear about but maybe want, more, you know, robust take on especially from the data science perspective.
So their most recent issue that came out, earlier in the month at the December 5, and in this issue there are a couple cool things that caught my eye. First of which, as I mentioned, these models are rapidly evolving, and there are some new releases on the Frontier providers. And I guess somewhat not surprising, but yet it's still interesting to see that from Anthropic's side of things, the new Claude Opus four dot five model has been deemed by by Sarah and and Simon here to be the best coding model available according to their benchmarks, which does kind of track from some of the things we've heard in the various community members here that are in the world of LOMs.
Seems like Claude has been one that most software developers developers in general turn to, to generate their programming code of choice. And it's getting really nice with our code as I've been playing with it a bit here and there in positron assistant. So I'll certainly be eager to try cloud Opus four dot five. And it does sound like they are working on the pricing to be a little more competitive compared to maybe some initial releases. So that's something to keep in mind too, which does remind me there's another part of the post that kinda clears up, which can be a confusion for those new to the world of Frontier l o m providers.
There's often two ways to pay for these services. One of which is that you literally sign up for an account on these services and pay a monthly fee. You may pay it yearly, but it's a static fee. But then there's also the API token method where you pay for what you use in terms of API tokens. In general, especially if you're new to this, a lot of folks, myself included, recommend you go the API token route as you're just getting your feet wet with things. And it maybe turns to the more rigorous, account, you know, account subscription for times that you really need it. But there's lots of great information in the post on where you can learn more about those those, choices.
The other model that got talked about here is Google's recent update to Gemini, which I know a lot of people have been using because I believe Gemini is actually free to start with, to get your feet wet, but they have a new release called nano banana pro. Who knows what they use to name these things, but I'm sure that generates some fun images. But what was interesting in this post here is that they took a little bit of a challenge here and saw that it might actually be helpful in terms of being an image model, that's what this one is, that can generate interesting and at least semi clear technical diagrams of software.
So in the blog post, you'll see when you when you click on this after listening that they asked it to create an infographic explaining how the Elmer R package works, and it looks like a pretty solid graph. Certainly, there may be some questionable choices on where the arrows are going in terms of their layout, but I could see this being kinda neat. If you don't wanna code up something in those, open source kind of diagramming software or the other diagramming software offer that or software that's proprietary, maybe an LOM can ingest kinda your overall code base and help you do that. So I'll I'll take that note of that for future developments and the like.
But, yeah, there's definitely more highlights here in terms of other, interesting developments. The other one I'll highlight before I turn it over to Mike here is that they have been, deposit teams and regularly benchmarking how well the LMs are generating our code because there are some nice packages that they've been developing, Simon especially, such as the vitals package, and then using kind of a built in, you might call test case called r, a r e, to run on all these models and see how well it performs. And now instead of just relying on the blog post to come every so often to summarize performance, they have a fun little shiny app that we'll link to in the show notes that compares the model performance, and you can choose as many as are offered here and there's there's quite a bit offered here. But when I looked at it earlier this morning preparing the show notes here, it definitely shows that Claude Opus is the leader in terms of getting the percent correct in cogeneration based on the truth, so to speak, and this are, you know, scenario followed by GPT five and then SONNET four dot five and the like. But you can you can add in, like, the older models, and you can see the evolution of how how well or maybe not so well some of these models are performing. So great little tool to bookmark, not just for performance, but also the ratio of cost to performance and getting the nitty gritty on pricing details, which if you're in an organization, might be really important or just important to you, so to speak, if you're paying out of pocket for this.
So really nice app here. Straight to the point. Nicely done by b s lib and the like. So definitely bookmark that if you're new to this space. Yeah. I always enjoy reading this when it comes out every few weeks or so, but, Mike, what did you think about the roundup here in this newsletter?
[00:08:31] Mike Thomas:
Yeah. The the end of last month, we saw a ton of activity in sort of the closed source, you know, frontier model updates that Gemini three pro update, I think, kinda rock the world. I think it sort of threw anthropic and and claw and, ChatGPT, maybe OpenAI for a minute, but I think, that their responses between GPT 5.1 pro and now CloudOpus 4.5 have been pretty strong as well. And I've also been experimenting with both a little bit and kinda concur that CloudOpus 4.5 is is the best for what we're doing day in and day out with our coding, and quarto type of stuff.
The some of the stuff that I was experimenting with Gemini three Pro was good. I think maybe, like, a little too verbose, a little too research y, and a little less targeted, but I think that there's probably some, fantastic applications as well. And the pricing model changes is is pretty interesting. It sounds like, Anthropic actually increased the price of, like, their smaller haiku model and decreased the price of of Opus, kind of causing the the cost to converge towards the the balanced SONNET, other sort of pricing model that they have. So it's it's interesting that the the the highest, performing, supposedly, model with Opus is now sort of one of the the cheaper sides, and it seems like that's gonna be a bait and switch, like, just to get you in. Right? And then then ramp up the cost eventually once you get hooked. But we're we're pretty hooked on on Opus, so they'll probably have us for a while.
And the one thing I I haven't explored, but glad the Paza team did in terms of Sam and and Sarah in this this article, or Simon and and Sarah, excuse me, is explore sort of these image models as well and get into multimodality here with Gemini's Nano Banana Pro. And one of the things I think that they found interesting compared to every other image model that I've seen thus far is it's not necessarily throwing out images that have, like, junk text in them and Wingdings and stuff that looks like it's from another language. I saw something going around on LinkedIn where or Blue Sky maybe, where there was an image in a research paper that was published in the prolific science magazine Nature.
I think. That was like a total AI junk slop, has since been retracted. I think it was a research paper on autism, pretty pretty serious stuff. And, been retracted because it was, I think, found to be just, you know, half AI generated or at least the the diagrams were. And if they use Nano Banana Pro, they're maybe their junk science wouldn't have been discovered or at least not as
[00:11:24] Eric Nantz:
quickly. So I don't know if it's It makes me both happy and sad at the same time. I don't know.
[00:11:29] Mike Thomas:
Yes. Exactly. But, Sarah and Simon did find, you know, an interesting use case for these nano for the nano Banana Pro, image generation models and their data science workflows in terms of generating, co what they're calling coherent, sometimes, technical diagrams, which in a lot of our repositories and a lot of our work and documentation that that we do at Catchbook, you know, we're almost always creating workflow diagrams, technical diagrams that are accompanied with the code and the software that we're creating. And, you know, there's great tools out there like, Mermaid JS and things like that, Excalidraw, that allow us to do those things, but it it does take some time. So if there's the potential for, you know, Nano Banana Pro to be able to speed that process up or maybe even just help us iterate, that's pretty interesting.
And the other part of, I think the article here that they shared in the the newsletter is some additional updates to our packages, in terms of the vitals package, now has a o dot 2.o release, so still very early in its life, but that's a really cool package for doing evals and evaluating LLM tools using Elmer. So o o dot two dot o is now on CRAN as well. And, yeah, as you mentioned to Eric, that Shiny app that they have developed to track performance, in terms of R code generation and the accuracy of R code generation across these new Frontier models is really, really cool.
I think it draws data, from using the the vitals package as well, and they I know they try to update it regularly. That's something that I'd seen Sharon Machlis do over on Blue Sky and not a lot of other folks trying to do, at least in terms of our benchmarking. So for the, those of us that are still developing in R and haven't switched everything over to Python, this is fantastic because most of the benchmarking that I've seen thus far in terms of code generation is just around Python. And we've been using or or I've been using the API model, pricing model for all of these services for a while, but I think it's it's probably time that I consider maybe the all in monthly cost because that may be beneficial because I have a we have caps on most of our APIs where, I get a notification every time we use $5.
And that that $5 notification is starting to to ding, like, every other day or every every three days. So probably time to consider that all in monthly cost, but it was a great way to or it has been a great way to dip our feet into trying to leverage these in our workflows.
[00:14:10] Eric Nantz:
Yeah. I think, especially when you get into more of the agentic side of things and you're not just doing simple code completions or simple answers to cryptic JavaScript or XML based parsing that I've done in the past, I know people are trying to build I heard a story from another colleague at a different company. Just as a challenge, try to build like a like an internal version of a Stack Overflow. And it worked pretty well, but I don't think just the API usage was kinda got it. It was very much an agentic thing, but that just shows you just how far people are pushing these things. And like I mentioned before, generating full blown Linux distribution configuration files for it in a couple hours versus, like, five weeks it might take somebody new to roll the nicks or whatnot. So that's it's this agentic flow is happening, and I'm hearing code and cursor being used quite a bit in those situations.
But I will say back on the API front, if you want something where you can kind of pick and choose the models, but in, like, one overall platform, OpenRouter has been really helpful to me. It helps you, you know, pick and choose between, say, you know, OpenAI's models, Anthropics, Google's, and others. And then you just basically give your service account credentials for these API keys into OpenRouter, and they'll take care of the rest. So some tools can benefit from that where you just wanna be you don't wanna have to hop back and forward to, like, five or six different services. You wanna put it all in one place. There's not this is not meant to be free advertising. I'm just saying it actually works pretty well. So you might wanna check that out. If you're using more than a handful of these at a time, it can be a lot to manage, especially someone new like me. So, yeah, lots of great developments here. And, again, really great to see this Elmer ecosystem just grow exponentially.
We learned a lot more about this in recent conferences I've been a part of, both the, Gen AI day for our pharma that we have the recordings on YouTube and some great content that will be on YouTube soon from our recent r pharma conference in respect to how we're leveraging Elmer and the like. So good times to be had. And speaking of good times, jeez, Mike. There's a little competition out there in the world of podcasting, apparently, but, no, it's all in good fun, of course, because our friends at Posit have been hard at work in developing some novel podcast content called the test set. Sounds like you've had a chance to listen to a few of these, and maybe you could take us through what your impressions are of of the test set.
[00:17:09] Mike Thomas:
I have. I think they have a higher budget than we have, Eric. They've got some really nice backgrounds, and it's very well produced. It's great. I think originally, it sort of started out with, Michael Chow as the host, and now they've brought Wes McKinney in as a co host on most of the most recent episodes, I should say. And they've been fantastic. You know, really great deep dives into the backgrounds of many sort of prominent figures in data science across both both R and Python. Folks like Julia Silgi, you would know, James Blair who works at Posit as as an engineer there, and, Kelly Baldwin most recently, who I believe is a a professor of statistics, in data science at Cal Poly, which is really interesting to get her perspective on how she teaches and, sort of different strategies that she has as well as her background coming up through data science. So I have really enjoyed it thus far. It sounds like they have some pretty exciting guests on the roadmap for 2026 as well, folks from DBT, Shopify, Astral, Mode, Meta, and and a bunch more. So I can't get enough of, which maybe is unhealthy listening to data science podcasts after I'm done working for the day and walking the dog and mowing the lawn on the weekends.
But I have long been bit by the data science bug and started out probably listening to the, original R podcast by my cohost here as well as Shiny developer series and and just ingesting as much as I possibly can. So I really enjoy these conversations. It's as a way, maybe for somebody who works kind of fully remote in my team. It's it's a way to connect with the greater, you know, data science ecosystem and and folks that are out there, Greater data science community, hear what everybody's thinking about, interested in, and and working on. So I appreciate posit making the investment to put this together and and produce it and share it with the world.
[00:19:16] Eric Nantz:
Yep. It is on my catalog, on my little phone here to catch up on the back catalog. I did listen to a little bit of, one of the episodes earlier this week with, Minh, Chitchennai, Rundell, on her experiences with teaching in the world of data science, especially in the advent of LOMs. And she has some interesting takes on how she's leveraging the novel technology to make running a course easier, but also kind of the real fundamental issues with students now growing up in this new age, which is almost like, you know, when old timers like me were growing up with the Internet just starting in high school and college and how how that revolutionized my my my workflow there. But it's a but, yeah, great perspectives, great, you know, lineup of guests.
So like I said, it must be nice to have a positive budget instead of a pseudo negative one like what we have here, but, hey, fair play, and then they got the right minds behind it. So I'll definitely have that in my podcaster of choice, feeder of choice because it is both on YouTube and in your favorite podcast provider. Certainly, I would say definitely audio friendly. I think like you said, Mike, I'm my consumption of content in the world of data science and open source software development, I am often doing something else while I listen. And then the challenge is when I hear this great insight of, like, being able to time it, hit pause, write it down, or at least jot down a link to follow-up on. But there are some creative things I'm sure in that space. But, yeah, test sets, going on strong and look forward to seeing what else they have in their in their, episode pipeline.
[00:20:57] Mike Thomas:
Good podcast name too.
[00:21:00] Eric Nantz:
Yes. Naming things is hard. I can but I it was easy back then because there was no R podcast before, so I got the easy one. Better than that. Yes. I'm gonna get better than that. I someday, I tell you, I'm gonna get that one back up and running again. I've got I've got plans, buddy. Got plans. So I know, Mike, you've been talking, especially in your recent conference talk about, you know, bilingual type of data science, especially from the R and Python side of things. And there was a post a few weeks ago by, a prominent member of the r community, Klaus Wilk.
Definitely caused a bit of a stir about his take on why Python was not such a great language for data science. Certainly a mix of, I would say, things I nodded my head on and some things I'm like, yeah. That sounds like just a bad experience and maybe with a better environment management or things like that, it might have gotten better. But it definitely got a lot of people talking, a lot of people thinking. And our last highlight here is definitely from one of the leading thinkers in the space of alternative ways to do data science, John Carroll.
He goes, he was inspired by Klaus' post to look at a language that you may, on the surface, not expect to be great with data science workflows. But in John's mind, it's actually coming a pretty long way. What language are we talking about here? Haskell. Haskell is, definitely kind of towing the line between a lower level language and and maybe some interpretable type of components to it. If my memory serves me correct, I believe Haskell is the basis for the Pandoc utility, which Pandoc is what we use quite a bit as the back end converter system going from, like, markdown over to HTML or to, you know, open office formats and whatnot.
I could be wrong on that. Someone have to fact may check on that. I'm sure they will. But that's where I first heard of Haskell. But I do know that there are a lot of different use cases for it. And up until recently, John has been using this Haskell in a lot of interesting kind of advent of code challenges, but he has seen some interesting traction in the world of Haskell with respect to data science, such as the data Haskell project, which is meant to be kind of a curated organization of various packages in the Haskell ecosystem and tutorials and in the future kind of learning environments to get your journey off the right foot on using Haskell for data science.
Now there are some key differences between Haskell and the language we're familiar with such as r or Python, that are definitely worth taking a note of, especially if you're new to the syntax that definitely would take some learning. One interesting one is that we're so used to using parentheses to feed in, like, the name of the function, and then put within the parentheses and name of arguments. No parentheses in Haskell. It's just spaces. So that takes a little getting used to when you're reading that the first time even for a simple sum summarization function or whatnot.
But there are some things that are kind of similar, such as it has the concept of list, but list and Haskell are kinda more like vectors and r where they kinda need to be the same type of each element inside. You can't mix and match like an integer and a character and a Haskell list. They must be a single type. But one interesting thing is, you know, ever since, you know, a couple years ago, our now has the native pipe operator, obviously inspired by Magruder and other packages. Haskell has the same pipe in its system too. So if you're used to pipe workflows, more power to you. You can use it.
There are some nuances, though. As you go from left to right, it'll pass the left side of the pipe, you know, side of the pipe chain over at the end of whatever's on the right side as opposed to the first argument or whatever you're doing on the right side. Again, you may not run into that unless you are really taking this first spin, but that is something that you might wanna look into. But going back to what Klaus had mentioned about, you know, why he felt Python wasn't a great language for data science and what attributes of r do make it a great language for data science, there are four key pillars that Klaus talked about. First was non mutability or basically the idea of keeping things kind of static and not modifying without, you know, guardrails behind it with respect to objects and results of functions and whatnot, having built in concept for missing values as well as vectorization and also hooks for nonstandard evaluation.
So in the remainder of John's post here, he talks about how Haskell actually also addresses these four pillars. First of which is the non mutability, and this is where Haskell is what we call a strongly typed language. That may be a term not familiar to you if you've only done r or Python work, but, basically, what strong type kinda means in this context is that it is really difficult to change, if not impossible, the types of objects that you work with unless you use the kind of guardrails or functions that the language provides. This is where you can't really change, like, a character to a number easily or things like that.
And Haskell does share those traits of being a strongly typed language with some recent advancements kind of at this intersection of r and strong type languages to enhance r itself, such as the the vapor framework from John Coon, type r, and our Lang checks, and even some other, you know, efforts like that. But Haskell has it all built in, so that's one nice feature. And it also has a concept of missing values. You do have to, I think, tap into another package for it, but you can use built in constructs that have an interesting label such as just or nothing.
So this one kind of made my head scratch a little bit, but there's an example here about a list of numbers one to four and then a list with, quote, just one, just two, nothing, and just four. And then there are checks to see which one has missing values and which one doesn't, and it does detect which one has a missing value. And that you've got to do if you want to do a summarization of that where it just had one, two, and four, you have to exclude the nothing from that on that list before you can do it. There's no n a dot r m type of argument in Haskell that we're seeing here, but that's pretty interesting.
But what the other interesting thing is is that while it doesn't have built in vectorization, it does have great support for iteration via the map kind of constructs that we use from per and the like, but they come built in to Haskell. So some nice examples of doing doing those kind of mapping operations, but then the part that John thinks really shows some great promise is a nonstandard evaluation where you can do a lot of interesting dplyr like constructs with Haskell. Again, combining that pipe operator to do things like filtering and doing things like changing variables or adding deriving variables as they call it. And you can print out the data frame in a nice text kind of display, not too dissimilar.
So we print out a tibble or something in the world of r and, of course, in Python with pandas and the like. Some pretty interesting things you can do there. Now, again, the hat the syntax is quite different. It's not like I'm going to Haskell anytime soon, but it is interesting to see that some languages that you may not expect do come of here some built in tricks or can tap into its own package ecosystem to do some data science operations. And it does sound like they're trying to beef up the resources on the use of Haskell for data science. I think John himself is contributing some small pull requests in different repositories.
And maybe the maybe the Haskell world of data science kinda takes off in 2026. Who knows? But as usual, open source, great to have choice and great to learn something new along the way. So if you got some downtime in December, maybe give Haskell a spin. Who knows? Yeah. I think Jonathan
[00:30:33] Mike Thomas:
has been one throughout the years of our weekly to explore different options in data science programming. I think that it may have been him that explored APL,
[00:30:46] Eric Nantz:
a programming language. That's right. We covered that. That was a fun one. I remember that correctly,
[00:30:52] Mike Thomas:
which, just kind of used, like, arbitrary random characters that didn't necessarily make a whole lot of logical sense, but but worked just fine. And I think maybe that was the point of APL. Haskell is definitely a little easier to read and consume for me. And it sounds like a lot of ground has been broken through this data Haskell project as well as this data frame package within Haskell that allows it to be quite a bit friendlier to data scientists as well. You know, I think that familiar pipe operator is is really interesting when you take a look at the documentation in terms of how it lines up with maybe your familiarity in r. I think one of the big differences is that in, instead of using parentheses for function calls, you actually use a space. So in r, you would do some parenthesis x close paren, and then in Haskell, that would just be some space x as well. And, yeah, as you mentioned, Eric, the the whole immutability, concept as well, which Jonathan gave in blog post this example of if I have an an r vector with with three elements that I define using, you know, the lowercase c function, I can sort of overwrite the second element if I just assign some value to the vector with, you know, square brackets after it that have maybe number the number two in it if I'm trying to overwrite the the second element. And that's really difficult and kind of impossible to do. Not necessarily impossible, but, they make it difficult to do in Haskell. It's kind of off limits. You have to go through quite a few hoops to be able to to update, you know, a single element within a vector.
So that's, you know, one interesting thing, and I think that leads to the one interesting difference, and I think that leads to the concept of immutability and and strongly typed languages. And Haskell also has, you know, some really strong sort of support for handling, missing values as well, which can always sort of be this tricky thing in R where we have just NA and then we also have NA underscore for all these different types, character, you know, number, things like that that make it make it difficult. And if you have to deal with missingness, Haskell may potentially do a better job.
And as you mentioned in in terms of performance, it lacks, you know, this built in vectorization. It's it's not an array language, but since it's compiled, it it really compensates for this using some pretty nifty compiler tricks, it seems like, which is is pretty cool. And it also aligns with, you know, nonstandard evaluation type of features that we see in things like dplyr. And there is actually a great, article, I believe, in the data frame docs for this Haskell data frame library that does a comparison, between dplyr on how you would filter rows using the the dplyr filter function, and then compares that to what you do in Haskell, Haskell, which looks to me just kind of on the outset of sort of this combination between, combination between r and Python a little bit. You know, there's an at symbol that sort of specifies data types in the middle of this filter statement, but not that far off. So maybe an interesting thing to look at if you're one potentially doing Advent of Code or something like that where you're trying to explore, you know, different edges of the data science ecosystem and maybe trying different languages that you haven't picked up before, Haskell might be a good one to take a crack at.
Particularly now that this data Haskell ecosystem, is available and this data frame package, has been, you know, has had the tires kicked on it. It seems like for quite a while now and seems like a great option. So great blog post from Jonathan. Always always very interested to see how he's pushing the fringes of data science.
[00:34:49] Eric Nantz:
Yep. And speaking of pushing, there was a a great, write up in in the vectorization portion of his post where you can you can see kind of the benefits of Haskell, kinda have some good compilation tricks, and it can be actually quite fast for things that would be challenging r itself to compile or or to run such as doing of a huge length of of of integers. There's a great example of, you know, just reversing the sorting of that and taking literally almost zero seconds on on Haskell's side versus over four and a half seconds on the r side. So while, again, the vectorization isn't quite built in, it does have hooks and compilation, that does sound like it's good for performance too. So for those of the need for speed, sounds like Haskell can get you there as well. But speaking of speed, we better, speed right through to the end of this because we got everything to attend to, but we won't have time for additional fines. But, again, fantastic issue, put together by Sam here, and where you can find that is at rwahoo.org, of course. That's always where you find the latest issue as well as links to all the previous issues that, again, wonderful content over the year. And, yeah, 2025 has been a fantastic year of content, I would say.
And we love hearing from you as well. You can always fact check me on what I get wrong in my summaries here, but, you can find us or get in contact with us a few different ways. You can send us a little, contact submission form, which is in the episode show notes wherever you're listening to this humble little podcast. You can also get in touch with us on these social media outlets out there. I am on Blue Sky with @rpodcastatbsky.social. I'm also, Mastodon. We have @rpodcastatpodcastindex on social, and I'm on LinkedIn. I'm not causing a a I fluff. You can search for my name, and you'll find me there. And, Mike, where can the listeners find you? You can find me on blue sky at mike dash thomas dot b s k y dot social,
[00:36:55] Mike Thomas:
or you can find me on LinkedIn if you search Ketchbrook Analytics, k e t c h b r o o k. You can find out what we're up to. Awesome stuff. Yeah. I know you've recently completed some really fun projects. We hope to hear about those somewhat soon. But time's our enemy, of course, to write all that stuff up. And some new R packages on Crayon in the last month that we've published, which we're pretty excited about. Although one is, one is having some issues that we need to update. The joys of crayons We got the crayon email. Yep. We were good. And then the website,
[00:37:27] Eric Nantz:
that the package interacts with changed. Put up a firewall. Oh. Verify that you're a human.
[00:37:31] Mike Thomas:
Verify that you're a human type of page and it broke our download dot file function in our
[00:37:38] Eric Nantz:
just in case you've been there. If you know, you know. You know, you know. Yep. That's bit me and behind another places many, many times. But nonetheless, very exciting stuff to see contributing to open source. And with that, we will close-up episode 215 of our wiki highlights, and we hopefully will be back with another fresh episode to help wrap up the year next week.
Episode Wrapup