It's been far too long since our last episode of R Weekly Highlights, but we are finally back with episode 207! In this episode we learn about novel ways to automate fancy Quarto content, how we can be on our best behavior with behavior-driven-development, and finding that pesky portion of data breaking long data pipelines with a magical debugging technique. Plus one of your hosts could not resist a hot take or two!
Episode Links
Episode Links
- This week's curator: Ryo Nakagawara - @[email protected] (Mastodon) & @rbyryo.bsky.social (Bluesky) & @R_by_Ryo) (X/Twitter)
- Generating quarto syntax within R
- An Introduction to Behavior-Driven Development in R
- Dive()ing into the hunt #rstats
- Entire issue available at rweekly.org/2025-W28
- Add links discussed in the episode (in place of this sentence)
- {quartose}: Dynamically Generate Quarto Syntax https://quartose.djnavarro.net/
- Cucumber: write automated tests in plain language https://cucumber.io/
- {cucumber}: An implementation of the Cucumber testing framework in R https://jakubsobolewski.com/cucumber/
- Shiny App-Packages https://mjfrigaard.github.io/shiny-app-pkgs/
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)
- Mike Thomas: @[email protected] (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter)
- Chillin' with the Bros - Super Smash Bros - Jamphibious - https://ocremix.org/remix/OCR03072
- Flames of Darkness - Mega Man ZX - Vyper - https://ocremix.org/remix/OCR01966
[00:00:03]
Eric Nantz:
Hello, friends. Oh my goodness. It's great to be back with episode 207 of the our weekly holidays podcast. And I do admit this layoff has been a lot longer than intended, but we're happy to be back here. And if you're new to the show, this is the what usually is the weekly podcast where we talk about the great resources that have been shared in the highlights section at rweekly.0rg, along so much more of our various adventures in the world of r and open source. My name is Eric Nance, and, yeah, as I said, the layoff was definitely longer than intended because a combination of company mandated vacation and, you know, other business to resolve, but I am back here keeping things together, at least somewhat, piece by piece. But keeping things together, I would never wanna do a, after a layoff an episode alone.
And back with a vengeance is my awesome cohost, Mike Thomas. Mike, how are you doing? And more importantly, how are you sounding today?
[00:01:00] Mike Thomas:
Sounding much better. And to be honest, Eric, when I hadn't heard from you for a couple weeks after my prior audio issues in our most recent podcast, I thought I might have been fired. So I am glad to hear that that's not the case and to have gotten back in touch last week and finally been able to get back on the mic with you this week. So I'm super excited. Hopefully, I remember how to do this.
[00:01:22] Eric Nantz:
You and me both. And, I was telling you in the preshow, you will you will know firsthand if you're getting the the pink slip, so to speak. But, no, you are you are very safe here. This is always one of my, makeshift therapy sessions every week after all the chaos I go through in my dev life. But nonetheless, we got lots of great things to talk about here. And as you well know, if you listen before our weekly is a volunteer effort and every week we have a rotating curator. And this past issue that we're talking about today has been curated by Rio Nakagorora, another one of our OG curators on the team. And as always, he had tremendous help from our fellow Arruki team members and contributors like all of you around the world. We've heard a poll request and other wonderful suggestions.
So in in the midst of this layoff, I have been, you know, put in my hand in various things, and one of them is definitely writing a lot of documentation and not just booting up a static word document. I always go to one of my favorite new tools in the ecosystem, and that is Quarto for generating scientific, you know, documentation that leverages R, even Python from time to time. And that's been really wonderful for me to keep saying a lot of the new capabilities I'm developing into. But where Quarto and Rmarkdown before that got their major, you might say, future and, you know, future use cases has been the world of data analysis to be able to make reproducible research. Right? So you don't have to copy and paste all those tables and graphs that you're making in r or Python or whatever else. It's all automatically inserted in.
And with the HTML format, you can do a lot of fun things with that format. I have been pushing the envelope quite a bit. I know, Mike, you have as well with your websites and other documentation you're making. But there are times that you wanna kinda merge those concepts together, really make a novel, you know, user experience for that web based report, but it's based on maybe different levels of data or different, you know, datasets, but they're all doing a similar analysis. So how can you kinda blend these two worlds together and take advantage of, you know, quarto and ours built in automation in play?
So our first highlight today is coming from Danielle Navarro who has been at the forefront of a lot of things and reproducible research, as well as some really awesome artwork and, you know, generative art, if you ever want to check her blog out for that. But in this post here, Danielle talks about a recent, you know, situation, which was, inspired by a data analysis, and they had to repeat many different types of plots or tables across these different partitions of the data. So in this example that Danielle comes up with here, they leverage the baby names, our package as a a fun way to look at all the different names out there to kind of look at how you would, look at the different permutations of the name that that they have used, throughout their life going from Dan, Danny, Danny, different spellings of Danny all the way to Daniella.
So little nice little, first off the bat, in quarto, they have created what's called a tab set of this visualization. Looks like a histogram of the years and the baby names dataset and the color coded by genders to, the frequency of that. But right off the bat in quarto, you get a nice tab based, you know, layout for each tab being a different name, and that is, a function that we'll get to in a little bit how how they built that. But that's great right there, and that doesn't stop with plots. You can do that with tables as well or even just print out the data. And then this is where things get fancy. Maybe you don't wanna be limited just to the chordal tab set feature.
You wanna construct this report and take advantage of some nice features in HTML, like putting text in the sidebar or the margin of the report. So there are some other functions you can create to leverage some more HTML features, like putting custom divs, which are like unique blocks of of HTML constructs, but they're uniquely identified where you could fit in some margin text with this. You could also fit on these nice little call out boxes in the margin itself if you wanna put it in this way. And throughout, there is an example function that has the contact, the content of this block of text, if you will, along with the class and any separators if there's multiple pieces of content.
And this is pretty interesting how you can do this, but how did how did Danielle pull this off? Well, they have created a new r package mostly scratching this own itch here called Quartos. I hope I'm pronouncing that right. But the motivation was, again, going to a reporting perspective, all these different groups are partitions of data. You don't wanna have to manually repeat these chunks that you're putting into this HTML content. So this package has a way via code chunks to automate and iterate through these many different groups or many rows and be able to put these HTML based content in dynamically and have it write the quartal syntax for you. So instead of you having the right, like the the three colons, the curly bracket, and then the ID or the the CSS attribute or whatever the construct is you're doing in quartile and then putting the text in manually, it's gonna use R to create these code chunks for you.
This is not a new concept in and of itself because I remember my early days of R Markdown and Knitter. I would very much use this same technique in my biomarker data reports where I might have 200 protein biomarkers. They're all doing a visualization or maybe a bolded list of key features of that biomarker. And I would use r and knitter to automate, say, making a bolded list of these key features and just dynamically put it into text via, you know, back then, just, you know, paste or now glue or things like that. So this is basically taking that to Quarto, but now giving these additional nice HTML features that Quarto has, like these custom margin blocks, the tab sets, and putting it all in directly.
Now Danielle is upfront at the end here saying not sure if that's gonna be useful to other people, but it's useful to them right now. And perhaps you might be able to learn something from it if you've been struggling with repeating similar analytical or data displays in your portal doc without having to manually type that in every time. I think there's a lot of potential here, and I may, definitely take inspiration from this for my next, data driven report that I make with Kortal. And, no, it's not gonna be in Microsoft Word because HTML is the new hotness, although it shouldn't be that new to anybody listening to this show.
Mike, what did you think about Daniel's exploration here?
[00:08:45] Mike Thomas:
Calling HTML, like, something new is, you know, as the language itself is is sort of funny, sort of silly, but I I totally agree with your sentiment. So we do a lot of reporting, and I I do a lot of reporting on a day to day basis. And lots of times, this involves sort of a traditional data analysis approach to maybe building the same chart, a time series, for example, for each independent variable in your dataset, if you're doing like a machine learning problem. And you could hard code this in your quarto document and have a section that says, you know, variable one, and then have your plot. And then, you know, do your two hashtag pound signs for your next section and just say variable two and show your next plot. But if you are trying to do something that's much more dynamic than that, if your dataset changes and you don't wanna have to change your code and those variable names or anything like that, a great trick that's existed sort of since the our markdown days and has been adopted in quarto is is this, output as is capability, which allows you to essentially put raw markdown HTML, in your chunk, and it will actually render that nicely in your report.
And you can wrap all of that in a for loop or take a list and stick it into a per statement. So I have to imagine imagine what's going on under the hood here in the Quartos package that Daniela has put together is quite a bit of that, probably. Some some output as is type of functionality, list type functionality, to be able to render these these lists in a dynamic way that spits out sort of multiple sections in your report or, you know, these multiple tab sets. And that's something that if you've ever done it, it's very powerful, but it does take a good amount of code to write to pull that off. And Right. I think what Danielle has done here is provide us with some really nice helper wrapper functions around that type of syntax to make this way easier than and way less verbose than what I, you know, have typically written in the past and what you may have typically written in the past to accomplish that same thing.
So very excited about this. I think it's gonna be a great helper set of tools and little functions. It's it's almost like an internal package that I would have loved to have written for myself if you've ever, you know, written yourself a toy package, maybe at the day job to, you know, do things that you do on a day to day basis, but you're not sure necessarily needs to see the light of crayon or anything like that. In this case, I'm very glad that Danielle put this out there in the world. I'm excited to use it. Another thing I think while we're on, like, the per topic that I saw just as a tangent is I think, Charlie Gao has done quite bit of work, and there's a new release maybe today of Per that came out that allows for much more, parallel processing of Per. So use all those cores on your machine and see how fast it can go. But really excited to see this Quartos, our package and and to be able to start using it in my own reporting.
[00:12:00] Eric Nantz:
Yeah. There's a lot of wonderful concepts here at play that until you really realize and get tired of manually coding this up, you may not fully appreciate it, but I know I've been there in my early days of a data data scientist kind of says this role on the on the trials and these biomarkers. Yeah. I got so tired of having to manually code this up every single time. I would port over functions from, like, one report to another. And, yes, I should have ran a package out of that. But back then, I was like, at least I got something. But, yeah, I can take a lot of inspiration from Cortos here. And then in the in the package of write up, Danielle notes that there are actually other packages in this space a little bit.
One, I believe, is called Portable, by or Quartaps, I should say, by Suzuki Sasaki Usuki, if I'm pronouncing that right, as well as Q Report by doctor Frank Harrell himself who has been one of the more progressive adopters of knitter and r markdown in our quartiles. So there's definitely, an appetite for this kind of reusable and and very, very convenient functions to automate the process of entering these codes these code blocks or these output blocks, I should say, leveraging r itself. So great great resource here. And I'm definitely gonna dive into this, like I said, for my next reporting adventures. And, yeah, on the per topic, goodness gracious. I was an early adopter of the fur package when that first came out, and they've responded immense job to at least give me that that taste of multicore kind of goodness with per. But when now that Charlie Gao, who is now, working a positive full time, I sense big things for the world of iteration and take advantage of multiple, architectures for your, multiprocessing and and now, frankly, HPC components. So, watch this space. It's gonna get real fun real fast.
Now we're gonna go to a pretty big departure room as talked about, so more of a conceptual exercise, but one that I'm definitely curious about. But I might need to think about a bit more before I start adopting into my workflow. Let's let's talk through this here. Mike and I have spoken about there are different in previous episodes, there are different ways you can structure your development, you know, philosophies. One that gets a lot of attention in the DevOps kinda community is, especially when you're building, say, an application or a package that's gonna be used by lots of people, is making sure you're on top of, like, the testing paradigm alongside your development, and they call that test driven development. That has served a lot of people well. I have kind of hopped back and forth between that. There are times I should have done the testing sooner, and then I regret it. And I realize I got bolted on now, and I gotta, you know, get through that pretty quickly.
There is another spin on this that kinda goes a bit deeper than just testing in and of itself, and that is a concept of behavior driven development or BDD for short. And you can use this on a lot of different domains. But in the second highlight here, we're gonna talk about a way that this can apply to what could be either a package development or definitely in a Shiny app development. And this post has been authored by Jacob Sobolewski, who is a software engineer over at Appsilon. And he has been looking at this topic for quite some time. I'm we may have even covered a previous post that Jacob has written in this particular space. But if you've never heard about BDD or behavior driven development before, this post is for you because he's gonna walk you through just how this works in a in a very, accessible example here.
The idea around BDD is basically three major points, and this does have some parallels to the, you might say, famous or infamous agile philosophy. So don't turn off your podcast yet if you're sick of agile. We're not gonna we're not gonna beat you to the death over that. But the idea is that you still capture some kind of what's called a user story, which is meant to be a high level about what you want to accomplish. And in the case of the example we're gonna talk about here, one example might be that if you're a customer for a bookstore, you might say, I wanna be able to get a book, add it to my cart so I can buy it if it was on a website or even a brick or mortar store.
But then that's one thing to capture at a high level. What do you actually do with that? Well, you need to refine it somehow, and that becomes more specific examples about how would you actually fulfill this story. And then lastly, based on those examples, then you create what are called the specifications. What are the requirements? There are a lot of different terms throughout in development world. What are the rules or what are the criteria to actually accomplish this? So back to the storybook or the book example, you've got that general user story.
Now to refine this, you've got to look at a few different principles here. You need some context. And then when something occurs such as an event, then there should be some kind of result of that event. So if you want to be more verbose about this in the bookstore example, it's like you're in the bookstore, you walked in, you're gonna select the book such as, in this case, the the Hobbit, then you wanna add this to your cart, and then you should be able to see it in the cart. Now it sounds like I'm talking to a four year old about this, but that is the process of starting to break this out further. Because in those statements, we didn't really say what kind of store this was. Right?
It could be an online store like an Amazon type equivalent or it could be, to be really geeky, a a terminal application or, like I said, a physical place that you walk into. So then the next step, the final step is, okay, how do we get specifications from this? And that's where you start to really get into the nuts and bolts about how you pull this off and actually develop the software, in this case, to accompany this. So in the case of this blog post, he's walking us through creating a new class, an r six class for this bookstore. And within it, there are certain methods that you're gonna have in this such as selecting the book, adding it to the cart, and then verifying what it actually includes.
There's literally code that you can write to do this, and this is where the the stuff kind of flips in my head a little bit. He starts off with a test from a test that right off the bat, even though there is no actual code to do this yet, he's got the code in it of what he wants to accomplish. No kidding. It's gonna fail when you first run this, but that's that's the that's the hook here. He's putting the code in a test before the code's actually been fleshed out. I still have to wrap my head around this a little bit, but I'm I'm keeping an open mind here. Once you've got the test established and what you wanna actually verify now, of course, you gotta actually flush out this bookstore class.
That's where we start to see actual code here. And it's interesting where he puts this. He puts this in a setup chunk and test that. Not even his actual r library yet, for maybe an r folder or whatnot. Now he's got the bookstore r six class with a few different public methods on there. They're pretty much empty right now, but they're actually there. They got the skeleton of what he wants to accomplish. And then the last step that to satisfy the specifications such as a function that selects a book you wanna get some result back with the book details, a function to actually add it to the the cart or the wherever you're putting these with a unique ID for that.
And then lastly, returning the list of what's in the cart. So then you start to flesh out then in this class, or the specifications for this, just like boilerplate structures or functions to actually pull this off and then plug that into the test code. So getting the specs documented somewhere in code and then putting that back in the original test script or the r six class, I should say, and then running the test again. And lo and behold, you've got it passing. Now that, again, I have never done it that approach before. I never started with a test with functions that don't actually work yet and then go kind of backwards from that. I will have to train myself on that if I do adopt this. It'll take a little getting used to.
Now as I said, Jacob has looked in this space for a while because he's actually got an r package that may have made it to rweekly before called cucumber, which lets you kinda set up this BDD approach in a very, you know, very elegant syntax that looks very much like a YAML kind of structure. And that's made in what's called a features text file, in this case, bookstore dot feature, where it's got the feature is called the bookstore then nested in that scenario. And then these different, like, ways of parsing the example use of that user story, the given, when, when, then statements.
And then the cucumber package will somehow import that. I've never used it before, so I'm just looking at the at the blog post here. And it will basically help you turn that into test specifications or or testing functionality. And then it's got its own test function to take advantage of that text file with the test that you put in test that, and it will give you the results if you if you pass or not. I will definitely need to sit down with this a bit. I think there's potential here. I'll have to figure out just which use case this best, benefits me. But maybe for a new project, I might give this a whirl and see what happens. I can definitely see some value in being deliberate early on with the general parts of this and then getting more specific later on. So definitely some good food for thought here, Mike, but I'm curious.
Have you even come close to developing things like this before?
[00:22:53] Mike Thomas:
No. I haven't. I'm still awful at taking the test driven development approach. I I'll just be straight up. I don't do it. I should. I should. It's the way to go. But I don't me both. You're you're among friends here. But this is really, interesting to me, and I'm I've tried to trace this down, like, all the way to the source a little bit. And it looks like there is this open source project that I believe is beyond, was it Jacob at, Epsilon who authored the blog post? I believe it's beyond that. And it's it's this open source project called Cucumber that lets you write automated tests in plain language. And what, the the type of document, that dot feature document that we were talking about, the the Cucumber r package being able to, parse and take a look at. I guess it's called a Gherkin document, which is another word that I had never heard before, g h e r k I n.
And it uses a set of special keywords to give structure and meaning to executable specifications. This is right from this documentation on the open source Cucumber project, website. But this is a a massive project. It it seems like that I just happen to have never come across before. We can put it in the show notes. It's cucumber.i cucumber.io. And this is essentially, I believe, this open source project is what led, Jacob to take the initiative to create the Cucumber R package, and that's why it's named Cucumber. But it's it's pretty interesting that we're able to parse something that, as you said, Eric, looks much more like a YAML file into essentially a set of unit tests that are very, you know, sort of stepwise driven development, feature driven development. It lends itself to a shiny context, I feel like, in a lot of ways. Right? And when I was looking at the the r six code that Jacob had in the blog post, it made me feel like, okay, we're almost in a a shiny setting, and I know he works at Absalon that does a lot of shiny development work. So I imagine that all ties together well there. But if there is a world where we can make unit testing more accessible to people who prob the percentage of, you know, data analysts out there that may not even know about unit testing or have ever used Test That, which is probably nonzero, unfortunately.
But if there's a way where we can make that more accessible to folks, you know, through this Cucumber framework or YAML specifications. And, also, this is something that these these dot feature files, these Gherkin documents, this is something that I could bring to, like, a business analyst who has no Exactly. Concept of of coding or anything like that. I don't wanna bring them my test that script. Right? That's gonna look like Wingdings to them. Much for how they bring them something like this. And if I can not only bring this to them to review and edit, but then not have to take the step of, okay, here's the plain English version of it. Now I need to translate it into, you know, our code. If this is already done for me, that's one less step in the process that can really streamline your your testing workflows, which are are not trivial. Right? I think as much as we probably don't want to admit it, AI nowadays in the chat GPTs of the world have made things like, you know, Roxxon documentation or, docstrings in Python or unit testing a little easier for us, which is awesome.
And and I think this maybe sort of piggybacks on that in a way that can make these types of things that have been in the past sort of tedious for us to do, the things that we put off till the last minute, just make it make it easier for us to to get done, and and wrap into our workflows, into our our packages, and things like that. So this has opened up a whole new world for me, this Cucumber framework, and I I'm really interested to dive into it further.
[00:26:59] Eric Nantz:
Yeah. I I remember I tried something kinda similar to this. There is, a great there's a excellent book that's been offered by Martin Frigard, on building shiny apps as our packages. And he talks about, one of our favorites, Golem, of course, amongst others. But he does have a section around this specification driven kind of development with developing your test. So it is kinda like a hybrid into this. So it's not the first time I've I've seen something like this in action. But, yeah, the Cucumber framework, it the I've never seen anything like this, indefinitely, treated like this before. So I do have a potential project that I've done a lot of you and I talked about this before.
A lot of times in our Shiny apps, we'll do the business logic testing via separate r package or just test that with the functions that don't depend on Shiny. And I've got some of those established in, but for more of the higher the higher level things, I might take a crack at a cucumber and see how that might help me out, especially in the case of it's not just testing the business logic. It's testing what you might call that end to end usage of the app. Or in the past, yeah, I would work with an analyst or somebody on our team. Be like, I've got this set of steps.
Just take a half hour, run through it. Let me know if it works. Nobody wants to do that anymore. So if we can automate that side of it with a combo of Cucumber and then maybe some frameworks like shiny test two or playwright, there there's lots of opportunities to take this into into future directions on that cohesive user experience. So I'm there are some lot of great examples on the our package, Cucumber site, which, again, we'll link to in the notes. So I would say, definitely, if you're in a collaborative environment and you need some advice from either just your end users or fellow developers about getting mired into the language syntax of how the tests are done, I can definitely see how this, Cucumber approach, can really help here.
And last but certainly not least in our, highlights episode today, it's, it's a rite of passage, folks, showing you doing a lots of data. Everything looks great for the first few rows, maybe in the first 100 rows. And it's just really deep in that large data set. There is just one partition that just breaks everything apart. You may be using that per package you just referenced to do a lot of iteration or just straight up looping for that matter, and you just don't know why it failed at that particular time. If you're like me in the past, you would say, okay. I'll try to trace back or do print statements of, like, which variable was processing. And then when it gets to the one that fails, I'll try to subset the data manually for that particular one that failed, try to run stuff manually to see what broke or whatnot.
That might be good for smaller cases, but when you have a large data set and lots of functions that you're running, that's not scalable. You need, you need you need some help, so to speak, to make finding that problem a lot easier. Well, our our last highlight today, it comes from a very authoritative source in this space of novel data science with r. It is authored by foe our former r wicked curator, Miles McBain, who has been, of course, one of the forefront pioneers of, you know, you know, novel data analysis. Also has been an early adopter of the targets package. He's been one of the main vocal advocates of targets.
But in this post here, he calls it diving into the hunt. And when you say hunt, it is very much like trying to find a needle in a haystack here. So the situation was he had about a large dataset or large sets of datasets, maybe more than one. And there's a lot of steps, so he's got the example here, that look like something you might write in an interactive notebook fashion, such as like a Jupyter notebook or like a chordal document or a markdown document. You're making new variables, making new variables based on those previously made variables, you're transforming them, and then you're more transformations. It's like a step by step thing, but he said imagine doing that for thousands of lines of stuff like this split across different, files entirely.
And he does make the analogy that this might look familiar to somebody either a as an interactive data analysis or somebody that's a wizard in Microsoft Excel and has a whole bunch of formulas in each row of that of that table that build upon each other or build upon columns and stuff. Very much a literal step by step by step, you know, representation of that data flow. Well, that's great for them. But if you're if you're inheriting this kind of code and you encounter one of these issues, how do you actually figure this out? Well, in one, there is some kind of grouping. Right? And might be just the row of your data or it might be a group of rows. Well, let's say row for the sake of simplicity here.
There are derivations being made on each row. And how do you figure out, once you know there's a problem, how to get to that environment where the problem resides. So in essence, he kinda wants a way at a high level to zoom in literally on that set of data or part of the data, that row that's causing the problem, but do it in a way that's native to the environment that in this case, the tidyverse kind of packages like dply are operating on. Here is the trick that Miles has come up with, the, simple function he's written called dive. I say simple because there's only two lines of code, but I don't think it was simple how he got there, not in the least.
Goodness gracious. So here's the first step. Some I don't take a much advantage of is that you can take a list of things in r, and you can literally transform it to an environment. Mind blown moment number one. Number two, if you're familiar with the r constructs of objects, a data frame is a special type of list. So with this first line converting the list of the data frame to the environment or an environment object. And then I so can't wrap my head around this. Within a low a function called local, which I believe means you're can pertaining to that function's environment, you run the browser function, which I use every single day in my debugging. But I'll just put the browser statement in, like, the top of a reactive or the top of, like, a table output or whatnot, and just literally just do the debugger from there.
But you can feed in in this local function, not just the name of the function you wanna run, but the environment you wanna run it in. So this second argument to local is this object of the data frame environment. So now he can simply put all this pipeable code, but in the end, pipe that to the die function. So that and that's after filtering for that specific role or ID that he knows has a problem. So once you isolate where the problem is, use your familiar piping syntax, put it into this dive call. And then when the browser kicks in, you're not in your global environment, You're in the data frames environment. And that's where you can do some really nice diagnostics about like trying those mutations that you have and those mutate calls, really getting to know what those variables are representing so that you've really taken away another source of maybe discrepancy or variation when your debugging environment isn't quite the same as what's in that pipeline environment.
This takes care of that. He says he's now added this to his dot r profile, so he has it wherever he is in his data analysis needs. And he doesn't really have to rewrite anything substantial other than finding where the problem is and piping it to this dive function. Oh my goodness. Could I have used this about ten or twelve years ago when you're analyzing 55,000 genetic markers, and there's just that one that's out of domain range, and you just don't know why this function would have been hugely helpful for that. So definitely take a look at this post. It's very short to the point, but it it is very relatable based on my past experience and my debugging adventures with complicated datasets. So credit to Miles for once again blowing my mind. Well, it looks like a simple function, but it wasn't simple how we got there. So I'm gonna be adding this to my R profile for sure. Mike, how about you?
[00:36:40] Mike Thomas:
This is wizardry at its finest, and no surprise that it's coming from Miles. But if this is, I guess, sort of exploitation of R's ability to have multiple environments. Right? And that idea. And I would say if you are a little earlier on in your R journey, if you don't know about the browser function, please learn about the browser function and how to use the the debugger in R. It can be life changing in terms of being able to much more quickly diagnose and address issues than if you're trying to do that without the browser function or the debugger.
But this is, as I mentioned, you know, sort of combining that debugging concept and capabilities with multi environment specifications. And as you said, it's a simple, quote, unquote, simple function, a two liner, but there's a lot going on here and a lot of power. And this is one of the coolest tricks that I think I've seen in a long time in any programming language. So I'm absolutely gonna add this to my arsenal and my repertoire as well. Probably stick it right in my dot r profile as as you mentioned, and I think this will be really handy for a lot of use cases that we have.
I have seen so much code like the example code that, Miles described at the beginning where everything is very flat, not very well modularized. There's really no no function extraction or anything like that, and you're just sort of creating data frames and overriding them as you you do your next step. Not a lot of chain assignment going on. I have been critiqued in the past, and our team has been as well, in some of, like, the model validations or the code reviews that we do for for caring too much about software development best practices. And the code works. So, we just wanna make sure that the, you know, the the correct inputs are getting converted to the the correct outputs, essentially, and that there's no bugs in the logic. And, technically, there's no bugs in the logic, sure, but you've got a lot of fragile code hanging around here, and potentially the next person that steps in is gonna have no idea how to address it, especially when something either breaks or you're asked for that inevitable enhancement. And it's tricky to know exactly where to apply that.
And anyways but I digress. I I can just very much relate to, Miles' sort of story here that set off this blog post in his code review of some some tricky code and trying to find a needle in a haystack, and what a useful little function he's put together there that for those similar use cases that I face on a fairly, unfortunately, consistent basis, he was able to drum up something that makes his life easier and it's certainly gonna make mine easier in the future as well. So much thanks to Miles, not only for the blog post, but for the handy trick function.
[00:39:43] Eric Nantz:
I couldn't say it better myself. Two additional thoughts on that. First, as you alluded to, our comes of a lot of these great features built in. This is not something you had to add as an extra package. Right? The environment conversion, the, you know, building the run functions and specified environments, the browser function. I have not met any other language that comes with this so elegantly, readily available off the bat. I'll stand on my soapbox for that. Yes. Hot yeah. Get to get the hot take out. I'm I'm not done yet either. While we were away in our little hiatus, you probably saw this, Mike. There was some chatter about a very provocative post that alluded to r is on its last days.
No. It's not. I can tell you for sure in my industry is most definitely not. And how do I well, so I'm not gonna pretend that life sciences dominates everything. And in data science, no, it doesn't. But when you look at robust development of capabilities that do have a very important need and need to stand the the test of production usage and not just have been our company's firewalls, but also to our health authorities or whatnot. Yes. We are seeing some promising avenues from Python, and no shade on the Python listeners out there. But in the world of statistics, data science, r has a leg up on a lot of these things, and it's gonna be hard to close that gap anytime soon.
And I and but, again, it's not even either or thing. But I do not feel that we're in any decline here, so to speak, even with the world of AI taken off. We got Elmer, folks. We got Elmer to help us on the our side with that. And I've heard people like Hadley and Joe say that the Elmer interface is just fun to use. I get it's not like you don't feel like you're making a compromise going to r for this stuff compared to Python. It is an elegant interface because of our extensibility of the class system and whatnot. We can do a lot of interesting tricks. So, like, Miles opens up this post here. You got so much of this available.
You can tackle almost anything with it. It's just the limits of your imagination, and I I I fully agree with him. In fact, you're you're calling this kind of wizardry post here. He actually gave a talk at, previous RStudio Conf, like, the magic of open source. Right? So he is definitely our our expert wizard here in the world of of ours. So really, really fun post here. And, yeah, those those reports of ours, demise, I'm gonna say are greatly exaggerated. Hot take over.
[00:42:39] Mike Thomas:
I couldn't agree more. I think it boils down. That was ridiculous. I saw that as well. I'm glad that you called it out. I saw it all over LinkedIn. And I think what has clearly happened is that the great things about R folks have been trying to port over to Python, and the great things about Python folks have been trying to port into R as well. So it's very much a use whatever one you want, but don't trash on the other language just because you don't like it as much. You don't need to do that. There's no there's no need. So I'm gonna shameless plug while we're still on the the subject. Can I shameless plug?
You got it. It's gonna be a talk at Positconf this year given by yours truly titled Yeah. Building and Managing Multilingual Data Science Teams. So I am a big proponent of of both. It doesn't necessarily have to be only one or the other. They both have their strengths. They both have their weaknesses. But we don't need to trash on one side or the other.
[00:43:40] Eric Nantz:
That isn't a great teaser for to come to Positiv to watch Mike's talk. I don't know what is. So, yeah, I can't wait to to be hopefully in the front row for that one when I'm over there. It is is very much the state of the world we're living and not in our dev lives. Right? So, I mean, I'm I'm got some colleagues on the on the Python side doing some great work, and then I'm integrating this in my R workflows, and it's all it's all working well. But, yeah, like Miles says here, we got a lot at our fingertips here. You may, in your R journey, encounter some hairy situations, whether it's in the data analysis, the Shiny app side, package development side, chances are there is a way to get out of those bugs. And, yeah, you may use AI for that. You may not need to, like, in this in this blog post. Sometimes the tried and true is the best way to go.
Speaking of the best way to go, we have lots of other content here in ourweekly.org. We invite you to check out the rest of the issue. We're running a little low on time, so we won't do our additional fines here. But if you're new to the website, it's got a great set of sections, all clearly labeled, whether it's package updates, tutorials, events in the community, lots of great things that are happening in this in this ecosystem and our weekly, through I don't remember how many years it's been since we've started running this this project as, stood the test of time as the truly open and community driven way of giving you this great content.
And no no bots are powering this. No organization is overseeing all this. This is all driven by us passionate, you know, advocates of our and data science. So that's why I'm a little, mini soapbox number two, but nonetheless, we definitely invite you to get in touch with us. If you have some interesting takes you wanna share with us, I, we always welcome all opinions from all sectors, and you can get in touch with us multiple ways. First of which, in our podcast episode show notes, we got a little contact form. Feel free to send us a note there. And also we got our our availability on social media these days. I have been a bit quiet, but I'm hoping to get back into it. I am available on Blue Sky with @rpodcastatbsky.social.
I am also on, Mastodon with @rpodcastatpodcastindex.social, and I'm on the aforementioned LinkedIn. I try to stay away from the clickbait stuff. I just post some relevant stuff. I search my name, and you'll find me there. And, Mike, where can listeners find you?
[00:46:13] Mike Thomas:
Likewise, it's been a bit quiet, but certainly, if it's the podcast today is any indication, hoping to get back, back out there a little bit. And you can find me on blue sky at mike dash thomas dot b s k y dot social, or you can find me on LinkedIn if you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to.
[00:46:34] Eric Nantz:
And I believe I saw you on a, golf course recently. I was jealous already because I wanna get my golf game back on. Maybe someday, we'll we'll hit around the nine or eighteen, and you'll poem me to death with your skills, but we'll have fun. That will not happen. But we can certainly golf. Yeah. I I'm I'm gonna enjoy that at the best of them. But, nonetheless, we're gonna we're gonna sink the eighteenth putt here and close out the shop here for our weekly highlights, for this week. And hopefully, things are back to normal. Of course, you never know with the way life goes, especially in the summer. Shout out to all of you that have to shuffle kids around day, you know, summer camps and whatnot that can wreck havoc on schedules. I'll do my best to wrangle that. So we'll close-up shop here. Thank you so much for listening to this episode 207 of our rookie highlights, and we hopefully will be back with another episode of Arroyuki highlights next week.
Hello, friends. Oh my goodness. It's great to be back with episode 207 of the our weekly holidays podcast. And I do admit this layoff has been a lot longer than intended, but we're happy to be back here. And if you're new to the show, this is the what usually is the weekly podcast where we talk about the great resources that have been shared in the highlights section at rweekly.0rg, along so much more of our various adventures in the world of r and open source. My name is Eric Nance, and, yeah, as I said, the layoff was definitely longer than intended because a combination of company mandated vacation and, you know, other business to resolve, but I am back here keeping things together, at least somewhat, piece by piece. But keeping things together, I would never wanna do a, after a layoff an episode alone.
And back with a vengeance is my awesome cohost, Mike Thomas. Mike, how are you doing? And more importantly, how are you sounding today?
[00:01:00] Mike Thomas:
Sounding much better. And to be honest, Eric, when I hadn't heard from you for a couple weeks after my prior audio issues in our most recent podcast, I thought I might have been fired. So I am glad to hear that that's not the case and to have gotten back in touch last week and finally been able to get back on the mic with you this week. So I'm super excited. Hopefully, I remember how to do this.
[00:01:22] Eric Nantz:
You and me both. And, I was telling you in the preshow, you will you will know firsthand if you're getting the the pink slip, so to speak. But, no, you are you are very safe here. This is always one of my, makeshift therapy sessions every week after all the chaos I go through in my dev life. But nonetheless, we got lots of great things to talk about here. And as you well know, if you listen before our weekly is a volunteer effort and every week we have a rotating curator. And this past issue that we're talking about today has been curated by Rio Nakagorora, another one of our OG curators on the team. And as always, he had tremendous help from our fellow Arruki team members and contributors like all of you around the world. We've heard a poll request and other wonderful suggestions.
So in in the midst of this layoff, I have been, you know, put in my hand in various things, and one of them is definitely writing a lot of documentation and not just booting up a static word document. I always go to one of my favorite new tools in the ecosystem, and that is Quarto for generating scientific, you know, documentation that leverages R, even Python from time to time. And that's been really wonderful for me to keep saying a lot of the new capabilities I'm developing into. But where Quarto and Rmarkdown before that got their major, you might say, future and, you know, future use cases has been the world of data analysis to be able to make reproducible research. Right? So you don't have to copy and paste all those tables and graphs that you're making in r or Python or whatever else. It's all automatically inserted in.
And with the HTML format, you can do a lot of fun things with that format. I have been pushing the envelope quite a bit. I know, Mike, you have as well with your websites and other documentation you're making. But there are times that you wanna kinda merge those concepts together, really make a novel, you know, user experience for that web based report, but it's based on maybe different levels of data or different, you know, datasets, but they're all doing a similar analysis. So how can you kinda blend these two worlds together and take advantage of, you know, quarto and ours built in automation in play?
So our first highlight today is coming from Danielle Navarro who has been at the forefront of a lot of things and reproducible research, as well as some really awesome artwork and, you know, generative art, if you ever want to check her blog out for that. But in this post here, Danielle talks about a recent, you know, situation, which was, inspired by a data analysis, and they had to repeat many different types of plots or tables across these different partitions of the data. So in this example that Danielle comes up with here, they leverage the baby names, our package as a a fun way to look at all the different names out there to kind of look at how you would, look at the different permutations of the name that that they have used, throughout their life going from Dan, Danny, Danny, different spellings of Danny all the way to Daniella.
So little nice little, first off the bat, in quarto, they have created what's called a tab set of this visualization. Looks like a histogram of the years and the baby names dataset and the color coded by genders to, the frequency of that. But right off the bat in quarto, you get a nice tab based, you know, layout for each tab being a different name, and that is, a function that we'll get to in a little bit how how they built that. But that's great right there, and that doesn't stop with plots. You can do that with tables as well or even just print out the data. And then this is where things get fancy. Maybe you don't wanna be limited just to the chordal tab set feature.
You wanna construct this report and take advantage of some nice features in HTML, like putting text in the sidebar or the margin of the report. So there are some other functions you can create to leverage some more HTML features, like putting custom divs, which are like unique blocks of of HTML constructs, but they're uniquely identified where you could fit in some margin text with this. You could also fit on these nice little call out boxes in the margin itself if you wanna put it in this way. And throughout, there is an example function that has the contact, the content of this block of text, if you will, along with the class and any separators if there's multiple pieces of content.
And this is pretty interesting how you can do this, but how did how did Danielle pull this off? Well, they have created a new r package mostly scratching this own itch here called Quartos. I hope I'm pronouncing that right. But the motivation was, again, going to a reporting perspective, all these different groups are partitions of data. You don't wanna have to manually repeat these chunks that you're putting into this HTML content. So this package has a way via code chunks to automate and iterate through these many different groups or many rows and be able to put these HTML based content in dynamically and have it write the quartal syntax for you. So instead of you having the right, like the the three colons, the curly bracket, and then the ID or the the CSS attribute or whatever the construct is you're doing in quartile and then putting the text in manually, it's gonna use R to create these code chunks for you.
This is not a new concept in and of itself because I remember my early days of R Markdown and Knitter. I would very much use this same technique in my biomarker data reports where I might have 200 protein biomarkers. They're all doing a visualization or maybe a bolded list of key features of that biomarker. And I would use r and knitter to automate, say, making a bolded list of these key features and just dynamically put it into text via, you know, back then, just, you know, paste or now glue or things like that. So this is basically taking that to Quarto, but now giving these additional nice HTML features that Quarto has, like these custom margin blocks, the tab sets, and putting it all in directly.
Now Danielle is upfront at the end here saying not sure if that's gonna be useful to other people, but it's useful to them right now. And perhaps you might be able to learn something from it if you've been struggling with repeating similar analytical or data displays in your portal doc without having to manually type that in every time. I think there's a lot of potential here, and I may, definitely take inspiration from this for my next, data driven report that I make with Kortal. And, no, it's not gonna be in Microsoft Word because HTML is the new hotness, although it shouldn't be that new to anybody listening to this show.
Mike, what did you think about Daniel's exploration here?
[00:08:45] Mike Thomas:
Calling HTML, like, something new is, you know, as the language itself is is sort of funny, sort of silly, but I I totally agree with your sentiment. So we do a lot of reporting, and I I do a lot of reporting on a day to day basis. And lots of times, this involves sort of a traditional data analysis approach to maybe building the same chart, a time series, for example, for each independent variable in your dataset, if you're doing like a machine learning problem. And you could hard code this in your quarto document and have a section that says, you know, variable one, and then have your plot. And then, you know, do your two hashtag pound signs for your next section and just say variable two and show your next plot. But if you are trying to do something that's much more dynamic than that, if your dataset changes and you don't wanna have to change your code and those variable names or anything like that, a great trick that's existed sort of since the our markdown days and has been adopted in quarto is is this, output as is capability, which allows you to essentially put raw markdown HTML, in your chunk, and it will actually render that nicely in your report.
And you can wrap all of that in a for loop or take a list and stick it into a per statement. So I have to imagine imagine what's going on under the hood here in the Quartos package that Daniela has put together is quite a bit of that, probably. Some some output as is type of functionality, list type functionality, to be able to render these these lists in a dynamic way that spits out sort of multiple sections in your report or, you know, these multiple tab sets. And that's something that if you've ever done it, it's very powerful, but it does take a good amount of code to write to pull that off. And Right. I think what Danielle has done here is provide us with some really nice helper wrapper functions around that type of syntax to make this way easier than and way less verbose than what I, you know, have typically written in the past and what you may have typically written in the past to accomplish that same thing.
So very excited about this. I think it's gonna be a great helper set of tools and little functions. It's it's almost like an internal package that I would have loved to have written for myself if you've ever, you know, written yourself a toy package, maybe at the day job to, you know, do things that you do on a day to day basis, but you're not sure necessarily needs to see the light of crayon or anything like that. In this case, I'm very glad that Danielle put this out there in the world. I'm excited to use it. Another thing I think while we're on, like, the per topic that I saw just as a tangent is I think, Charlie Gao has done quite bit of work, and there's a new release maybe today of Per that came out that allows for much more, parallel processing of Per. So use all those cores on your machine and see how fast it can go. But really excited to see this Quartos, our package and and to be able to start using it in my own reporting.
[00:12:00] Eric Nantz:
Yeah. There's a lot of wonderful concepts here at play that until you really realize and get tired of manually coding this up, you may not fully appreciate it, but I know I've been there in my early days of a data data scientist kind of says this role on the on the trials and these biomarkers. Yeah. I got so tired of having to manually code this up every single time. I would port over functions from, like, one report to another. And, yes, I should have ran a package out of that. But back then, I was like, at least I got something. But, yeah, I can take a lot of inspiration from Cortos here. And then in the in the package of write up, Danielle notes that there are actually other packages in this space a little bit.
One, I believe, is called Portable, by or Quartaps, I should say, by Suzuki Sasaki Usuki, if I'm pronouncing that right, as well as Q Report by doctor Frank Harrell himself who has been one of the more progressive adopters of knitter and r markdown in our quartiles. So there's definitely, an appetite for this kind of reusable and and very, very convenient functions to automate the process of entering these codes these code blocks or these output blocks, I should say, leveraging r itself. So great great resource here. And I'm definitely gonna dive into this, like I said, for my next reporting adventures. And, yeah, on the per topic, goodness gracious. I was an early adopter of the fur package when that first came out, and they've responded immense job to at least give me that that taste of multicore kind of goodness with per. But when now that Charlie Gao, who is now, working a positive full time, I sense big things for the world of iteration and take advantage of multiple, architectures for your, multiprocessing and and now, frankly, HPC components. So, watch this space. It's gonna get real fun real fast.
Now we're gonna go to a pretty big departure room as talked about, so more of a conceptual exercise, but one that I'm definitely curious about. But I might need to think about a bit more before I start adopting into my workflow. Let's let's talk through this here. Mike and I have spoken about there are different in previous episodes, there are different ways you can structure your development, you know, philosophies. One that gets a lot of attention in the DevOps kinda community is, especially when you're building, say, an application or a package that's gonna be used by lots of people, is making sure you're on top of, like, the testing paradigm alongside your development, and they call that test driven development. That has served a lot of people well. I have kind of hopped back and forth between that. There are times I should have done the testing sooner, and then I regret it. And I realize I got bolted on now, and I gotta, you know, get through that pretty quickly.
There is another spin on this that kinda goes a bit deeper than just testing in and of itself, and that is a concept of behavior driven development or BDD for short. And you can use this on a lot of different domains. But in the second highlight here, we're gonna talk about a way that this can apply to what could be either a package development or definitely in a Shiny app development. And this post has been authored by Jacob Sobolewski, who is a software engineer over at Appsilon. And he has been looking at this topic for quite some time. I'm we may have even covered a previous post that Jacob has written in this particular space. But if you've never heard about BDD or behavior driven development before, this post is for you because he's gonna walk you through just how this works in a in a very, accessible example here.
The idea around BDD is basically three major points, and this does have some parallels to the, you might say, famous or infamous agile philosophy. So don't turn off your podcast yet if you're sick of agile. We're not gonna we're not gonna beat you to the death over that. But the idea is that you still capture some kind of what's called a user story, which is meant to be a high level about what you want to accomplish. And in the case of the example we're gonna talk about here, one example might be that if you're a customer for a bookstore, you might say, I wanna be able to get a book, add it to my cart so I can buy it if it was on a website or even a brick or mortar store.
But then that's one thing to capture at a high level. What do you actually do with that? Well, you need to refine it somehow, and that becomes more specific examples about how would you actually fulfill this story. And then lastly, based on those examples, then you create what are called the specifications. What are the requirements? There are a lot of different terms throughout in development world. What are the rules or what are the criteria to actually accomplish this? So back to the storybook or the book example, you've got that general user story.
Now to refine this, you've got to look at a few different principles here. You need some context. And then when something occurs such as an event, then there should be some kind of result of that event. So if you want to be more verbose about this in the bookstore example, it's like you're in the bookstore, you walked in, you're gonna select the book such as, in this case, the the Hobbit, then you wanna add this to your cart, and then you should be able to see it in the cart. Now it sounds like I'm talking to a four year old about this, but that is the process of starting to break this out further. Because in those statements, we didn't really say what kind of store this was. Right?
It could be an online store like an Amazon type equivalent or it could be, to be really geeky, a a terminal application or, like I said, a physical place that you walk into. So then the next step, the final step is, okay, how do we get specifications from this? And that's where you start to really get into the nuts and bolts about how you pull this off and actually develop the software, in this case, to accompany this. So in the case of this blog post, he's walking us through creating a new class, an r six class for this bookstore. And within it, there are certain methods that you're gonna have in this such as selecting the book, adding it to the cart, and then verifying what it actually includes.
There's literally code that you can write to do this, and this is where the the stuff kind of flips in my head a little bit. He starts off with a test from a test that right off the bat, even though there is no actual code to do this yet, he's got the code in it of what he wants to accomplish. No kidding. It's gonna fail when you first run this, but that's that's the that's the hook here. He's putting the code in a test before the code's actually been fleshed out. I still have to wrap my head around this a little bit, but I'm I'm keeping an open mind here. Once you've got the test established and what you wanna actually verify now, of course, you gotta actually flush out this bookstore class.
That's where we start to see actual code here. And it's interesting where he puts this. He puts this in a setup chunk and test that. Not even his actual r library yet, for maybe an r folder or whatnot. Now he's got the bookstore r six class with a few different public methods on there. They're pretty much empty right now, but they're actually there. They got the skeleton of what he wants to accomplish. And then the last step that to satisfy the specifications such as a function that selects a book you wanna get some result back with the book details, a function to actually add it to the the cart or the wherever you're putting these with a unique ID for that.
And then lastly, returning the list of what's in the cart. So then you start to flesh out then in this class, or the specifications for this, just like boilerplate structures or functions to actually pull this off and then plug that into the test code. So getting the specs documented somewhere in code and then putting that back in the original test script or the r six class, I should say, and then running the test again. And lo and behold, you've got it passing. Now that, again, I have never done it that approach before. I never started with a test with functions that don't actually work yet and then go kind of backwards from that. I will have to train myself on that if I do adopt this. It'll take a little getting used to.
Now as I said, Jacob has looked in this space for a while because he's actually got an r package that may have made it to rweekly before called cucumber, which lets you kinda set up this BDD approach in a very, you know, very elegant syntax that looks very much like a YAML kind of structure. And that's made in what's called a features text file, in this case, bookstore dot feature, where it's got the feature is called the bookstore then nested in that scenario. And then these different, like, ways of parsing the example use of that user story, the given, when, when, then statements.
And then the cucumber package will somehow import that. I've never used it before, so I'm just looking at the at the blog post here. And it will basically help you turn that into test specifications or or testing functionality. And then it's got its own test function to take advantage of that text file with the test that you put in test that, and it will give you the results if you if you pass or not. I will definitely need to sit down with this a bit. I think there's potential here. I'll have to figure out just which use case this best, benefits me. But maybe for a new project, I might give this a whirl and see what happens. I can definitely see some value in being deliberate early on with the general parts of this and then getting more specific later on. So definitely some good food for thought here, Mike, but I'm curious.
Have you even come close to developing things like this before?
[00:22:53] Mike Thomas:
No. I haven't. I'm still awful at taking the test driven development approach. I I'll just be straight up. I don't do it. I should. I should. It's the way to go. But I don't me both. You're you're among friends here. But this is really, interesting to me, and I'm I've tried to trace this down, like, all the way to the source a little bit. And it looks like there is this open source project that I believe is beyond, was it Jacob at, Epsilon who authored the blog post? I believe it's beyond that. And it's it's this open source project called Cucumber that lets you write automated tests in plain language. And what, the the type of document, that dot feature document that we were talking about, the the Cucumber r package being able to, parse and take a look at. I guess it's called a Gherkin document, which is another word that I had never heard before, g h e r k I n.
And it uses a set of special keywords to give structure and meaning to executable specifications. This is right from this documentation on the open source Cucumber project, website. But this is a a massive project. It it seems like that I just happen to have never come across before. We can put it in the show notes. It's cucumber.i cucumber.io. And this is essentially, I believe, this open source project is what led, Jacob to take the initiative to create the Cucumber R package, and that's why it's named Cucumber. But it's it's pretty interesting that we're able to parse something that, as you said, Eric, looks much more like a YAML file into essentially a set of unit tests that are very, you know, sort of stepwise driven development, feature driven development. It lends itself to a shiny context, I feel like, in a lot of ways. Right? And when I was looking at the the r six code that Jacob had in the blog post, it made me feel like, okay, we're almost in a a shiny setting, and I know he works at Absalon that does a lot of shiny development work. So I imagine that all ties together well there. But if there is a world where we can make unit testing more accessible to people who prob the percentage of, you know, data analysts out there that may not even know about unit testing or have ever used Test That, which is probably nonzero, unfortunately.
But if there's a way where we can make that more accessible to folks, you know, through this Cucumber framework or YAML specifications. And, also, this is something that these these dot feature files, these Gherkin documents, this is something that I could bring to, like, a business analyst who has no Exactly. Concept of of coding or anything like that. I don't wanna bring them my test that script. Right? That's gonna look like Wingdings to them. Much for how they bring them something like this. And if I can not only bring this to them to review and edit, but then not have to take the step of, okay, here's the plain English version of it. Now I need to translate it into, you know, our code. If this is already done for me, that's one less step in the process that can really streamline your your testing workflows, which are are not trivial. Right? I think as much as we probably don't want to admit it, AI nowadays in the chat GPTs of the world have made things like, you know, Roxxon documentation or, docstrings in Python or unit testing a little easier for us, which is awesome.
And and I think this maybe sort of piggybacks on that in a way that can make these types of things that have been in the past sort of tedious for us to do, the things that we put off till the last minute, just make it make it easier for us to to get done, and and wrap into our workflows, into our our packages, and things like that. So this has opened up a whole new world for me, this Cucumber framework, and I I'm really interested to dive into it further.
[00:26:59] Eric Nantz:
Yeah. I I remember I tried something kinda similar to this. There is, a great there's a excellent book that's been offered by Martin Frigard, on building shiny apps as our packages. And he talks about, one of our favorites, Golem, of course, amongst others. But he does have a section around this specification driven kind of development with developing your test. So it is kinda like a hybrid into this. So it's not the first time I've I've seen something like this in action. But, yeah, the Cucumber framework, it the I've never seen anything like this, indefinitely, treated like this before. So I do have a potential project that I've done a lot of you and I talked about this before.
A lot of times in our Shiny apps, we'll do the business logic testing via separate r package or just test that with the functions that don't depend on Shiny. And I've got some of those established in, but for more of the higher the higher level things, I might take a crack at a cucumber and see how that might help me out, especially in the case of it's not just testing the business logic. It's testing what you might call that end to end usage of the app. Or in the past, yeah, I would work with an analyst or somebody on our team. Be like, I've got this set of steps.
Just take a half hour, run through it. Let me know if it works. Nobody wants to do that anymore. So if we can automate that side of it with a combo of Cucumber and then maybe some frameworks like shiny test two or playwright, there there's lots of opportunities to take this into into future directions on that cohesive user experience. So I'm there are some lot of great examples on the our package, Cucumber site, which, again, we'll link to in the notes. So I would say, definitely, if you're in a collaborative environment and you need some advice from either just your end users or fellow developers about getting mired into the language syntax of how the tests are done, I can definitely see how this, Cucumber approach, can really help here.
And last but certainly not least in our, highlights episode today, it's, it's a rite of passage, folks, showing you doing a lots of data. Everything looks great for the first few rows, maybe in the first 100 rows. And it's just really deep in that large data set. There is just one partition that just breaks everything apart. You may be using that per package you just referenced to do a lot of iteration or just straight up looping for that matter, and you just don't know why it failed at that particular time. If you're like me in the past, you would say, okay. I'll try to trace back or do print statements of, like, which variable was processing. And then when it gets to the one that fails, I'll try to subset the data manually for that particular one that failed, try to run stuff manually to see what broke or whatnot.
That might be good for smaller cases, but when you have a large data set and lots of functions that you're running, that's not scalable. You need, you need you need some help, so to speak, to make finding that problem a lot easier. Well, our our last highlight today, it comes from a very authoritative source in this space of novel data science with r. It is authored by foe our former r wicked curator, Miles McBain, who has been, of course, one of the forefront pioneers of, you know, you know, novel data analysis. Also has been an early adopter of the targets package. He's been one of the main vocal advocates of targets.
But in this post here, he calls it diving into the hunt. And when you say hunt, it is very much like trying to find a needle in a haystack here. So the situation was he had about a large dataset or large sets of datasets, maybe more than one. And there's a lot of steps, so he's got the example here, that look like something you might write in an interactive notebook fashion, such as like a Jupyter notebook or like a chordal document or a markdown document. You're making new variables, making new variables based on those previously made variables, you're transforming them, and then you're more transformations. It's like a step by step thing, but he said imagine doing that for thousands of lines of stuff like this split across different, files entirely.
And he does make the analogy that this might look familiar to somebody either a as an interactive data analysis or somebody that's a wizard in Microsoft Excel and has a whole bunch of formulas in each row of that of that table that build upon each other or build upon columns and stuff. Very much a literal step by step by step, you know, representation of that data flow. Well, that's great for them. But if you're if you're inheriting this kind of code and you encounter one of these issues, how do you actually figure this out? Well, in one, there is some kind of grouping. Right? And might be just the row of your data or it might be a group of rows. Well, let's say row for the sake of simplicity here.
There are derivations being made on each row. And how do you figure out, once you know there's a problem, how to get to that environment where the problem resides. So in essence, he kinda wants a way at a high level to zoom in literally on that set of data or part of the data, that row that's causing the problem, but do it in a way that's native to the environment that in this case, the tidyverse kind of packages like dply are operating on. Here is the trick that Miles has come up with, the, simple function he's written called dive. I say simple because there's only two lines of code, but I don't think it was simple how he got there, not in the least.
Goodness gracious. So here's the first step. Some I don't take a much advantage of is that you can take a list of things in r, and you can literally transform it to an environment. Mind blown moment number one. Number two, if you're familiar with the r constructs of objects, a data frame is a special type of list. So with this first line converting the list of the data frame to the environment or an environment object. And then I so can't wrap my head around this. Within a low a function called local, which I believe means you're can pertaining to that function's environment, you run the browser function, which I use every single day in my debugging. But I'll just put the browser statement in, like, the top of a reactive or the top of, like, a table output or whatnot, and just literally just do the debugger from there.
But you can feed in in this local function, not just the name of the function you wanna run, but the environment you wanna run it in. So this second argument to local is this object of the data frame environment. So now he can simply put all this pipeable code, but in the end, pipe that to the die function. So that and that's after filtering for that specific role or ID that he knows has a problem. So once you isolate where the problem is, use your familiar piping syntax, put it into this dive call. And then when the browser kicks in, you're not in your global environment, You're in the data frames environment. And that's where you can do some really nice diagnostics about like trying those mutations that you have and those mutate calls, really getting to know what those variables are representing so that you've really taken away another source of maybe discrepancy or variation when your debugging environment isn't quite the same as what's in that pipeline environment.
This takes care of that. He says he's now added this to his dot r profile, so he has it wherever he is in his data analysis needs. And he doesn't really have to rewrite anything substantial other than finding where the problem is and piping it to this dive function. Oh my goodness. Could I have used this about ten or twelve years ago when you're analyzing 55,000 genetic markers, and there's just that one that's out of domain range, and you just don't know why this function would have been hugely helpful for that. So definitely take a look at this post. It's very short to the point, but it it is very relatable based on my past experience and my debugging adventures with complicated datasets. So credit to Miles for once again blowing my mind. Well, it looks like a simple function, but it wasn't simple how we got there. So I'm gonna be adding this to my R profile for sure. Mike, how about you?
[00:36:40] Mike Thomas:
This is wizardry at its finest, and no surprise that it's coming from Miles. But if this is, I guess, sort of exploitation of R's ability to have multiple environments. Right? And that idea. And I would say if you are a little earlier on in your R journey, if you don't know about the browser function, please learn about the browser function and how to use the the debugger in R. It can be life changing in terms of being able to much more quickly diagnose and address issues than if you're trying to do that without the browser function or the debugger.
But this is, as I mentioned, you know, sort of combining that debugging concept and capabilities with multi environment specifications. And as you said, it's a simple, quote, unquote, simple function, a two liner, but there's a lot going on here and a lot of power. And this is one of the coolest tricks that I think I've seen in a long time in any programming language. So I'm absolutely gonna add this to my arsenal and my repertoire as well. Probably stick it right in my dot r profile as as you mentioned, and I think this will be really handy for a lot of use cases that we have.
I have seen so much code like the example code that, Miles described at the beginning where everything is very flat, not very well modularized. There's really no no function extraction or anything like that, and you're just sort of creating data frames and overriding them as you you do your next step. Not a lot of chain assignment going on. I have been critiqued in the past, and our team has been as well, in some of, like, the model validations or the code reviews that we do for for caring too much about software development best practices. And the code works. So, we just wanna make sure that the, you know, the the correct inputs are getting converted to the the correct outputs, essentially, and that there's no bugs in the logic. And, technically, there's no bugs in the logic, sure, but you've got a lot of fragile code hanging around here, and potentially the next person that steps in is gonna have no idea how to address it, especially when something either breaks or you're asked for that inevitable enhancement. And it's tricky to know exactly where to apply that.
And anyways but I digress. I I can just very much relate to, Miles' sort of story here that set off this blog post in his code review of some some tricky code and trying to find a needle in a haystack, and what a useful little function he's put together there that for those similar use cases that I face on a fairly, unfortunately, consistent basis, he was able to drum up something that makes his life easier and it's certainly gonna make mine easier in the future as well. So much thanks to Miles, not only for the blog post, but for the handy trick function.
[00:39:43] Eric Nantz:
I couldn't say it better myself. Two additional thoughts on that. First, as you alluded to, our comes of a lot of these great features built in. This is not something you had to add as an extra package. Right? The environment conversion, the, you know, building the run functions and specified environments, the browser function. I have not met any other language that comes with this so elegantly, readily available off the bat. I'll stand on my soapbox for that. Yes. Hot yeah. Get to get the hot take out. I'm I'm not done yet either. While we were away in our little hiatus, you probably saw this, Mike. There was some chatter about a very provocative post that alluded to r is on its last days.
No. It's not. I can tell you for sure in my industry is most definitely not. And how do I well, so I'm not gonna pretend that life sciences dominates everything. And in data science, no, it doesn't. But when you look at robust development of capabilities that do have a very important need and need to stand the the test of production usage and not just have been our company's firewalls, but also to our health authorities or whatnot. Yes. We are seeing some promising avenues from Python, and no shade on the Python listeners out there. But in the world of statistics, data science, r has a leg up on a lot of these things, and it's gonna be hard to close that gap anytime soon.
And I and but, again, it's not even either or thing. But I do not feel that we're in any decline here, so to speak, even with the world of AI taken off. We got Elmer, folks. We got Elmer to help us on the our side with that. And I've heard people like Hadley and Joe say that the Elmer interface is just fun to use. I get it's not like you don't feel like you're making a compromise going to r for this stuff compared to Python. It is an elegant interface because of our extensibility of the class system and whatnot. We can do a lot of interesting tricks. So, like, Miles opens up this post here. You got so much of this available.
You can tackle almost anything with it. It's just the limits of your imagination, and I I I fully agree with him. In fact, you're you're calling this kind of wizardry post here. He actually gave a talk at, previous RStudio Conf, like, the magic of open source. Right? So he is definitely our our expert wizard here in the world of of ours. So really, really fun post here. And, yeah, those those reports of ours, demise, I'm gonna say are greatly exaggerated. Hot take over.
[00:42:39] Mike Thomas:
I couldn't agree more. I think it boils down. That was ridiculous. I saw that as well. I'm glad that you called it out. I saw it all over LinkedIn. And I think what has clearly happened is that the great things about R folks have been trying to port over to Python, and the great things about Python folks have been trying to port into R as well. So it's very much a use whatever one you want, but don't trash on the other language just because you don't like it as much. You don't need to do that. There's no there's no need. So I'm gonna shameless plug while we're still on the the subject. Can I shameless plug?
You got it. It's gonna be a talk at Positconf this year given by yours truly titled Yeah. Building and Managing Multilingual Data Science Teams. So I am a big proponent of of both. It doesn't necessarily have to be only one or the other. They both have their strengths. They both have their weaknesses. But we don't need to trash on one side or the other.
[00:43:40] Eric Nantz:
That isn't a great teaser for to come to Positiv to watch Mike's talk. I don't know what is. So, yeah, I can't wait to to be hopefully in the front row for that one when I'm over there. It is is very much the state of the world we're living and not in our dev lives. Right? So, I mean, I'm I'm got some colleagues on the on the Python side doing some great work, and then I'm integrating this in my R workflows, and it's all it's all working well. But, yeah, like Miles says here, we got a lot at our fingertips here. You may, in your R journey, encounter some hairy situations, whether it's in the data analysis, the Shiny app side, package development side, chances are there is a way to get out of those bugs. And, yeah, you may use AI for that. You may not need to, like, in this in this blog post. Sometimes the tried and true is the best way to go.
Speaking of the best way to go, we have lots of other content here in ourweekly.org. We invite you to check out the rest of the issue. We're running a little low on time, so we won't do our additional fines here. But if you're new to the website, it's got a great set of sections, all clearly labeled, whether it's package updates, tutorials, events in the community, lots of great things that are happening in this in this ecosystem and our weekly, through I don't remember how many years it's been since we've started running this this project as, stood the test of time as the truly open and community driven way of giving you this great content.
And no no bots are powering this. No organization is overseeing all this. This is all driven by us passionate, you know, advocates of our and data science. So that's why I'm a little, mini soapbox number two, but nonetheless, we definitely invite you to get in touch with us. If you have some interesting takes you wanna share with us, I, we always welcome all opinions from all sectors, and you can get in touch with us multiple ways. First of which, in our podcast episode show notes, we got a little contact form. Feel free to send us a note there. And also we got our our availability on social media these days. I have been a bit quiet, but I'm hoping to get back into it. I am available on Blue Sky with @rpodcastatbsky.social.
I am also on, Mastodon with @rpodcastatpodcastindex.social, and I'm on the aforementioned LinkedIn. I try to stay away from the clickbait stuff. I just post some relevant stuff. I search my name, and you'll find me there. And, Mike, where can listeners find you?
[00:46:13] Mike Thomas:
Likewise, it's been a bit quiet, but certainly, if it's the podcast today is any indication, hoping to get back, back out there a little bit. And you can find me on blue sky at mike dash thomas dot b s k y dot social, or you can find me on LinkedIn if you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to.
[00:46:34] Eric Nantz:
And I believe I saw you on a, golf course recently. I was jealous already because I wanna get my golf game back on. Maybe someday, we'll we'll hit around the nine or eighteen, and you'll poem me to death with your skills, but we'll have fun. That will not happen. But we can certainly golf. Yeah. I I'm I'm gonna enjoy that at the best of them. But, nonetheless, we're gonna we're gonna sink the eighteenth putt here and close out the shop here for our weekly highlights, for this week. And hopefully, things are back to normal. Of course, you never know with the way life goes, especially in the summer. Shout out to all of you that have to shuffle kids around day, you know, summer camps and whatnot that can wreck havoc on schedules. I'll do my best to wrangle that. So we'll close-up shop here. Thank you so much for listening to this episode 207 of our rookie highlights, and we hopefully will be back with another episode of Arroyuki highlights next week.
R is here to stay!
Episode Wrapup