Key learnings from learners in recent R workshops, advice on navigating thorny package installation issues within renv, and a showdown of how the parquet and RDS formats perform with large data sets.
Episode Links
- This week's curator: Ryo Nakagawara - @R_by_Ryo) (Twitter) & @[email protected] (Mastodon)
- Teaching you - teaching me
- Things that can go wrong when using renv
- Parquet vs the RDS Format
- Entire issue available at rweekly.org/2024-W06
Supplement Resources
- Quartaki an introduction to Quarto https://drmowinckels.io/quartaki/
- R project management https://www.capro.dev/workshop_rproj/
- r2u - CRAN binaries as Ubuntu binaries https://eddelbuettel.github.io/r2u/
- Shiny and Arrow https://posit.co/blog/shiny-and-arrow
- data.table new release and governance structure https://rdatatable-community.github.io/The-Raft/posts/2024-01-30-new_governance_new_release-toby_hocking/
- rix is looking for testers https://www.brodrigues.co/blog/2024-02-02-nix_for_r_part_9/
- The 2024 Shiny Conference call for speakers https://www.shinyconf.com/call-for-speakers
Supporting the show
- Use the contact page at https://rweekly.fireside.fm/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @theRcast (Twitter) and @[email protected] (Mastodon)
- Mike Thomas: @mike_ketchbrook (Twitter) and @[email protected] (Mastodon)
Music credits powered by OCRemix
- Tails and the Music Maker - Picolescence - zircon - https://ocremix.org/remix/OCR02176
- Wily theme - Mega Man 2 - TheManPF, Chocobao, DakotaCityRag, Gamer of the Winds, Zach Chapman - https://ocremix.org/remix/OCR04485
[00:00:03]
Eric Nantz:
Hello, friends. We're back with episode a 151 of the R Weekly Holidays podcast. If you are new to the show, this is where we talk about the latest issue of Our Weekly that you can find at rweekly.org, and in particular, the highlights that have been selected by our curation team along with our usual banter and rambles along the way. My name is Eric Nantz, and I'm delighted that you joined us wherever you are around the world. And as always, joining me right at the virtual hip here is my cohost, Mike Thomas. Mike, how are you doing this morning?
[00:00:31] Mike Thomas:
Doing well, Eric. It's pretty crazy that we've surpassed a 150, recordings now of the our weekly highlights. And, I guess, what's the next milestone? 200 to look forward to?
[00:00:43] Eric Nantz:
That's right. The Vague 200. And, yeah. I know a lot of the podcasts I've listened to, they'll either do a fun little retrospectively thing, or they just might act like everything's business as usual. So we'll see what happens when we get there, but it should be fun one way or another. And this week's issue, speaking of fun, is from our longtime curator on the team, Ryo Nakagawara. So I have very fond memories of meeting him in IRL at one of the pa or r studio conferences long ago. That was a fun time. I hope I get to meet up with him again someday. But as always, he had a tremendous help from our fellow, our Wiki team members, and contributors like all of you around the world with your poll requests and other awesome recommendations. Well, Mike, you and I are both at one point, one way or another in our various projects or consultations.
We do have to do a little guidance or teaching along the way on various concepts. For me, I've definitely been doing a bit of that with, you know, helping with a little bit of inside our training in my organization, getting some analysts lined up with the latest and greatest resources that we have in your ecosystem. Well, our first highlight is doing just that. But it's a great perspective because anytime that I've done, whether it's a tutorial, a forum meetup, or a mini workshop, it's not just I'm trying to help the the learners, if you will, you know, learn a new concept. I often learn just as much as they do, especially from that perspective or persona of somebody maybe new to those frameworks or new to the the language itself.
And our first highlight today comes from Athanasia Mowenkel, who is a cognitive neuroscientist and now a great R developer who recently gave a fantastic talk, by the way, at Posikoff this past year on using R Universe for her package development. So really recommend that talk if you haven't seen it. Well, on her latest blog post, she talks about some of the learnings she's had while she was helping others learn at a recent digital scholarship days at her University of Oslo. And in particular, she talked about her findings from 2 workshops that she gave. One was called Kortaki.
Cool name. And that's an introduction to Quarto. We've actually been talking about Quarto quite a bit in our off recording here, so that's timely. As well as our project management, another area that we're gonna be touching on a little bit later in the show, actually. So in particular, in these two workshops, she had a few interesting findings that probably have you have encountered before in one way or another, but one of them is definitely pretty esoteric that has happened to me, many times before. And we'll start with quarto here Because if you're familiar with creating slides in quarto or, frankly, even Rmarkdown before that, a lot of the organization of slides is governed by the use of headings, like the the markdown syntax headings.
In particular, having a single heading is going to give you one of those kind of title like slides with a big text in the middle. And then when you do a two level heading, I. E. With the 2 hashtags, that's when you typically denote a new slide with content underneath. Well, what was interesting in the workshop prep that she did for this quarter workshop is that she noticed that there was a slide that had some nice content, and then suddenly the huge text in the middle superimposed on top, which I have had mishaps of in the past with sharing in back when before I adopted Qartle. And I remember I was getting bewildered by just what exactly is going on here. Well, apparently, one of the learners picked up about this is that if you do a single heading but then do another heading that's 3 or more hashes in front.
It's not going to do a new slide after that first level heading. It's just simply going to superimpose that big text on top of the context that was under that 3 or, say, 4 level heading, like in her example. So the switch was to force a new slide that doesn't have, say, a typical two level markdown heading. There is the 3 dashes syntax, and this works for both Quarto, and I believe for sharing it as well, where then you force it to create a new slide. Now that's something that you kind of learn from experience. It's typical we don't have to use that, but I have used that sparingly in the past. And it's a good mental note to me and future self that I can use that same syntax of, okay, I can force a new slide with that nomenclature or that, 3 dashes instead of having the mishap that we saw in her presentation slides. But the good news is once you make the fix, it's all in version control. Right? Just fix that, commit it, push it, and recompile your slides. So all is well that ends well, as I say.
But speaking of quartile, another interesting nuance is the idea of how things are actually named between HTML and some of the interfaces we use to build the slides. In particular, if you're familiar with web development, you've probably heard about something called a horizontal rule, which is literally the HR tag in HTML, which gives you that nice horizontal straight line that can use a separate or partition content, if you will. Well, she asked her learners to add that in her portal slides. Right. And then some are using the RStudio IDE with the insert field and the markdown editor, the visual editor.
Others are using the nice command palette, which has a shortcut to start adding in commands. And you notice there's a discrepancy between the 2 is that in the visual editor, the insert calls it horizontal rule. But then the command palette version, it's called horizontal line. And, yeah, this may seem trivial, but it can trip up people sometimes. Like, well, that what did what did she ask me to put in? Why is it called different? And sure enough, you know, she put an issue on the Cornell issue tracker with this difference, and sure enough, the Pasa team has fixed this. So now it's consistently called horizontal line in the IDE now. So good catch.
But, again, one of those things that you don't really notice until you put this in front of people. So that was that was an interesting find on the portal side of things. Yeah. I think it's interesting that they called it that they I guess our studio or or posit decided to go with
[00:07:27] Mike Thomas:
horizontal line instead of horizontal rule which is sort of from a development perspective and especially web development, what it's what it's always been called.
[00:07:37] Eric Nantz:
Now the next sections of her blog post definitely hit home with certain things I've dealt with, and that's dealing with the constraints that sometimes IT will bring upon you. And that is in particular one of my biggest bugaboos is spaces and file names or directory names. I have never had good luck with that. And apparently, some of the learners got tripped up with some of this too in respect to some of the materials that she had put together for the project management piece. And again, sometimes you can't control what you can't control. Right? But she has been in contact with IT about how can we make sure that these directory names at least have a little more structure around them or these file names have a little more structure about them to try and have the best of both worlds.
Those that are doing more programming, dealing with the file systems that are being created, and those that are simply just trying to get their work done. So it can be a thorny issue. I do admit every time I tell people how to interact, you know, nicely with our POSIT workbench internally or over our HPC systems internally, I'm always that annoying person that says no spaces, only underscores and dashes in your file and their directory names. You'll thank me later. And usually they do actually. But that can trip people up as well. So I felt seen on that one. And this next one is really hitting home is that as installations, let's say, r itself accumulate over time, you got multiple r versions, which might mean that you've messed around with settings on some of the versions, maybe not. Maybe you've interacted with those site configuration files, which are basically the installation level, our profile, or our environment files.
Maybe your IT group or whoever admins have done some tricky things with library pass. Well, guess what? That what she discovered is that there are some of their shared, you know, project areas. They had 5 library paths at various levels in the stack, and some of them were just completely empty. So sure enough, a lot of crap can happen. She has a fun little ggplot of the different r versions that they have available and how many libraries are installed inside of them. So that, again, these things happen over time. I'm privileged because our Linux team, which is top notch at my org, has a system in place where we can load a specific R version as needed with what's called the module command in Linux. So we can just quickly say if we want to use r 422 or go back to r 363 for, like, an esoteric reason, we have all that segregated away. But not everybody's that lucky. So I can definitely sympathize with the Wild West of where packages are installed.
So that was an interesting finding that she had with respect to her organization. So all in all, really entertaining blog posts. And again, just shows you that it's not just the learners benefiting from the workshop. Us on the other side of it, on the other side of the fence, so to speak. We learn just as much. So terrific blog post by Athanasia, and we really, again, I highly recommend her previous talks as well. Really entertaining stuff.
[00:10:56] Mike Thomas:
Yeah. It it seems like all the content that Athanasia is is putting together lately is is really awesome. I really enjoy, following her work and I couldn't agree more with sort of the overall sentiment of this blog post that sometimes you learn more when you teach than, what you might expect, you know, because it's one of the best debugging exercises as she puts it, that you can possibly do is to actually go out and and teach something and and say it out loud. It's it's somewhat like talking to the little rubber ducky, right, that, since developers say that you should have, on your desk. So it's it's great feedback. It's a great exercise in order to, sort of, troubleshoot, your understanding and and really solidify, sort of, what you what you thought you knew and then figure out, where the gaps exist when you actually go to teach that to other folks. So that that's a really interesting reflection that resonated a lot with me, as well as the the Quarto stuff and project management. You know, I think we've absolutely all been there been there, with multiple versions of R on shared file systems and and trying to collaborate sort of before the world of RN, then posit package manager, and, now Docker, which is very relevant to me because we have a a new open source package, and I am trying to figure out what is the earliest version of R that our package will successfully install with.
We went back to to 3.5 and realized that it breaks there because some dependencies, I think Tidyr being one of them, rely on our 3.6 or greater. So, instead of having to install all of those different versions of R on my computer, I'm able to just change the change the base image and and spin up a Docker container and and run the installation and see if it succeeds or if it fails of our package which is is pretty cool I think just overall we have a lot more tooling now around our version management and dependency management than we used to have but but I certainly remember the heydays of logging on to sort of a shared network and seeing a 1000000 different versions of R, a 1000000 different library paths as she notes that they had 5 different R library paths at various levels, across many different versions of R. So that was that was, sort of nostalgic to see for me, but a great write up. Lots to learn from here, and thanks to Athanasya for this blog post.
[00:13:22] Eric Nantz:
Yeah. And we're gonna have direct links to each of the workshops that she gave in the show notes because of material freely available. Terrific slides. Terrific, you know, hands on work in those in those workshops. So, again, if you're in the space of kind of beginning your educational journey, if you will, with your respective organization on these concepts, yeah, this is some great material to draw upon. So as we were just talking about that, Mike, one of the, features in, in the previous highlight workshop about project management was leveraging the r env package for managing your r dependencies in a given project.
And I have used r env quite a bit as the foundation for my reproducibly wares at the day job, even for my open source projects and, yes, some very important collaborations with a certain government agency and submission pilots. But like anything in life, nothing is perfect. Right? And there are going to be some snags you encounter along the way, some of which may not be inflicted by RM directly, but things that a new user might encounter in their various setups. And our next blog post for our second highlight today comes from Hugo Gruensohn, who is a data engineer with the data.org organization and part of the Epiverse project, which has been featured in highlights in the past. And his blog post is talking about some of the things that can go a little wrong when you use r m. So I'm gonna run through some of the issues that, that Hugo highlights, and then, Mike, I'll turn it over to you for some of the solutions that he talks about. But the first issue, and as a Linux user, oh, boy, do I know about this, is the issue of binary package installations versus compiling from source.
So, yes, if you are on those operating systems and you want to use a current version of a package, there's usually no problem installing the binary versions. What happens, though, is that you might need an older version of a package. And in that case, CRAN is not going to have binary versions of older package sources available. You are now into the compiling from source world that actually is the default role for most of the Linux usage of R itself. And that's where you want to pay attention to maybe the package's description file and see if they call out any system requirements. Because there are certain packages, there are going to be system requirements to be able to compile from source.
Typically, these are like, c level libraries, other utilities that you might get in your package manager, for instance. But that's going to trip up our end, especially because it might be trying to install an older version of the package from CRAN. And then you're into some issues trying to figure out, okay, what what other system library do I need? Now, there are some tools in the ecosystem to help with finding this. I know there are some packages out, I believe, by Pazit's team on help to, you know, identify package dependency from a system level.
A lot of times you'll end up Googling it anyway, and then you'll figure out the package name that you might need if you're on an Ubuntu system or a Red Hat system or a Mac OS, what kind of library you might need for that. And apparently, there are some gotchas in addition for those on Apple Silicon, I. E. The M one chips with binary, I should say, with source package installation. So there's a a good note, especially around G4tran for that. So my sympathies or anybody that's encountering that issue because that must be thorny to troubleshoot if you don't know what you're looking for. So I know this from experience because when, you know, we mentioned Michael Mike just now that with Docker environments, guess what? That's Linux. So you're gonna have to put in those system dependencies before you start installing those packages. And that if you don't know what you're looking for, that can be tricky. Must use r two u. There you go. Plug right 1. So they're trying to simplify this if if you're in the container world. But if you don't know, you don't know. And now you do.
Well, there are other issues to deal with as well in this space is that maybe you've done the homework. Maybe you've got that system dependency already installed. But how long ago did you install that? And did a recent package that ended up having to be compiled from source utilize a newer version of that same library? That's the example that's in Hugo's post here is that, for example, the matrix stats package had a compilation error from those trying to install 1 from a version from 2021 about a double max double X max undeclared variable. And sure enough, there was an explanation for this in the release notes of matrix stats that mentioned that they were moving to a different construct for this constant DBL max instead of a legacy one called double x max.
And there you go. That could trip you up too because you think, hey, I've done what I needed to do. I got that system library in there. But sometimes you have to keep those up to date as well. So lots of little gotchas. And you may be wondering, oh, my goodness, I'm gonna be in this? How the heck do I navigate this? And that's where the next part of the post talks about some potential solutions in this space.
[00:18:52] Mike Thomas:
Yeah. So as you noted, you know, CRAN will only provide binaries, I think, for the most recent versions of the R Packages that are available on CRAN. However, POSIT package manager, provides a larger collection of binaries, for different package versions historically across different platforms as well via, you know, the public Posit package manager, which is awesome. So for for those using our end, by default, it may try to install the packages in your random dot lock file from CRAN. My recommendation is to switch that over to to posit package manager as quickly as possible. I think you'll find the installation experience, not only less prone to running into errors with some of those system dependencies, but also faster, if you're installing binaries as opposed to from source. So that would be recommendation number 1, and that's that's the first sort of solution, that's that's positive here. No pun intended.
And then they talk about extending the scope of reproducibility and introducing the Rig Package, which is honestly a package that I have not used enough, but absolutely should. And RIG is a package, an R package within the Rlib, ecosystem, and it allows you to sort of go back and forth between different versions of R. I believe you can run code against multiple versions of R at the same time. There's some pretty pretty wild things that you can do with RIG that I think help, solve some of these issues that you may run into, working with different versions of packages across different versions of R. So I think Rig can be a really helpful tool for for troubleshooting, or doing some of that exploratory work to make sure that your environment is set up correctly and appropriately in a way that's that's not going to fail. And then, obviously, you know, these we've talked about at length, even discussed already in the highlights, but there's there's Docker, there's Nix. Shout out, Bruno Rodriguez.
He has a series of blog posts on Nick's which are linked within this this blog. So hopefully, he's he's super stoked to to see, Nick's being represented here again with all of the fantastic resources that he's put together there. There's a link to using our end with Docker, the the RN vignette that's that's within the RN, I believe, package down site, as well as a link to a paper, that's an introduction to Rocker, which is one of the most popular images out there for working with R, and that paper is authored, by none other than the the creator of R2U himself, Dirk Edelbuttel, who I was talking about, as well as Carl, Boettiger.
So that might be a paper that you may be interested in in checking out that was published in 2017 in the R Journal, but some just fantastic resources here, that allow you to explore some of the different potential solutions for for handling these issues that you might be coming coming into when you are trying to to work on a new project with potentially an older version of R or older version of R packages. And sort of the the final note here, and to summarize this blog post in its entirety and, Eric, I think you can you can share this sentiment. I don't think that package management is a solved problem quite yet at this point, our environment management.
So I think this there are a lot of similar sentiments in the Python ecosystem as well. There's there's Pynth, there's there's Vnth, there's Pynth, there's Pynth. A lot of different ways to go about trying to do it and I don't know if any of them are perfect. And in our end is obviously, you know, in my opinion, at least, you know, the the the most, recent and and sort of, you know, best attempt at package management thus far in the R ecosystem. I think it it, improves upon some of the things that Packrat, the previous package management, package, tried to handle. I know that there is the pack, p a k, package as well, which does allow you to create sort of a a lock file as well and and manage, some of these things. But, again, I I don't think it's a a perfectly solved problem yet. Maybe it never will be. You know, it's a very tricky thing to manage. But I think, in terms of some of those issues that you may run into when using RM, this blog post is a great resource on some of the ways that you can try to troubleshoot those issues.
[00:23:17] Eric Nantz:
Yeah. I echo a lot of those same thoughts, Mike. And it does take me to even just at a broader level, the issue of distributing software even just on Linux in general, because there are a lot of issues that are very common here with what's happening in the R ecosystem of packages and the Python ecosystem of package dependencies. There have been some new standards in place to help give developers kind of a single, quote, unquote, single target so that those on any Linux distribution, no matter what, can install these software utilities.
I'm thinking of flat pack is one that's gotten the most attention with Snaps probably close behind that. And R, you're right. There's a lot of different ways to tackle this, and I don't think there is a perfect one in place. I do think what needs to happen, though, and I think this blog post is a great kind of precursor to it, is that these different paths of reproducibility that you want to take, whether it's the full system reproducibility, talking to the full stack, if you will, or if it's just the package dependencies, just that perspective. Those personas can mean different things into how far you go with these solutions.
So, certainly, what I'm keeping an eye on is, yes, I do often integrate r m with containers, but not r m valve of the box. I am gonna configure a little bit to my liking to make sure that it plays nicely, like you said, with deposit package manager, a huge win for container development and package environments. But also, again, you shouted them out. Bruno is on such a role here with spreading the message of of Nicks, in particular the Rick's package that he is codeveloping. Nix is taking a lot of, you know, I would say, mindshare in the general software development communities. Certainly, it's a huge topic of the podcast I listen to. And I think with time, we're gonna start seeing some enhancements to what Bruno is working on with Rick's, but also maybe others sharing their thoughts on it. I know quite a few people in the community are starting to dip their toes in it, myself included, still got a ways to go.
But anything that can simplify that full stack with or without containers, I think is going to come up kind of above the surface, if you will, as teams and organizations figure out the best way to tackle this. But there was a nugget in the in the conclusion here I wanna emphasize here is that when you have multiple team or multiple members involved on a team for a reproducibility kind of project, there needs to be a real team effort to keep up to date with everything. And I still recommend that if there's, like, a even a 2 person or more team, that one person is kind of in charge of kind of handling the RM side of things if you're doing RM for your package management because, trust me, there'd be dragons when you have multiple people clobbering that RM block file in a GitHub repo and not knowing which one is which. Which change should I pull into? So you're smiling. I know you know what I'm talking about, Mike. We've been there and it is rough when you don't have that delineation
[00:26:32] Mike Thomas:
set up front. I think there are a ton of organizations out there that struggle with this. We do a lot of work around this to try to set up, Data Science teams and Data Science collaborative workflows within some of our clients organizations that we work with and you know like you said it's an it's an evolving, you know, not perfectly solved problem but you have to you have to implement a framework and you have to framework and you have to set some sort of controls around how you're going to at least try to employ some of these best practices for collaboration, between team members across projects.
Otherwise, you'll just be in a world of pain.
[00:27:07] Eric Nantz:
Yeah. And, you know, data science is hard enough, folks. We don't need more pain alongside our data science adventures. So, yeah, certainly, if you've had your share of ill successes or, frankly, maybe even not so great moments with package and environment reproduce. We'd love to hear about it. We'll tell you how to get in touch with us later in the show. And rounding out our highlights today, some of that's right up both of our wheelhouses lately in different ways. But, you know, we've been pretty vocal on this podcast and some of our other ventures about it's a it's a new era in terms of data storage formats. We're talking about databases traditionally or some of these newer methods.
And in particular, a format that we are very excited about is the parquet format, part of the, you know, Apache Arrow project. There are lots of interesting ways that you can leverage this technology to streamline your data storage needs. And, yeah, my cohost here, Mike, yeah, you know a thing or 2 about this. But, this this last highlight is coming from Colin Gillespie, who is, the CTO of Jumping Rivers, who have been big proponents of advancing computing and their data science consulting projects and blogging for all of us to to learn about. And this is part of a series of posts that are diving deep into Apache Arrow in respect to the R ecosystem. And in this blog post here, Colin talks about some of the benefits that you can see in parquet versus what is the traditional format that we've been using in the our ecosystem since frankly the beginning of the language, and that's called the RDS format.
[00:28:53] Mike Thomas:
Absolutely. And you know that I am a huge fan, of par the parquet format and sort of all the advances that have come within, data storage in the last I don't even know how long it's been. 12, 18 months between parquet, DuckDV, all those things. It's it's happened very very quickly. And Colin leverages, one of the most popular, I think, built in datasets, I believe within the Arrow R package, which allows you to easily work with, the parquet format files and query them using dplyr syntax, which we we know and love. And that that dataset is called the NYC, data and I believe that's that's on New York City taxi data, which is a pretty pretty large dataset, so it makes for a good example when wrangling and querying this large parquet file. And so one of the, you know, big comparisons here between parquet is RDS files as you talked about, Eric, which is a file format that us as our users have been using, I think, for as long as ours been around or as long as I can remember at least for essentially saving any type of object right it could be a data frame could be a list could be a model often so it's a very flexible file storage format and you know to date when we typically compared you know RDS storage to like a CSV especially if you are storing a data frame and most of the time that RDS file was going to be smaller and snappier to load than a c really having to read a CSV file. But now that we have this new file storage format called Parquet, which is columnar storage, we've sort of gone through that comparison again and this time comparing RDS to to parquet file format.
And that's what Colin's blog post is doing here. And I think you'll be you'll be fairly surprised at the results in taking a look at, at least, this example, New York City taxi dataset appears to outperform, the parquet version appears to outperform the RDS version of this file across a few different metrics. So I don't know if you wanna dive into, I could do the drum roll and you can dive into the results here for us, Eric.
[00:31:09] Eric Nantz:
Alright. Here we go, folks. Yes. And, the results are in. And one thing to note with the parquet format and how the arrow package writes to parquet is it's taking advantage of a compression utility called Snappy, which is a fun little name right there. But that alone is a huge gain in terms of writing this taxi dataset to disk. And in particular, in the average of the metrics that the columns put together here, it takes on average about 4 seconds to write to parquet format of this taxi dataset. Whereas for using the gzip compression library in RDS, takes 27 seconds on average to write that to disk. Now that is some massive savings right there.
Some nuances here about parquet versus, like, the traditional things like CSV and whatnot is that parquet is column based partitioning of how it writes the data set, which means they can take advantage of repeating, say, you know, values of, like, a numeric index, advantages of, like, common character strings, advantages of POSIX times, lots of interesting optimizations. We don't have time to get into it all on this podcast, but there are also some references in Colin's post if you wanna really dive into that. So, yes, we already see writing is significant here. How about reading itself?
Now the results aren't quite as drastic, But as you might guess, because of the different way data is organized behind the scenes of these formats, it actually takes on average about 0.3 seconds or 0.4 seconds to read that into memory from parquet. Whereas for RDS, it takes about 5 ish 6 seconds on average. Now that, if you're doing interactive analysis, may not be a huge deal to you if you're just kinda doing your data reporting and expirations. But what's the space that you and I play with, Mike? Is that it's Shiny apps. Yep. It can mean everything. Yeah. It can mean absolutely everything. And I'm literally dealing with this right now as I speak with an open source project where I don't want to load the entire contents of, in this case, a 4,000,000 row dataset.
I wanna just grab what I need at the app load and then as needed, add in more. I am using Parquet for that. It is a very optimal solution. And, yeah, Mike, you know a thing or two about loading Parquet in the Shiny app. So you wrote a darn article about it, didn't you? Yep. You could find it on the deposit blog post. It's a couple years old now. It may need some updating,
[00:33:48] Mike Thomas:
but, yes, there's a blog post called Shiny and Arrow, a match made in high performance, computing heaven or something like that. So feel free to to check that out if you are interested in leveraging Parquet files to make your Shiny app so it's snappy.
[00:34:04] Eric Nantz:
Absolutely. So you can see that, you know, we don't wanna get the cliche. It depends on your use case. But how it concludes or, you know, the obvious question is, okay, you as new to this world, which one should you use, parquet or RDS for your your next project? Well, as you saw from the metrics, writing, there are just massive gains for writing volumeless data like this taxi data to disc with parquet. I think that if that's a concern to you and you're doing this on a regular basis and for efficiency, it does seem like parquet is a clear winner on that. For reading, importing into your r session, again, I think it depends on the context you're dealing with here. But I do think that, yeah, if you're in a pipeline that needs as much, you know, fast response time, whether that's a Shiny app or other situations, I think parquet is very attractive for those for those features alone.
Now, one thing to keep in mind, though, is that if you are trying to keep as lean of a stack as possible, we were talking about dependencies earlier. Right? Well, guess what? RDS is built into R. It's been built into R since the very beginning. So if you don't want to depend on the arrow package for importing this into your R session, that's another, you know, win for the RDS camp, if you will. And again, for smaller data sets, RDS has had no issues in my shiny app or my other, you know, data science needs. So, again, it's there. It's always there. You can depend on it no matter where you're running or which version of our, no headaches on that front alone. But I did have an interesting use case for parkade. I'm gonna, you know, give a little insider baseball here on this very podcast on my exploration on this at the day job where our clinical sets are organized and, you've guessed it, SAS data sets organized across many, many, many different directory, subdirectory patterns based on the treatment, based on the study name and whatnot. Many subdirectories inside. Right?
Well, we get questions from leadership about kind of how many sets do we actually have or, like, how many are SAS? How many programs do we have in this whole space that are SAS based? How many are R based? You know, can we get some metrics around it? So no one's going to do this manually. Right? Nate, nobody got time for that. So let's see if we can read all this metadata into some form of a data structure so we can interrogate it just to go to any database. Right? I used to use a bloated and I do mean bloated SQLite database to house all this. It worked fine ish until recently because I had a silly thing with modification times. I kind of had to re pivot.
So in this re pivoting, I thought, well, wait a minute here. Since there's a logical grouping and how these are organized where it's like the, I'll call it treatment ID of the treatment. And then within that, there's an umbrella of different studies or experiments under this. Well, this is right for grouping in a logical way by those two variables. And instead of having everything written to one massive file, why not distribute these as parquet files for the metadata? So that if I know I only need one particular treatment ID and one particular study I wanna get the data from, I can get this just as easily of arrow parquet files as I could with anything else. Plus, if I need to update only a specific treatment ID and study combination, I don't have to touch the rest of the study and data combination or study and treatment combinations. I can just update that one set and it will still magically bind all together if I need to further on. The magic of Arrow and the the dplyr, dplyr packages. It's all right there.
So that is saving me immense time. And the parquet files are fast. They're they're in efficient size. And I just feel a lot more organized on how I'm keeping track of all this. So that was my recent success with parquet. So, yes, your voice was in my head, Mike, as I was in this rearchitecting adventure. Like, I gotta get away from this monolithic set. What can I do here? And Parquet was the answer.
[00:38:14] Mike Thomas:
I don't wanna sound too cliche, but I am very proud of you, Eric. Great job.
[00:38:19] Eric Nantz:
Well, as you know, we just scratched the surface here. Every R weekly issue has a lot more terrific content for you to to learn about the R ecosystem, data science, integrations of R, and many other ways to inspire your daily journeys with data science. And, of course, we have the link to the issue in the show notes, but we're going to take a couple of minutes for some additional finds here. And I want to give a great shout out to a project that just keeps rolling along and had a massive update recently, and that is data dot table just had a major new release combining with a new governance structure for how they manage the project's life going forward.
This is a really fascinating post authored by Toby Dylan Hocking, and, again, we'll link to it in the in the show notes, but a really great kind of road map of what they've done to help put a little more governance around the data dot table project. There's been newer members joining the team. There's been new maintainership and lots of transparency on what they're looking at as new features going forward. So on top of that new release, it's really a great time if you're been using data. Table and you wanna get involved with the project. They're making that even more transparent on what the road map is and their contribution guidelines and kind of where things are are at going forward. So a big shout out again to the data. Table team. They're doing immense work in this space.
Always have tremendous respect for that project, and congrats
[00:39:48] Mike Thomas:
on the release of 1 dot 15 dot o. Yes. Congrats to that team. That's that's fantastic news. I'm gonna reach across the aisle to Bruno and, shout out his new blog post called reproducible data science with Nix part 9. Rix is looking for testers. So this is a call to action blog post in the r the rix package, spelled r I x, is in our package that leverages Nix. Essentially, allows you, I believe, to work with, Nix in that configuration from R. So if you are interested in Nix for environment and package management and, want to kick the tires on his Rick package and give some feedback, I think that would be really greatly appreciated. So check out this blog post, for info on how to get started.
[00:40:36] Eric Nantz:
Yeah. Huge congrats to to Bruno and Philip, the the comaintainer of REx. They have been doing immense work on this over 5 months and and and counting according to the blog post. So, yeah, getting real world usage of REx is hugely important as they get to this stable state, if you will. And and, yeah, count me in, Bruno. I'm gonna be testing the heck out of Ricks. I've already done initial explorations near the end of last year, but I am firmly on board with seeing just how far we can take it. And my initial experiences have been quite positive to say the least. But, yeah, I'll definitely put it in some more rigor and call again, shout out to all of you in the community that have been even just remotely curious about this. Give it a shot. Let them know what do you think? Because I do think in the reproducibility story that this is going to get a lot more traction as we get more users involved. So, again, huge congrats to Bruno and Philip on getting close to this major milestone.
But, of course, he doesn't just want to hear from all of you. We want to hear from all of you too. Right? And the best way to get in touch with us, you got a few ways to do it, actually. First of which is there's a contact page directly linked in the episode show notes if you want to give us a quick shout out on that. Also, if you're on the modern podcast app train like a few of us are using, say, Podverse, Fountain, Cast O Matic, many others, Pod Home, whatever have you, there are lots of great ways to get in touch with us on that via the boost functionality again or from the podcast index directly where this podcast has probably hosted it. You can find details on that in the show notes of the episode as well. And I'm doing some fun projects analyzing the massive amount of literally podcast metadata as we speak. And it's got a lot of geekery behind the scenes with, quartile, point blank, Docker, GitHub actions. There's going to be a lot to talk about with this.
I just conquered GitHub Actions successfully on an automated run time of these pipelines, which I felt pretty stoked about because I'm still a bit of a newer user of GitHub
[00:42:46] Mike Thomas:
but I'm I'm getting there. I'm getting there, Mike. The the old dog here learns a new trick once in a while. No. That's awesome. I have no doubt that you'll nail that. GitHub Actions has been definitely a game changer for us at Catchbrook.
[00:42:56] Eric Nantz:
Yeah. It's one of those things where you just can't imagine how you live without it all these years, but it is there, and it's a great service. So and, also, if you wanna get in touch with us on the social medias, I'm mostly on Mastodon these days. My handle is [email protected]. I am sporadically on the Weapon X thingy with atdrcast and also on LinkedIn from time to time on there. And also a quick reminder, we plugged this a couple of weeks ago, but the call for talks is still open for the upcoming Epsilon Shiny conference. So if you have a talk you'd like to share with the rest of those shiny enthusiasts out there, Mike and I are obviously big fans of this conference. Yeah. We'll have a link again to the conference registration and talk submissions in the show notes. And, Mike, where can a list who's gonna get a hold of you? Sure. You can find me on LinkedIn by searching Catchbrook Analytics,
[00:43:50] Mike Thomas:
k e t c h b r o o k. Or you can find me on mastodon@[email protected].
[00:43:59] Eric Nantz:
Very nice, Mike. And, yeah. You know, Mike deserves some extra praise here. You're not gonna know this from the polished version you hear of this episode, but he had to put up a lot of shenanigans during our recording today. So my thanks to you for putting up all that. As if you haven't had to put up with me on other episodes. So we're even. Yeah. We'll see who who causes the chaos next time around. But in any event, that's gonna wrap episode a 151 of our movie highlights, and, we're so happy to listen to us, and we hope you join us for another episode of RWBY highlights next week.
Hello, friends. We're back with episode a 151 of the R Weekly Holidays podcast. If you are new to the show, this is where we talk about the latest issue of Our Weekly that you can find at rweekly.org, and in particular, the highlights that have been selected by our curation team along with our usual banter and rambles along the way. My name is Eric Nantz, and I'm delighted that you joined us wherever you are around the world. And as always, joining me right at the virtual hip here is my cohost, Mike Thomas. Mike, how are you doing this morning?
[00:00:31] Mike Thomas:
Doing well, Eric. It's pretty crazy that we've surpassed a 150, recordings now of the our weekly highlights. And, I guess, what's the next milestone? 200 to look forward to?
[00:00:43] Eric Nantz:
That's right. The Vague 200. And, yeah. I know a lot of the podcasts I've listened to, they'll either do a fun little retrospectively thing, or they just might act like everything's business as usual. So we'll see what happens when we get there, but it should be fun one way or another. And this week's issue, speaking of fun, is from our longtime curator on the team, Ryo Nakagawara. So I have very fond memories of meeting him in IRL at one of the pa or r studio conferences long ago. That was a fun time. I hope I get to meet up with him again someday. But as always, he had a tremendous help from our fellow, our Wiki team members, and contributors like all of you around the world with your poll requests and other awesome recommendations. Well, Mike, you and I are both at one point, one way or another in our various projects or consultations.
We do have to do a little guidance or teaching along the way on various concepts. For me, I've definitely been doing a bit of that with, you know, helping with a little bit of inside our training in my organization, getting some analysts lined up with the latest and greatest resources that we have in your ecosystem. Well, our first highlight is doing just that. But it's a great perspective because anytime that I've done, whether it's a tutorial, a forum meetup, or a mini workshop, it's not just I'm trying to help the the learners, if you will, you know, learn a new concept. I often learn just as much as they do, especially from that perspective or persona of somebody maybe new to those frameworks or new to the the language itself.
And our first highlight today comes from Athanasia Mowenkel, who is a cognitive neuroscientist and now a great R developer who recently gave a fantastic talk, by the way, at Posikoff this past year on using R Universe for her package development. So really recommend that talk if you haven't seen it. Well, on her latest blog post, she talks about some of the learnings she's had while she was helping others learn at a recent digital scholarship days at her University of Oslo. And in particular, she talked about her findings from 2 workshops that she gave. One was called Kortaki.
Cool name. And that's an introduction to Quarto. We've actually been talking about Quarto quite a bit in our off recording here, so that's timely. As well as our project management, another area that we're gonna be touching on a little bit later in the show, actually. So in particular, in these two workshops, she had a few interesting findings that probably have you have encountered before in one way or another, but one of them is definitely pretty esoteric that has happened to me, many times before. And we'll start with quarto here Because if you're familiar with creating slides in quarto or, frankly, even Rmarkdown before that, a lot of the organization of slides is governed by the use of headings, like the the markdown syntax headings.
In particular, having a single heading is going to give you one of those kind of title like slides with a big text in the middle. And then when you do a two level heading, I. E. With the 2 hashtags, that's when you typically denote a new slide with content underneath. Well, what was interesting in the workshop prep that she did for this quarter workshop is that she noticed that there was a slide that had some nice content, and then suddenly the huge text in the middle superimposed on top, which I have had mishaps of in the past with sharing in back when before I adopted Qartle. And I remember I was getting bewildered by just what exactly is going on here. Well, apparently, one of the learners picked up about this is that if you do a single heading but then do another heading that's 3 or more hashes in front.
It's not going to do a new slide after that first level heading. It's just simply going to superimpose that big text on top of the context that was under that 3 or, say, 4 level heading, like in her example. So the switch was to force a new slide that doesn't have, say, a typical two level markdown heading. There is the 3 dashes syntax, and this works for both Quarto, and I believe for sharing it as well, where then you force it to create a new slide. Now that's something that you kind of learn from experience. It's typical we don't have to use that, but I have used that sparingly in the past. And it's a good mental note to me and future self that I can use that same syntax of, okay, I can force a new slide with that nomenclature or that, 3 dashes instead of having the mishap that we saw in her presentation slides. But the good news is once you make the fix, it's all in version control. Right? Just fix that, commit it, push it, and recompile your slides. So all is well that ends well, as I say.
But speaking of quartile, another interesting nuance is the idea of how things are actually named between HTML and some of the interfaces we use to build the slides. In particular, if you're familiar with web development, you've probably heard about something called a horizontal rule, which is literally the HR tag in HTML, which gives you that nice horizontal straight line that can use a separate or partition content, if you will. Well, she asked her learners to add that in her portal slides. Right. And then some are using the RStudio IDE with the insert field and the markdown editor, the visual editor.
Others are using the nice command palette, which has a shortcut to start adding in commands. And you notice there's a discrepancy between the 2 is that in the visual editor, the insert calls it horizontal rule. But then the command palette version, it's called horizontal line. And, yeah, this may seem trivial, but it can trip up people sometimes. Like, well, that what did what did she ask me to put in? Why is it called different? And sure enough, you know, she put an issue on the Cornell issue tracker with this difference, and sure enough, the Pasa team has fixed this. So now it's consistently called horizontal line in the IDE now. So good catch.
But, again, one of those things that you don't really notice until you put this in front of people. So that was that was an interesting find on the portal side of things. Yeah. I think it's interesting that they called it that they I guess our studio or or posit decided to go with
[00:07:27] Mike Thomas:
horizontal line instead of horizontal rule which is sort of from a development perspective and especially web development, what it's what it's always been called.
[00:07:37] Eric Nantz:
Now the next sections of her blog post definitely hit home with certain things I've dealt with, and that's dealing with the constraints that sometimes IT will bring upon you. And that is in particular one of my biggest bugaboos is spaces and file names or directory names. I have never had good luck with that. And apparently, some of the learners got tripped up with some of this too in respect to some of the materials that she had put together for the project management piece. And again, sometimes you can't control what you can't control. Right? But she has been in contact with IT about how can we make sure that these directory names at least have a little more structure around them or these file names have a little more structure about them to try and have the best of both worlds.
Those that are doing more programming, dealing with the file systems that are being created, and those that are simply just trying to get their work done. So it can be a thorny issue. I do admit every time I tell people how to interact, you know, nicely with our POSIT workbench internally or over our HPC systems internally, I'm always that annoying person that says no spaces, only underscores and dashes in your file and their directory names. You'll thank me later. And usually they do actually. But that can trip people up as well. So I felt seen on that one. And this next one is really hitting home is that as installations, let's say, r itself accumulate over time, you got multiple r versions, which might mean that you've messed around with settings on some of the versions, maybe not. Maybe you've interacted with those site configuration files, which are basically the installation level, our profile, or our environment files.
Maybe your IT group or whoever admins have done some tricky things with library pass. Well, guess what? That what she discovered is that there are some of their shared, you know, project areas. They had 5 library paths at various levels in the stack, and some of them were just completely empty. So sure enough, a lot of crap can happen. She has a fun little ggplot of the different r versions that they have available and how many libraries are installed inside of them. So that, again, these things happen over time. I'm privileged because our Linux team, which is top notch at my org, has a system in place where we can load a specific R version as needed with what's called the module command in Linux. So we can just quickly say if we want to use r 422 or go back to r 363 for, like, an esoteric reason, we have all that segregated away. But not everybody's that lucky. So I can definitely sympathize with the Wild West of where packages are installed.
So that was an interesting finding that she had with respect to her organization. So all in all, really entertaining blog posts. And again, just shows you that it's not just the learners benefiting from the workshop. Us on the other side of it, on the other side of the fence, so to speak. We learn just as much. So terrific blog post by Athanasia, and we really, again, I highly recommend her previous talks as well. Really entertaining stuff.
[00:10:56] Mike Thomas:
Yeah. It it seems like all the content that Athanasia is is putting together lately is is really awesome. I really enjoy, following her work and I couldn't agree more with sort of the overall sentiment of this blog post that sometimes you learn more when you teach than, what you might expect, you know, because it's one of the best debugging exercises as she puts it, that you can possibly do is to actually go out and and teach something and and say it out loud. It's it's somewhat like talking to the little rubber ducky, right, that, since developers say that you should have, on your desk. So it's it's great feedback. It's a great exercise in order to, sort of, troubleshoot, your understanding and and really solidify, sort of, what you what you thought you knew and then figure out, where the gaps exist when you actually go to teach that to other folks. So that that's a really interesting reflection that resonated a lot with me, as well as the the Quarto stuff and project management. You know, I think we've absolutely all been there been there, with multiple versions of R on shared file systems and and trying to collaborate sort of before the world of RN, then posit package manager, and, now Docker, which is very relevant to me because we have a a new open source package, and I am trying to figure out what is the earliest version of R that our package will successfully install with.
We went back to to 3.5 and realized that it breaks there because some dependencies, I think Tidyr being one of them, rely on our 3.6 or greater. So, instead of having to install all of those different versions of R on my computer, I'm able to just change the change the base image and and spin up a Docker container and and run the installation and see if it succeeds or if it fails of our package which is is pretty cool I think just overall we have a lot more tooling now around our version management and dependency management than we used to have but but I certainly remember the heydays of logging on to sort of a shared network and seeing a 1000000 different versions of R, a 1000000 different library paths as she notes that they had 5 different R library paths at various levels, across many different versions of R. So that was that was, sort of nostalgic to see for me, but a great write up. Lots to learn from here, and thanks to Athanasya for this blog post.
[00:13:22] Eric Nantz:
Yeah. And we're gonna have direct links to each of the workshops that she gave in the show notes because of material freely available. Terrific slides. Terrific, you know, hands on work in those in those workshops. So, again, if you're in the space of kind of beginning your educational journey, if you will, with your respective organization on these concepts, yeah, this is some great material to draw upon. So as we were just talking about that, Mike, one of the, features in, in the previous highlight workshop about project management was leveraging the r env package for managing your r dependencies in a given project.
And I have used r env quite a bit as the foundation for my reproducibly wares at the day job, even for my open source projects and, yes, some very important collaborations with a certain government agency and submission pilots. But like anything in life, nothing is perfect. Right? And there are going to be some snags you encounter along the way, some of which may not be inflicted by RM directly, but things that a new user might encounter in their various setups. And our next blog post for our second highlight today comes from Hugo Gruensohn, who is a data engineer with the data.org organization and part of the Epiverse project, which has been featured in highlights in the past. And his blog post is talking about some of the things that can go a little wrong when you use r m. So I'm gonna run through some of the issues that, that Hugo highlights, and then, Mike, I'll turn it over to you for some of the solutions that he talks about. But the first issue, and as a Linux user, oh, boy, do I know about this, is the issue of binary package installations versus compiling from source.
So, yes, if you are on those operating systems and you want to use a current version of a package, there's usually no problem installing the binary versions. What happens, though, is that you might need an older version of a package. And in that case, CRAN is not going to have binary versions of older package sources available. You are now into the compiling from source world that actually is the default role for most of the Linux usage of R itself. And that's where you want to pay attention to maybe the package's description file and see if they call out any system requirements. Because there are certain packages, there are going to be system requirements to be able to compile from source.
Typically, these are like, c level libraries, other utilities that you might get in your package manager, for instance. But that's going to trip up our end, especially because it might be trying to install an older version of the package from CRAN. And then you're into some issues trying to figure out, okay, what what other system library do I need? Now, there are some tools in the ecosystem to help with finding this. I know there are some packages out, I believe, by Pazit's team on help to, you know, identify package dependency from a system level.
A lot of times you'll end up Googling it anyway, and then you'll figure out the package name that you might need if you're on an Ubuntu system or a Red Hat system or a Mac OS, what kind of library you might need for that. And apparently, there are some gotchas in addition for those on Apple Silicon, I. E. The M one chips with binary, I should say, with source package installation. So there's a a good note, especially around G4tran for that. So my sympathies or anybody that's encountering that issue because that must be thorny to troubleshoot if you don't know what you're looking for. So I know this from experience because when, you know, we mentioned Michael Mike just now that with Docker environments, guess what? That's Linux. So you're gonna have to put in those system dependencies before you start installing those packages. And that if you don't know what you're looking for, that can be tricky. Must use r two u. There you go. Plug right 1. So they're trying to simplify this if if you're in the container world. But if you don't know, you don't know. And now you do.
Well, there are other issues to deal with as well in this space is that maybe you've done the homework. Maybe you've got that system dependency already installed. But how long ago did you install that? And did a recent package that ended up having to be compiled from source utilize a newer version of that same library? That's the example that's in Hugo's post here is that, for example, the matrix stats package had a compilation error from those trying to install 1 from a version from 2021 about a double max double X max undeclared variable. And sure enough, there was an explanation for this in the release notes of matrix stats that mentioned that they were moving to a different construct for this constant DBL max instead of a legacy one called double x max.
And there you go. That could trip you up too because you think, hey, I've done what I needed to do. I got that system library in there. But sometimes you have to keep those up to date as well. So lots of little gotchas. And you may be wondering, oh, my goodness, I'm gonna be in this? How the heck do I navigate this? And that's where the next part of the post talks about some potential solutions in this space.
[00:18:52] Mike Thomas:
Yeah. So as you noted, you know, CRAN will only provide binaries, I think, for the most recent versions of the R Packages that are available on CRAN. However, POSIT package manager, provides a larger collection of binaries, for different package versions historically across different platforms as well via, you know, the public Posit package manager, which is awesome. So for for those using our end, by default, it may try to install the packages in your random dot lock file from CRAN. My recommendation is to switch that over to to posit package manager as quickly as possible. I think you'll find the installation experience, not only less prone to running into errors with some of those system dependencies, but also faster, if you're installing binaries as opposed to from source. So that would be recommendation number 1, and that's that's the first sort of solution, that's that's positive here. No pun intended.
And then they talk about extending the scope of reproducibility and introducing the Rig Package, which is honestly a package that I have not used enough, but absolutely should. And RIG is a package, an R package within the Rlib, ecosystem, and it allows you to sort of go back and forth between different versions of R. I believe you can run code against multiple versions of R at the same time. There's some pretty pretty wild things that you can do with RIG that I think help, solve some of these issues that you may run into, working with different versions of packages across different versions of R. So I think Rig can be a really helpful tool for for troubleshooting, or doing some of that exploratory work to make sure that your environment is set up correctly and appropriately in a way that's that's not going to fail. And then, obviously, you know, these we've talked about at length, even discussed already in the highlights, but there's there's Docker, there's Nix. Shout out, Bruno Rodriguez.
He has a series of blog posts on Nick's which are linked within this this blog. So hopefully, he's he's super stoked to to see, Nick's being represented here again with all of the fantastic resources that he's put together there. There's a link to using our end with Docker, the the RN vignette that's that's within the RN, I believe, package down site, as well as a link to a paper, that's an introduction to Rocker, which is one of the most popular images out there for working with R, and that paper is authored, by none other than the the creator of R2U himself, Dirk Edelbuttel, who I was talking about, as well as Carl, Boettiger.
So that might be a paper that you may be interested in in checking out that was published in 2017 in the R Journal, but some just fantastic resources here, that allow you to explore some of the different potential solutions for for handling these issues that you might be coming coming into when you are trying to to work on a new project with potentially an older version of R or older version of R packages. And sort of the the final note here, and to summarize this blog post in its entirety and, Eric, I think you can you can share this sentiment. I don't think that package management is a solved problem quite yet at this point, our environment management.
So I think this there are a lot of similar sentiments in the Python ecosystem as well. There's there's Pynth, there's there's Vnth, there's Pynth, there's Pynth. A lot of different ways to go about trying to do it and I don't know if any of them are perfect. And in our end is obviously, you know, in my opinion, at least, you know, the the the most, recent and and sort of, you know, best attempt at package management thus far in the R ecosystem. I think it it, improves upon some of the things that Packrat, the previous package management, package, tried to handle. I know that there is the pack, p a k, package as well, which does allow you to create sort of a a lock file as well and and manage, some of these things. But, again, I I don't think it's a a perfectly solved problem yet. Maybe it never will be. You know, it's a very tricky thing to manage. But I think, in terms of some of those issues that you may run into when using RM, this blog post is a great resource on some of the ways that you can try to troubleshoot those issues.
[00:23:17] Eric Nantz:
Yeah. I echo a lot of those same thoughts, Mike. And it does take me to even just at a broader level, the issue of distributing software even just on Linux in general, because there are a lot of issues that are very common here with what's happening in the R ecosystem of packages and the Python ecosystem of package dependencies. There have been some new standards in place to help give developers kind of a single, quote, unquote, single target so that those on any Linux distribution, no matter what, can install these software utilities.
I'm thinking of flat pack is one that's gotten the most attention with Snaps probably close behind that. And R, you're right. There's a lot of different ways to tackle this, and I don't think there is a perfect one in place. I do think what needs to happen, though, and I think this blog post is a great kind of precursor to it, is that these different paths of reproducibility that you want to take, whether it's the full system reproducibility, talking to the full stack, if you will, or if it's just the package dependencies, just that perspective. Those personas can mean different things into how far you go with these solutions.
So, certainly, what I'm keeping an eye on is, yes, I do often integrate r m with containers, but not r m valve of the box. I am gonna configure a little bit to my liking to make sure that it plays nicely, like you said, with deposit package manager, a huge win for container development and package environments. But also, again, you shouted them out. Bruno is on such a role here with spreading the message of of Nicks, in particular the Rick's package that he is codeveloping. Nix is taking a lot of, you know, I would say, mindshare in the general software development communities. Certainly, it's a huge topic of the podcast I listen to. And I think with time, we're gonna start seeing some enhancements to what Bruno is working on with Rick's, but also maybe others sharing their thoughts on it. I know quite a few people in the community are starting to dip their toes in it, myself included, still got a ways to go.
But anything that can simplify that full stack with or without containers, I think is going to come up kind of above the surface, if you will, as teams and organizations figure out the best way to tackle this. But there was a nugget in the in the conclusion here I wanna emphasize here is that when you have multiple team or multiple members involved on a team for a reproducibility kind of project, there needs to be a real team effort to keep up to date with everything. And I still recommend that if there's, like, a even a 2 person or more team, that one person is kind of in charge of kind of handling the RM side of things if you're doing RM for your package management because, trust me, there'd be dragons when you have multiple people clobbering that RM block file in a GitHub repo and not knowing which one is which. Which change should I pull into? So you're smiling. I know you know what I'm talking about, Mike. We've been there and it is rough when you don't have that delineation
[00:26:32] Mike Thomas:
set up front. I think there are a ton of organizations out there that struggle with this. We do a lot of work around this to try to set up, Data Science teams and Data Science collaborative workflows within some of our clients organizations that we work with and you know like you said it's an it's an evolving, you know, not perfectly solved problem but you have to you have to implement a framework and you have to framework and you have to set some sort of controls around how you're going to at least try to employ some of these best practices for collaboration, between team members across projects.
Otherwise, you'll just be in a world of pain.
[00:27:07] Eric Nantz:
Yeah. And, you know, data science is hard enough, folks. We don't need more pain alongside our data science adventures. So, yeah, certainly, if you've had your share of ill successes or, frankly, maybe even not so great moments with package and environment reproduce. We'd love to hear about it. We'll tell you how to get in touch with us later in the show. And rounding out our highlights today, some of that's right up both of our wheelhouses lately in different ways. But, you know, we've been pretty vocal on this podcast and some of our other ventures about it's a it's a new era in terms of data storage formats. We're talking about databases traditionally or some of these newer methods.
And in particular, a format that we are very excited about is the parquet format, part of the, you know, Apache Arrow project. There are lots of interesting ways that you can leverage this technology to streamline your data storage needs. And, yeah, my cohost here, Mike, yeah, you know a thing or 2 about this. But, this this last highlight is coming from Colin Gillespie, who is, the CTO of Jumping Rivers, who have been big proponents of advancing computing and their data science consulting projects and blogging for all of us to to learn about. And this is part of a series of posts that are diving deep into Apache Arrow in respect to the R ecosystem. And in this blog post here, Colin talks about some of the benefits that you can see in parquet versus what is the traditional format that we've been using in the our ecosystem since frankly the beginning of the language, and that's called the RDS format.
[00:28:53] Mike Thomas:
Absolutely. And you know that I am a huge fan, of par the parquet format and sort of all the advances that have come within, data storage in the last I don't even know how long it's been. 12, 18 months between parquet, DuckDV, all those things. It's it's happened very very quickly. And Colin leverages, one of the most popular, I think, built in datasets, I believe within the Arrow R package, which allows you to easily work with, the parquet format files and query them using dplyr syntax, which we we know and love. And that that dataset is called the NYC, data and I believe that's that's on New York City taxi data, which is a pretty pretty large dataset, so it makes for a good example when wrangling and querying this large parquet file. And so one of the, you know, big comparisons here between parquet is RDS files as you talked about, Eric, which is a file format that us as our users have been using, I think, for as long as ours been around or as long as I can remember at least for essentially saving any type of object right it could be a data frame could be a list could be a model often so it's a very flexible file storage format and you know to date when we typically compared you know RDS storage to like a CSV especially if you are storing a data frame and most of the time that RDS file was going to be smaller and snappier to load than a c really having to read a CSV file. But now that we have this new file storage format called Parquet, which is columnar storage, we've sort of gone through that comparison again and this time comparing RDS to to parquet file format.
And that's what Colin's blog post is doing here. And I think you'll be you'll be fairly surprised at the results in taking a look at, at least, this example, New York City taxi dataset appears to outperform, the parquet version appears to outperform the RDS version of this file across a few different metrics. So I don't know if you wanna dive into, I could do the drum roll and you can dive into the results here for us, Eric.
[00:31:09] Eric Nantz:
Alright. Here we go, folks. Yes. And, the results are in. And one thing to note with the parquet format and how the arrow package writes to parquet is it's taking advantage of a compression utility called Snappy, which is a fun little name right there. But that alone is a huge gain in terms of writing this taxi dataset to disk. And in particular, in the average of the metrics that the columns put together here, it takes on average about 4 seconds to write to parquet format of this taxi dataset. Whereas for using the gzip compression library in RDS, takes 27 seconds on average to write that to disk. Now that is some massive savings right there.
Some nuances here about parquet versus, like, the traditional things like CSV and whatnot is that parquet is column based partitioning of how it writes the data set, which means they can take advantage of repeating, say, you know, values of, like, a numeric index, advantages of, like, common character strings, advantages of POSIX times, lots of interesting optimizations. We don't have time to get into it all on this podcast, but there are also some references in Colin's post if you wanna really dive into that. So, yes, we already see writing is significant here. How about reading itself?
Now the results aren't quite as drastic, But as you might guess, because of the different way data is organized behind the scenes of these formats, it actually takes on average about 0.3 seconds or 0.4 seconds to read that into memory from parquet. Whereas for RDS, it takes about 5 ish 6 seconds on average. Now that, if you're doing interactive analysis, may not be a huge deal to you if you're just kinda doing your data reporting and expirations. But what's the space that you and I play with, Mike? Is that it's Shiny apps. Yep. It can mean everything. Yeah. It can mean absolutely everything. And I'm literally dealing with this right now as I speak with an open source project where I don't want to load the entire contents of, in this case, a 4,000,000 row dataset.
I wanna just grab what I need at the app load and then as needed, add in more. I am using Parquet for that. It is a very optimal solution. And, yeah, Mike, you know a thing or two about loading Parquet in the Shiny app. So you wrote a darn article about it, didn't you? Yep. You could find it on the deposit blog post. It's a couple years old now. It may need some updating,
[00:33:48] Mike Thomas:
but, yes, there's a blog post called Shiny and Arrow, a match made in high performance, computing heaven or something like that. So feel free to to check that out if you are interested in leveraging Parquet files to make your Shiny app so it's snappy.
[00:34:04] Eric Nantz:
Absolutely. So you can see that, you know, we don't wanna get the cliche. It depends on your use case. But how it concludes or, you know, the obvious question is, okay, you as new to this world, which one should you use, parquet or RDS for your your next project? Well, as you saw from the metrics, writing, there are just massive gains for writing volumeless data like this taxi data to disc with parquet. I think that if that's a concern to you and you're doing this on a regular basis and for efficiency, it does seem like parquet is a clear winner on that. For reading, importing into your r session, again, I think it depends on the context you're dealing with here. But I do think that, yeah, if you're in a pipeline that needs as much, you know, fast response time, whether that's a Shiny app or other situations, I think parquet is very attractive for those for those features alone.
Now, one thing to keep in mind, though, is that if you are trying to keep as lean of a stack as possible, we were talking about dependencies earlier. Right? Well, guess what? RDS is built into R. It's been built into R since the very beginning. So if you don't want to depend on the arrow package for importing this into your R session, that's another, you know, win for the RDS camp, if you will. And again, for smaller data sets, RDS has had no issues in my shiny app or my other, you know, data science needs. So, again, it's there. It's always there. You can depend on it no matter where you're running or which version of our, no headaches on that front alone. But I did have an interesting use case for parkade. I'm gonna, you know, give a little insider baseball here on this very podcast on my exploration on this at the day job where our clinical sets are organized and, you've guessed it, SAS data sets organized across many, many, many different directory, subdirectory patterns based on the treatment, based on the study name and whatnot. Many subdirectories inside. Right?
Well, we get questions from leadership about kind of how many sets do we actually have or, like, how many are SAS? How many programs do we have in this whole space that are SAS based? How many are R based? You know, can we get some metrics around it? So no one's going to do this manually. Right? Nate, nobody got time for that. So let's see if we can read all this metadata into some form of a data structure so we can interrogate it just to go to any database. Right? I used to use a bloated and I do mean bloated SQLite database to house all this. It worked fine ish until recently because I had a silly thing with modification times. I kind of had to re pivot.
So in this re pivoting, I thought, well, wait a minute here. Since there's a logical grouping and how these are organized where it's like the, I'll call it treatment ID of the treatment. And then within that, there's an umbrella of different studies or experiments under this. Well, this is right for grouping in a logical way by those two variables. And instead of having everything written to one massive file, why not distribute these as parquet files for the metadata? So that if I know I only need one particular treatment ID and one particular study I wanna get the data from, I can get this just as easily of arrow parquet files as I could with anything else. Plus, if I need to update only a specific treatment ID and study combination, I don't have to touch the rest of the study and data combination or study and treatment combinations. I can just update that one set and it will still magically bind all together if I need to further on. The magic of Arrow and the the dplyr, dplyr packages. It's all right there.
So that is saving me immense time. And the parquet files are fast. They're they're in efficient size. And I just feel a lot more organized on how I'm keeping track of all this. So that was my recent success with parquet. So, yes, your voice was in my head, Mike, as I was in this rearchitecting adventure. Like, I gotta get away from this monolithic set. What can I do here? And Parquet was the answer.
[00:38:14] Mike Thomas:
I don't wanna sound too cliche, but I am very proud of you, Eric. Great job.
[00:38:19] Eric Nantz:
Well, as you know, we just scratched the surface here. Every R weekly issue has a lot more terrific content for you to to learn about the R ecosystem, data science, integrations of R, and many other ways to inspire your daily journeys with data science. And, of course, we have the link to the issue in the show notes, but we're going to take a couple of minutes for some additional finds here. And I want to give a great shout out to a project that just keeps rolling along and had a massive update recently, and that is data dot table just had a major new release combining with a new governance structure for how they manage the project's life going forward.
This is a really fascinating post authored by Toby Dylan Hocking, and, again, we'll link to it in the in the show notes, but a really great kind of road map of what they've done to help put a little more governance around the data dot table project. There's been newer members joining the team. There's been new maintainership and lots of transparency on what they're looking at as new features going forward. So on top of that new release, it's really a great time if you're been using data. Table and you wanna get involved with the project. They're making that even more transparent on what the road map is and their contribution guidelines and kind of where things are are at going forward. So a big shout out again to the data. Table team. They're doing immense work in this space.
Always have tremendous respect for that project, and congrats
[00:39:48] Mike Thomas:
on the release of 1 dot 15 dot o. Yes. Congrats to that team. That's that's fantastic news. I'm gonna reach across the aisle to Bruno and, shout out his new blog post called reproducible data science with Nix part 9. Rix is looking for testers. So this is a call to action blog post in the r the rix package, spelled r I x, is in our package that leverages Nix. Essentially, allows you, I believe, to work with, Nix in that configuration from R. So if you are interested in Nix for environment and package management and, want to kick the tires on his Rick package and give some feedback, I think that would be really greatly appreciated. So check out this blog post, for info on how to get started.
[00:40:36] Eric Nantz:
Yeah. Huge congrats to to Bruno and Philip, the the comaintainer of REx. They have been doing immense work on this over 5 months and and and counting according to the blog post. So, yeah, getting real world usage of REx is hugely important as they get to this stable state, if you will. And and, yeah, count me in, Bruno. I'm gonna be testing the heck out of Ricks. I've already done initial explorations near the end of last year, but I am firmly on board with seeing just how far we can take it. And my initial experiences have been quite positive to say the least. But, yeah, I'll definitely put it in some more rigor and call again, shout out to all of you in the community that have been even just remotely curious about this. Give it a shot. Let them know what do you think? Because I do think in the reproducibility story that this is going to get a lot more traction as we get more users involved. So, again, huge congrats to Bruno and Philip on getting close to this major milestone.
But, of course, he doesn't just want to hear from all of you. We want to hear from all of you too. Right? And the best way to get in touch with us, you got a few ways to do it, actually. First of which is there's a contact page directly linked in the episode show notes if you want to give us a quick shout out on that. Also, if you're on the modern podcast app train like a few of us are using, say, Podverse, Fountain, Cast O Matic, many others, Pod Home, whatever have you, there are lots of great ways to get in touch with us on that via the boost functionality again or from the podcast index directly where this podcast has probably hosted it. You can find details on that in the show notes of the episode as well. And I'm doing some fun projects analyzing the massive amount of literally podcast metadata as we speak. And it's got a lot of geekery behind the scenes with, quartile, point blank, Docker, GitHub actions. There's going to be a lot to talk about with this.
I just conquered GitHub Actions successfully on an automated run time of these pipelines, which I felt pretty stoked about because I'm still a bit of a newer user of GitHub
[00:42:46] Mike Thomas:
but I'm I'm getting there. I'm getting there, Mike. The the old dog here learns a new trick once in a while. No. That's awesome. I have no doubt that you'll nail that. GitHub Actions has been definitely a game changer for us at Catchbrook.
[00:42:56] Eric Nantz:
Yeah. It's one of those things where you just can't imagine how you live without it all these years, but it is there, and it's a great service. So and, also, if you wanna get in touch with us on the social medias, I'm mostly on Mastodon these days. My handle is [email protected]. I am sporadically on the Weapon X thingy with atdrcast and also on LinkedIn from time to time on there. And also a quick reminder, we plugged this a couple of weeks ago, but the call for talks is still open for the upcoming Epsilon Shiny conference. So if you have a talk you'd like to share with the rest of those shiny enthusiasts out there, Mike and I are obviously big fans of this conference. Yeah. We'll have a link again to the conference registration and talk submissions in the show notes. And, Mike, where can a list who's gonna get a hold of you? Sure. You can find me on LinkedIn by searching Catchbrook Analytics,
[00:43:50] Mike Thomas:
k e t c h b r o o k. Or you can find me on mastodon@[email protected].
[00:43:59] Eric Nantz:
Very nice, Mike. And, yeah. You know, Mike deserves some extra praise here. You're not gonna know this from the polished version you hear of this episode, but he had to put up a lot of shenanigans during our recording today. So my thanks to you for putting up all that. As if you haven't had to put up with me on other episodes. So we're even. Yeah. We'll see who who causes the chaos next time around. But in any event, that's gonna wrap episode a 151 of our movie highlights, and, we're so happy to listen to us, and we hope you join us for another episode of RWBY highlights next week.