An honest take on common patterns and anti-patterns for re-use of data analyses that hit a bit too close to home for your hosts, a cautionary tale of garbage online references pretending to be authentic material, and a new (human-created) cheat sheet with terrific best practices taking front and center.
Episode Links
- This week's curator: Sam Parmar - @parmsam_ (Twitter) & @[email protected] (Mastodon)
- Patterns and anti-patterns of data analysis reuse
- $%@! R help from $%@! AI
- Best Practice for R :: Cheat Sheet
- Entire issue available at rweekly.org/2024-W12
Additional Links
- Jon Harmon's request for additional R4DS funding: https://fosstodon.org/@R4DSCommunity/112099679313058951
- Linux Unplugged Episode 554: SCaLEing Nix https://www.jupiterbroadcasting.com/show/linux-unplugged/554/
Supporting the show
- Use the contact page at https://rweekly.fireside.fm/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @theRcast (Twitter) and @[email protected] (Mastodon)
- Mike Thomas: @mike_ketchbrook (Twitter) and @[email protected] (Mastodon)
Music credits powered by OCRemix
- A Crook Man's Eyes - Mega Man 5 - Nightswim - http://ocremix.org/remix/OCR03679
- Plastik Skies - VROOM: Sega Racing - Palpable, Diodes - https://ocremix.org/remix/OCR03726
[00:00:03]
Eric Nantz:
Hello, friends. We're back of episode 157 of the R Weekly Highlights podcast. My name is Eric Nantz, and I'm so delighted you joined us from wherever you are around the world for our weekly show where we talk about the latest highlights that you can see on this week's Our Weekly Issue. And as always, I am joined at the hip here. My line mate in Our Weekly Fund is my co host, Mike Thomas. Mike, how are you doing today?
[00:00:27] Mike Thomas:
I'm doing well, Eric. It was starting to warm up here on the East Coast. Now we were getting a a cold week. So it's it's a little frustrating, but I think that maybe the theme of this week's highlights. We are venting this week in some of these highlights and I am here for it.
[00:00:42] Eric Nantz:
Oh, as am I. And and even in the preshow, that all you can listen to, you heard about Mike here heard about my events on some recent rabbit holes that I went under. But, yeah, we're gonna we're gonna have a lot to share today because we do, feel very relatable to a lot of the concepts we're about to talk about here. And how is this issue possible? Well, our curator this week was Sam Parmer, another good friend of mine from the life sciences industry. He has put together a terrific issue we're gonna talk about here. And as always, he had tremendous help from our fellow Rwicky team members and contributors like all of you around the world with your awesome pro requests and other suggestions.
So, yeah, let's get the, quote, unquote, venting session going, and we're gonna go fast with this because Miles McBain here, Mike, has some very insightful, tidbits to share, which I can tell have been gleaned from a lot of experience in data science called the patterns and antipatterns of data analysis reuse. So what do we have here? Why don't you set this up for us? This is too relatable. It's a phenomenal blog post
[00:01:52] Mike Thomas:
talking about, you know, the data analyst curse and, you know, understanding that, you know, every data analysis and data scientist role that Miles has been in and I agree with this as well. At some point in time, you're redoing variations of the same analysis. And there there's 2 assumptions that he's making, for those who would be able to to relate to this blog post. The the first one is that your work is is written in code like R, Python, Julia, or Rust. If you're using if you're using, in his words, Power BI or God forbid Excel, you probably won't relate to this. And then second, you're using a technology like Quarto, Rmarkdown, or Shiny, such that sort of your end deliverable is generated from that that code. So if if this sounds like you, I am assuming that maybe you'll be able to relate to this blog post as I did.
And when you're redoing, you know, that same analysis, in sort of different ways, one of the first things maybe that you may start out doing as a as a beginning developer is copying and pasting. Right? From from one, version of your report to the next version of your report that you need to create. And, you know, this can be a quick solution that you have, but it may not be very extensible or maintainable because when it comes time to update some sort of global, you know, version of this analysis, you would have to copy and paste to each different version that you have out there. And when you're trying to fix a bug or or create an enhancement, that same concept would apply that instead of just doing that update in one specific place, you would need to copy and paste that to all of the different places because you don't have something like a template.
And this is where you can start to move on towards oh, I'm going to create a sort of a single template that is going to have, parameters in it that I can set that will, be able to allow me to run different versions of this analysis just based upon, the different parameters that I'm passing to this. And in theory, this is this is great. And I think this is exactly what we all strive for. But if you have done this long enough, if you have sort of been in this world long enough, you'll start to see that as you create this parameterized global version, with each new variation of your analysis that someone's asking you to run, or each new dataset that's coming in, there's going to be a new edge case that you're going to have to handle.
And what that means is probably an additional parameter and if you're like me, this can get into now you're starting to write like conditional if statements, to test and see, you know, if this particular variation matches this one very, very specific edge case. And now your your global template is just getting super bloated because, it's trying to handle all of these different particular cases. And then at some point, someone says, hey. Why don't we, why don't we manage those parameters with with YAML or JSON? Because, for sure. Right? Now we're talking about configuration.
And, you know, then you have a a YAML or or a JSON file that is supposed to to manage and handle these parameters. And maybe it starts out small because there's only a few global parameters, there. But, again, as you introduce these different versions of this analysis or somebody asks for this new thing or this new dataset comes in, you know, it wants to look at your analysis at a different angle, you're just adding additional parameters. And now all of a sudden your YAML file starts to get pretty pretty long. And then maybe at some point, you are having a second YAML file to manage the configuration of of the first YAML file, and it's it's YAML all the way down. And, it things just start to be unwieldy, and you start to think, hey, maybe I just need to go back to cut and paste.
Copy and paste. And it can drive you a little insane as I would say, Miles may have gotten to in the, the the paragraph here titled Power Overwhelming, which I think is is where we start to, sort of go off off the rails and it I mean that in the nicest way because this is incredibly incredibly relatable. So then, Eric, if you wanna take it away to talk about maybe we start to move towards a a package framework.
[00:06:31] Eric Nantz:
Right. And first, yes, we all can very much relate to this because the entire spectrum of that that build up up to this point, I have seen with my own eyes. I have committed some of this in my own hands, if you will. And, yes, sometimes the only way we learn is through painful experiences. I have had some extremely sophisticated templates in our markdown before that had a boatload of params in the end. And sometimes I would be shared with different teams, and then they realize, oh, yep. That particular dataset for that study has this type of efficacy variable, and I didn't cover it. So it just keeps adding on, adding on, adding on until it's a point where no one knows where the central place is for that thing. And everybody copy pasted to a different study. Some of this is ironically still happening. We are trying to put the reins on it. But yes.
And when you get to this point, you think about what are ways that I can make it easier for us to maintain some kind of structure to this, still make it easy for the end user to implement in their analysis pipelines, but still be able to tap into some of the modern practices to help maintain this reusable code. And, yes, spoiler alert, that does mean creating an internal package, and that may be intimidating to many people. Well but the thing is, I would say, once you've been through these hardships, you're at a point to appreciate the upfront work to build a package, maybe more so than if we just told you this if you're brand new to data science in your particular industry or particular group.
Then you're gonna be ready to absorb some great resources out there already, especially in the our ecosystem to get a package off the ground. Why should you do this, though? Is that that way instead of having these template variables and these massive templates, you can have functions with function parameters that cover much of this much of this, operation and functionality. And you don't have to have it perfect the first time. Maybe you just automate certain parts of it and just kinda build on it over time. But having the package is gonna let you opt into additional best practices to get ready for cases where maybe your package analysis functions are being used in cases you didn't anticipate.
But you can build in things like automated testing. You can build in documentation on these parameters so that you can use the wonderful tools like use this, test that, dev tools to help make this package more robust in the R ecosystem. And, of course, Python fans, you have similar frameworks on that side as well. But just getting to that package step is a huge first step to start in writing some of the wrongs that you may have experienced in your respective effort. Now, like anything, there's some there's some gotchas to worry about. And another issue that I've seen firsthand and I've seen very talented people do this firsthand is that you started this great analysis package for your group.
Maybe you called it the name of your group. Who knows? And then over time, either you or others say, hey. You know what? What if this package did this new thing? What if this package did do this new thing? Suddenly, your your catalog of functions going into this internal package starts to balloon up. And maybe it gets to the point where you have so many functions, and some of them just don't really relate to each other. But because it felt like such hard work to get that package off the ground, everybody just wants to put it in one place. So they only have to load 1 package, and then it's all there. But you're running into the risk, as Miles points out, of complexity overload and a lot of bloat.
And especially if you need to make a change or deprecate something in that package, suddenly, the whole package is being updated in ways that maybe you didn't anticipate. And so that's where going through this exercise, yes, getting to the package is a great first step. But there are a lot of diminishing returns if you decide to put everything but the kitchen sink, so to speak, in this one internal package. I've also seen this in a capability tool that we used to author for helping design clinical trials. We had a monolithic, and I do mean monolithic application that was meant to do everything for our clinical program, and we just could not maintain it anymore. There was just so much in frameworks that honestly half of us didn't even understand, but that it was all in one monolithic code base.
It just at some point, the technical debt became too much. What did we realize and what Miles transitioned to in this piece is that instead of having this single package, try making your own internal, like, group of packages. They call he calls it the personal Versa packages. Of course, we're familiar within the R ecosystem, the tidyverse and other groups of related packages that may make some decisions, may have common, you might say, data structures that they operate on, but they've separated their purposes. They separated their concerns into fit for purpose packages.
This way, instead of having to update this monolithic piece with maybe that one little change, now you have a set of packages. They all contribute to a greater whole as they say, but now you can write updates to these in fit for purpose fashion. So he mentions in his, examples here in his current job, he's got a fun a package called check yourself, definitely before you wreck yourself. I'm just saying, to help you look some quality checks, you know, for your dataset. Great. First step in a data analytical pipeline that makes a lot of sense. And then because Miles is a huge fan of reproducible analytical pipelines, they have a package on top of targets called t d TDC targets.
And that's helping them build these pipelines in a unified way. It's still leveraging targets on the back end, but they're helping bootstrap that a bit easier. But, see, he separated out the data checking and the analytical pipeline building into this two sets of packages. There may be others that deal with internal APIs. I'm living that world right now. Do I wanna put all API calls in 1, like, company package? Oh, heck no. We wanna separate that out into its own fit for purpose thing because, spoiler alert for me, testing APIs is a much more wieldy effort than testing normal r functions. So why do you wanna boatload why do you wanna put a monolithic package to do all of that? You wanna separate that out as best you can. You're getting flexibility, but it is gonna take discipline to get there.
So I do think that it's not gonna be easy to do this all right away. You've gotta start somewhere. But, honestly, the first step is recognizing when you have a problem. Because sometimes you may send these great, like, end products of, like, these templates or these monolithic templates or monolithic packages. And everybody in leadership thinks, oh, you're doing great. Yeah. This is helping the company so much. But what are you standing on? Is that foundation solid? You really gotta pay attention to that because it's one thing to get the short term win by doing the copy paste, you know, method, but it's gonna fall down on you at some point.
And, honestly, nobody like I said, nobody gets this perfect the first time, and we are all continuing learning on this. And that's where the post concludes where you're gonna have these humble beginnings. Right? But then you're really honing your craft as you go along. You all don't wanna see my first internal package I did at the company. It took a lot of shortcuts that I'm not proud of. But getting there was a huge first step for me. And then as I learned from the communities, I learned from my, my teammates, learning from you know, I'm I'm very privileged to learn from Will Landau, basically, every time I talk to him about something new. So all these things just build upon it. You're you're going to get even more comfortable with this.
And, obviously, time is another factor. Time is not infinite. We have to prioritize this. But, honestly, I'm of the belief if you take the time up front to set this up the right way, even if you don't quite know how you're gonna get to that end goal yet, but you know that there are some best practices you wanna start with, that is gonna be a huge help to minimize the technical debt that Miles is definitely outlining here if you go with that quote, unquote easy approach to do this at first. So all in all, I think the the biggest piece of advice is when you see you're doing your copy pasting a bit too much, stop, pause, try to think about what are ways that we can make this reusable and more importantly, maintainable in your team.
I really resonate with many things in this and credit to Miles for putting this in such a, you know, comprehensive yet, you know, very much an evolutionary type story of what data analysis pipelines are all about. And definitely start, like I said, start small. But once you start small and do fit for purpose, you know, I think I think you're gonna be on the right foot. So
[00:16:20] Mike Thomas:
definitely spoken to for experience. I can tell with his insights here. So, yeah. Really excellent post and I think, Mike, you and I have been through been through this quite a few times in our internal adventures. Right? No. It's a little too relatable and there's, you know, there's a lot of things to to try to balance here. Right? As you move from a script to to function to a package, and then and then back and forth sort of depending on your use case. If I may just read a very small excerpt that I think is is worth reading. You know, he he talks about, you know, creating a massive function that gets written with with maybe a dozen arguments, you know, that has hundreds of lines of code that's it's not really much different than just a wrapper around some sort of data analysis script that you would have. And it's great that you're using functions, but you're actually, like, attempting to to template your entire solution using the functions arguments.
And he has a little little footnote in here that says, if a function starts taking YAML configuration files as arguments, you are on borrowed time. And and the last paragraph I wanna read, this is if Shakespeare was a data scientist, he would have written this. Such a function is it's pretty much untestable due to the combinatoric explosion of possible sets of inputs, and you can bet that the internals are not written in high level domain specific code. When I look at that function signature, I hear the screams of matrices being ground to a fine powder of bits through the conical burrs of nested for loops and labored index arithmetic. I mean, it's it's it's incredible. It's poetic.
[00:17:52] Eric Nantz:
And it's it's most definitely real. Right? Yep. I have seen this. The approach I've seen is, hey. You know what? This package, we're just gonna have to use our modem, custom CSV with all the params inside. Like, oh my gosh. No. Please stop the pain. Yes. But it it it is the the the the crutch that people will fall back on is, you know what? It it's a lot of work to put all those as function parameters, and I don't wanna test that. Just prune the config. It's all about the user configs. No. It's not about the user configs. What it's about is building an actual package that has actual documentation and actual testing.
Yes. I am firm on that because when you don't do that, you may not pay for it right away, but somebody's gonna pay for it. And it'll most likely be your end users, and that's about the worst result of all.
[00:18:49] Mike Thomas:
I agree. No. We have a client who leverages a third party API that in order to send your params to that API, you send your data and then you send this this wild like ASCII file, a text file that just you know has has no the limit, you know, elimination or or whatever you wanna call it. You just have to like add an n or or a y depending on which, which things you want to receive back from the API and then you have to zip it all up and send a zip zip file. It's it's pretty wild but let's let's vent about something else, Eric.
[00:19:29] Eric Nantz:
Yeah. I think we need to now think about you know what? We've we've thrown a lot of knowledge your way, and we acknowledge that it's not always easy for you to know everything at once. Right? We're all continuing learning. And, of course, if we don't have the answer to a question, we're probably gonna ask for help in certain ways, especially online. Right? I mean, how many times have I googled for how to do something esoteric with, like, an r to call this, you know, API parameter or whatnot or do this new statistical function that I'm just not as familiar with? So like anybody else, Mike, I know you and I have searched, you know, the Internet's and webs quite a bit for a certain help here and there. And, you know, with the advent of r over the years, you're starting to see a lot more results there, Taylor, with r itself, certainly post on Stack Overflow and things like that.
And then you'll start to see in these search results, you know, some things that look interesting but don't quite look right. And what is this trend we're talking about here? Well, our next highlight is coming from Matt Dre, another previous contributor to R Weekly. He, he's noticed this trend too, and we'll just kinda take this bit for a bit because I think those of you that have used R for a bit and have been searching for your various help or tutorials on there have probably seen this because you will often see these websites that are now coming up higher in the rankings that look almost too good to be true, but they kind of are too good to be true because these are sites that are coming up that have clearly you can kind of tell whether it's said explicitly or kind of implicitly, they're being written by some kind of bot, maybe some kind of AI framework.
We don't want to send traffic there. So, like, he, Matt, or I are not going to tell you the names of these sites, but they are very easy to kinda hook you in to to acting like these are authenticated, you know, very authoritative sources, I should say. And sometimes, you'll see that the same overall site is producing, like, thousands of these guides for each package individually. But the guides are not coming from the package authors. They're not coming from people in the community that we've, you know, seen or or, like, an authentic blog post. They're these, like, AI generated, you know, narratives that somehow have some great search engine optimization built in so that they're showing near at the top of your search results.
But, clearly, they're not they're not playing the right way here, and I think it kinda stinks. So the I think that's pretty clear to me why this is a bad thing for to see happen, but it's also reality that we all need to deal with. It's not like a quite one to one example here, but, of course, most of us have cell phones and we get these robocalls left and right on our cell phones. And we know they're bogus even though they try to act like they're coming from our area code. Right? But these these are, like, some real bogus results that are trying to show, quote, unquote, a guy to use a certain package. It's not. It's not. But, yeah, Matt continues in his post about just why, in his opinion, this is a bad thing that this keeps happening. Mike, why don't you take us through why? What are some of the downsides of what we're seeing here? Yeah. You know, I think what's gonna happen here for for the most part is that,
[00:23:05] Mike Thomas:
some of these summarizations that that take place, when it's a bot or, you know, AI, whatever you want, whatever that means, is is sort of summarizing and and trying to scrape the web and put together, these these are, you know, help sites. You know, I think what's gonna happen here is just a lot of the specificity is is going to get washed out. And, not only is the content terrible, because it's not written by a human for humans, but the ethics are pretty bad as well. They're really just trying to, either make money off of you somehow, you know, redirect you to some affiliate site, you know. And you also have to remember that, in a way, they're sort of stealing their content, you know.
A lot of the the content on the web that people are are putting out there in terms of, you know, Stack Overflow help and, you know, stuff that's actually authored by by someone. And, we went down the rabbit hole a little bit, when we started talking about Copilot, is a lot of this content does not necessarily, you know, have, like, a Creative Commons license behind it that says, hey, you know, go for it and and scrape this and use it however you want. You know, it's it's probably scraping stuff that the author may have not given them consent to to scrape. So that's that's crappy.
You know, the fact that it's moving up, in terms of SEO, because, you know, I guess that's probably something that AI is is fairly good at as well. Right? To try to make this site look like a site that gets a lot of clicks. You know, that's, I guess, the name of the game these days, unfortunately. And I'm seeing it myself, you know, I'm having a harder time getting to Stack Overflow links, which are have traditionally been really the thing that has helped me the most, has helped get me to my answer the quickest. And usually, maybe I have to sort through, 2 or 3 or 4 different Stack Overflow links to to find my exact solution. But in the past, that would come up those links would come up very very high, you know, if not the first result, you know, the second result.
And and now, they're much sparser, unfortunately. And I'm I am having to sort, like Matt, through a lot of this this crap, unfortunately, to so it stinks, you know. It's, and that is taking sort of a pessimistic view here, which which maybe I share. I don't know. I haven't put I hadn't put too too much thought into this until Matt's blog post, but, don't you know, we don't see really an avenue where this gets better as opposed to to gets worse. I don't know, Eric.
[00:25:51] Eric Nantz:
Yeah. I I think it's a reality, like I said, even my robocall example. It's like, no matter how many you block, there's always others that are gonna keep coming. And I think there's gonna be these sites that crop up with now AI and automation becoming so much easier for the masses or in the case of some of these, you know, non unethical corporations or whoever's behind some of these to just launch all these automated processes on some server somewhere and do the SEO gaming up of of of search results. I think the biggest thing we can recommend is to never to to have, like, a a careful eye as you're searching these results. And I think over time, you'll see these patterns such as Matt has been talking about here in this post.
But I'm gonna say the best places to draw upon for, you know, you know, help for, say, a package itself is hopefully the package documentation itself. Most of them now have package down sites. They usually have a GitHub repository or GitHub like repository and, you know, seeing what issues have been talked about for that particular package on their issue board. Like, that's a great way to learn even just by scraping through that. And also leveraging community based built resources that you know are being built by humans. And guess what? Another spoiler. We're our weekly is built by humans. Right? We are linking to content that has been created by package authors, by data scientists, by others in the field that you that are authentic. And that's why we have a curator to always sift through. We get noise too just as much as anyone else, but we wipe the heck out of those. We make sure those don't get into our issue.
Unfortunately, for search engines, we don't have that control. Right? They're just always gonna kinda pop up from time to time. I think if, if it sounds too good to be true, it probably is, so to speak. So definitely have a careful eye to that, especially those, as Matt points out, will have some random affiliate links somewhere in maybe the footer of the site or it's a sidebar or whatnot. No. I don't see that for authentic R based content or data science content in my day. So I think it's more about with experience. You're gonna be able to see this more quickly. But we I think what Matt does here is at least bringing awareness to the issue that this is real. It's probably not slowing down.
And so just making sure that you are, you know, looking at the authentic community based or, in some cases, the developer authored resources to really get you in the right direction for your particular issue. But, Matt, I see at the end of the post, yeah, you also grew up in the times I grew up with. Good old floppy disk. Right? We didn't have fancy AI bots generating these queries. We had to make sure that 5.25 floppy disk somehow worked in our IBM, you know, PCs or Apple 2 GS's. Shout out to all those that use vintage Apple computers. So it's a different time now. And that I think with that just comes some some new skills that we have to learn about finding the the best from the noise as they say.
[00:29:01] Mike Thomas:
Yep. No. I remember, you wanted to know something. You looked it up in an encyclopedia. So I very much yeah. At the library no less. Yeah. At the library no less. I think we had we had an encyclopedia on a CD ROM at home or something like that when I was growing up. That's right. Yep. No. Times have certainly changed, and I think, you know, unfortunately, that means navigating navigating the Internet and search results, you know, requires new skills to be able to do so. But it's unfortunate that some of these sites out there, exist because it sort of feels like cheating.
[00:29:49] Eric Nantz:
Now with that said, of course, what are ways that you can kinda get your your journey of data science started off right, especially if you're new and you wanna turn to maybe a human generated resource to help you. Well, one thing that has helped me over the years, and I think many others would agree, is the concept of having that handy cheat sheet next to you. So if you're looking up stuff all the time but you just want a quick reminder of how something works, you know, cheat sheets are a great way to have at your desk or at your virtual wallpaper or whatever have you to kinda get those concepts reinforced, from time to time. And our last highlight is, actually a new cheat sheet in the R community authored by Jacob Scott who is a data scientist based in the UK.
And Jacob has put together this best practice for our cheat sheet, and he is very much upfront in his repository that this is highly opinionated in some of his preferred workflows. But I think there are some concepts here that we can very much relate to, especially in the context of if you are following the advice of what Miles authored and our first highlight, some ways you can get started pretty quickly. One thing that whether you're running our studio proper or not, but having some kind of project structure is it's one of those things that you take for granted. But boy, oh, boy, I have seen countless times people, like, throw all their r scripts in one directory that have no real relation to each other. Just throw it all in there. Right?
Have you ever done that, Mike?
[00:31:27] Mike Thomas:
I've seen it. Maybe I did it when I was starting out potentially, when I didn't necessarily know best practices around, you know, what an R project even was. You know, like I said, when I was taught R in undergrad, we were only taught R markdown. I didn't even know what an R script was. I I only knew the existence of of dotrmd files. So there's there's a possibility that at some point in my journey, which I don't wanna, maybe dive back into that that I would have been guilty of that, but I am very happy to have, found it and understood our projects.
[00:32:04] Eric Nantz:
Yes. And and, yeah, certainly the examples he's talking about here, they are specific to RStudio, but you can do this in any of our typical ID as well. I mean, Versus Code has workspaces you can utilize. And, of course, you know, there are loads of extensions in the classical frameworks that people turn into IDEs like Vim or Emax that do similar things too. The idea is just logical grouping of your code. And he's got a nice little snippet here in the cheat sheet about what his project structure looks like. It's got, you know, subfolders for the scripts itself. It's got, you know, potential database query, SQL scripts. You know, all and, of course, he's using RM too. That's an even another best practice that I think goes it needs to be reinforced is that these projects can have wildly different dependency requirements. And in the r side of things, having RM is a real bulletproof way, give or take a few gotchas here and there, making sure that you can reproduce that R base execution environment within reason, from project to project and be able to have that finer tuned, that finer tuned control for your dependencies.
And then also, there's some great sections in here about how to create a repreqs, which, again, we were talking about getting help. Right? Finding ways to effectively communicate and effectively search. Well, if you know that you're having an issue with a said package or or another, you know, utility, the best way to to get help from the community, whether it's in Stack Overflow or, say, PASA community or whatnot, is having a reprec so that it shows in a very concise manner what exact error you're getting and let others reproduce that error. Reprec is not this is not the first time we mentioned reprec on this highlights podcast over the years.
So I think having that skill set is a great way to put yourself in best position to not only ask for help, but then to receive it as well. And then there are also some great sections here about how do you connect the databases. And I've used the DBI package quite a bit. He's got a little snippet about connecting to, a a database with that as well. And then others such as styling. Again, there can be different takes on how you style your code. What, Jacob recommends here, I believe, is a variation of the tidyverse style guide with certain pieces. But, again, I think the key, as we mentioned maybe a few weeks ago, is just consistency.
Once you have consistent styling, no matter if you use the tidyverse framework or, let's say, Google's framework or any your own company's framework, having consistency is gonna help you as well as your future self and collaborators on those projects. And then it concludes with, some links to learning more about the R community and building R for, you know, our projects such as r for data science, building packages of r packages, and really getting into the nuts and bolts of r, where the Vance r. And, yes, for the shiny fans out there, a link to mastering shiny as well. Highly recommended. But and I think it's a it's a great way to get started. I think for those new to the R framework, this is one of those great examples to get you started off the right way, and it's gonna wet your appetite, so to speak, to dive into some of these concepts in more detail.
So really nice nice job here, Jacob, and I think it's a it belongs in many, collections of cheat sheets out there.
[00:35:36] Mike Thomas:
Yeah. This is absolutely beautiful, design wise, the way that he he drew this up. It says that, he originally created a similar version of this this cheat sheet specifically for use in the UK Department of Education, but he's created this more generalized version, I think for for everybody else, which is fantastic. And it it sort of makes me think about perhaps, you know, you may wanna create a cheat sheet similar to this within your own organization that, you know, mentions and outlines some of the specific best practices that you wanna follow and employ within your own organization. So if you're looking for inspiration to do something like that, this would be a great place to start.
[00:36:16] Eric Nantz:
Yeah. And he does have link in the GitHub repo to additional, additional cheat sheets that are available on Posit's site. You know, we see many contribute to that both from Posit and outside of Posit too. That style is is very reminiscent of that. So I think it's it's interesting, yeah, interesting way to get started the right way, and I'm always all for it. Again, human generated our resources out there. No bot made that resource. I can I could pretty much tell that one? That's right. Yep. And like I said, what else do bots not create? Well, it's the RWQ issue itself. We've got a curator helping with that every single week. And, again, Sam did a tremendous job with this issue, and we'll take a couple minutes to talk about our additional fines for that we found in this issue.
For me, I'm still very much in my learning journey of, you know, APIs with R, but also developing web based resources with R and pushing, like, Shiny to new directions and pushing even the portal sites I create the new directions. And friend of the show, Albert Rapp, has another terrific blog post here in his web dev for R section. One area that I simply have to always keep looking up every time, it's not muscle memory yet, but getting a handle on selecting certain elements in your CSS style sheets. And he's got both a video and an accompanying blog post that talk about how you can select particular tags of a certain type with both the source code and solution right there in in the in the post itself, selecting elements by class, lots of things that unless you really practice a bit, especially if you're new to web dev, it's gonna seem pretty foreign to you. But he brings it home with how he used these techniques to modify some of the styling behind a GT table that he was, creating. So you can give a little extra personality, a little extra style along the way. So if you're in the world of CSS style and you're just not sure where to go to get that particular nagging element that you wanna make like a bold font or make a red color background, this is a great post to let you dive into just what kind of detective skills you might need to get to that set element and make it look the way you want it to.
[00:38:41] Mike Thomas:
And, Mike, what did you find? Oh, that's a great find, Eric. And the the shiny developer in me and the web scraper in me, is very interested in in checking that one out to to dive down and figure out very specific CSS and style elements on a on a web page. That's awesome. I found, very interestingly, a phenomenal article in Nature by Jeffrey M Pirkle called No Installation Required, How WebAssembly is Changing Scientific Computing. I think we teased this a little bit last week, but it's a fantastic walk through about WebAssembly. It starts off with some quotes from, George Stagg, at Posit, who has done so much work on the WebR package to allow us to, write our code that is compatible with the WebAssembly framework and essentially have have the the work run-in the user's browser and no server required, which obviously is is pretty game changing, something we've talked about many times here. And it's really the theme in this blog post. And, it's a really interesting article because there's little anecdotes from from many people with many different perspectives on this topic.
One of those people, being, my co host, statistician Eric Nance. Oh, hello. And, you you have an awesome quote here that you're you believe WebAssembly will minimize, from the reviewer's perspective, many of the steps that they had to take to get an application running on their own machine in the context of clinical trials. I totally agree. I I I really enjoyed reading this article, and it's it's very exciting again, for me to see this topic being picked up in something like nature.
[00:40:20] Eric Nantz:
It's it's super exciting because it it means that it's it's gaining steam out there. Yeah. I'm as excited as anybody right now in this space and the fact that we're we're being recognized in places I never even dreamed of as we're we're kinda pushing the envelope here. I think it's just another piece where this has a chance to transform so many things and not only my industry, but many other places as well. And, we're just at the tip of the iceberg. There's still a lot of growth here, but I I I sense that, you know, we're gonna be talking about this for for years to come as one of these next big evolutions in technology at the intersection of data science. We're we're on the way, Mike. We're on the way.
Yep. And one other thing I wanna leave off with, and a good friend of the show as well, John Harmon, he's been, you know, sending out some posts on his Mastodon account and LinkedIn about a recent unfortunate event with the r for data science community and that their particular provider that they've been using for assembly funds to keep the project going has unfortunately changed direction. And now they're kind of, looking at other ways to receive, you know, robust funding through a robust infrastructure. So I will just mention if you're in this space, maybe be able to help John out with some advice on where to go for additional funding opportunities for our for data science and platforms that they can leverage.
Certainly get get in touch with John personally. I'm sure he would love to hear, you know, some other advice that people have along the way. So, again, really hoping to see the r for data science group keep going. But I know it's gonna be it's always tough when you rely on a platform to help centralize some of this, then they pull the rug from under you. So let's hope for the best, John. And, certainly,
[00:42:16] Mike Thomas:
if we find any, you know, resources, we'll pass them along your way. Yes. No. Please help out, John. If you have the the means to be able to do so because this R4DS community, is is fantastic. I think that it's it's helping a lot of people get up to speed with R and get introduced with R. It's helping people like me, with very niche questions and, just an incredible community of folks who are willing to to help one another and to listen and to try to encourage each other in our R programming journeys.
[00:42:47] Eric Nantz:
Absolutely. Absolutely. So, I was reading some of his latest posts here as of 4 days ago. They did get some additional funding before that host kinda pulled the rug from under them, but that's not gonna last forever. So, again, he's I'll put a link in the show notes to where you can you can contact John and contact the project. So, again, we we really hope for the best here. And talking about finding great resources for help, I mean, we've said this many times. The Hartford Data Science community has so many helpful participants at all skill levels. It is an it is a wonderful resource out there. They have book clubs. They have groups dedicated to different packages or different frameworks. It is all there for the taking.
And, really, some of the best support you can do is even just helping out with that community on top of financial donations. So I'm sure he would welcome that as well. And speaking of welcoming, we welcome your feedback too with this humble little, you know, endeavor we call a podcast here. And what are ways that you can help us out? Well, first, the Rwicky project itself. We'd love to get your new package idea or new package resource. If you have a blog post, a tutorial, or announcement you wanna share, we're just a poll request away. It's all marked down all the time. Where you can do that is linked right in the top right corner of rweekly.org, the link of this current issue's draft or upcoming issue draft, I should say. And then you can just send your poll request there, and our curator of the week will be glad to merge that in for you. And as well, we love to hear from you in the community.
There are many ways to do that. We have a little contact page in this episode show notes that you can send us direct feedback with. You can also have a new modern podcast app like Paverse, Fountain, Cast O Matic, CurioCaster. I could go on and on. They have a little boost functionality. You can send a fun little message along the way. And, quick congrats to my friends at Jupiter Broadcasting because they use this modern infrastructure from the Fountain app to do live podcast episodes on the ground at the recent SCALE and Knicks conferences in California. It was a good time to be had, so you might wanna search them out. That was some amazing content there. And, yeah, the Knicks stuff was, quite entertaining. So I'm thinking of Bruno right away when I when I listen to this. I may have to dust off the Knicks stuff now. And, again, I'm even more inspired than I was last week. I'm sure his ears are ringing. Yes. Yeah. I'll have to link that in the show notes. Bruno, I think you'll find it very interesting.
But, as well as what's awful interesting is is hearing from you, as I said. You can also get in touch with us on social media. I am on Mastodon mostly these days with at our podcast at podcast index on social. I am also sporadically on the weapon x thing at the r cast and as well on LinkedIn as well. You can just search for my name, and you'll see all my show announcements and other fun announcements there. And, Mike, where can the listeners find you? Yeah. Probably best on LinkedIn. If you search Catchbrook Analytics, k e t c h b r o o k, you can find out what I'm up to,
[00:45:54] Mike Thomas:
or on mastodon@[email protected].
[00:45:59] Eric Nantz:
Awesome stuff. And certainly, yeah, we love hearing from you, as I said. And our weekly training keeps on going. And hopefully, we keep going again for the foreseeable future. But I can guarantee you there will be no robotic voices on this podcast. You can be sure we are the authentic Eric and Mike, whether you like it or not.
[00:46:18] Mike Thomas:
That's right.
[00:46:19] Eric Nantz:
Yep. So we will close-up shop here. And thanks again for joining us from wherever you are. And definitely helps if you wanna spread the word for others in your in your organizations learning data science. You know? Spread your word about the podcast is probably some of the best support we can get, so we greatly appreciate that. So we will close-up shop here and we will be back with another episode of our weekly highlights next week
Hello, friends. We're back of episode 157 of the R Weekly Highlights podcast. My name is Eric Nantz, and I'm so delighted you joined us from wherever you are around the world for our weekly show where we talk about the latest highlights that you can see on this week's Our Weekly Issue. And as always, I am joined at the hip here. My line mate in Our Weekly Fund is my co host, Mike Thomas. Mike, how are you doing today?
[00:00:27] Mike Thomas:
I'm doing well, Eric. It was starting to warm up here on the East Coast. Now we were getting a a cold week. So it's it's a little frustrating, but I think that maybe the theme of this week's highlights. We are venting this week in some of these highlights and I am here for it.
[00:00:42] Eric Nantz:
Oh, as am I. And and even in the preshow, that all you can listen to, you heard about Mike here heard about my events on some recent rabbit holes that I went under. But, yeah, we're gonna we're gonna have a lot to share today because we do, feel very relatable to a lot of the concepts we're about to talk about here. And how is this issue possible? Well, our curator this week was Sam Parmer, another good friend of mine from the life sciences industry. He has put together a terrific issue we're gonna talk about here. And as always, he had tremendous help from our fellow Rwicky team members and contributors like all of you around the world with your awesome pro requests and other suggestions.
So, yeah, let's get the, quote, unquote, venting session going, and we're gonna go fast with this because Miles McBain here, Mike, has some very insightful, tidbits to share, which I can tell have been gleaned from a lot of experience in data science called the patterns and antipatterns of data analysis reuse. So what do we have here? Why don't you set this up for us? This is too relatable. It's a phenomenal blog post
[00:01:52] Mike Thomas:
talking about, you know, the data analyst curse and, you know, understanding that, you know, every data analysis and data scientist role that Miles has been in and I agree with this as well. At some point in time, you're redoing variations of the same analysis. And there there's 2 assumptions that he's making, for those who would be able to to relate to this blog post. The the first one is that your work is is written in code like R, Python, Julia, or Rust. If you're using if you're using, in his words, Power BI or God forbid Excel, you probably won't relate to this. And then second, you're using a technology like Quarto, Rmarkdown, or Shiny, such that sort of your end deliverable is generated from that that code. So if if this sounds like you, I am assuming that maybe you'll be able to relate to this blog post as I did.
And when you're redoing, you know, that same analysis, in sort of different ways, one of the first things maybe that you may start out doing as a as a beginning developer is copying and pasting. Right? From from one, version of your report to the next version of your report that you need to create. And, you know, this can be a quick solution that you have, but it may not be very extensible or maintainable because when it comes time to update some sort of global, you know, version of this analysis, you would have to copy and paste to each different version that you have out there. And when you're trying to fix a bug or or create an enhancement, that same concept would apply that instead of just doing that update in one specific place, you would need to copy and paste that to all of the different places because you don't have something like a template.
And this is where you can start to move on towards oh, I'm going to create a sort of a single template that is going to have, parameters in it that I can set that will, be able to allow me to run different versions of this analysis just based upon, the different parameters that I'm passing to this. And in theory, this is this is great. And I think this is exactly what we all strive for. But if you have done this long enough, if you have sort of been in this world long enough, you'll start to see that as you create this parameterized global version, with each new variation of your analysis that someone's asking you to run, or each new dataset that's coming in, there's going to be a new edge case that you're going to have to handle.
And what that means is probably an additional parameter and if you're like me, this can get into now you're starting to write like conditional if statements, to test and see, you know, if this particular variation matches this one very, very specific edge case. And now your your global template is just getting super bloated because, it's trying to handle all of these different particular cases. And then at some point, someone says, hey. Why don't we, why don't we manage those parameters with with YAML or JSON? Because, for sure. Right? Now we're talking about configuration.
And, you know, then you have a a YAML or or a JSON file that is supposed to to manage and handle these parameters. And maybe it starts out small because there's only a few global parameters, there. But, again, as you introduce these different versions of this analysis or somebody asks for this new thing or this new dataset comes in, you know, it wants to look at your analysis at a different angle, you're just adding additional parameters. And now all of a sudden your YAML file starts to get pretty pretty long. And then maybe at some point, you are having a second YAML file to manage the configuration of of the first YAML file, and it's it's YAML all the way down. And, it things just start to be unwieldy, and you start to think, hey, maybe I just need to go back to cut and paste.
Copy and paste. And it can drive you a little insane as I would say, Miles may have gotten to in the, the the paragraph here titled Power Overwhelming, which I think is is where we start to, sort of go off off the rails and it I mean that in the nicest way because this is incredibly incredibly relatable. So then, Eric, if you wanna take it away to talk about maybe we start to move towards a a package framework.
[00:06:31] Eric Nantz:
Right. And first, yes, we all can very much relate to this because the entire spectrum of that that build up up to this point, I have seen with my own eyes. I have committed some of this in my own hands, if you will. And, yes, sometimes the only way we learn is through painful experiences. I have had some extremely sophisticated templates in our markdown before that had a boatload of params in the end. And sometimes I would be shared with different teams, and then they realize, oh, yep. That particular dataset for that study has this type of efficacy variable, and I didn't cover it. So it just keeps adding on, adding on, adding on until it's a point where no one knows where the central place is for that thing. And everybody copy pasted to a different study. Some of this is ironically still happening. We are trying to put the reins on it. But yes.
And when you get to this point, you think about what are ways that I can make it easier for us to maintain some kind of structure to this, still make it easy for the end user to implement in their analysis pipelines, but still be able to tap into some of the modern practices to help maintain this reusable code. And, yes, spoiler alert, that does mean creating an internal package, and that may be intimidating to many people. Well but the thing is, I would say, once you've been through these hardships, you're at a point to appreciate the upfront work to build a package, maybe more so than if we just told you this if you're brand new to data science in your particular industry or particular group.
Then you're gonna be ready to absorb some great resources out there already, especially in the our ecosystem to get a package off the ground. Why should you do this, though? Is that that way instead of having these template variables and these massive templates, you can have functions with function parameters that cover much of this much of this, operation and functionality. And you don't have to have it perfect the first time. Maybe you just automate certain parts of it and just kinda build on it over time. But having the package is gonna let you opt into additional best practices to get ready for cases where maybe your package analysis functions are being used in cases you didn't anticipate.
But you can build in things like automated testing. You can build in documentation on these parameters so that you can use the wonderful tools like use this, test that, dev tools to help make this package more robust in the R ecosystem. And, of course, Python fans, you have similar frameworks on that side as well. But just getting to that package step is a huge first step to start in writing some of the wrongs that you may have experienced in your respective effort. Now, like anything, there's some there's some gotchas to worry about. And another issue that I've seen firsthand and I've seen very talented people do this firsthand is that you started this great analysis package for your group.
Maybe you called it the name of your group. Who knows? And then over time, either you or others say, hey. You know what? What if this package did this new thing? What if this package did do this new thing? Suddenly, your your catalog of functions going into this internal package starts to balloon up. And maybe it gets to the point where you have so many functions, and some of them just don't really relate to each other. But because it felt like such hard work to get that package off the ground, everybody just wants to put it in one place. So they only have to load 1 package, and then it's all there. But you're running into the risk, as Miles points out, of complexity overload and a lot of bloat.
And especially if you need to make a change or deprecate something in that package, suddenly, the whole package is being updated in ways that maybe you didn't anticipate. And so that's where going through this exercise, yes, getting to the package is a great first step. But there are a lot of diminishing returns if you decide to put everything but the kitchen sink, so to speak, in this one internal package. I've also seen this in a capability tool that we used to author for helping design clinical trials. We had a monolithic, and I do mean monolithic application that was meant to do everything for our clinical program, and we just could not maintain it anymore. There was just so much in frameworks that honestly half of us didn't even understand, but that it was all in one monolithic code base.
It just at some point, the technical debt became too much. What did we realize and what Miles transitioned to in this piece is that instead of having this single package, try making your own internal, like, group of packages. They call he calls it the personal Versa packages. Of course, we're familiar within the R ecosystem, the tidyverse and other groups of related packages that may make some decisions, may have common, you might say, data structures that they operate on, but they've separated their purposes. They separated their concerns into fit for purpose packages.
This way, instead of having to update this monolithic piece with maybe that one little change, now you have a set of packages. They all contribute to a greater whole as they say, but now you can write updates to these in fit for purpose fashion. So he mentions in his, examples here in his current job, he's got a fun a package called check yourself, definitely before you wreck yourself. I'm just saying, to help you look some quality checks, you know, for your dataset. Great. First step in a data analytical pipeline that makes a lot of sense. And then because Miles is a huge fan of reproducible analytical pipelines, they have a package on top of targets called t d TDC targets.
And that's helping them build these pipelines in a unified way. It's still leveraging targets on the back end, but they're helping bootstrap that a bit easier. But, see, he separated out the data checking and the analytical pipeline building into this two sets of packages. There may be others that deal with internal APIs. I'm living that world right now. Do I wanna put all API calls in 1, like, company package? Oh, heck no. We wanna separate that out into its own fit for purpose thing because, spoiler alert for me, testing APIs is a much more wieldy effort than testing normal r functions. So why do you wanna boatload why do you wanna put a monolithic package to do all of that? You wanna separate that out as best you can. You're getting flexibility, but it is gonna take discipline to get there.
So I do think that it's not gonna be easy to do this all right away. You've gotta start somewhere. But, honestly, the first step is recognizing when you have a problem. Because sometimes you may send these great, like, end products of, like, these templates or these monolithic templates or monolithic packages. And everybody in leadership thinks, oh, you're doing great. Yeah. This is helping the company so much. But what are you standing on? Is that foundation solid? You really gotta pay attention to that because it's one thing to get the short term win by doing the copy paste, you know, method, but it's gonna fall down on you at some point.
And, honestly, nobody like I said, nobody gets this perfect the first time, and we are all continuing learning on this. And that's where the post concludes where you're gonna have these humble beginnings. Right? But then you're really honing your craft as you go along. You all don't wanna see my first internal package I did at the company. It took a lot of shortcuts that I'm not proud of. But getting there was a huge first step for me. And then as I learned from the communities, I learned from my, my teammates, learning from you know, I'm I'm very privileged to learn from Will Landau, basically, every time I talk to him about something new. So all these things just build upon it. You're you're going to get even more comfortable with this.
And, obviously, time is another factor. Time is not infinite. We have to prioritize this. But, honestly, I'm of the belief if you take the time up front to set this up the right way, even if you don't quite know how you're gonna get to that end goal yet, but you know that there are some best practices you wanna start with, that is gonna be a huge help to minimize the technical debt that Miles is definitely outlining here if you go with that quote, unquote easy approach to do this at first. So all in all, I think the the biggest piece of advice is when you see you're doing your copy pasting a bit too much, stop, pause, try to think about what are ways that we can make this reusable and more importantly, maintainable in your team.
I really resonate with many things in this and credit to Miles for putting this in such a, you know, comprehensive yet, you know, very much an evolutionary type story of what data analysis pipelines are all about. And definitely start, like I said, start small. But once you start small and do fit for purpose, you know, I think I think you're gonna be on the right foot. So
[00:16:20] Mike Thomas:
definitely spoken to for experience. I can tell with his insights here. So, yeah. Really excellent post and I think, Mike, you and I have been through been through this quite a few times in our internal adventures. Right? No. It's a little too relatable and there's, you know, there's a lot of things to to try to balance here. Right? As you move from a script to to function to a package, and then and then back and forth sort of depending on your use case. If I may just read a very small excerpt that I think is is worth reading. You know, he he talks about, you know, creating a massive function that gets written with with maybe a dozen arguments, you know, that has hundreds of lines of code that's it's not really much different than just a wrapper around some sort of data analysis script that you would have. And it's great that you're using functions, but you're actually, like, attempting to to template your entire solution using the functions arguments.
And he has a little little footnote in here that says, if a function starts taking YAML configuration files as arguments, you are on borrowed time. And and the last paragraph I wanna read, this is if Shakespeare was a data scientist, he would have written this. Such a function is it's pretty much untestable due to the combinatoric explosion of possible sets of inputs, and you can bet that the internals are not written in high level domain specific code. When I look at that function signature, I hear the screams of matrices being ground to a fine powder of bits through the conical burrs of nested for loops and labored index arithmetic. I mean, it's it's it's incredible. It's poetic.
[00:17:52] Eric Nantz:
And it's it's most definitely real. Right? Yep. I have seen this. The approach I've seen is, hey. You know what? This package, we're just gonna have to use our modem, custom CSV with all the params inside. Like, oh my gosh. No. Please stop the pain. Yes. But it it it is the the the the crutch that people will fall back on is, you know what? It it's a lot of work to put all those as function parameters, and I don't wanna test that. Just prune the config. It's all about the user configs. No. It's not about the user configs. What it's about is building an actual package that has actual documentation and actual testing.
Yes. I am firm on that because when you don't do that, you may not pay for it right away, but somebody's gonna pay for it. And it'll most likely be your end users, and that's about the worst result of all.
[00:18:49] Mike Thomas:
I agree. No. We have a client who leverages a third party API that in order to send your params to that API, you send your data and then you send this this wild like ASCII file, a text file that just you know has has no the limit, you know, elimination or or whatever you wanna call it. You just have to like add an n or or a y depending on which, which things you want to receive back from the API and then you have to zip it all up and send a zip zip file. It's it's pretty wild but let's let's vent about something else, Eric.
[00:19:29] Eric Nantz:
Yeah. I think we need to now think about you know what? We've we've thrown a lot of knowledge your way, and we acknowledge that it's not always easy for you to know everything at once. Right? We're all continuing learning. And, of course, if we don't have the answer to a question, we're probably gonna ask for help in certain ways, especially online. Right? I mean, how many times have I googled for how to do something esoteric with, like, an r to call this, you know, API parameter or whatnot or do this new statistical function that I'm just not as familiar with? So like anybody else, Mike, I know you and I have searched, you know, the Internet's and webs quite a bit for a certain help here and there. And, you know, with the advent of r over the years, you're starting to see a lot more results there, Taylor, with r itself, certainly post on Stack Overflow and things like that.
And then you'll start to see in these search results, you know, some things that look interesting but don't quite look right. And what is this trend we're talking about here? Well, our next highlight is coming from Matt Dre, another previous contributor to R Weekly. He, he's noticed this trend too, and we'll just kinda take this bit for a bit because I think those of you that have used R for a bit and have been searching for your various help or tutorials on there have probably seen this because you will often see these websites that are now coming up higher in the rankings that look almost too good to be true, but they kind of are too good to be true because these are sites that are coming up that have clearly you can kind of tell whether it's said explicitly or kind of implicitly, they're being written by some kind of bot, maybe some kind of AI framework.
We don't want to send traffic there. So, like, he, Matt, or I are not going to tell you the names of these sites, but they are very easy to kinda hook you in to to acting like these are authenticated, you know, very authoritative sources, I should say. And sometimes, you'll see that the same overall site is producing, like, thousands of these guides for each package individually. But the guides are not coming from the package authors. They're not coming from people in the community that we've, you know, seen or or, like, an authentic blog post. They're these, like, AI generated, you know, narratives that somehow have some great search engine optimization built in so that they're showing near at the top of your search results.
But, clearly, they're not they're not playing the right way here, and I think it kinda stinks. So the I think that's pretty clear to me why this is a bad thing for to see happen, but it's also reality that we all need to deal with. It's not like a quite one to one example here, but, of course, most of us have cell phones and we get these robocalls left and right on our cell phones. And we know they're bogus even though they try to act like they're coming from our area code. Right? But these these are, like, some real bogus results that are trying to show, quote, unquote, a guy to use a certain package. It's not. It's not. But, yeah, Matt continues in his post about just why, in his opinion, this is a bad thing that this keeps happening. Mike, why don't you take us through why? What are some of the downsides of what we're seeing here? Yeah. You know, I think what's gonna happen here for for the most part is that,
[00:23:05] Mike Thomas:
some of these summarizations that that take place, when it's a bot or, you know, AI, whatever you want, whatever that means, is is sort of summarizing and and trying to scrape the web and put together, these these are, you know, help sites. You know, I think what's gonna happen here is just a lot of the specificity is is going to get washed out. And, not only is the content terrible, because it's not written by a human for humans, but the ethics are pretty bad as well. They're really just trying to, either make money off of you somehow, you know, redirect you to some affiliate site, you know. And you also have to remember that, in a way, they're sort of stealing their content, you know.
A lot of the the content on the web that people are are putting out there in terms of, you know, Stack Overflow help and, you know, stuff that's actually authored by by someone. And, we went down the rabbit hole a little bit, when we started talking about Copilot, is a lot of this content does not necessarily, you know, have, like, a Creative Commons license behind it that says, hey, you know, go for it and and scrape this and use it however you want. You know, it's it's probably scraping stuff that the author may have not given them consent to to scrape. So that's that's crappy.
You know, the fact that it's moving up, in terms of SEO, because, you know, I guess that's probably something that AI is is fairly good at as well. Right? To try to make this site look like a site that gets a lot of clicks. You know, that's, I guess, the name of the game these days, unfortunately. And I'm seeing it myself, you know, I'm having a harder time getting to Stack Overflow links, which are have traditionally been really the thing that has helped me the most, has helped get me to my answer the quickest. And usually, maybe I have to sort through, 2 or 3 or 4 different Stack Overflow links to to find my exact solution. But in the past, that would come up those links would come up very very high, you know, if not the first result, you know, the second result.
And and now, they're much sparser, unfortunately. And I'm I am having to sort, like Matt, through a lot of this this crap, unfortunately, to so it stinks, you know. It's, and that is taking sort of a pessimistic view here, which which maybe I share. I don't know. I haven't put I hadn't put too too much thought into this until Matt's blog post, but, don't you know, we don't see really an avenue where this gets better as opposed to to gets worse. I don't know, Eric.
[00:25:51] Eric Nantz:
Yeah. I I think it's a reality, like I said, even my robocall example. It's like, no matter how many you block, there's always others that are gonna keep coming. And I think there's gonna be these sites that crop up with now AI and automation becoming so much easier for the masses or in the case of some of these, you know, non unethical corporations or whoever's behind some of these to just launch all these automated processes on some server somewhere and do the SEO gaming up of of of search results. I think the biggest thing we can recommend is to never to to have, like, a a careful eye as you're searching these results. And I think over time, you'll see these patterns such as Matt has been talking about here in this post.
But I'm gonna say the best places to draw upon for, you know, you know, help for, say, a package itself is hopefully the package documentation itself. Most of them now have package down sites. They usually have a GitHub repository or GitHub like repository and, you know, seeing what issues have been talked about for that particular package on their issue board. Like, that's a great way to learn even just by scraping through that. And also leveraging community based built resources that you know are being built by humans. And guess what? Another spoiler. We're our weekly is built by humans. Right? We are linking to content that has been created by package authors, by data scientists, by others in the field that you that are authentic. And that's why we have a curator to always sift through. We get noise too just as much as anyone else, but we wipe the heck out of those. We make sure those don't get into our issue.
Unfortunately, for search engines, we don't have that control. Right? They're just always gonna kinda pop up from time to time. I think if, if it sounds too good to be true, it probably is, so to speak. So definitely have a careful eye to that, especially those, as Matt points out, will have some random affiliate links somewhere in maybe the footer of the site or it's a sidebar or whatnot. No. I don't see that for authentic R based content or data science content in my day. So I think it's more about with experience. You're gonna be able to see this more quickly. But we I think what Matt does here is at least bringing awareness to the issue that this is real. It's probably not slowing down.
And so just making sure that you are, you know, looking at the authentic community based or, in some cases, the developer authored resources to really get you in the right direction for your particular issue. But, Matt, I see at the end of the post, yeah, you also grew up in the times I grew up with. Good old floppy disk. Right? We didn't have fancy AI bots generating these queries. We had to make sure that 5.25 floppy disk somehow worked in our IBM, you know, PCs or Apple 2 GS's. Shout out to all those that use vintage Apple computers. So it's a different time now. And that I think with that just comes some some new skills that we have to learn about finding the the best from the noise as they say.
[00:29:01] Mike Thomas:
Yep. No. I remember, you wanted to know something. You looked it up in an encyclopedia. So I very much yeah. At the library no less. Yeah. At the library no less. I think we had we had an encyclopedia on a CD ROM at home or something like that when I was growing up. That's right. Yep. No. Times have certainly changed, and I think, you know, unfortunately, that means navigating navigating the Internet and search results, you know, requires new skills to be able to do so. But it's unfortunate that some of these sites out there, exist because it sort of feels like cheating.
[00:29:49] Eric Nantz:
Now with that said, of course, what are ways that you can kinda get your your journey of data science started off right, especially if you're new and you wanna turn to maybe a human generated resource to help you. Well, one thing that has helped me over the years, and I think many others would agree, is the concept of having that handy cheat sheet next to you. So if you're looking up stuff all the time but you just want a quick reminder of how something works, you know, cheat sheets are a great way to have at your desk or at your virtual wallpaper or whatever have you to kinda get those concepts reinforced, from time to time. And our last highlight is, actually a new cheat sheet in the R community authored by Jacob Scott who is a data scientist based in the UK.
And Jacob has put together this best practice for our cheat sheet, and he is very much upfront in his repository that this is highly opinionated in some of his preferred workflows. But I think there are some concepts here that we can very much relate to, especially in the context of if you are following the advice of what Miles authored and our first highlight, some ways you can get started pretty quickly. One thing that whether you're running our studio proper or not, but having some kind of project structure is it's one of those things that you take for granted. But boy, oh, boy, I have seen countless times people, like, throw all their r scripts in one directory that have no real relation to each other. Just throw it all in there. Right?
Have you ever done that, Mike?
[00:31:27] Mike Thomas:
I've seen it. Maybe I did it when I was starting out potentially, when I didn't necessarily know best practices around, you know, what an R project even was. You know, like I said, when I was taught R in undergrad, we were only taught R markdown. I didn't even know what an R script was. I I only knew the existence of of dotrmd files. So there's there's a possibility that at some point in my journey, which I don't wanna, maybe dive back into that that I would have been guilty of that, but I am very happy to have, found it and understood our projects.
[00:32:04] Eric Nantz:
Yes. And and, yeah, certainly the examples he's talking about here, they are specific to RStudio, but you can do this in any of our typical ID as well. I mean, Versus Code has workspaces you can utilize. And, of course, you know, there are loads of extensions in the classical frameworks that people turn into IDEs like Vim or Emax that do similar things too. The idea is just logical grouping of your code. And he's got a nice little snippet here in the cheat sheet about what his project structure looks like. It's got, you know, subfolders for the scripts itself. It's got, you know, potential database query, SQL scripts. You know, all and, of course, he's using RM too. That's an even another best practice that I think goes it needs to be reinforced is that these projects can have wildly different dependency requirements. And in the r side of things, having RM is a real bulletproof way, give or take a few gotchas here and there, making sure that you can reproduce that R base execution environment within reason, from project to project and be able to have that finer tuned, that finer tuned control for your dependencies.
And then also, there's some great sections in here about how to create a repreqs, which, again, we were talking about getting help. Right? Finding ways to effectively communicate and effectively search. Well, if you know that you're having an issue with a said package or or another, you know, utility, the best way to to get help from the community, whether it's in Stack Overflow or, say, PASA community or whatnot, is having a reprec so that it shows in a very concise manner what exact error you're getting and let others reproduce that error. Reprec is not this is not the first time we mentioned reprec on this highlights podcast over the years.
So I think having that skill set is a great way to put yourself in best position to not only ask for help, but then to receive it as well. And then there are also some great sections here about how do you connect the databases. And I've used the DBI package quite a bit. He's got a little snippet about connecting to, a a database with that as well. And then others such as styling. Again, there can be different takes on how you style your code. What, Jacob recommends here, I believe, is a variation of the tidyverse style guide with certain pieces. But, again, I think the key, as we mentioned maybe a few weeks ago, is just consistency.
Once you have consistent styling, no matter if you use the tidyverse framework or, let's say, Google's framework or any your own company's framework, having consistency is gonna help you as well as your future self and collaborators on those projects. And then it concludes with, some links to learning more about the R community and building R for, you know, our projects such as r for data science, building packages of r packages, and really getting into the nuts and bolts of r, where the Vance r. And, yes, for the shiny fans out there, a link to mastering shiny as well. Highly recommended. But and I think it's a it's a great way to get started. I think for those new to the R framework, this is one of those great examples to get you started off the right way, and it's gonna wet your appetite, so to speak, to dive into some of these concepts in more detail.
So really nice nice job here, Jacob, and I think it's a it belongs in many, collections of cheat sheets out there.
[00:35:36] Mike Thomas:
Yeah. This is absolutely beautiful, design wise, the way that he he drew this up. It says that, he originally created a similar version of this this cheat sheet specifically for use in the UK Department of Education, but he's created this more generalized version, I think for for everybody else, which is fantastic. And it it sort of makes me think about perhaps, you know, you may wanna create a cheat sheet similar to this within your own organization that, you know, mentions and outlines some of the specific best practices that you wanna follow and employ within your own organization. So if you're looking for inspiration to do something like that, this would be a great place to start.
[00:36:16] Eric Nantz:
Yeah. And he does have link in the GitHub repo to additional, additional cheat sheets that are available on Posit's site. You know, we see many contribute to that both from Posit and outside of Posit too. That style is is very reminiscent of that. So I think it's it's interesting, yeah, interesting way to get started the right way, and I'm always all for it. Again, human generated our resources out there. No bot made that resource. I can I could pretty much tell that one? That's right. Yep. And like I said, what else do bots not create? Well, it's the RWQ issue itself. We've got a curator helping with that every single week. And, again, Sam did a tremendous job with this issue, and we'll take a couple minutes to talk about our additional fines for that we found in this issue.
For me, I'm still very much in my learning journey of, you know, APIs with R, but also developing web based resources with R and pushing, like, Shiny to new directions and pushing even the portal sites I create the new directions. And friend of the show, Albert Rapp, has another terrific blog post here in his web dev for R section. One area that I simply have to always keep looking up every time, it's not muscle memory yet, but getting a handle on selecting certain elements in your CSS style sheets. And he's got both a video and an accompanying blog post that talk about how you can select particular tags of a certain type with both the source code and solution right there in in the in the post itself, selecting elements by class, lots of things that unless you really practice a bit, especially if you're new to web dev, it's gonna seem pretty foreign to you. But he brings it home with how he used these techniques to modify some of the styling behind a GT table that he was, creating. So you can give a little extra personality, a little extra style along the way. So if you're in the world of CSS style and you're just not sure where to go to get that particular nagging element that you wanna make like a bold font or make a red color background, this is a great post to let you dive into just what kind of detective skills you might need to get to that set element and make it look the way you want it to.
[00:38:41] Mike Thomas:
And, Mike, what did you find? Oh, that's a great find, Eric. And the the shiny developer in me and the web scraper in me, is very interested in in checking that one out to to dive down and figure out very specific CSS and style elements on a on a web page. That's awesome. I found, very interestingly, a phenomenal article in Nature by Jeffrey M Pirkle called No Installation Required, How WebAssembly is Changing Scientific Computing. I think we teased this a little bit last week, but it's a fantastic walk through about WebAssembly. It starts off with some quotes from, George Stagg, at Posit, who has done so much work on the WebR package to allow us to, write our code that is compatible with the WebAssembly framework and essentially have have the the work run-in the user's browser and no server required, which obviously is is pretty game changing, something we've talked about many times here. And it's really the theme in this blog post. And, it's a really interesting article because there's little anecdotes from from many people with many different perspectives on this topic.
One of those people, being, my co host, statistician Eric Nance. Oh, hello. And, you you have an awesome quote here that you're you believe WebAssembly will minimize, from the reviewer's perspective, many of the steps that they had to take to get an application running on their own machine in the context of clinical trials. I totally agree. I I I really enjoyed reading this article, and it's it's very exciting again, for me to see this topic being picked up in something like nature.
[00:40:20] Eric Nantz:
It's it's super exciting because it it means that it's it's gaining steam out there. Yeah. I'm as excited as anybody right now in this space and the fact that we're we're being recognized in places I never even dreamed of as we're we're kinda pushing the envelope here. I think it's just another piece where this has a chance to transform so many things and not only my industry, but many other places as well. And, we're just at the tip of the iceberg. There's still a lot of growth here, but I I I sense that, you know, we're gonna be talking about this for for years to come as one of these next big evolutions in technology at the intersection of data science. We're we're on the way, Mike. We're on the way.
Yep. And one other thing I wanna leave off with, and a good friend of the show as well, John Harmon, he's been, you know, sending out some posts on his Mastodon account and LinkedIn about a recent unfortunate event with the r for data science community and that their particular provider that they've been using for assembly funds to keep the project going has unfortunately changed direction. And now they're kind of, looking at other ways to receive, you know, robust funding through a robust infrastructure. So I will just mention if you're in this space, maybe be able to help John out with some advice on where to go for additional funding opportunities for our for data science and platforms that they can leverage.
Certainly get get in touch with John personally. I'm sure he would love to hear, you know, some other advice that people have along the way. So, again, really hoping to see the r for data science group keep going. But I know it's gonna be it's always tough when you rely on a platform to help centralize some of this, then they pull the rug from under you. So let's hope for the best, John. And, certainly,
[00:42:16] Mike Thomas:
if we find any, you know, resources, we'll pass them along your way. Yes. No. Please help out, John. If you have the the means to be able to do so because this R4DS community, is is fantastic. I think that it's it's helping a lot of people get up to speed with R and get introduced with R. It's helping people like me, with very niche questions and, just an incredible community of folks who are willing to to help one another and to listen and to try to encourage each other in our R programming journeys.
[00:42:47] Eric Nantz:
Absolutely. Absolutely. So, I was reading some of his latest posts here as of 4 days ago. They did get some additional funding before that host kinda pulled the rug from under them, but that's not gonna last forever. So, again, he's I'll put a link in the show notes to where you can you can contact John and contact the project. So, again, we we really hope for the best here. And talking about finding great resources for help, I mean, we've said this many times. The Hartford Data Science community has so many helpful participants at all skill levels. It is an it is a wonderful resource out there. They have book clubs. They have groups dedicated to different packages or different frameworks. It is all there for the taking.
And, really, some of the best support you can do is even just helping out with that community on top of financial donations. So I'm sure he would welcome that as well. And speaking of welcoming, we welcome your feedback too with this humble little, you know, endeavor we call a podcast here. And what are ways that you can help us out? Well, first, the Rwicky project itself. We'd love to get your new package idea or new package resource. If you have a blog post, a tutorial, or announcement you wanna share, we're just a poll request away. It's all marked down all the time. Where you can do that is linked right in the top right corner of rweekly.org, the link of this current issue's draft or upcoming issue draft, I should say. And then you can just send your poll request there, and our curator of the week will be glad to merge that in for you. And as well, we love to hear from you in the community.
There are many ways to do that. We have a little contact page in this episode show notes that you can send us direct feedback with. You can also have a new modern podcast app like Paverse, Fountain, Cast O Matic, CurioCaster. I could go on and on. They have a little boost functionality. You can send a fun little message along the way. And, quick congrats to my friends at Jupiter Broadcasting because they use this modern infrastructure from the Fountain app to do live podcast episodes on the ground at the recent SCALE and Knicks conferences in California. It was a good time to be had, so you might wanna search them out. That was some amazing content there. And, yeah, the Knicks stuff was, quite entertaining. So I'm thinking of Bruno right away when I when I listen to this. I may have to dust off the Knicks stuff now. And, again, I'm even more inspired than I was last week. I'm sure his ears are ringing. Yes. Yeah. I'll have to link that in the show notes. Bruno, I think you'll find it very interesting.
But, as well as what's awful interesting is is hearing from you, as I said. You can also get in touch with us on social media. I am on Mastodon mostly these days with at our podcast at podcast index on social. I am also sporadically on the weapon x thing at the r cast and as well on LinkedIn as well. You can just search for my name, and you'll see all my show announcements and other fun announcements there. And, Mike, where can the listeners find you? Yeah. Probably best on LinkedIn. If you search Catchbrook Analytics, k e t c h b r o o k, you can find out what I'm up to,
[00:45:54] Mike Thomas:
or on mastodon@[email protected].
[00:45:59] Eric Nantz:
Awesome stuff. And certainly, yeah, we love hearing from you, as I said. And our weekly training keeps on going. And hopefully, we keep going again for the foreseeable future. But I can guarantee you there will be no robotic voices on this podcast. You can be sure we are the authentic Eric and Mike, whether you like it or not.
[00:46:18] Mike Thomas:
That's right.
[00:46:19] Eric Nantz:
Yep. So we will close-up shop here. And thanks again for joining us from wherever you are. And definitely helps if you wanna spread the word for others in your in your organizations learning data science. You know? Spread your word about the podcast is probably some of the best support we can get, so we greatly appreciate that. So we will close-up shop here and we will be back with another episode of our weekly highlights next week