A peek behind the curtain of how R handles that batch of code you send to the console, an adventure in automating the translation of Quarto documents to multiple languages, and there's no time like the present to give your code a little linting love.
Episode Links
Episode Links
- This week's curator: Sam Parmar - @[email protected] (Mastodon) & @parmsam_ (X/Twitter)
- Long input lines
- Translating Quarto (and other markdown files) into Any Language
- Get your codebase lint-free forever with lintr
- Entire issue available at rweekly.org/2024-W36
- News from R Submissions Working Group – Pilot 3 Successfully Reviewed by FDA
- Mastodon Accounts Posting About #RStats
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon) and @theRcast (X/Twitter)
- Mike Thomas: @mike[email protected] (Mastodon) and @mikeketchbrook (X/Twitter)
- Torvus Clockwork - Metroid Prime 2: Echoes - DarkeSword - https://ocremix.org/remix/OCR01507
- Sleep, My Sephy (Judgement Day) - Final Fantasy VII: Voices of the Lifestream - Pot Hocket - https://ff7.ocremix.org/
[00:00:03]
Eric Nantz:
Hello, friends. We are back with episode 177 of the R Weekly Highlights podcast. This is the weekly podcast where we talk about the awesome highlights and other resources that are shared every single week at rweekly.org. My name is Eric Nantz, and as always, I am delighted you join us from wherever you are around the world.
[00:00:21] Mike Thomas:
And you know what, folks? All is right with the world because I am not flying the plane solo this week. I am my awesome cohost, Mike Thomas, making his triumphant return to the podcast. Mike, how have you been? Been doing well, Eric. Yes. Speaking about from around the world. I feel like I've been around the world a little bit. I was fortunate enough to be able to attend posit comp and and see you there and I know we weren't able to record that week, and then I sort of went straight from there to more conferences in Boston and elsewhere and and finally just getting my feet back in the on the ground at home, back in the office. And, I apologize to to you for having to make you do it for a couple weeks, on your own, but you did a great job. And, looking forward to being back at it today.
[00:01:07] Eric Nantz:
Yeah. I survived, but, I think the listeners will agree that things are more correct now with you here. But, yeah. Back in it was a couple episodes ago, I gave my kind of, conference experience. I wanna make sure you got some time to talk about what you thought about Positconf and anything you wanna share with the audience about what you took away from that. Sure. Well, it's always just so
[00:01:30] Mike Thomas:
sort of invigorating to get to be around other, you know, data scientists and and like minded folks and listen to talks and just connect and and collaborate and be in that community face to face together. So I I thought they did a great job putting on the conference.
[00:01:46] Eric Nantz:
I was able to to see some of my team members at Catch Brook that I actually hadn't seen in person before Yeah. Which is really special and really I was able to be at the table when you saw him for the first time. That was awesome. Yeah.
[00:01:58] Mike Thomas:
That's right. Shout out Ivan. And it it was it was a fantastic experience you know some of the highlights for me I guess on the shiny side were you know really Joe Chang's, talk about leveraging you know large language models within your your Shiny app that are able to essentially put together a DuckDV query show you that DuckDV query just from natural language and return the results on screen to you which was pretty incredible I think we have probably no less than a 1000000000 opportunities, to incorporate that into some Shiny apps that we have have currently, but you know there were a bunch of fantastic talks. I think it's easy to not or take for granted all of the work that goes into putting together these talks and and how hard folks work to be able to put these presentations together for us. So just a big shout out to to everyone that that participated and, gave a talk this year and was able to provide us with some fantastic content. So I had a had a great time. I I got to see a little bit more of Seattle as well, than I had seen in the past. Sometimes at these conferences, it's it's tricky because you're, you know, you're just sort of glued to the conference and sometimes it's hard to to make it out of the hotel but I was able to go out with a few clients as well finally was able to see the the big Pike Place Market I think downtown there which was cool and did a little walking around and enjoyed enjoyed getting out to the West Coast and seeing Seattle. So we'll see what, I guess, Atlanta brings next year.
[00:03:30] Eric Nantz:
That's right. You're going to Hotlanta next year, so it'll be my my first time there. But, yeah, I'm seeing it's always funny trying to get a read of the room whenever they make those announcements. It's always, you know, mostly cheers, sometimes a couple groans, but in either way, it should be a good time nonetheless. I don't think there were any boos, so that's good. Yeah. That's true. Yeah. Yeah. No no heat, so to speak, from that announcement. But, yeah, that was that was, yeah, so that will be here before we know it because I know the the months go by fast. But, yeah, it was, again, terrific to see you once again and, just to hang out with you a bit. And, yeah, like you said, I was in I was pulled in in many different directions. So I didn't get to see as much as Seattle as I would like, but who knows? Maybe I'll be back there next year for other events or whatnot. But, nonetheless, it was it was awesome. And and, yeah, I hope that the hype I I put around Joe's shiny announcement was worth it because I for the listeners, I I had a sneak preview of that before Joe made that presentation. And poor Mike here was just asking me, any hints, any hints, any hints, and when's it coming? When's it coming? It's like, stay tuned for Wednesday. And, yep, it seems like it didn't disappoint.
[00:04:37] Mike Thomas:
Yep. For anybody that needs this information, Eric, it's a great great secret keeper.
[00:04:44] Eric Nantz:
When when I'm told to zip the lips, I know how to zip the lips, usually. Usually. But this time, I honor good old Joe on that one. But, nonetheless, yeah, amazing experience. And, you know, we gotta get get to business here. We got our fun our weekly issue to talk about here. This has been curated by Sam Parmer who was also at the at the POSIT Conf. I got a chance to hang out with him multiple times, and I actually had some nice dinner with him one of the nights too just walking around the Seattle area, but he and I always have lots to talk about. And for this, he had tremendous help as always from our fellow iRwerky team members and contributors like all of you around the world with your poll requests and other resource suggestions.
We're gonna go technical here, Mike, on this first one in areas that, admittedly, I kinda take for granted, but I guess could bite you if you're not careful. Now with r, r is one of those you might call interpreted languages, which means that when you launch r, whether it's through the console directly, through its own GUI interface on Windows, or through an IDE like pod you know, RStudio or the new positron or whatever have you, you're gonna get a REPL, which is kind of what the r console looks like. But if you don't know what repl is, and believe me, even when I first got in, because I had no idea what that is, that is the read, eval, print loop. And both r and Python and other languages come with this so that you can type a command, hit enter, and then you get the result.
Sometimes when you have a long command, you can actually space this out to multiple lines if you want and just hit enter, and you'll get with a little plus sign, not plus an addition, but plus to the new line that you're adding to that that console input. And you can do this for as much as you want, albeit you might get tired if you do this a lot in the console directly. So for most of the time, we end up writing our scripts or Python scripts depending on which language you are. And that way, you can run the whole file or you could send certain bits of that file into the into the console itself. Apparently, recently, there has been a user that had sent a lot of input to this console or to the the interpreter itself ended up being hitting a limit that r has had for historically for a while of 4,096 bytes. And I did a little crack research before this. Apparently, a byte is a character, so it's kind of a one to one translation, which means that this user was sending a string of command that was over 4,096 bytes in that in that session.
[00:07:37] Mike Thomas:
It might not actually be that long, though. Right? If it's a character, it's not like it's 4,000 lines. How many characters in a if you you get your 80, your 80 width. Right? It's what we try to adhere to. Well, yeah. For styling, sure. Yeah. For styling purposes. So in a regular 80 80, width set of lines, that's gonna be let me do some quick math here. Oh, boy.
[00:08:01] Eric Nantz:
Dangerous.
[00:08:02] Mike Thomas:
51 lines?
[00:08:05] Eric Nantz:
Well, there may be something else going on here, and this is where, again, we're gonna get into the weeds a little bit by the author of this post that comes directly from the r blog. Thomas Calabrera, who is one of the members of the r core team, apparently has been diving into this issue as a result of this report that went on the r develop mailing list. And, apparently, now there have been advancements in our the development version to help get to, quote unquote, unlimited lines so you can throw through the parser. So I'm gonna give you my take on how this works. This is a lot of machinery here that, admittedly, I've not gotten into as much. But one has to wonder just what process was generating that massive set of input that was causing all these bytes to be limited. And, again, maybe I have it completely wrong on what a character translates to bytes, but, nonetheless, let's talk about how when you send a command to r, how it actually gets in there.
So, first, Thomas is quick to point out that the parser is not the issue here because the parser needs the functionality to look at code with input of technically unlimited length that could span multiple lines. And one concrete example that he highlights here is an if statement. Because in an if statement, you need to look at that entire if statement, everything going into that to know what to do in terms of the rest of the console input. So that is always going to have to be parsed in the theory of an unlimited length depending on how extensive that is. So that's where that needs that needs to happen.
But then there's not just the parser at play here. There's also the REPL itself, which, again, as I mentioned, is getting input from either the user or maybe copying pasting that code into the console. And there is under a hood an API function in r itself. They call it r underscore read console, which is modeled after a c function called fgets, again, outside of my wheelhouse here. But, apparently, that has that has a capability to know whether it's getting one massive line or if it's getting multiple lines that was separated by, say, a carriage return. Hence, like I mentioned earlier, when you type a line, hit enter, but you're not done yet, you'll get that plus sign in the console. It's gonna know how to interpret that.
But that buffer that basically that empty space, if you will, that the repo uses to grab all this all this input has been in a constant definition for the limit length which has been 4,096 bytes apparently since 2,008. And before that, it was actually 1024. So it was much smaller back then. So, apparently, there are ways up until recently where you could overflow this repo buffer. I don't think that's to be confused. I'm I'm abusing terminology here, but I often hear about buffer overruns when code crashes. I don't think this is quite the same, but, nonetheless, the consequence here up until recently is that if you had an input that was expanded past this 4,096 limit, it would just truncate it. So it's as if you just didn't type in the rest of your code, so to speak.
So that that again, that's been a gotcha, apparently, that some people have seen, but now seems like that is a thing of the past in the development version. So there are improvements that have been made, but there are some nuances to keep in mind when term in terms of where you're actually interacting with r. So there is on Windows historically been, since my early days of r, the r GUI that comes when you install r on a Windows machine that has again, it looks like its own mini IDE, if you wanna call it that. It's got its own console in there, but there were issues that have been identified with that, that make it so that that buffer size was getting hit more often than it should.
And now they fix that by giving kind of an intermediate interface between the user typing in, hitting enter, and then that going to the REPL and for interpretation to help account for lines that could exceed this this limit. So, apparently, that's landing soon in production r. We don't know when yet, but it looks like there have been some improvements on that side of it. And then the r term itself, another thing within Windows, that's like the console itself, which is getting enhancements as well to help circumvent some of these limitations.
Again, I'm not a Windows user anymore, so I don't know exactly if that will affect me day to day. My guess is not for those out there. And then Linux, you know, Unix in general also have a a function here because they use a read line library whenever you start typing in the console and then sending that to the r interpreter that has had, limits before. You may run into that give or take depending on what OS you're on. I mean, Mac OS also encounters this. But in in conclusion, Thomas says that there are ways you can defend yourself against this. I think the biggest way is to put everything in a file usually if you have a huge massive chunk of code.
Otherwise, it looks like if you are grabbing this code from, say, another process that's generating it, I wonder if it was kind of akin to when you have JavaScript and you can make what's called a minified JavaScript file where it takes away all those carriage returns, but then you get this, like, massive wide text that the JavaScript parsers know how to deal with. I wonder if something like that was going on with this user, and they were just getting this massive, massive string of our code that was causing these limits. Again, speculation on my end, but it looks like these limitations are hopefully gonna be a thing of the past if you've encountered them before.
So, again, pretty we invite you to check out the blog post because I've only gone high level on what Thomas has outlined here. But if you find yourself in this situation, then, yeah, this this post may help you encourage you that the new versions of r will have a more unlimited limit, if you will, as opposed to this more truncated 4,096
[00:15:10] Mike Thomas:
limit. Well, I tried to do it justice anyway. Like, what are your thoughts on this? No, Eric. This is a tricky one and I think you, boiled it down for us as best as possible. I think it, again, reminds me of, like, how much actually goes on under the hood when you execute in our function. Alright. There's a lot of wizardry to compile that code and to get it to execute, right, that the actual c code, right, that is is getting run under the hood. And a lot of work goes into making this all work perfectly across different operating systems as well. So it's it's it's pretty incredible how easy it is for us to to take for granted about how well that works and it's a this is a limitation I haven't run into myself necessarily before. I've, I think, seen some strange things when I I try to copy and paste, you know, large amounts of code into the console and I certainly try to avoid that especially when I'm jumping between IDE's and RStudio versus Versus Code and dev containers and things like that. I'll I'll run into some strange issues. So I think as you said, you know, a best practice to keep in mind is is to just, you know, contain everything in scripts. If you have something really long that you want to execute, you can you can source that script if you want or just highlight the the code that you want to to run and execute that code, itself as opposed to, you know, copying and pasting large amounts of of, you know, code into directly into, your your console.
It this also brings me back to my early days of r and R GUI, which is something that I had not opened up in in quite a long time and it brought back some memories pretty quickly because that is admittedly where I started in R. I'm not even sure if RStudio existed
[00:16:58] Eric Nantz:
when I was getting started and are not to date myself too much but, at least Speak for yourself. My goodness. You know what ID is I'd use in the past? And
[00:17:09] Mike Thomas:
and you know when you do install our I was looking looking at my local R installation and and seeing, you know, our GUI dot exe, our term dot exe, and you know these are I guess things hanging around that we we sort of take for granted. Right? That that are packaged in with our local installation of R. But if you're using R Studio or if you're using Positron nowadays or or Versus Code or whatever it is, you know, these sort of legacy items, I I would imagine that our term is still applicable but probably our GUI, isn't applicable for folks using, you know, one of those those IDEs. These legacy, executables and applications, if you will, are still there, you know, if you wanted to get some nostalgia, pop into them and and try a little bit. I remember when I after I found our studio, I started a new job and and was working with somebody who was writing a little bit of our code, but we were sort of sort of siloed so we weren't working, together very well and I went over to his desk one day and he was using our GUI and didn't know that you could write in our script and, like, save in our script that you could rerun later so he thought that you just had to, you know, save commands in like a notepad text file or something like that and that was the way I guess that his professor taught him in in university. It never showed him that you could save in our script which is is a little scary but it makes me think about how far we've come but that really really interesting blog post, you know, it sounds like this is a potentially solved issue in the coming releases of our and maybe something that we don't necessarily have to to dive too deep in the weeds on in the future or or worry about because they're handling so much of this behind the scenes stuff for us.
[00:18:53] Eric Nantz:
Yeah. Again, you know, all the machinery behind the the hood here, or under the hood of of our yeah. It there's we definitely need to appreciate it more in the fact that we can have modern tooling on top of it and still leverage, you know, the language in many different ways. Yeah. That's a it's a testament to how far things are going and, yeah, development is not slowing down anytime soon. So, yep, a lot of a lot of things that you can go down rabbit holes of, and and, certainly, this is this is one of those. And, yeah, speaking of having, like, modern interfaces on top of R and whatnot, of course, one of the the great interfaces or new add ons that many are using in terms of reproducible research and and then the like has been the quartile ecosystem, the quartile compilation engine for documents that can be coded up in markdown, but then have code embedded inside, whether it's r or Python and whatnot.
And the but one of the best things about Quarto has always been the ability, of course, to write in markdown. I mean, we all I always joke kind of at the end of each episode, if you don't know how to write mark markdown in 5 minutes, e y c would give you $5. So remember that presentation in NBSW years ago. But, you know, markdown, of course, can be written in any language. Right? And we have a very diverse community in data science as we all well know. So what if you're in a situation where you've written this great mortal document and now you want to send that to collaborators that have a different you know, spoken language or primary language.
Well, this next post is gonna give you some insights on how you can do just that with the advent of technology. This blog post is coming to us from Frank Aragona, and he talks about how he was able to translate Quiridon and ends any markdown file into any other language. So there are some services in play here that can account for this, but he did take a look at, you know, the existing services out there. Unfortunately, a lot of them, in his opinion, unfortunately, required API access that ends up even needing your credit card even if you're not planning on paying for it, services from Google and others like DeepL or whatnot.
But he did find into found another avenue called Hugging Face transformers, which do provide APIs to get pretrained models that are tailored for translation. And so now the key is, well, how do you actually use this thing? There is an existing R library for Hugging Face, but it required Conda to install some Python libraries. And like the author here, me and Conda don't quite get along, especially in my work environment. So, of course, I look for ways to simplify that. So he ended up coding up a more friendly wrapper to install these packages via Reticulate and the pip, you know, install interface for Python.
And, of course, this does necessitate Reticulate. And then through his package, it'll download the hugging face transformer Python bindings, which, again, I'm not as familiar with. But once that's in there, he's got a Python package now or an R package called translatemd. He's got the coding in the blog post on how to get that onto your system once you have reticulate ready to go, but it's gonna spin up a virtual environment for you, which again in the Python world is like what we have for RM for containing your dependencies, calls it our transformers, but you can rename it whatever you like.
And then once you have that ready to go, you can feed in a quartile document, and it will basically have a multi process approach. It'll parse the quartile document, apply the translation, and then in that kind of tidy form of the parsed translation, it'll then now rewrite to a new quartile file a new document of the translated language. So he's got a snippet of how this looks when you start parsing it, and much I already talked about in previous highlights when you talk about parsing code files, you do see different areas that have been parsed, whether it's the YAML, inline text, headings, more inline text, blocks of of of code or whatnot.
And then once you unwrap that and then translate it, then you start to see you can see side by side in the post what he's gone from the English version to, I believe, the Spanish version. So that looking pretty good, and then he's got a little picture at the bottom of the post that shows the 2 documents side by side. But like everything, Mike, there are a little bit of gotchas here that you might want to be aware of before you translate this. Why don't you walk us through some of the little off bugs he's found here? Yeah. There's a couple a couple of bugs it seems like you might have to manually
[00:24:21] Mike Thomas:
adjust for and it's actually pretty some of them are somewhat evident, just from the 2 screenshots which are are really cool. I think an awesome feature this blog post showing on the left, the English translation or the original, I guess, quarto document if you will and then on the right side the equivalent Spanish translated version of the same quarto document. So it's really cool to see these 2 things side by side. One of the things that unfortunately happened during the translation is that a particular section header that had a single, you know, pound sign to be able to to have that as sort of an h one header, the pound sign got removed or for the the hashtag, for for the younger audience out there. So you can see sort of on the right side, that that bold heading that would correspond on the left side doesn't exist in the Spanish translation. Translation. So it's just something that you would have to go into, I think in in add that pound sign yourself. Not a huge deal, in a small document probably in a big document situation it would be a little painstaking to have to go through and do that in a lot of different places.
We just rolled out 9 reports, 9 quartile reports, that were due on the 31st that we rolled out to our clients and each one of them was about 77 pages long. So in that case would have been difficult for us to do but if you have a small little, you know quarto report it shouldn't be too big of a deal to go in and add those headers and there's there's always control f too, right, that we can try to find the particular header text and then go in there and just just stick that pound sign in front of it. The Light Parser package which is leveraged here has a a known bug as well with, quarto chunk YAML parameters and in particular if you have a chunk that is specified as eval eval false excuse me such that the code shows up but it actually doesn't get evaluated it translates that into eval no the n o instead of eval false So it looks like the Light Parser packages is working on rectifying that issue.
I haven't looked into it yet. Frank, the author of this blog post, it says that hopefully this is fixed but maybe we can do a little follow-up next week to take a look at whether or not that bug still exists but, Frank also wraps up by letting you know that it's probably a good idea because we're potentially automating a lot here to go through the translated document and really with a fine tooth comb make sure that there are no other bugs because I think in situations like this right it can handle a lot for you you know maybe upwards of 90% but some of those edge cases depending on what you're trying to do in your quarto document, we do a lot of crazy stuff with include chunks, you know, child documents and things like that. So you never know, you know, if there are particular edge cases that this translator has not been able to solve yet at this point. So I think it's a good idea to to take your time and take a look at the output. Know that a lot of, you know, the the manual work that you would have had to have done has been taken care of for you, but there might still be some some little spots you might have to tune up.
[00:27:36] Eric Nantz:
Yeah. It it's not a direct one to one analogy here, but I've been looking at things like this even for the production of this very show. The podcasting service that we use has a functionality where it'll produce a transcript for our episodes. Right? I mean, it's kind of like a translation from audio to to text, so it's not quite the same. But just like you said, Mike, about maybe doing a double check before you sign off on it, I even noticed that will mess up a a few keywords here and there that I tend to spot and then put a correction in, but there may be others that I don't catch. So you won't get everything perfect in these, but, of course, we we cannot, you know, understate how much time this can save, especially if you're gonna do this routinely to multiple languages. I think that's a huge win for accessibility and certainly depending on your needs. I could see a use case where maybe you have an application, whatever, Shiny or something else, and you're you know, the user's doing a bunch of stuff, and then you want them to download or reproduce a report of those findings. Hence, a cordial document or a markdown document, whatever have you, you could plug something like this in and then make have that report have multiple languages. There's lots of automation ideas you could have at play here.
[00:28:49] Mike Thomas:
Very, very cool. Yeah. A little radio button allowing the user to pick which language they want, they wanna download the report in.
[00:28:56] Eric Nantz:
Yeah. I think, yeah, the possibilities once you get your hands on these tailored models, there's just so much fun stuff we we can do with it. So credit to Frank here for another awesome use case of both automation and portal to to to check a lot of, nice solution here. Gotta love it. Last but certainly not least in our highlights, episode here, we've got another great use case of, speaking of cleaning things up sometimes. Yeah. Yours truly code sometimes needs to get cleaned up a little bit. I know, Mike, you write perfect code every time. Right? Right? I wish. You wish. Yep. You and me both. Well, that's where you could manually look at what you've messed up, which again, I often do and make the corrections, but there are ways that you can have something else scan your code and make your life a bit easier.
And that, again, is our last, highlight for today. Another great blog post coming from the one and only Ma'al Salman who, again, has been a frequent contributor to the highlights for a very long time now where she talks about a really great way to get started with using winter, the winter package to get your code base up in tip top shape. And I admit when I first heard of linting, I didn't know what heads or tails this was all about. Now I'm starting to come around to it, but this is a post I wish I had seen years ago when I started to see this mention or I see this magical stuff being used in other people's dev sessions or selling their code with a keystroke just so it's perfect. And I'm like, how does that even work? Well, now we're gonna figure out how that works. So, again, the lint r package is what drives this in the r ecosystem, but many languages have equivalents in their in their, package ecosystems.
The first step is you need to tell winter what it's going to do, and that is through a configuration file. And so that is with the syntax name of dot winter. So it's by default a quote unquote hidden file. But if you put that at the root of the project that you're gonna use the linter on, you can then put different options here. And you can start by letting linter do everything it wants to do, and that's where in the snippet she has here in the post, there is a function called linters with tags, and tags is an optional parameter. If you make a null, it's gonna use everything, every check that it wants to do. And then also the encoding as well, which most of the time you can do UTF 8. Sorry for any Windows users that have to do something different in that, but that's usually what I stick with.
And then once you have that, if, say, you're writing a package, like in the use case of this post, you wanna lint it. It's just lintrcoloncolonlint_package, and then it's gonna depend on the environment you're running in, say in RStudio ID, you're gonna get another tab appear next to your code or console where it'll highlight all the different issues that the linter has identified. And, yeah, yours truly sometimes has a long list of this, and I have to parse through this a little bit. So now you know what the issue is. What are you gonna do about it? So Miles outlines a few points here. Mike, why don't you take us through what you might wanna do once you found the problems?
[00:32:45] Mike Thomas:
Yeah. Once you've identified, you know, some of the problems, some of the things that you can do, I guess, for for one example that Mael provides here is is maybe you have a function that is is reading a super long path and sometimes, you know, hopefully, we're not doing too much of hard coding a specific path on our machine. We're using things like the here, the here package to be able to sort of build out that path, you know, relative to anyone else's system who might be running our code. But you can break that up instead of supplying that path directly within the function argument. You could define that path as a variable first and then pass that variable into the following function that is going to, you know, access that particular path or or apply the logic against that path. So it's it's interesting things like that, you know, some simple stuff that you can integrate.
One of the nice things, about the this linting functionality is that and I know that I always have, you know, particular edge cases in my code where maybe it's just not possible to get this line down to 80 characters, for example. Like, it's not feasible. I would have to use, you know, glue or or paste statements and it would start to get ugly and it's, you know, it's only 83 characters and it just really sort of makes sense to leave it the way it is. So if I wanna skip the linting for a particular line, you can actually add, this this string called no lint, and then the name of the linter exclusion to a specific line in your dotlinter file, I believe.
One of the cool things as well about the linter package is if I remember correctly that you can sort of set up, custom linting, you know, based upon the the styling guidelines that you might have internally. So, you know, depending on if you like your arguments written in wide format versus long format, there's only actually one correct answer there. It's long format but we could have a whole podcast about that sometime soon. So you you can do interesting things like that and then you know one of the I think final really nice things as well is the ability to, you know, put this all in a GitHub action for CICD purposes where you can just specifically lint changed files which is nice so that you're not running this CICD and spending your GitHub actions minutes on your entire, package repository or your entire just our repository if it's not necessarily a package but you can just run your linter against the new stuff, as part of your your CICD process and then get those warnings or corrections you know during that that pull request, before things end up in the main branch of your repository and headed to prod. I I think that's that's really interesting. It's something that we haven't leveraged at Catchbrook, before. You know, most of our, CICD is is just around either rebuilding the package down site or running unit tests, but but linting I think is a really cool additional additional use case for, you know, having these automated checks as part of your, you know, continuous integration workflows and I really appreciate my Elle calling out the ability, to be able to introduce, linting as part of your your CICD process and particularly only linting, the change stuff because that's something that we struggle with, as well on occasion is using up a lot of GitHub actions minutes, on code that's already been checked before.
[00:36:13] Eric Nantz:
Yeah. One interesting thing about, you know, the relatively newer IDEs out there or new uses of IDEs out there is that even just writing our code itself depending on your settings, I'm thinking, of course, of the Versus code r extension, it will, by default, run linter already, and it will already highlight in your code where you're like, yeah, pass 80 character width or other things where things you're referencing a variable that hasn't been defined yet. And I admit there are times where, like, I don't wanna see that right away. I just wanted you to see that, like, maybe when I'm close to, like, my cleanup stage. So there there are ways that as as Mel's hotline here, you can turn certain checks off. I'm still getting to know, like, the best use cases of that for me. I admit I haven't put in GitHub actions yet because I always I don't wanna say I have trust issues, but it's like something that's gonna edit your code. You're kinda little nervous. Like, is it gonna get it right? When we just saw about the translation thing, it might not get all these translations correct. So but maybe I need to trust more. Trust the process as a old sports cliche goes.
[00:37:21] Mike Thomas:
Me too, Eric. Yeah. It it's hard during development when you see, like, you know, in Versus Code, the the problems, panel. Right? Just just loaded down with all sorts of problems with the linting of your code and it can, you know cause you to to I guess spend some unnecessary time at the beginning of the process as opposed to doing that that final cleanup once things are in a better state. But maybe maybe that's just my workflow and maybe it's actually helpful to to rectify those things up front. So teach their own.
[00:37:52] Eric Nantz:
Yep. Exactly. I I still remember doing those, Twitch livestreams back in the day, and suddenly I'd boot up Versus Go and this file has, like, 10 or 20 these squigglies on their line. I'm like, oh, the poor the poor viewers are seeing my sloppy code. Oh, no. So then I quickly turn that winner off, and it looks like everything's perfect. Wink wink. Exactly. Not really. The they each your own. Exactly. But, yeah, as as always, the the tooling is there. It's your experience, right, how you wanna tweak it, but it's, again, great that we have all this at our disposal for sure. And we have a lot at our fingertips, if you will, when you look at the rest of the article we issued that Sam has curated for us. And we'll take a couple of minutes here for our additional fines. And a huge congratulations must go out to my fellow, members of the R Consortium submissions working group because the R Consortium submission pilot 3 has officially been approved by the health authorities at the FDA.
This is monumental in this particular pilot. This was looking at the ways of using R to create what in the pharma industry we call ADaM datasets. That's a specific format that is kind of longitudinal in nature, but it's often a key intermediate layout that is used to populate, tables, figures, and listings that often go into our clinical study reports. So, again, a great way to show that, yes, we can use r in many aspects of the clinical submission process. This was certainly a key focus of my positconf talk on the efforts of Shiny and web assembly. But again, we're looking at all the different parts of a pharma submission process.
And like always, they are all the materials are reproducible. All the materials are on GitHub. We wanna make sure it benefits all the entire industry. And again, we invite you to check out the blog post where it's got all the key contacts. I've been involved with this. So my thanks to everybody that led the project, go from the working group side and the regulator side. We could not do this without them. So, again, major congratulations. And that means I'm up next, so to speak. I'm on deck for pilot 4, so I'm super excited.
[00:40:24] Mike Thomas:
Congratulations to everybody involved, Eric. That is monumental, really exciting to see, especially the industry that that you're working in really adopt are so heavily and, really sort of pushing the language to be able to to change the world in a lot of ways. So so that's fantastic. Really excited for you. One additional find that I had is a look what looks like a, you know, shiny live potentially app, that's hosted, by Sharon Machlis that she created, and it leverages the Artut API, which interacts with, Mastodon. And, what she has is a nice little interactive table here. It looks like d t, that posts that a list of all of the accounts on Mastodon that have made a post with the r stats hashtag at least twice in the last 30 days. So there's some familiar names on here.
I see, I see Dirk Edelbuttel, Steven Sanderson, Luke Pembleton. I see a lot of folks that I follow in the r Kelly Baldwin data science community as well. So it's really cool little app to be able to explore and, I believe congratulations goes out to Sharon as well on her recent retirement, but it seems like she is not retiring from doing, continued r and data science exploration.
[00:41:48] Eric Nantz:
Yeah. I got credit to her. She's always been very, enlightening her her work to the fall, and she seems to be enjoying her next stage of life. And she was recently on the, data science hangouts with with Rachel and Libby this past week. So, hopefully, the recording of that will be out soon for those that weren't able to tune in live on that. But, yeah, I always like to see these great resources shared. Again, taking advantage of technology, things like, again, compiling Shiny into a web browser. You cannot have it better than that. And Exactly. There's a whole bunch of more to choose from in this issue. We wish we could talk about it all, but we got our our day jobs to get back to. But we're gonna leave you with our usual, where do you find all this? It's at rok.org.
The for the current issue is always at the home page. And, of course, the archive is available too if you wanna check all the back catalog out as well. It was searchable bar if you wanna search for specific topics as well. And this project is driven by the community, so we invite you to share that great resource you found online wherever you wrote it or you found someone else's great resource. Please give us a poll request, which is linked at the top right corner ataruk.org, all marked down all the time. You won't need a fancy API to translate that. Just put in your link. You are all set to go. We have the template right there for you. And as well, we love to hear from you in the audience that is we got a few ways of doing that. So you can get in touch with us, via the contact page, which is linked in the episode show notes in your favorite podcast player that you're listening on. You can also send us a fun little boost if you have a modern podcast app as well.
You can also get in touch with us on these social medias, and the aforementioned mastodon is where you'll find me the most. I am at our podcast at podcast index on social. You'll also find me on LinkedIn. Just search my name, and you'll find me there. And on the x thingy, occasionally, we've got the r cast. Mike, where can the listeners find you? Sure. You can find me on mastodon@mike_thomas@fostodon
[00:43:54] Mike Thomas:
dot org, or you can find me on LinkedIn. If you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to.
[00:44:03] Eric Nantz:
Very good. And, yeah, you got a lot of things cooking, man. It must have taken a lot of time to write all those quarter reports. Great job nonetheless.
[00:44:10] Mike Thomas:
Thank goodness, for parameterization.
[00:44:13] Eric Nantz:
Let's just put it that way. Oh, yeah. That was a hot topic on the last episode if you wanna listen to that as well. Yep. I've been using that a lot in my day job too. I cannot live without it. So we can't prioritize everything in our life, but we can prioritize reports. So with that, we're gonna close-up shop here again. Great to have you back in the saddle once again, Mike. It's great to not have to do this alone for a 3rd week in a row.
[00:44:40] Mike Thomas:
No. I'm back for a while now. That's awesome.
[00:44:43] Eric Nantz:
Awesome. Yeah. Exactly. So we're gonna close-up shop here, and I'll do it for episode 177 of our weekly highlights. And we'll be back with episode 178 of our weekly highlights next week.
Hello, friends. We are back with episode 177 of the R Weekly Highlights podcast. This is the weekly podcast where we talk about the awesome highlights and other resources that are shared every single week at rweekly.org. My name is Eric Nantz, and as always, I am delighted you join us from wherever you are around the world.
[00:00:21] Mike Thomas:
And you know what, folks? All is right with the world because I am not flying the plane solo this week. I am my awesome cohost, Mike Thomas, making his triumphant return to the podcast. Mike, how have you been? Been doing well, Eric. Yes. Speaking about from around the world. I feel like I've been around the world a little bit. I was fortunate enough to be able to attend posit comp and and see you there and I know we weren't able to record that week, and then I sort of went straight from there to more conferences in Boston and elsewhere and and finally just getting my feet back in the on the ground at home, back in the office. And, I apologize to to you for having to make you do it for a couple weeks, on your own, but you did a great job. And, looking forward to being back at it today.
[00:01:07] Eric Nantz:
Yeah. I survived, but, I think the listeners will agree that things are more correct now with you here. But, yeah. Back in it was a couple episodes ago, I gave my kind of, conference experience. I wanna make sure you got some time to talk about what you thought about Positconf and anything you wanna share with the audience about what you took away from that. Sure. Well, it's always just so
[00:01:30] Mike Thomas:
sort of invigorating to get to be around other, you know, data scientists and and like minded folks and listen to talks and just connect and and collaborate and be in that community face to face together. So I I thought they did a great job putting on the conference.
[00:01:46] Eric Nantz:
I was able to to see some of my team members at Catch Brook that I actually hadn't seen in person before Yeah. Which is really special and really I was able to be at the table when you saw him for the first time. That was awesome. Yeah.
[00:01:58] Mike Thomas:
That's right. Shout out Ivan. And it it was it was a fantastic experience you know some of the highlights for me I guess on the shiny side were you know really Joe Chang's, talk about leveraging you know large language models within your your Shiny app that are able to essentially put together a DuckDV query show you that DuckDV query just from natural language and return the results on screen to you which was pretty incredible I think we have probably no less than a 1000000000 opportunities, to incorporate that into some Shiny apps that we have have currently, but you know there were a bunch of fantastic talks. I think it's easy to not or take for granted all of the work that goes into putting together these talks and and how hard folks work to be able to put these presentations together for us. So just a big shout out to to everyone that that participated and, gave a talk this year and was able to provide us with some fantastic content. So I had a had a great time. I I got to see a little bit more of Seattle as well, than I had seen in the past. Sometimes at these conferences, it's it's tricky because you're, you know, you're just sort of glued to the conference and sometimes it's hard to to make it out of the hotel but I was able to go out with a few clients as well finally was able to see the the big Pike Place Market I think downtown there which was cool and did a little walking around and enjoyed enjoyed getting out to the West Coast and seeing Seattle. So we'll see what, I guess, Atlanta brings next year.
[00:03:30] Eric Nantz:
That's right. You're going to Hotlanta next year, so it'll be my my first time there. But, yeah, I'm seeing it's always funny trying to get a read of the room whenever they make those announcements. It's always, you know, mostly cheers, sometimes a couple groans, but in either way, it should be a good time nonetheless. I don't think there were any boos, so that's good. Yeah. That's true. Yeah. Yeah. No no heat, so to speak, from that announcement. But, yeah, that was that was, yeah, so that will be here before we know it because I know the the months go by fast. But, yeah, it was, again, terrific to see you once again and, just to hang out with you a bit. And, yeah, like you said, I was in I was pulled in in many different directions. So I didn't get to see as much as Seattle as I would like, but who knows? Maybe I'll be back there next year for other events or whatnot. But, nonetheless, it was it was awesome. And and, yeah, I hope that the hype I I put around Joe's shiny announcement was worth it because I for the listeners, I I had a sneak preview of that before Joe made that presentation. And poor Mike here was just asking me, any hints, any hints, any hints, and when's it coming? When's it coming? It's like, stay tuned for Wednesday. And, yep, it seems like it didn't disappoint.
[00:04:37] Mike Thomas:
Yep. For anybody that needs this information, Eric, it's a great great secret keeper.
[00:04:44] Eric Nantz:
When when I'm told to zip the lips, I know how to zip the lips, usually. Usually. But this time, I honor good old Joe on that one. But, nonetheless, yeah, amazing experience. And, you know, we gotta get get to business here. We got our fun our weekly issue to talk about here. This has been curated by Sam Parmer who was also at the at the POSIT Conf. I got a chance to hang out with him multiple times, and I actually had some nice dinner with him one of the nights too just walking around the Seattle area, but he and I always have lots to talk about. And for this, he had tremendous help as always from our fellow iRwerky team members and contributors like all of you around the world with your poll requests and other resource suggestions.
We're gonna go technical here, Mike, on this first one in areas that, admittedly, I kinda take for granted, but I guess could bite you if you're not careful. Now with r, r is one of those you might call interpreted languages, which means that when you launch r, whether it's through the console directly, through its own GUI interface on Windows, or through an IDE like pod you know, RStudio or the new positron or whatever have you, you're gonna get a REPL, which is kind of what the r console looks like. But if you don't know what repl is, and believe me, even when I first got in, because I had no idea what that is, that is the read, eval, print loop. And both r and Python and other languages come with this so that you can type a command, hit enter, and then you get the result.
Sometimes when you have a long command, you can actually space this out to multiple lines if you want and just hit enter, and you'll get with a little plus sign, not plus an addition, but plus to the new line that you're adding to that that console input. And you can do this for as much as you want, albeit you might get tired if you do this a lot in the console directly. So for most of the time, we end up writing our scripts or Python scripts depending on which language you are. And that way, you can run the whole file or you could send certain bits of that file into the into the console itself. Apparently, recently, there has been a user that had sent a lot of input to this console or to the the interpreter itself ended up being hitting a limit that r has had for historically for a while of 4,096 bytes. And I did a little crack research before this. Apparently, a byte is a character, so it's kind of a one to one translation, which means that this user was sending a string of command that was over 4,096 bytes in that in that session.
[00:07:37] Mike Thomas:
It might not actually be that long, though. Right? If it's a character, it's not like it's 4,000 lines. How many characters in a if you you get your 80, your 80 width. Right? It's what we try to adhere to. Well, yeah. For styling, sure. Yeah. For styling purposes. So in a regular 80 80, width set of lines, that's gonna be let me do some quick math here. Oh, boy.
[00:08:01] Eric Nantz:
Dangerous.
[00:08:02] Mike Thomas:
51 lines?
[00:08:05] Eric Nantz:
Well, there may be something else going on here, and this is where, again, we're gonna get into the weeds a little bit by the author of this post that comes directly from the r blog. Thomas Calabrera, who is one of the members of the r core team, apparently has been diving into this issue as a result of this report that went on the r develop mailing list. And, apparently, now there have been advancements in our the development version to help get to, quote unquote, unlimited lines so you can throw through the parser. So I'm gonna give you my take on how this works. This is a lot of machinery here that, admittedly, I've not gotten into as much. But one has to wonder just what process was generating that massive set of input that was causing all these bytes to be limited. And, again, maybe I have it completely wrong on what a character translates to bytes, but, nonetheless, let's talk about how when you send a command to r, how it actually gets in there.
So, first, Thomas is quick to point out that the parser is not the issue here because the parser needs the functionality to look at code with input of technically unlimited length that could span multiple lines. And one concrete example that he highlights here is an if statement. Because in an if statement, you need to look at that entire if statement, everything going into that to know what to do in terms of the rest of the console input. So that is always going to have to be parsed in the theory of an unlimited length depending on how extensive that is. So that's where that needs that needs to happen.
But then there's not just the parser at play here. There's also the REPL itself, which, again, as I mentioned, is getting input from either the user or maybe copying pasting that code into the console. And there is under a hood an API function in r itself. They call it r underscore read console, which is modeled after a c function called fgets, again, outside of my wheelhouse here. But, apparently, that has that has a capability to know whether it's getting one massive line or if it's getting multiple lines that was separated by, say, a carriage return. Hence, like I mentioned earlier, when you type a line, hit enter, but you're not done yet, you'll get that plus sign in the console. It's gonna know how to interpret that.
But that buffer that basically that empty space, if you will, that the repo uses to grab all this all this input has been in a constant definition for the limit length which has been 4,096 bytes apparently since 2,008. And before that, it was actually 1024. So it was much smaller back then. So, apparently, there are ways up until recently where you could overflow this repo buffer. I don't think that's to be confused. I'm I'm abusing terminology here, but I often hear about buffer overruns when code crashes. I don't think this is quite the same, but, nonetheless, the consequence here up until recently is that if you had an input that was expanded past this 4,096 limit, it would just truncate it. So it's as if you just didn't type in the rest of your code, so to speak.
So that that again, that's been a gotcha, apparently, that some people have seen, but now seems like that is a thing of the past in the development version. So there are improvements that have been made, but there are some nuances to keep in mind when term in terms of where you're actually interacting with r. So there is on Windows historically been, since my early days of r, the r GUI that comes when you install r on a Windows machine that has again, it looks like its own mini IDE, if you wanna call it that. It's got its own console in there, but there were issues that have been identified with that, that make it so that that buffer size was getting hit more often than it should.
And now they fix that by giving kind of an intermediate interface between the user typing in, hitting enter, and then that going to the REPL and for interpretation to help account for lines that could exceed this this limit. So, apparently, that's landing soon in production r. We don't know when yet, but it looks like there have been some improvements on that side of it. And then the r term itself, another thing within Windows, that's like the console itself, which is getting enhancements as well to help circumvent some of these limitations.
Again, I'm not a Windows user anymore, so I don't know exactly if that will affect me day to day. My guess is not for those out there. And then Linux, you know, Unix in general also have a a function here because they use a read line library whenever you start typing in the console and then sending that to the r interpreter that has had, limits before. You may run into that give or take depending on what OS you're on. I mean, Mac OS also encounters this. But in in conclusion, Thomas says that there are ways you can defend yourself against this. I think the biggest way is to put everything in a file usually if you have a huge massive chunk of code.
Otherwise, it looks like if you are grabbing this code from, say, another process that's generating it, I wonder if it was kind of akin to when you have JavaScript and you can make what's called a minified JavaScript file where it takes away all those carriage returns, but then you get this, like, massive wide text that the JavaScript parsers know how to deal with. I wonder if something like that was going on with this user, and they were just getting this massive, massive string of our code that was causing these limits. Again, speculation on my end, but it looks like these limitations are hopefully gonna be a thing of the past if you've encountered them before.
So, again, pretty we invite you to check out the blog post because I've only gone high level on what Thomas has outlined here. But if you find yourself in this situation, then, yeah, this this post may help you encourage you that the new versions of r will have a more unlimited limit, if you will, as opposed to this more truncated 4,096
[00:15:10] Mike Thomas:
limit. Well, I tried to do it justice anyway. Like, what are your thoughts on this? No, Eric. This is a tricky one and I think you, boiled it down for us as best as possible. I think it, again, reminds me of, like, how much actually goes on under the hood when you execute in our function. Alright. There's a lot of wizardry to compile that code and to get it to execute, right, that the actual c code, right, that is is getting run under the hood. And a lot of work goes into making this all work perfectly across different operating systems as well. So it's it's it's pretty incredible how easy it is for us to to take for granted about how well that works and it's a this is a limitation I haven't run into myself necessarily before. I've, I think, seen some strange things when I I try to copy and paste, you know, large amounts of code into the console and I certainly try to avoid that especially when I'm jumping between IDE's and RStudio versus Versus Code and dev containers and things like that. I'll I'll run into some strange issues. So I think as you said, you know, a best practice to keep in mind is is to just, you know, contain everything in scripts. If you have something really long that you want to execute, you can you can source that script if you want or just highlight the the code that you want to to run and execute that code, itself as opposed to, you know, copying and pasting large amounts of of, you know, code into directly into, your your console.
It this also brings me back to my early days of r and R GUI, which is something that I had not opened up in in quite a long time and it brought back some memories pretty quickly because that is admittedly where I started in R. I'm not even sure if RStudio existed
[00:16:58] Eric Nantz:
when I was getting started and are not to date myself too much but, at least Speak for yourself. My goodness. You know what ID is I'd use in the past? And
[00:17:09] Mike Thomas:
and you know when you do install our I was looking looking at my local R installation and and seeing, you know, our GUI dot exe, our term dot exe, and you know these are I guess things hanging around that we we sort of take for granted. Right? That that are packaged in with our local installation of R. But if you're using R Studio or if you're using Positron nowadays or or Versus Code or whatever it is, you know, these sort of legacy items, I I would imagine that our term is still applicable but probably our GUI, isn't applicable for folks using, you know, one of those those IDEs. These legacy, executables and applications, if you will, are still there, you know, if you wanted to get some nostalgia, pop into them and and try a little bit. I remember when I after I found our studio, I started a new job and and was working with somebody who was writing a little bit of our code, but we were sort of sort of siloed so we weren't working, together very well and I went over to his desk one day and he was using our GUI and didn't know that you could write in our script and, like, save in our script that you could rerun later so he thought that you just had to, you know, save commands in like a notepad text file or something like that and that was the way I guess that his professor taught him in in university. It never showed him that you could save in our script which is is a little scary but it makes me think about how far we've come but that really really interesting blog post, you know, it sounds like this is a potentially solved issue in the coming releases of our and maybe something that we don't necessarily have to to dive too deep in the weeds on in the future or or worry about because they're handling so much of this behind the scenes stuff for us.
[00:18:53] Eric Nantz:
Yeah. Again, you know, all the machinery behind the the hood here, or under the hood of of our yeah. It there's we definitely need to appreciate it more in the fact that we can have modern tooling on top of it and still leverage, you know, the language in many different ways. Yeah. That's a it's a testament to how far things are going and, yeah, development is not slowing down anytime soon. So, yep, a lot of a lot of things that you can go down rabbit holes of, and and, certainly, this is this is one of those. And, yeah, speaking of having, like, modern interfaces on top of R and whatnot, of course, one of the the great interfaces or new add ons that many are using in terms of reproducible research and and then the like has been the quartile ecosystem, the quartile compilation engine for documents that can be coded up in markdown, but then have code embedded inside, whether it's r or Python and whatnot.
And the but one of the best things about Quarto has always been the ability, of course, to write in markdown. I mean, we all I always joke kind of at the end of each episode, if you don't know how to write mark markdown in 5 minutes, e y c would give you $5. So remember that presentation in NBSW years ago. But, you know, markdown, of course, can be written in any language. Right? And we have a very diverse community in data science as we all well know. So what if you're in a situation where you've written this great mortal document and now you want to send that to collaborators that have a different you know, spoken language or primary language.
Well, this next post is gonna give you some insights on how you can do just that with the advent of technology. This blog post is coming to us from Frank Aragona, and he talks about how he was able to translate Quiridon and ends any markdown file into any other language. So there are some services in play here that can account for this, but he did take a look at, you know, the existing services out there. Unfortunately, a lot of them, in his opinion, unfortunately, required API access that ends up even needing your credit card even if you're not planning on paying for it, services from Google and others like DeepL or whatnot.
But he did find into found another avenue called Hugging Face transformers, which do provide APIs to get pretrained models that are tailored for translation. And so now the key is, well, how do you actually use this thing? There is an existing R library for Hugging Face, but it required Conda to install some Python libraries. And like the author here, me and Conda don't quite get along, especially in my work environment. So, of course, I look for ways to simplify that. So he ended up coding up a more friendly wrapper to install these packages via Reticulate and the pip, you know, install interface for Python.
And, of course, this does necessitate Reticulate. And then through his package, it'll download the hugging face transformer Python bindings, which, again, I'm not as familiar with. But once that's in there, he's got a Python package now or an R package called translatemd. He's got the coding in the blog post on how to get that onto your system once you have reticulate ready to go, but it's gonna spin up a virtual environment for you, which again in the Python world is like what we have for RM for containing your dependencies, calls it our transformers, but you can rename it whatever you like.
And then once you have that ready to go, you can feed in a quartile document, and it will basically have a multi process approach. It'll parse the quartile document, apply the translation, and then in that kind of tidy form of the parsed translation, it'll then now rewrite to a new quartile file a new document of the translated language. So he's got a snippet of how this looks when you start parsing it, and much I already talked about in previous highlights when you talk about parsing code files, you do see different areas that have been parsed, whether it's the YAML, inline text, headings, more inline text, blocks of of of code or whatnot.
And then once you unwrap that and then translate it, then you start to see you can see side by side in the post what he's gone from the English version to, I believe, the Spanish version. So that looking pretty good, and then he's got a little picture at the bottom of the post that shows the 2 documents side by side. But like everything, Mike, there are a little bit of gotchas here that you might want to be aware of before you translate this. Why don't you walk us through some of the little off bugs he's found here? Yeah. There's a couple a couple of bugs it seems like you might have to manually
[00:24:21] Mike Thomas:
adjust for and it's actually pretty some of them are somewhat evident, just from the 2 screenshots which are are really cool. I think an awesome feature this blog post showing on the left, the English translation or the original, I guess, quarto document if you will and then on the right side the equivalent Spanish translated version of the same quarto document. So it's really cool to see these 2 things side by side. One of the things that unfortunately happened during the translation is that a particular section header that had a single, you know, pound sign to be able to to have that as sort of an h one header, the pound sign got removed or for the the hashtag, for for the younger audience out there. So you can see sort of on the right side, that that bold heading that would correspond on the left side doesn't exist in the Spanish translation. Translation. So it's just something that you would have to go into, I think in in add that pound sign yourself. Not a huge deal, in a small document probably in a big document situation it would be a little painstaking to have to go through and do that in a lot of different places.
We just rolled out 9 reports, 9 quartile reports, that were due on the 31st that we rolled out to our clients and each one of them was about 77 pages long. So in that case would have been difficult for us to do but if you have a small little, you know quarto report it shouldn't be too big of a deal to go in and add those headers and there's there's always control f too, right, that we can try to find the particular header text and then go in there and just just stick that pound sign in front of it. The Light Parser package which is leveraged here has a a known bug as well with, quarto chunk YAML parameters and in particular if you have a chunk that is specified as eval eval false excuse me such that the code shows up but it actually doesn't get evaluated it translates that into eval no the n o instead of eval false So it looks like the Light Parser packages is working on rectifying that issue.
I haven't looked into it yet. Frank, the author of this blog post, it says that hopefully this is fixed but maybe we can do a little follow-up next week to take a look at whether or not that bug still exists but, Frank also wraps up by letting you know that it's probably a good idea because we're potentially automating a lot here to go through the translated document and really with a fine tooth comb make sure that there are no other bugs because I think in situations like this right it can handle a lot for you you know maybe upwards of 90% but some of those edge cases depending on what you're trying to do in your quarto document, we do a lot of crazy stuff with include chunks, you know, child documents and things like that. So you never know, you know, if there are particular edge cases that this translator has not been able to solve yet at this point. So I think it's a good idea to to take your time and take a look at the output. Know that a lot of, you know, the the manual work that you would have had to have done has been taken care of for you, but there might still be some some little spots you might have to tune up.
[00:27:36] Eric Nantz:
Yeah. It it's not a direct one to one analogy here, but I've been looking at things like this even for the production of this very show. The podcasting service that we use has a functionality where it'll produce a transcript for our episodes. Right? I mean, it's kind of like a translation from audio to to text, so it's not quite the same. But just like you said, Mike, about maybe doing a double check before you sign off on it, I even noticed that will mess up a a few keywords here and there that I tend to spot and then put a correction in, but there may be others that I don't catch. So you won't get everything perfect in these, but, of course, we we cannot, you know, understate how much time this can save, especially if you're gonna do this routinely to multiple languages. I think that's a huge win for accessibility and certainly depending on your needs. I could see a use case where maybe you have an application, whatever, Shiny or something else, and you're you know, the user's doing a bunch of stuff, and then you want them to download or reproduce a report of those findings. Hence, a cordial document or a markdown document, whatever have you, you could plug something like this in and then make have that report have multiple languages. There's lots of automation ideas you could have at play here.
[00:28:49] Mike Thomas:
Very, very cool. Yeah. A little radio button allowing the user to pick which language they want, they wanna download the report in.
[00:28:56] Eric Nantz:
Yeah. I think, yeah, the possibilities once you get your hands on these tailored models, there's just so much fun stuff we we can do with it. So credit to Frank here for another awesome use case of both automation and portal to to to check a lot of, nice solution here. Gotta love it. Last but certainly not least in our highlights, episode here, we've got another great use case of, speaking of cleaning things up sometimes. Yeah. Yours truly code sometimes needs to get cleaned up a little bit. I know, Mike, you write perfect code every time. Right? Right? I wish. You wish. Yep. You and me both. Well, that's where you could manually look at what you've messed up, which again, I often do and make the corrections, but there are ways that you can have something else scan your code and make your life a bit easier.
And that, again, is our last, highlight for today. Another great blog post coming from the one and only Ma'al Salman who, again, has been a frequent contributor to the highlights for a very long time now where she talks about a really great way to get started with using winter, the winter package to get your code base up in tip top shape. And I admit when I first heard of linting, I didn't know what heads or tails this was all about. Now I'm starting to come around to it, but this is a post I wish I had seen years ago when I started to see this mention or I see this magical stuff being used in other people's dev sessions or selling their code with a keystroke just so it's perfect. And I'm like, how does that even work? Well, now we're gonna figure out how that works. So, again, the lint r package is what drives this in the r ecosystem, but many languages have equivalents in their in their, package ecosystems.
The first step is you need to tell winter what it's going to do, and that is through a configuration file. And so that is with the syntax name of dot winter. So it's by default a quote unquote hidden file. But if you put that at the root of the project that you're gonna use the linter on, you can then put different options here. And you can start by letting linter do everything it wants to do, and that's where in the snippet she has here in the post, there is a function called linters with tags, and tags is an optional parameter. If you make a null, it's gonna use everything, every check that it wants to do. And then also the encoding as well, which most of the time you can do UTF 8. Sorry for any Windows users that have to do something different in that, but that's usually what I stick with.
And then once you have that, if, say, you're writing a package, like in the use case of this post, you wanna lint it. It's just lintrcoloncolonlint_package, and then it's gonna depend on the environment you're running in, say in RStudio ID, you're gonna get another tab appear next to your code or console where it'll highlight all the different issues that the linter has identified. And, yeah, yours truly sometimes has a long list of this, and I have to parse through this a little bit. So now you know what the issue is. What are you gonna do about it? So Miles outlines a few points here. Mike, why don't you take us through what you might wanna do once you found the problems?
[00:32:45] Mike Thomas:
Yeah. Once you've identified, you know, some of the problems, some of the things that you can do, I guess, for for one example that Mael provides here is is maybe you have a function that is is reading a super long path and sometimes, you know, hopefully, we're not doing too much of hard coding a specific path on our machine. We're using things like the here, the here package to be able to sort of build out that path, you know, relative to anyone else's system who might be running our code. But you can break that up instead of supplying that path directly within the function argument. You could define that path as a variable first and then pass that variable into the following function that is going to, you know, access that particular path or or apply the logic against that path. So it's it's interesting things like that, you know, some simple stuff that you can integrate.
One of the nice things, about the this linting functionality is that and I know that I always have, you know, particular edge cases in my code where maybe it's just not possible to get this line down to 80 characters, for example. Like, it's not feasible. I would have to use, you know, glue or or paste statements and it would start to get ugly and it's, you know, it's only 83 characters and it just really sort of makes sense to leave it the way it is. So if I wanna skip the linting for a particular line, you can actually add, this this string called no lint, and then the name of the linter exclusion to a specific line in your dotlinter file, I believe.
One of the cool things as well about the linter package is if I remember correctly that you can sort of set up, custom linting, you know, based upon the the styling guidelines that you might have internally. So, you know, depending on if you like your arguments written in wide format versus long format, there's only actually one correct answer there. It's long format but we could have a whole podcast about that sometime soon. So you you can do interesting things like that and then you know one of the I think final really nice things as well is the ability to, you know, put this all in a GitHub action for CICD purposes where you can just specifically lint changed files which is nice so that you're not running this CICD and spending your GitHub actions minutes on your entire, package repository or your entire just our repository if it's not necessarily a package but you can just run your linter against the new stuff, as part of your your CICD process and then get those warnings or corrections you know during that that pull request, before things end up in the main branch of your repository and headed to prod. I I think that's that's really interesting. It's something that we haven't leveraged at Catchbrook, before. You know, most of our, CICD is is just around either rebuilding the package down site or running unit tests, but but linting I think is a really cool additional additional use case for, you know, having these automated checks as part of your, you know, continuous integration workflows and I really appreciate my Elle calling out the ability, to be able to introduce, linting as part of your your CICD process and particularly only linting, the change stuff because that's something that we struggle with, as well on occasion is using up a lot of GitHub actions minutes, on code that's already been checked before.
[00:36:13] Eric Nantz:
Yeah. One interesting thing about, you know, the relatively newer IDEs out there or new uses of IDEs out there is that even just writing our code itself depending on your settings, I'm thinking, of course, of the Versus code r extension, it will, by default, run linter already, and it will already highlight in your code where you're like, yeah, pass 80 character width or other things where things you're referencing a variable that hasn't been defined yet. And I admit there are times where, like, I don't wanna see that right away. I just wanted you to see that, like, maybe when I'm close to, like, my cleanup stage. So there there are ways that as as Mel's hotline here, you can turn certain checks off. I'm still getting to know, like, the best use cases of that for me. I admit I haven't put in GitHub actions yet because I always I don't wanna say I have trust issues, but it's like something that's gonna edit your code. You're kinda little nervous. Like, is it gonna get it right? When we just saw about the translation thing, it might not get all these translations correct. So but maybe I need to trust more. Trust the process as a old sports cliche goes.
[00:37:21] Mike Thomas:
Me too, Eric. Yeah. It it's hard during development when you see, like, you know, in Versus Code, the the problems, panel. Right? Just just loaded down with all sorts of problems with the linting of your code and it can, you know cause you to to I guess spend some unnecessary time at the beginning of the process as opposed to doing that that final cleanup once things are in a better state. But maybe maybe that's just my workflow and maybe it's actually helpful to to rectify those things up front. So teach their own.
[00:37:52] Eric Nantz:
Yep. Exactly. I I still remember doing those, Twitch livestreams back in the day, and suddenly I'd boot up Versus Go and this file has, like, 10 or 20 these squigglies on their line. I'm like, oh, the poor the poor viewers are seeing my sloppy code. Oh, no. So then I quickly turn that winner off, and it looks like everything's perfect. Wink wink. Exactly. Not really. The they each your own. Exactly. But, yeah, as as always, the the tooling is there. It's your experience, right, how you wanna tweak it, but it's, again, great that we have all this at our disposal for sure. And we have a lot at our fingertips, if you will, when you look at the rest of the article we issued that Sam has curated for us. And we'll take a couple of minutes here for our additional fines. And a huge congratulations must go out to my fellow, members of the R Consortium submissions working group because the R Consortium submission pilot 3 has officially been approved by the health authorities at the FDA.
This is monumental in this particular pilot. This was looking at the ways of using R to create what in the pharma industry we call ADaM datasets. That's a specific format that is kind of longitudinal in nature, but it's often a key intermediate layout that is used to populate, tables, figures, and listings that often go into our clinical study reports. So, again, a great way to show that, yes, we can use r in many aspects of the clinical submission process. This was certainly a key focus of my positconf talk on the efforts of Shiny and web assembly. But again, we're looking at all the different parts of a pharma submission process.
And like always, they are all the materials are reproducible. All the materials are on GitHub. We wanna make sure it benefits all the entire industry. And again, we invite you to check out the blog post where it's got all the key contacts. I've been involved with this. So my thanks to everybody that led the project, go from the working group side and the regulator side. We could not do this without them. So, again, major congratulations. And that means I'm up next, so to speak. I'm on deck for pilot 4, so I'm super excited.
[00:40:24] Mike Thomas:
Congratulations to everybody involved, Eric. That is monumental, really exciting to see, especially the industry that that you're working in really adopt are so heavily and, really sort of pushing the language to be able to to change the world in a lot of ways. So so that's fantastic. Really excited for you. One additional find that I had is a look what looks like a, you know, shiny live potentially app, that's hosted, by Sharon Machlis that she created, and it leverages the Artut API, which interacts with, Mastodon. And, what she has is a nice little interactive table here. It looks like d t, that posts that a list of all of the accounts on Mastodon that have made a post with the r stats hashtag at least twice in the last 30 days. So there's some familiar names on here.
I see, I see Dirk Edelbuttel, Steven Sanderson, Luke Pembleton. I see a lot of folks that I follow in the r Kelly Baldwin data science community as well. So it's really cool little app to be able to explore and, I believe congratulations goes out to Sharon as well on her recent retirement, but it seems like she is not retiring from doing, continued r and data science exploration.
[00:41:48] Eric Nantz:
Yeah. I got credit to her. She's always been very, enlightening her her work to the fall, and she seems to be enjoying her next stage of life. And she was recently on the, data science hangouts with with Rachel and Libby this past week. So, hopefully, the recording of that will be out soon for those that weren't able to tune in live on that. But, yeah, I always like to see these great resources shared. Again, taking advantage of technology, things like, again, compiling Shiny into a web browser. You cannot have it better than that. And Exactly. There's a whole bunch of more to choose from in this issue. We wish we could talk about it all, but we got our our day jobs to get back to. But we're gonna leave you with our usual, where do you find all this? It's at rok.org.
The for the current issue is always at the home page. And, of course, the archive is available too if you wanna check all the back catalog out as well. It was searchable bar if you wanna search for specific topics as well. And this project is driven by the community, so we invite you to share that great resource you found online wherever you wrote it or you found someone else's great resource. Please give us a poll request, which is linked at the top right corner ataruk.org, all marked down all the time. You won't need a fancy API to translate that. Just put in your link. You are all set to go. We have the template right there for you. And as well, we love to hear from you in the audience that is we got a few ways of doing that. So you can get in touch with us, via the contact page, which is linked in the episode show notes in your favorite podcast player that you're listening on. You can also send us a fun little boost if you have a modern podcast app as well.
You can also get in touch with us on these social medias, and the aforementioned mastodon is where you'll find me the most. I am at our podcast at podcast index on social. You'll also find me on LinkedIn. Just search my name, and you'll find me there. And on the x thingy, occasionally, we've got the r cast. Mike, where can the listeners find you? Sure. You can find me on mastodon@mike_thomas@fostodon
[00:43:54] Mike Thomas:
dot org, or you can find me on LinkedIn. If you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to.
[00:44:03] Eric Nantz:
Very good. And, yeah, you got a lot of things cooking, man. It must have taken a lot of time to write all those quarter reports. Great job nonetheless.
[00:44:10] Mike Thomas:
Thank goodness, for parameterization.
[00:44:13] Eric Nantz:
Let's just put it that way. Oh, yeah. That was a hot topic on the last episode if you wanna listen to that as well. Yep. I've been using that a lot in my day job too. I cannot live without it. So we can't prioritize everything in our life, but we can prioritize reports. So with that, we're gonna close-up shop here again. Great to have you back in the saddle once again, Mike. It's great to not have to do this alone for a 3rd week in a row.
[00:44:40] Mike Thomas:
No. I'm back for a while now. That's awesome.
[00:44:43] Eric Nantz:
Awesome. Yeah. Exactly. So we're gonna close-up shop here, and I'll do it for episode 177 of our weekly highlights. And we'll be back with episode 178 of our weekly highlights next week.
Mike's posit::conf takeaways
Episode Wrapup