Fresh off an amazing experience at posit::conf(2025), the R Weekly Highlights podcast is back with episode 211! Eric and Mike share their experiences at the conference and then dive into an amazing collection of highlights. We learn about a myriad of packages to programmatically write and parse markdown documents, initial impressions with vibe-coding an R package to learn Japanese, and the immense lengths the R-Exams project is taking to clean up messy scans of exam papers.
Episode Links
Episode Links
- This week's curator: Jon Carroll - @[email protected] (Mastodon) & @jonocarroll.fosstodon.org.ap.brid.gy (Bluesky) & @carroll_jono (X/Twitter)
- All the Ways to Programmatically Edit or Parse R Markdown / Quarto Documents
- I Vibe Coded an R Package
- Quality Control for Scanned Multiple-Choice Exams
- Entire issue available at rweekly.org/2025-W39
- Mike's presentation slides: Building Multilingual Data Science Teams https://ketchbrookanalytics.github.io/multilingual-data-science-presentation
- Eric's presentation slides: Introducing Shinystate - Launching Shiny collaboration to new heights https://rpodcast.github.io/shinystate-positconf2025/#/titleslide
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)
- Mike Thomas: @[email protected] (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter)
- Twoson Hits the Road - EarthBound - djpretzel - https://ocremix.org/remix/OCR01427
- Gemini Salsa - Mega Man 3 - MkVaff - https://ocremix.org/remix/OCR00146
[00:00:03]
Eric Nantz:
Hello, friends. We are back with episode 211 of the Our Week of Holidays podcast. And yes, it was true. We had to check the episode numbers. This has been a hot minute since we last talked with you. But this is the weekly, usually, weekly podcast where we talk about the awesome resources that are shared on this week's our weekly issue at ourweekly.0rg. My name is Eric Nance, and, yes, I am probably the biggest culprit for why we've had the long layoff. There was a a busy few weeks getting prepared for what we'll be talking about, I think, very shortly of a very eventful posit conference, but I'm happy to be back back settling down and processing what I all learned.
But thank goodness I'm not doing this alone because he's back as well, my awesome cohost, Mike Thomas. Mike, how are you today? Does it feel odd to be back into this?
[00:00:51] Mike Thomas:
Feels odd. It feels great. I've been on the road quite a bit lately. Had some conferences in on the West Coast. Had a conference, obviously, down in Atlanta at POSIT COMP last week, which was fantastic. I feel like I I'm just giving presentations for a living and not writing a lot of code, but I did scratch and claw together in our package over the weekend, while I was kind of on the couch down with the cold. So, hopefully, that will come to see the light of day soon.
[00:01:20] Eric Nantz:
Nice. We've both been a package mode recently. So, for those aren't aware, both Mike and I did have the good fortune of of being presenters at this year's Posikov. And I'm just gonna, you know, give the floor to you first, Mike, because your talk about kind of bilingual data science teams, you had so many nuggets in there and so many words of wisdom for everybody in this situation. You had you had the room in the palm of your hand, but how how was that experience like for you?
[00:01:49] Mike Thomas:
It was a great experience, and I think I'm sure as with you and anyone else that has presented in the past or is used to giving presentations, it's all about preparation. And the more you the more you prepare, the easier it is. And I felt that, I was well prepared, not not, at at least in part to the team at Articulation who posit, is a contractor that posit hires to essentially help us, make sure that our presentation is put together in a timely manner and that we do a good job of organizing our thoughts. So that was a huge help. I enjoyed yeah. It's it's very easy to speak about a topic that you're passionate about. So I enjoyed giving the presentation, got a lot of great feedback. But how about you, Eric?
[00:02:32] Eric Nantz:
Yeah. I do echo the preparation, and I do admit mine went down to practically near the wire on a certain part of it, which I'll touch on later. But, I was able to for the first time, I think, ever in a non industry, like, type of conference, I was able and I mean industry. I got my day job. I was able to present something I've actually built in the open source world that actually has seen the light of day. And so I've talked about my newest r package called shiny state, which takes bookmarkable state to new heights, and I think I did a pretty nice, way of rounding that out and my motivation for it and, did spend a lot of time on the messaging because anytime you have, like, twenty minutes or less, and I think it's almost harder to prepare for those than if you get that nice one hour seminar or whatever. So well received, I think. And I'm also happy to say that the package a day after the presentation finally hit crayons. So my first crayon package in over fourteen years, folks, I I mean, if I was able to get it on by hook or by crook, and I went through that checklist probably about 50 times. It seemed like by the time I filed it, but it it feels good, and it also feels good. It seems like a lot of people have big ideas for it, so I've gotta watch that issue tracker pretty closely now.
But, overall, great experience. And then the day before the conference, I was busy with the Our Pharma Summit. Great shout out to all my life sciences colleagues out there that attended that event. And I was running around and doing some audio magic here and there, but giving a presentation there as well. So by the time the conference is over, I definitely was ready for a bit of rest. Not that I got much of it, but, nonetheless, it was, it was eventful and, yeah, all the great conversations hanging out with you during one of the lunches and hanging out with some other peeps. Yeah. It was it was just as advertised. And no matter where posit goes in terms of their direction, the people make it make it the WER file experience. So I had a blast.
[00:04:40] Mike Thomas:
Likewise. Great talk, Eric. If you haven't seen it, hopefully, when the recordings come out for the folks that purchased a ticket, you can see it, today. And then in a few months from now, I think it'll be on YouTube. But I am so pumped about your Johnny State package. So thank you for all your work on that.
[00:04:57] Eric Nantz:
Awesome stuff. And, yeah, I can't wait to actually use it for my projects now. And I have it out there. It's my biggest motivation for it. And I am actually dusting off the cobwebs of the shiny dev series site. I'm I'm gonna put a blog section on it, and I have a blog post about the package in the near future too because I gotta I gotta get on the blog and train too when I get a chance and follow your lead amongst countless others that do a way better job with us than I do. So Definitely not me. I'm not sure what you're thinking of, but there are plenty of people out there. I've seen them, man. I've seen them. You did you do a bang up job. But we gotta talk about the awesome our weekly issue before we go too much further here and checking my notes because it's been a minute. Our issue this week has been curated by Jonathan Carroll, another one of our longtime curators and really one of the founding fathers of this project, I must say. And as always, he had tremendous help from our fellow Arrow Crew team members and contributors like all of you around the world with your poll requests and other wonderful suggestions.
So we lead off with a ways that you can not just write your markdown documents in the traditional way. And when I say markdown being what almost is like the kind of like the English language to me at this point in terms of technical documentation, I'm writing it every day. It's going in many different places, whether it's a GitHub issue, summary, a GitHub pull request note, or just my technical notes. As in the R community, we have many ways of authoring our mark our markdown documents. I even just said the first one. R markdown is what pioneered a lot of these efforts. More recently, the quartal framework amongst others as well.
It's one thing to manually write it. Right? What if you wanna programmatically generate and parse markdown content for further downstream or upstream processing? Our first highlight is just for you because this is a really big deep dive into many different ways you can approach this programmatic perspective of markdown, and this is authored on the rOpenSci blog. So, by the way, shout out to to Noam Ross from rOpenSci. I was able to meet him at the conference, and it may have been our first meeting ever, but been talking to him online for years and years. So that was that was fun to to talk shop with him for a bit. But this post on your open side blog was written by Mahal Salmon, former former curator of our weekly, along with Christophe Dever and Gian Convar.
And this is this is a a big one. So we'll summarize what I think the key points I'm taking away here. And, of course, Mike, your perspective is appreciated too. It first talks about what exactly is markdown. I hope by now most of you listening are familiar with markdown, but if you're not, there's a great intro to how you write the syntax. And there are a few different ways you can write certain bits of syntax, but it does go over what I mentioned earlier, the various ways that you can interact with markdown, especially as an R user. And and, again, there are quite a bit out there. And, of course, one of the biggest selling points recently is that you have maybe some front matter in the markdown written in typical YAML format that might this or it might define certain options for rendering or other parameters.
And, of course, you can embed programmatic code chunks itself using R, Python, or other languages, with the special kind of fence div kind of syntax there. Again, all that's in the in the blog post. We all if you've done markdown before, you you're pretty familiar with that stuff. Now it's about what if you are in a situation you wanna build a template, but yet you wanna dynamically insert certain pieces from, let's say, an outside art program or whatever else, there are various ways to accomplish this just to get things off the ground. In the art community, markdown and r markdown rely heavily on the knitter package by EYC, who was that has been the glue of so many things for what I do in markdown.
It comes with a function called knit expand, which will let you kinda part you know, inject into your markdown syntax with maybe those placeholders for variables. You can inject certain things in that as well, but there's more than that in town. There's also the whisker package, authored by Edwin de Jong. That's actually used by the package down framework. I didn't know that until this post. There's also the brew package, which is a a long time package in the r ecosystem I've used in the past. And if you wanna get outside of R itself, there is, of course, pandoc, which is used by more than a handful of markdown rendering engines.
So we got some nice examples based on if you were giving kind of a homework assignment and you want to inject the name of the student into the document as well as a customized solution, or I should say, customized mean and standard deviation for the question here. And then the example talks about using, you know, a workflow based in the whisker package, which for those of you familiar with the glue package, you're gonna feel right at home with whisker. I've played with it a bit. It is pretty nice. And a new one I learned, once you generate those lines of dynamically generated content, if you wanna write that out, there's this newer package to me anyway called Brio, b r I o, which apparently is kinda like a souped up version of what base r can do for reading just arbitrary lines of text and writing out lines of text. So I'll have to put that in my toolbox for later, but the example basically has a custom function called make assignment that wraps this together. And sure enough, we got a nice little data frame out that has for each student the custom mean and SD parameters and then the file name that it's writing to. And and once you run that, you get a markdown file of everything filled in dynamically.
Definitely a great way to get started, and that is for generating. Now let's talk about parsing. This is where I've wondered. I've been through grep nightmares parsing stuff before in text in text documents over markdown or not. So that is of it is an approach if you're willing to get, you know, your hands kinda dirty and muddy with regret, go go for it. Not my cup of tea anymore. I've been down that road before. But there is the new the new thing in town for importing Isuzu's programmatic type documents is changing it into the abstract syntax tree layout or AST.
This is kinda like when in markdown, you have these common constructs to say headings, lists, code blocks, numerous other elements. And then you can think of ASTs are a way to give you this kind of maybe nested kind of list kind of structure where then you can parse that and then write maybe in or modify certain elements of that and then write it back out. This is being used in many different ways that I've seen in the community, but, the post continues with a real life example that I did not know happened, but, there is this great, reference that I saw in the g g pod ecosystem in the past called the g g pod two extensions gallery.
And apparently, somebody hijacked one of the links in that gallery for some of those. Not so nice, I'll say. And so what if you were on the receiving end of that, if you were to maintainer of that and you need to quickly find which links were hijacked, you know, you're parsing out this huge markdown file or set of markdown files where these links are scattered. What are the best ways to handle that? So there are many different parsers that do the AST route for markdown, some of which include the tinker package authored by Miles Salmon herself and maintained by Jian Kamvar, another coauthor of this, that translate that to XML from markdown to XML using the rendering engine called common mark, which is used quite a bit.
XML still gives me the heebie jeebies sometimes, so it's not my very desired way. But, hey, XML is quite powerful. I do acknowledge that. So, send your feedback to me if you have any differences of opinions or if you're successful with XML. I'd love to hear it. There are some new games in town here. One is called the MD four r package by Colin Rundell. It is in the experimental state, but it does kinda take that list approach to translating the AST to a nested list, all built in r. There is a development package out there. Pandoc itself will let you do this, and and you can either use, into its, native form of the AST or in JSON format, but then you would have to use Lua filters to be able to parse through all that or JSON if you're doing JSON. If you're if you're familiar with JSON, you might be able to have some good stuff there that technically doesn't depend on another language.
There are also newer ones out there like parse QMD that Nick Crane, who I also met at the conference, a new package that parses quartal markdown documents, and then it's basically built on JSON as well with JSON lite. There is also the parser m d package, yet another package by Colin Rundell. Seems like they're really in tune in this space quite a bit, and that's using a special boost spirit library, which I think maybe based in c, although I have to fact check on that. There was a talk about that at a at a long ago, our studio conf in 2021 that might be helpful as well.
There is a lot to digest here because you got a lot of choice here. So they in the in the rest of this post, they talk about what can you use to inform your decision on choosing a parser. And part of that will depend on if you're gonna import code chunks and do something with code chunks or not. That might that might switch the game a little bit because in that case, you might or benefit from XML as well because that might handle the code chunks a bit better. But there that would vary by use case, of course. The post also talks about what about that metadata I talked about at the top that's usually in YAML type format? You separate that with those kind of three horizontal dashes at the front matter.
I definitely have done things I in the past to grab those dynamically. Wasn't the prettiest in the world. But if you are using r markdown, there is a handy function called YAML front matter that will help you extract that through one function call. And I'd say, yes, please. The last thing I wanna do is rewrite that kind of stuff if unless I absolutely have to. And then there's also other ways to do this too with, say, the YAML package itself. And if it was in a different type of markup format like TOML, t o m l, there are some other packages like RCPP, TOML, as well as others to be able to parse that as well.
So I think generating documents, I would say, of the two things is probably the easier thing to to tackle at first. And with those tools we mentioned earlier, I think you can get underwrite very quickly. Parsing, you've got you got a few options. Just like everything in r, you got a few options at your disposal. I think it really depends on the complexity of what you're parsing that's gonna determine which way you go. Again, I probably don't resort to XML based methods unless I have to, but that could be just me feeling like imposter syndrome with some of that stuff where if I give it a go, maybe I'll I'll feel better about it. But I am intrigued by a lot of these tools that we're seeing in the community. And if I am in this situation in the future, I'm gonna use this post as a as a great starting point for that, markdown generating and partisan journey.
But, yeah, markdown all the time, no matter which way you slice it. Mike, what did you think of this?
[00:17:31] Mike Thomas:
Yeah. Markdown is the English language for us programmers. I I like how you phrase that, Eric. I had a client today who had developed a statistical model in r and did the model documentation in Microsoft Word. Oh, boy. And when we went to do some counts, just simple counts of the data, in the data that they provided us, that they used to build the model. Those counts didn't match what was in the table in the word document, and I asked them if they had ever heard of our markdown or quarto, and they said no. And a single tear fell from my fell from my eye. So Wow. We still have apparently, we still have work to do or this person lives under a rock. But, you know, I think this, article right here is is not necessarily about small changes that we're making. These are about making significant changes that probably need to be made in a number of places where it makes sense to use a programmatic approach. Although, I would say in the small case situation, I love Versus codes, search, find, replace all, which probably now exists in Positron as well. Very useful tool. Sure does. Seen that, and you need to replace that all week, baby.
If you need to find an instance of something across your whole entire repository and replace all instances of that with something else, it's a pretty sweet tool to check out. I was definitely intrigued by by the tinker package here that I believe Mel Solomon, kind of dreamed up as it stated in the blog post for going from markdown to XML via XML two. Eric, I'm happy to be the good cop to your bad cop with respect to our feelings around XML. I'm here for it. I've been doing a ton of it lately. This is all stems back to that package I was working on over the weekend where it's a rest API that sends back XBRL of x b r l file. So I had never heard of before. I guess it's sort of like Excel, but for the finance world, you needed some sort of an x b r l reader, and I was not about to do that. I just want it to end up in a data frame that we can do our cool r stuff with.
So fortunately, I was able to convert that XBRL to XML and then parse that into a beautiful data frame. So I, at first, was quite intimidated. And then once I figured out some of the wizardry that the XML two package can do for us, that with a little bit of purr, made it fairly straightforward to get that XML data into a nice data frame. So initially intimidated, I'm here to tell you, Eric, it's not that scary.
[00:20:25] Eric Nantz:
Okay. You I feel safer now. Okay. That'll be my next journey.
[00:20:30] Mike Thomas:
I did think that it was kind of important to note, especially as you go the other direction of trying to, you know, do this sort of round trip where we're going from maybe markdown to some other format like XML and then back to markdown. There's a whole section here called the impossibility of a perfect round trip. And, the author gives the authors give the example of using that tinker package to go maybe from XML to markdown where maybe list items in your markdown that were previously represented with an asterisk, would then be represented with a hyphen instead.
And I think asterisk is kind of old school anyways for list items. Eric, I I'm not sure if you use asterisks or hyphens in our show notes here. I have always used asterisks, but today, I have switched to hyphens. I'm I'm taking a stand. But I I just We have a full mix in our show notes, unfortunately. Yes. 100%. But I this this is all to say that there are are probably some gotchas when you are doing this this sort of round trip to go from some different representation back to markdown in situations particularly where markdown may have multiple ways to do the same thing. Right?
The author of that package that's going to go that direction for you will have to make a choice about which to use. So I I thought that that was interesting. And the thing I really love about this blog post is it's it's not just a showcase of a single tool. It really is giving us this whole entire world of possible tooling that you could use to accomplish and and solve these different types of use cases that we're talking about here and weighing their pros and cons and benefits, which I think think is extremely helpful because I'm sure that everyone in this situation has a a different sort of nuanced use case that they're trying to solve. And one tool may work better than another, or they may have more comfort over, you know, HTML versus XML representations of their mark markdown, information. So that may be helpful here to know the different possibilities that are out there based upon your preferences.
[00:22:39] Eric Nantz:
Yeah. And I do think, certainly, on the parsing side of markdown, that's becoming even more important to me lately. As I think about, you know, another, you know, like, every conference this year, a big theme is in, you know, building custom AI tools on top of things. But we have internal markdown based documents at the day job. I'd love to be able to parse and ingest very quickly so that I can use that for a multiple, you know, sets of purposes. Maybe that feeds into rag down the road or whatnot, and there may be cases where I have to get lower level of the content versus just pointing it to a directory or markdown file. So I'm definitely intrigued by the by the different possibilities of what we can do here, and there is, like I said, a lot of choice involved, and there may not be the perfect way, but I think with the combination of tooling here, you can get you can get quite far.
And markdown is not stopping anytime soon. I'm who knows? Maybe by the time I'm, like, 80 years or something, markdown will still be around. At least I hope so. You know, Mike, learning even new programming languages already can be difficult enough, especially depending on what background you have. Heck, learning new languages in and of itself in this world can be quite difficult as well. I can, speak from experience because I have been slowly trying to learn Mandarin, Chinese, because my my wife is from China. So we've been you know, I can say the basic phrases, but as a English native speaker, oh, goodness. It is just a different world, and it just makes you realize how illogical the English language is compared to other languages that do have a lot more structure to it. That's just just my opinion. You know, hot takeaway.
Well, we're not the only I'm not the only one learning a new language, albeit I'm not learning that well, but, our next highlight here comes to us from the aforementioned curator of the issue, Jonathan Caro, because he is on a journey to learn Japanese as a as another language. And along the way, he has decided to take matters in his own hands and not just create an r package to help with some of the written side of the language, but he actually vibe coded it too along the way. Well, so this is always fascinating to see how far people can go here. So let's learn about John's journey here. So as I mentioned, he's been learning Japanese. I I guess his daughter is now, you know, learning has learned this in high school, so he wanted to tag along for the ride.
And there are some, of course, external services or resources that help with language such as Duolingo, which I think has been good in his opinion. Another one that he was recommended to try is called one economy. And that again, they all have slightly different takes on how to use repetition and how they link words together. In practice, they got you know, they gamify things a little bit. You got a leaderboard to level up on your on your tiers of skill set. And speaking and listening to it is one thing. It's a whole different ballgame when it comes to writing these type of languages because it is very easy for two, you know, characters.
And this is what he calls a logo graphic based, which is not so much based on the typical alphabet like we have in English and other languages. They look like drawings almost. They're they're they're, you know, hand strokes in different symbols, and just one minor difference can mean a huge difference in how that that word associated with that character is spoken. So I can definitely sympathize with that. I cannot write Chinese word for lick unless I get a lot of help. But, Jonathan Jonathan has definitely been on that same train with the Japanese language as well.
So he thought, boy, it would be nice just to kind of sympathize some of this information together. And he looked in the, in the recent, you know, resources online, and he learned about this technique kind of, by a, a person named Alex Chan, who had a post about storing language vocabulary as an actual graph that kind of links to get almost like a knowledge graph of sorts, but you link together these certain characters that have a similar component and may look similar. And you can do that for both chine for Chinese as the original author did. He wanted to try that for Japanese. So how does he go about building it? So, you know, obviously, you could just throw your LOM at it and just every time you need it, you just ask the LOM to build it, but he wanted to go deeper into this. He was hoping there might be a way to make a package out of this and not just turn to the LOM for one off request.
No r package existing that does this. He did stumble upon, an older, resource from seven years ago that tapped into the API of the Wanna Connie service. So that's interesting. So he was able to, you know, take the inspiration from that and then trying to assemble the data that would go into building these kind of custom graphs of the language characters. So he got an API key, went to the API docs, and was able to use again the hitter two package, which I use quite a bit for my web API needs these days, and was able to grab the endpoint as a big old set of JSON blobs, which typically is what we deal with with API data.
And for that, some metadata was associated with it, and he's got an example in in the post of what that that code looked like, a good hybrid of of hitter as well as, per mapping and all that. And you can you can, you know, be able to get that in a nice tidy data frame. Of course, he could have stopped there, but no no no. He wants to go further on this. He's been watching the space on, you know, tools like claw code and others. He even saw this interesting video about how it was linked to Obsidian, which is a very popular open source note taking software in markdown. There's markdown again. And it was able to ingest the markdown that was in these Obsidian notebooks and do some really interesting things with it.
So he thought, okay, this is interesting. What if I get inspiration from that and start the vibe code a package with cloud code and to be able to just see how far he could take building this new package to, again, generate these custom graphs of the language symbols. And this is where here we go. Right? So he booted it up and started to think of a plan of what to accomplish here and what it needed to do to build the package. First of which, you know, query that, that service API and then figure out which function to use, to start assembling this and then to write the documentation on its own, as well as the unit tests on its own.
And then he just kind of watched for a bit. I've seen these in action from other people. Just kind of starts with a checklist of going through things, and it just, kind of goes one by one. Some steps take longer than others, and it it it can be can be interesting to watch. But he was, using, I believe, something called whisper to accomplish some of this too. I haven't looked too carefully, but, once it was done, it built the package with modern approaches to the scraping with header two, as I mentioned earlier, even ran dev tools check on its own, and it mocks some tests using HTTP test. So that's pretty nice. That's a pretty modern way of doing it.
And then he would check check it around. Sounds like it wasn't too bad. And then he would instructed to do various, you know, commit messages, and it wanted to add itself as a coauthor. Apparently, it's, like, coauthor by Claude. That was pretty interesting. But but that's a good thing. Right? As long as you wanna be transparent on what it's actually doing here. And then once he had the package, he gave us some of that JSON output. Seems like it worked pretty well, but there were some issues along the way. And he had to tweak things a little bit to not just get the IDs of these characters strung, but also the actual symbol itself, from this. And so he was able to interrupt claw code for a bit, have it change course a little bit, and then he was able to get get the actual symbols out for various ways.
And once it all wrapped up, had a package over a 133 ish tests. That's pretty impressive. While I was ingesting the API, and then he's got the package on GitHub, it's all right there linked in the blog post. And in fact, he even goes a step further, and I believe he's also put together a Shiny app that puts as a front end to this. So he's got that link too as well. Again, lots of additional Vibe coding here to get get all that squared away. But in the end, this is one of the best ways of learning this. Right? Is to learn by by doing. Like him, I would I would the caveats I resonate with at the end of this post are a, certainly, you may wanna think about this approach being used in very high risk situations.
He jokes that you probably don't wanna do this to connect to your banking app or anything to do financial stuff. But and, and then the other thing to note is that, you know, things like Cloud Code don't come over free lunch. Right? You're gonna have to pay some fee for the back and forth with the AI service. So you gotta, you know, keep that in mind. Hopefully, it's not too extensive, but you never know. You might rack up $20 or perhaps more depending how far you take it. So just be just be watchful of that. And and, again, he is definitely have the mindset that this is a good tool for assisting your development.
I'm not losing my job anytime soon over an agentic agent, I think, but I definitely think it can help and speed up your process of going back and forth with a different dev task of a package development. And he's he's not pretending this is gonna be used, you know, a lot from this package itself. It does what he wants it to do, but this is a learning opportunity for him. And, yeah, I've definitely seen a lot of people start to use cloud code for purposes like this and for a real novel need that John had. Definitely got him there. And he probably could have gotten there himself, but it probably would have taken more time. So, certainly, there are benefits in trade offs or how much time you wanna invest versus the cost of using this and hopefully not having to refactor a lot of code after a Vibe coding session.
But it is a sign of the times. Right? So happy to see John learning out loud here, and hopefully, it's supercharging his, Japanese learning journey here.
[00:34:20] Mike Thomas:
This is really interesting, and I think this whole blog post for anybody that's trying to, I guess, weigh how much they should care about this whole vibe coating thing or perhaps how they could, should, or shouldn't incorporate AI in large language models into their software development workflows. I think this is just a a fantastic end to end blog post that's very transparent in what worked well, what what didn't work well, and and sort of, you know, why it it worked the way that it did. So, Eric, your journey with Mandarin and its differences with the English language, resonates very heavily with me with a a three year old who we're working on reading and writing with. It's it's like you teach them one rule in English, and the next word you run into breaks that rule there.
There's so many edge cases, Mike. Yep. Yep. Which is very frustrating, and I guess we take it for granted. But on on some of the busy days at the day job, I'm sure you, Eric, feel like I'm over here doing Vibe podcasting. But I did wanna let the listeners, know that they should not worry. This this podcast is not Google's notebook l m, at least not yet. I No. Was taking a look at a Python package, that wraps an API, and I was trying to make a similar R package over the weekend that wraps the same API. And the developer who I'm very, you know, impressed with, the initial version of the Python package that he he made, the the API services recently updated its its protocol.
So he had to go in and essentially refactor the whole package. And Claude I could see in the commits that Claude did the whole thing pretty much, you know, wrote the test, refactored the test when things weren't working. And I thought it was incredible. And I and I I think to Jonathan's point in this blog post, similar to him, like, it works. Is it the best, most concise, you know, way to write the code to do what it's trying to accomplish? I'm not sure. That API endpoint I'm talking about provided data in a couple of different possible formats depending on your your request. One was, semicolon delimited, and then the other option was XML.
And it was using polars and could have just specified, like, a read delim with a semicolon, but Claude decided to ask for the XML and write, like, 50 or a 100 lines of code to parse that XML into a data frame as opposed to, you know, when I was doing it in our it was it was two lines of code, to be able to do that. So I think that's just an example, and I I think that plays out in a lot of what I've seen in terms of me just asking for individual responses from, you know, chat g p t or Claude and my my coding day to day journey is that it will give me code that tends to be a lot more verbose than what I would normally write.
And I think, again, sort of if you don't care about the the the part in the middle, the code that's actually, you know, executing the thing and you only care about the end end goal, maybe a lot of times, you'll be okay. Just kind of blindly, you know, five coding. And, again, it all comes down to risk as you mentioned, Eric, and as Jonathan mentions here. So weigh that into your decision, but these tools are also getting getting better day in and day out. One of the things that I also struggle with sort of on this topic are leveraging ensuring that I'm leveraging, like, the latest and greatest from the packages that I'm using when these LLMs are trained on, you know, older older data. You know, recently, I was leveraging a function or I was leveraging TidyR's pivot wider, and it was giving me an it's Claude was giving me an argument that no longer existed.
It it had been long since deprecated, and it's asking me to use that. And I looked in the docs, and that argument didn't exist anymore. So I run into a lot of situations like that as well, whereas we're developing software. We're trying to make sure that we're using the latest and greatest as opposed to some of what I think philosophically these LLMs do is just regurgitate the past so that we don't make a lot of progress forward. That's a big probably a big way to to end, this section of of the podcast. But I really long story short, really appreciated Jonathan's fully transparent walk through of his journey on this package.
[00:39:01] Eric Nantz:
Me as well, and I am early in this process too, but I actually had a very, I won't say, stressful situation. But, back to my shiny state presentation. Night before, I had an example demo app, but I just was viewing. I was like, no. I don't like it. I wanted to do something that kind of brought some fun to it, but also an easy example. So, yes, I booted up Positron assistant with the anthropic agent mode, and I basically vibe coded a retro eighties video game looking app for the demonstrations. At 1AM It was awesome. It was this is what I needed because otherwise, I could've spent probably a week doing that knowing my OCD of developing shiny UIs, but it it did the job, man. It did the job. You did that the night before?
[00:39:54] Mike Thomas:
Yes. The night before. Oh my goodness. Go watch Eric's presentation, please. That will blow your mind.
[00:40:01] Eric Nantz:
It was, it it it but it you know, it it did the job. I mean, would it be exactly the way I wrote it? No. But it did good enough, and that's all I need. I just need a good enough. It was just for the slides, just some prerecorded demos of OBS. And once I got that app on, there was none to the next part, but it literally did save my behind when I had that very literally down to the wire change in my demo direction. And that that kinda made me a believer that in the right situation, this can definitely be a big help. But as you said, Mike, cautions abound for certain situations. But in the end, it can be a nice tool in your toolbox.
[00:40:42] Mike Thomas:
It's a heck of a Claude sales pitch.
[00:40:45] Eric Nantz:
Where's where's the where's the podcast revenue coming in, man? I don't know. And rounding out our highlights today, it it does feel like I'm going back in time on this one. I'm not meaning that in a disparaging way, but we back in graduate school when I was a TA in our stats department, we indeed would be part of the group that worked with our faculty to create dynamically generated quizzes and exams. I still remember making this in the Pearl and Lake Tech code of all things, so let's not get into those tangents. But this next highlight, this last time I should say, if you're in the situation where you are want needing to generate, you know, dynamically generated, say, statistical based exams, and you're confined to certain, I'll say, older formats, this may be just for you.
This actually is a blog post from the r exams project, and I think we may have mentioned them very briefly in previous highlights. Holy smokes. This thing is really comprehensive. If you are in this space of teaching with, with statistical knowledge or whatever else, this is a suite of packages that are literally meant for producing and parsing written exams in many different formats. One of those being multiple choice format. And, unfortunately, many of these are still being done in paper and when the students are filling it out and then turning it in. And there can be some issues when you scan those results in and some really gnarly issues that this blog post highlights here. They call it quality control for scanned multiple choice exams.
And I don't pretend to know all the internals of how the r exams tooling works in the r ecosystem, but the way they've conducted this and the way they have a solution to tackle some of these gnarly problems, it's definitely intriguing to me. So there's a couple demos here. One of which is there's a, you know, an exam that has some pretty typical problems where it's got five single choice answers. And then somehow when it was scanned, it was rotated upside down and certain things have been marked up a little oddly. And some of these choices actually have multiple check marks in it when there should have been single choice answer.
So this the the suite of packages that they have for our exams has a way to scan it and rotate it to the right way and then to literally run a wrapper function called knops fix where it'll give you kind of a visual what needs to be corrected and then be able to enter it in either in the prompt with the real value should be or click through a GUI like interface to to fix all that. So, yeah, it's it's kind of manual effort, but they that their clever trick is that when they import it, they actually make a zip file of the different problems as images. And then once you fix a problem, it kind of dynamically makes new metadata, I believe in JSON format or whatnot.
And then it ingest it pack into the parsed object that this exam or the importing this exam, you know, scan file generates. And then you can unlink kind of the temporary files and get this all cleaned up. Boy, oh, boy. Where was this twenty five years ago when I needed it? But, nonetheless, better late than never. And then their second demo has a lot of problems, not just the the rotating issue, but they've got, you know, many many other corrupted markers where it was supposed to be, again, one choice, many multiple choice, and and some fact some of the scan results are even off the page, so to speak, like really gnarly stuff.
And so the package again lets you fix these dynamically, but then you get a nice little looks like HTML based GUI on this one where it actually thinks there's more questions than there should be, but then you can at least select only the ones that are supposed to be selected, and then it'll write back out the actual amount of questions. And then you can, you know, review all this as you go. It keeps everything that you work on in the space so you can easily, you know, parse this through and with your human eye, so to speak, before you make the final call on fixing this up. But, man, I can just imagine how much time this pipeline saves for not just creating these exams, but also ingesting them when you're dealing with these, you know, I hate to say legacy, but pretty legacy formats in terms of the way education is these days. But I'm just really impressed with what they've accomplished here. And like I said, if you're in the education space and you wanna start, you know, learning from what they're doing, they got a whole bunch of blog posts about what they're doing with this suite of packages, even some integrations with AI tooling as well to help, along the way. So definitely learn something new that, again, me from many years ago would have been all over this for my exam generation.
[00:46:18] Mike Thomas:
This is absolutely incredible. It does sort of remind me of the, like, memes out there where, say, like, men will do anything but but fix themselves essentially, which is like, you know, we'll do anything, in our, you know, to parse information and allow us to programmatically interact with scanned, exam documents instead of creating non, like, PDF, you know, scanned exams in the first place. But but c'est la vie, that's that's just the way that it goes. One area that I think is really, really interesting and I'd love to dig in further to see exactly how they did it, but it's these functions that sort of pop up this interactive session. Right? Whether that be some sort of HTML, widget or or or whatever it is, some HTML interface that allow you to either take a look at the PDF, maybe interact with it as well. We have a ton of use cases lately where we are using LLMs to parse data out of PDFs to provide, you know, answers standardized answers for downstream reporting or downstream decision making, things like that.
But our users are wanting some sort of audit step in the middle where they can see the section of the PDF that the LLM used to get its answer, and that's an area that we struggle with. And I feel like if I dove deeper into the internals of, the packages that are being used here, it might give me some some big pointers on how to solve that problem. Because I think that that's probably a problem that exists for a lot of folks who are trying to merge the the worlds of large language models and PDFs these days. So very, very interesting to me. I'm excited to look a little bit further underneath the hood. But if you are in this space at all in education and needing to administer exams and then review them and grade them, and still doing that very manually and looking for tooling that can help. I mean, this is this is gonna be mind blowing if you haven't run into it before.
[00:48:25] Eric Nantz:
Yeah. I think this does just about everything you need in this space. So I I think it's worth worth checking out for sure. And and you know usually I I try to give credit to the authors but this blog post is this a project in general has a lot of people behind it so I have a link to their contact information in in the show notes if you want to get in touch with them but they're also on social media as well. So we'll have a a link to all that. But, yeah, really fantastic effort there. Just like the rest of this is yep. Go ahead. Yeah. It looks like maybe Akeem Zelos is tagged at the bottom of this. Oh, it is. Okay. Good eye. So, Akeem, yeah, fantastic job to you and the rest of the team on this fantastic, fantastic pipeline.
And the rest of the issue is fantastic as well with our weekly, and it never stops even when we were at our at our low for a bit, but it is back up and we're back up and running and luckily not a moment too soon. There are lots of great resources to look at and on top of these highlights that we talked about, we are a bit pressed for time so I will have to, you know, not do our additional fines, but again, you'll see a whole bunch that John's curated here that you can look at your leisure. And, also, we love hearing from you. Again, shout out to all the attendees at Pazit Comp. They've said some nice words about our humble little podcast and wonder when it was coming back. And I said, it was coming back. Trust us. So thank you to all of you that said hi to both Mike and I. It was it was gratifying to to connect with you in person. We always love hearing from all of you. But since we're not a cop anymore, you wanna get in touch with us, there are a few ways of doing that. We have the contact form directly linked in the episode show notes that's still up and running last time I checked. We also have the project itself that you can help out with poll requests.
Everything [email protected]. Click the little GitHub icon in the upper right. You'll get directly to the pull request template so the curator of the next issue can benefit from what you found. And little birdie tells him that might be me next, so I need all the help I can get. But last but not least, you can find us on social media. I am at [email protected]. We'll try to post more often now that I'm getting through some of these crazy August, September conference and presentation stuff that I've been involved with. You can also find me on Mastodon with at rpodcast@podcastindexonsocial, and I'm on LinkedIn. Just search my name. You're gonna see me there. Mike, where can the listeners find you? You find me on blue sky @mike-thomas.bsky.social.
[00:50:53] Mike Thomas:
You probably won't find me on on LinkedIn if you search my name. But if you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to lately.
[00:51:03] Eric Nantz:
Awesome stuff. And like I said, it felt it feels comfy again, being able to do do this. And hopefully, we can keep back to get back to a regular cadence, but we you and I both have been in the middle of a lot of stuff going on. So we'll we'll keep trudging along. But as always, we thank you so much for listening out there from Revier around the world. We really appreciate it, and we look forward to hopefully being back. We have a new episode of our weekly highlights next week.
Hello, friends. We are back with episode 211 of the Our Week of Holidays podcast. And yes, it was true. We had to check the episode numbers. This has been a hot minute since we last talked with you. But this is the weekly, usually, weekly podcast where we talk about the awesome resources that are shared on this week's our weekly issue at ourweekly.0rg. My name is Eric Nance, and, yes, I am probably the biggest culprit for why we've had the long layoff. There was a a busy few weeks getting prepared for what we'll be talking about, I think, very shortly of a very eventful posit conference, but I'm happy to be back back settling down and processing what I all learned.
But thank goodness I'm not doing this alone because he's back as well, my awesome cohost, Mike Thomas. Mike, how are you today? Does it feel odd to be back into this?
[00:00:51] Mike Thomas:
Feels odd. It feels great. I've been on the road quite a bit lately. Had some conferences in on the West Coast. Had a conference, obviously, down in Atlanta at POSIT COMP last week, which was fantastic. I feel like I I'm just giving presentations for a living and not writing a lot of code, but I did scratch and claw together in our package over the weekend, while I was kind of on the couch down with the cold. So, hopefully, that will come to see the light of day soon.
[00:01:20] Eric Nantz:
Nice. We've both been a package mode recently. So, for those aren't aware, both Mike and I did have the good fortune of of being presenters at this year's Posikov. And I'm just gonna, you know, give the floor to you first, Mike, because your talk about kind of bilingual data science teams, you had so many nuggets in there and so many words of wisdom for everybody in this situation. You had you had the room in the palm of your hand, but how how was that experience like for you?
[00:01:49] Mike Thomas:
It was a great experience, and I think I'm sure as with you and anyone else that has presented in the past or is used to giving presentations, it's all about preparation. And the more you the more you prepare, the easier it is. And I felt that, I was well prepared, not not, at at least in part to the team at Articulation who posit, is a contractor that posit hires to essentially help us, make sure that our presentation is put together in a timely manner and that we do a good job of organizing our thoughts. So that was a huge help. I enjoyed yeah. It's it's very easy to speak about a topic that you're passionate about. So I enjoyed giving the presentation, got a lot of great feedback. But how about you, Eric?
[00:02:32] Eric Nantz:
Yeah. I do echo the preparation, and I do admit mine went down to practically near the wire on a certain part of it, which I'll touch on later. But, I was able to for the first time, I think, ever in a non industry, like, type of conference, I was able and I mean industry. I got my day job. I was able to present something I've actually built in the open source world that actually has seen the light of day. And so I've talked about my newest r package called shiny state, which takes bookmarkable state to new heights, and I think I did a pretty nice, way of rounding that out and my motivation for it and, did spend a lot of time on the messaging because anytime you have, like, twenty minutes or less, and I think it's almost harder to prepare for those than if you get that nice one hour seminar or whatever. So well received, I think. And I'm also happy to say that the package a day after the presentation finally hit crayons. So my first crayon package in over fourteen years, folks, I I mean, if I was able to get it on by hook or by crook, and I went through that checklist probably about 50 times. It seemed like by the time I filed it, but it it feels good, and it also feels good. It seems like a lot of people have big ideas for it, so I've gotta watch that issue tracker pretty closely now.
But, overall, great experience. And then the day before the conference, I was busy with the Our Pharma Summit. Great shout out to all my life sciences colleagues out there that attended that event. And I was running around and doing some audio magic here and there, but giving a presentation there as well. So by the time the conference is over, I definitely was ready for a bit of rest. Not that I got much of it, but, nonetheless, it was, it was eventful and, yeah, all the great conversations hanging out with you during one of the lunches and hanging out with some other peeps. Yeah. It was it was just as advertised. And no matter where posit goes in terms of their direction, the people make it make it the WER file experience. So I had a blast.
[00:04:40] Mike Thomas:
Likewise. Great talk, Eric. If you haven't seen it, hopefully, when the recordings come out for the folks that purchased a ticket, you can see it, today. And then in a few months from now, I think it'll be on YouTube. But I am so pumped about your Johnny State package. So thank you for all your work on that.
[00:04:57] Eric Nantz:
Awesome stuff. And, yeah, I can't wait to actually use it for my projects now. And I have it out there. It's my biggest motivation for it. And I am actually dusting off the cobwebs of the shiny dev series site. I'm I'm gonna put a blog section on it, and I have a blog post about the package in the near future too because I gotta I gotta get on the blog and train too when I get a chance and follow your lead amongst countless others that do a way better job with us than I do. So Definitely not me. I'm not sure what you're thinking of, but there are plenty of people out there. I've seen them, man. I've seen them. You did you do a bang up job. But we gotta talk about the awesome our weekly issue before we go too much further here and checking my notes because it's been a minute. Our issue this week has been curated by Jonathan Carroll, another one of our longtime curators and really one of the founding fathers of this project, I must say. And as always, he had tremendous help from our fellow Arrow Crew team members and contributors like all of you around the world with your poll requests and other wonderful suggestions.
So we lead off with a ways that you can not just write your markdown documents in the traditional way. And when I say markdown being what almost is like the kind of like the English language to me at this point in terms of technical documentation, I'm writing it every day. It's going in many different places, whether it's a GitHub issue, summary, a GitHub pull request note, or just my technical notes. As in the R community, we have many ways of authoring our mark our markdown documents. I even just said the first one. R markdown is what pioneered a lot of these efforts. More recently, the quartal framework amongst others as well.
It's one thing to manually write it. Right? What if you wanna programmatically generate and parse markdown content for further downstream or upstream processing? Our first highlight is just for you because this is a really big deep dive into many different ways you can approach this programmatic perspective of markdown, and this is authored on the rOpenSci blog. So, by the way, shout out to to Noam Ross from rOpenSci. I was able to meet him at the conference, and it may have been our first meeting ever, but been talking to him online for years and years. So that was that was fun to to talk shop with him for a bit. But this post on your open side blog was written by Mahal Salmon, former former curator of our weekly, along with Christophe Dever and Gian Convar.
And this is this is a a big one. So we'll summarize what I think the key points I'm taking away here. And, of course, Mike, your perspective is appreciated too. It first talks about what exactly is markdown. I hope by now most of you listening are familiar with markdown, but if you're not, there's a great intro to how you write the syntax. And there are a few different ways you can write certain bits of syntax, but it does go over what I mentioned earlier, the various ways that you can interact with markdown, especially as an R user. And and, again, there are quite a bit out there. And, of course, one of the biggest selling points recently is that you have maybe some front matter in the markdown written in typical YAML format that might this or it might define certain options for rendering or other parameters.
And, of course, you can embed programmatic code chunks itself using R, Python, or other languages, with the special kind of fence div kind of syntax there. Again, all that's in the in the blog post. We all if you've done markdown before, you you're pretty familiar with that stuff. Now it's about what if you are in a situation you wanna build a template, but yet you wanna dynamically insert certain pieces from, let's say, an outside art program or whatever else, there are various ways to accomplish this just to get things off the ground. In the art community, markdown and r markdown rely heavily on the knitter package by EYC, who was that has been the glue of so many things for what I do in markdown.
It comes with a function called knit expand, which will let you kinda part you know, inject into your markdown syntax with maybe those placeholders for variables. You can inject certain things in that as well, but there's more than that in town. There's also the whisker package, authored by Edwin de Jong. That's actually used by the package down framework. I didn't know that until this post. There's also the brew package, which is a a long time package in the r ecosystem I've used in the past. And if you wanna get outside of R itself, there is, of course, pandoc, which is used by more than a handful of markdown rendering engines.
So we got some nice examples based on if you were giving kind of a homework assignment and you want to inject the name of the student into the document as well as a customized solution, or I should say, customized mean and standard deviation for the question here. And then the example talks about using, you know, a workflow based in the whisker package, which for those of you familiar with the glue package, you're gonna feel right at home with whisker. I've played with it a bit. It is pretty nice. And a new one I learned, once you generate those lines of dynamically generated content, if you wanna write that out, there's this newer package to me anyway called Brio, b r I o, which apparently is kinda like a souped up version of what base r can do for reading just arbitrary lines of text and writing out lines of text. So I'll have to put that in my toolbox for later, but the example basically has a custom function called make assignment that wraps this together. And sure enough, we got a nice little data frame out that has for each student the custom mean and SD parameters and then the file name that it's writing to. And and once you run that, you get a markdown file of everything filled in dynamically.
Definitely a great way to get started, and that is for generating. Now let's talk about parsing. This is where I've wondered. I've been through grep nightmares parsing stuff before in text in text documents over markdown or not. So that is of it is an approach if you're willing to get, you know, your hands kinda dirty and muddy with regret, go go for it. Not my cup of tea anymore. I've been down that road before. But there is the new the new thing in town for importing Isuzu's programmatic type documents is changing it into the abstract syntax tree layout or AST.
This is kinda like when in markdown, you have these common constructs to say headings, lists, code blocks, numerous other elements. And then you can think of ASTs are a way to give you this kind of maybe nested kind of list kind of structure where then you can parse that and then write maybe in or modify certain elements of that and then write it back out. This is being used in many different ways that I've seen in the community, but, the post continues with a real life example that I did not know happened, but, there is this great, reference that I saw in the g g pod ecosystem in the past called the g g pod two extensions gallery.
And apparently, somebody hijacked one of the links in that gallery for some of those. Not so nice, I'll say. And so what if you were on the receiving end of that, if you were to maintainer of that and you need to quickly find which links were hijacked, you know, you're parsing out this huge markdown file or set of markdown files where these links are scattered. What are the best ways to handle that? So there are many different parsers that do the AST route for markdown, some of which include the tinker package authored by Miles Salmon herself and maintained by Jian Kamvar, another coauthor of this, that translate that to XML from markdown to XML using the rendering engine called common mark, which is used quite a bit.
XML still gives me the heebie jeebies sometimes, so it's not my very desired way. But, hey, XML is quite powerful. I do acknowledge that. So, send your feedback to me if you have any differences of opinions or if you're successful with XML. I'd love to hear it. There are some new games in town here. One is called the MD four r package by Colin Rundell. It is in the experimental state, but it does kinda take that list approach to translating the AST to a nested list, all built in r. There is a development package out there. Pandoc itself will let you do this, and and you can either use, into its, native form of the AST or in JSON format, but then you would have to use Lua filters to be able to parse through all that or JSON if you're doing JSON. If you're if you're familiar with JSON, you might be able to have some good stuff there that technically doesn't depend on another language.
There are also newer ones out there like parse QMD that Nick Crane, who I also met at the conference, a new package that parses quartal markdown documents, and then it's basically built on JSON as well with JSON lite. There is also the parser m d package, yet another package by Colin Rundell. Seems like they're really in tune in this space quite a bit, and that's using a special boost spirit library, which I think maybe based in c, although I have to fact check on that. There was a talk about that at a at a long ago, our studio conf in 2021 that might be helpful as well.
There is a lot to digest here because you got a lot of choice here. So they in the in the rest of this post, they talk about what can you use to inform your decision on choosing a parser. And part of that will depend on if you're gonna import code chunks and do something with code chunks or not. That might that might switch the game a little bit because in that case, you might or benefit from XML as well because that might handle the code chunks a bit better. But there that would vary by use case, of course. The post also talks about what about that metadata I talked about at the top that's usually in YAML type format? You separate that with those kind of three horizontal dashes at the front matter.
I definitely have done things I in the past to grab those dynamically. Wasn't the prettiest in the world. But if you are using r markdown, there is a handy function called YAML front matter that will help you extract that through one function call. And I'd say, yes, please. The last thing I wanna do is rewrite that kind of stuff if unless I absolutely have to. And then there's also other ways to do this too with, say, the YAML package itself. And if it was in a different type of markup format like TOML, t o m l, there are some other packages like RCPP, TOML, as well as others to be able to parse that as well.
So I think generating documents, I would say, of the two things is probably the easier thing to to tackle at first. And with those tools we mentioned earlier, I think you can get underwrite very quickly. Parsing, you've got you got a few options. Just like everything in r, you got a few options at your disposal. I think it really depends on the complexity of what you're parsing that's gonna determine which way you go. Again, I probably don't resort to XML based methods unless I have to, but that could be just me feeling like imposter syndrome with some of that stuff where if I give it a go, maybe I'll I'll feel better about it. But I am intrigued by a lot of these tools that we're seeing in the community. And if I am in this situation in the future, I'm gonna use this post as a as a great starting point for that, markdown generating and partisan journey.
But, yeah, markdown all the time, no matter which way you slice it. Mike, what did you think of this?
[00:17:31] Mike Thomas:
Yeah. Markdown is the English language for us programmers. I I like how you phrase that, Eric. I had a client today who had developed a statistical model in r and did the model documentation in Microsoft Word. Oh, boy. And when we went to do some counts, just simple counts of the data, in the data that they provided us, that they used to build the model. Those counts didn't match what was in the table in the word document, and I asked them if they had ever heard of our markdown or quarto, and they said no. And a single tear fell from my fell from my eye. So Wow. We still have apparently, we still have work to do or this person lives under a rock. But, you know, I think this, article right here is is not necessarily about small changes that we're making. These are about making significant changes that probably need to be made in a number of places where it makes sense to use a programmatic approach. Although, I would say in the small case situation, I love Versus codes, search, find, replace all, which probably now exists in Positron as well. Very useful tool. Sure does. Seen that, and you need to replace that all week, baby.
If you need to find an instance of something across your whole entire repository and replace all instances of that with something else, it's a pretty sweet tool to check out. I was definitely intrigued by by the tinker package here that I believe Mel Solomon, kind of dreamed up as it stated in the blog post for going from markdown to XML via XML two. Eric, I'm happy to be the good cop to your bad cop with respect to our feelings around XML. I'm here for it. I've been doing a ton of it lately. This is all stems back to that package I was working on over the weekend where it's a rest API that sends back XBRL of x b r l file. So I had never heard of before. I guess it's sort of like Excel, but for the finance world, you needed some sort of an x b r l reader, and I was not about to do that. I just want it to end up in a data frame that we can do our cool r stuff with.
So fortunately, I was able to convert that XBRL to XML and then parse that into a beautiful data frame. So I, at first, was quite intimidated. And then once I figured out some of the wizardry that the XML two package can do for us, that with a little bit of purr, made it fairly straightforward to get that XML data into a nice data frame. So initially intimidated, I'm here to tell you, Eric, it's not that scary.
[00:20:25] Eric Nantz:
Okay. You I feel safer now. Okay. That'll be my next journey.
[00:20:30] Mike Thomas:
I did think that it was kind of important to note, especially as you go the other direction of trying to, you know, do this sort of round trip where we're going from maybe markdown to some other format like XML and then back to markdown. There's a whole section here called the impossibility of a perfect round trip. And, the author gives the authors give the example of using that tinker package to go maybe from XML to markdown where maybe list items in your markdown that were previously represented with an asterisk, would then be represented with a hyphen instead.
And I think asterisk is kind of old school anyways for list items. Eric, I I'm not sure if you use asterisks or hyphens in our show notes here. I have always used asterisks, but today, I have switched to hyphens. I'm I'm taking a stand. But I I just We have a full mix in our show notes, unfortunately. Yes. 100%. But I this this is all to say that there are are probably some gotchas when you are doing this this sort of round trip to go from some different representation back to markdown in situations particularly where markdown may have multiple ways to do the same thing. Right?
The author of that package that's going to go that direction for you will have to make a choice about which to use. So I I thought that that was interesting. And the thing I really love about this blog post is it's it's not just a showcase of a single tool. It really is giving us this whole entire world of possible tooling that you could use to accomplish and and solve these different types of use cases that we're talking about here and weighing their pros and cons and benefits, which I think think is extremely helpful because I'm sure that everyone in this situation has a a different sort of nuanced use case that they're trying to solve. And one tool may work better than another, or they may have more comfort over, you know, HTML versus XML representations of their mark markdown, information. So that may be helpful here to know the different possibilities that are out there based upon your preferences.
[00:22:39] Eric Nantz:
Yeah. And I do think, certainly, on the parsing side of markdown, that's becoming even more important to me lately. As I think about, you know, another, you know, like, every conference this year, a big theme is in, you know, building custom AI tools on top of things. But we have internal markdown based documents at the day job. I'd love to be able to parse and ingest very quickly so that I can use that for a multiple, you know, sets of purposes. Maybe that feeds into rag down the road or whatnot, and there may be cases where I have to get lower level of the content versus just pointing it to a directory or markdown file. So I'm definitely intrigued by the by the different possibilities of what we can do here, and there is, like I said, a lot of choice involved, and there may not be the perfect way, but I think with the combination of tooling here, you can get you can get quite far.
And markdown is not stopping anytime soon. I'm who knows? Maybe by the time I'm, like, 80 years or something, markdown will still be around. At least I hope so. You know, Mike, learning even new programming languages already can be difficult enough, especially depending on what background you have. Heck, learning new languages in and of itself in this world can be quite difficult as well. I can, speak from experience because I have been slowly trying to learn Mandarin, Chinese, because my my wife is from China. So we've been you know, I can say the basic phrases, but as a English native speaker, oh, goodness. It is just a different world, and it just makes you realize how illogical the English language is compared to other languages that do have a lot more structure to it. That's just just my opinion. You know, hot takeaway.
Well, we're not the only I'm not the only one learning a new language, albeit I'm not learning that well, but, our next highlight here comes to us from the aforementioned curator of the issue, Jonathan Caro, because he is on a journey to learn Japanese as a as another language. And along the way, he has decided to take matters in his own hands and not just create an r package to help with some of the written side of the language, but he actually vibe coded it too along the way. Well, so this is always fascinating to see how far people can go here. So let's learn about John's journey here. So as I mentioned, he's been learning Japanese. I I guess his daughter is now, you know, learning has learned this in high school, so he wanted to tag along for the ride.
And there are some, of course, external services or resources that help with language such as Duolingo, which I think has been good in his opinion. Another one that he was recommended to try is called one economy. And that again, they all have slightly different takes on how to use repetition and how they link words together. In practice, they got you know, they gamify things a little bit. You got a leaderboard to level up on your on your tiers of skill set. And speaking and listening to it is one thing. It's a whole different ballgame when it comes to writing these type of languages because it is very easy for two, you know, characters.
And this is what he calls a logo graphic based, which is not so much based on the typical alphabet like we have in English and other languages. They look like drawings almost. They're they're they're, you know, hand strokes in different symbols, and just one minor difference can mean a huge difference in how that that word associated with that character is spoken. So I can definitely sympathize with that. I cannot write Chinese word for lick unless I get a lot of help. But, Jonathan Jonathan has definitely been on that same train with the Japanese language as well.
So he thought, boy, it would be nice just to kind of sympathize some of this information together. And he looked in the, in the recent, you know, resources online, and he learned about this technique kind of, by a, a person named Alex Chan, who had a post about storing language vocabulary as an actual graph that kind of links to get almost like a knowledge graph of sorts, but you link together these certain characters that have a similar component and may look similar. And you can do that for both chine for Chinese as the original author did. He wanted to try that for Japanese. So how does he go about building it? So, you know, obviously, you could just throw your LOM at it and just every time you need it, you just ask the LOM to build it, but he wanted to go deeper into this. He was hoping there might be a way to make a package out of this and not just turn to the LOM for one off request.
No r package existing that does this. He did stumble upon, an older, resource from seven years ago that tapped into the API of the Wanna Connie service. So that's interesting. So he was able to, you know, take the inspiration from that and then trying to assemble the data that would go into building these kind of custom graphs of the language characters. So he got an API key, went to the API docs, and was able to use again the hitter two package, which I use quite a bit for my web API needs these days, and was able to grab the endpoint as a big old set of JSON blobs, which typically is what we deal with with API data.
And for that, some metadata was associated with it, and he's got an example in in the post of what that that code looked like, a good hybrid of of hitter as well as, per mapping and all that. And you can you can, you know, be able to get that in a nice tidy data frame. Of course, he could have stopped there, but no no no. He wants to go further on this. He's been watching the space on, you know, tools like claw code and others. He even saw this interesting video about how it was linked to Obsidian, which is a very popular open source note taking software in markdown. There's markdown again. And it was able to ingest the markdown that was in these Obsidian notebooks and do some really interesting things with it.
So he thought, okay, this is interesting. What if I get inspiration from that and start the vibe code a package with cloud code and to be able to just see how far he could take building this new package to, again, generate these custom graphs of the language symbols. And this is where here we go. Right? So he booted it up and started to think of a plan of what to accomplish here and what it needed to do to build the package. First of which, you know, query that, that service API and then figure out which function to use, to start assembling this and then to write the documentation on its own, as well as the unit tests on its own.
And then he just kind of watched for a bit. I've seen these in action from other people. Just kind of starts with a checklist of going through things, and it just, kind of goes one by one. Some steps take longer than others, and it it it can be can be interesting to watch. But he was, using, I believe, something called whisper to accomplish some of this too. I haven't looked too carefully, but, once it was done, it built the package with modern approaches to the scraping with header two, as I mentioned earlier, even ran dev tools check on its own, and it mocks some tests using HTTP test. So that's pretty nice. That's a pretty modern way of doing it.
And then he would check check it around. Sounds like it wasn't too bad. And then he would instructed to do various, you know, commit messages, and it wanted to add itself as a coauthor. Apparently, it's, like, coauthor by Claude. That was pretty interesting. But but that's a good thing. Right? As long as you wanna be transparent on what it's actually doing here. And then once he had the package, he gave us some of that JSON output. Seems like it worked pretty well, but there were some issues along the way. And he had to tweak things a little bit to not just get the IDs of these characters strung, but also the actual symbol itself, from this. And so he was able to interrupt claw code for a bit, have it change course a little bit, and then he was able to get get the actual symbols out for various ways.
And once it all wrapped up, had a package over a 133 ish tests. That's pretty impressive. While I was ingesting the API, and then he's got the package on GitHub, it's all right there linked in the blog post. And in fact, he even goes a step further, and I believe he's also put together a Shiny app that puts as a front end to this. So he's got that link too as well. Again, lots of additional Vibe coding here to get get all that squared away. But in the end, this is one of the best ways of learning this. Right? Is to learn by by doing. Like him, I would I would the caveats I resonate with at the end of this post are a, certainly, you may wanna think about this approach being used in very high risk situations.
He jokes that you probably don't wanna do this to connect to your banking app or anything to do financial stuff. But and, and then the other thing to note is that, you know, things like Cloud Code don't come over free lunch. Right? You're gonna have to pay some fee for the back and forth with the AI service. So you gotta, you know, keep that in mind. Hopefully, it's not too extensive, but you never know. You might rack up $20 or perhaps more depending how far you take it. So just be just be watchful of that. And and, again, he is definitely have the mindset that this is a good tool for assisting your development.
I'm not losing my job anytime soon over an agentic agent, I think, but I definitely think it can help and speed up your process of going back and forth with a different dev task of a package development. And he's he's not pretending this is gonna be used, you know, a lot from this package itself. It does what he wants it to do, but this is a learning opportunity for him. And, yeah, I've definitely seen a lot of people start to use cloud code for purposes like this and for a real novel need that John had. Definitely got him there. And he probably could have gotten there himself, but it probably would have taken more time. So, certainly, there are benefits in trade offs or how much time you wanna invest versus the cost of using this and hopefully not having to refactor a lot of code after a Vibe coding session.
But it is a sign of the times. Right? So happy to see John learning out loud here, and hopefully, it's supercharging his, Japanese learning journey here.
[00:34:20] Mike Thomas:
This is really interesting, and I think this whole blog post for anybody that's trying to, I guess, weigh how much they should care about this whole vibe coating thing or perhaps how they could, should, or shouldn't incorporate AI in large language models into their software development workflows. I think this is just a a fantastic end to end blog post that's very transparent in what worked well, what what didn't work well, and and sort of, you know, why it it worked the way that it did. So, Eric, your journey with Mandarin and its differences with the English language, resonates very heavily with me with a a three year old who we're working on reading and writing with. It's it's like you teach them one rule in English, and the next word you run into breaks that rule there.
There's so many edge cases, Mike. Yep. Yep. Which is very frustrating, and I guess we take it for granted. But on on some of the busy days at the day job, I'm sure you, Eric, feel like I'm over here doing Vibe podcasting. But I did wanna let the listeners, know that they should not worry. This this podcast is not Google's notebook l m, at least not yet. I No. Was taking a look at a Python package, that wraps an API, and I was trying to make a similar R package over the weekend that wraps the same API. And the developer who I'm very, you know, impressed with, the initial version of the Python package that he he made, the the API services recently updated its its protocol.
So he had to go in and essentially refactor the whole package. And Claude I could see in the commits that Claude did the whole thing pretty much, you know, wrote the test, refactored the test when things weren't working. And I thought it was incredible. And I and I I think to Jonathan's point in this blog post, similar to him, like, it works. Is it the best, most concise, you know, way to write the code to do what it's trying to accomplish? I'm not sure. That API endpoint I'm talking about provided data in a couple of different possible formats depending on your your request. One was, semicolon delimited, and then the other option was XML.
And it was using polars and could have just specified, like, a read delim with a semicolon, but Claude decided to ask for the XML and write, like, 50 or a 100 lines of code to parse that XML into a data frame as opposed to, you know, when I was doing it in our it was it was two lines of code, to be able to do that. So I think that's just an example, and I I think that plays out in a lot of what I've seen in terms of me just asking for individual responses from, you know, chat g p t or Claude and my my coding day to day journey is that it will give me code that tends to be a lot more verbose than what I would normally write.
And I think, again, sort of if you don't care about the the the part in the middle, the code that's actually, you know, executing the thing and you only care about the end end goal, maybe a lot of times, you'll be okay. Just kind of blindly, you know, five coding. And, again, it all comes down to risk as you mentioned, Eric, and as Jonathan mentions here. So weigh that into your decision, but these tools are also getting getting better day in and day out. One of the things that I also struggle with sort of on this topic are leveraging ensuring that I'm leveraging, like, the latest and greatest from the packages that I'm using when these LLMs are trained on, you know, older older data. You know, recently, I was leveraging a function or I was leveraging TidyR's pivot wider, and it was giving me an it's Claude was giving me an argument that no longer existed.
It it had been long since deprecated, and it's asking me to use that. And I looked in the docs, and that argument didn't exist anymore. So I run into a lot of situations like that as well, whereas we're developing software. We're trying to make sure that we're using the latest and greatest as opposed to some of what I think philosophically these LLMs do is just regurgitate the past so that we don't make a lot of progress forward. That's a big probably a big way to to end, this section of of the podcast. But I really long story short, really appreciated Jonathan's fully transparent walk through of his journey on this package.
[00:39:01] Eric Nantz:
Me as well, and I am early in this process too, but I actually had a very, I won't say, stressful situation. But, back to my shiny state presentation. Night before, I had an example demo app, but I just was viewing. I was like, no. I don't like it. I wanted to do something that kind of brought some fun to it, but also an easy example. So, yes, I booted up Positron assistant with the anthropic agent mode, and I basically vibe coded a retro eighties video game looking app for the demonstrations. At 1AM It was awesome. It was this is what I needed because otherwise, I could've spent probably a week doing that knowing my OCD of developing shiny UIs, but it it did the job, man. It did the job. You did that the night before?
[00:39:54] Mike Thomas:
Yes. The night before. Oh my goodness. Go watch Eric's presentation, please. That will blow your mind.
[00:40:01] Eric Nantz:
It was, it it it but it you know, it it did the job. I mean, would it be exactly the way I wrote it? No. But it did good enough, and that's all I need. I just need a good enough. It was just for the slides, just some prerecorded demos of OBS. And once I got that app on, there was none to the next part, but it literally did save my behind when I had that very literally down to the wire change in my demo direction. And that that kinda made me a believer that in the right situation, this can definitely be a big help. But as you said, Mike, cautions abound for certain situations. But in the end, it can be a nice tool in your toolbox.
[00:40:42] Mike Thomas:
It's a heck of a Claude sales pitch.
[00:40:45] Eric Nantz:
Where's where's the where's the podcast revenue coming in, man? I don't know. And rounding out our highlights today, it it does feel like I'm going back in time on this one. I'm not meaning that in a disparaging way, but we back in graduate school when I was a TA in our stats department, we indeed would be part of the group that worked with our faculty to create dynamically generated quizzes and exams. I still remember making this in the Pearl and Lake Tech code of all things, so let's not get into those tangents. But this next highlight, this last time I should say, if you're in the situation where you are want needing to generate, you know, dynamically generated, say, statistical based exams, and you're confined to certain, I'll say, older formats, this may be just for you.
This actually is a blog post from the r exams project, and I think we may have mentioned them very briefly in previous highlights. Holy smokes. This thing is really comprehensive. If you are in this space of teaching with, with statistical knowledge or whatever else, this is a suite of packages that are literally meant for producing and parsing written exams in many different formats. One of those being multiple choice format. And, unfortunately, many of these are still being done in paper and when the students are filling it out and then turning it in. And there can be some issues when you scan those results in and some really gnarly issues that this blog post highlights here. They call it quality control for scanned multiple choice exams.
And I don't pretend to know all the internals of how the r exams tooling works in the r ecosystem, but the way they've conducted this and the way they have a solution to tackle some of these gnarly problems, it's definitely intriguing to me. So there's a couple demos here. One of which is there's a, you know, an exam that has some pretty typical problems where it's got five single choice answers. And then somehow when it was scanned, it was rotated upside down and certain things have been marked up a little oddly. And some of these choices actually have multiple check marks in it when there should have been single choice answer.
So this the the suite of packages that they have for our exams has a way to scan it and rotate it to the right way and then to literally run a wrapper function called knops fix where it'll give you kind of a visual what needs to be corrected and then be able to enter it in either in the prompt with the real value should be or click through a GUI like interface to to fix all that. So, yeah, it's it's kind of manual effort, but they that their clever trick is that when they import it, they actually make a zip file of the different problems as images. And then once you fix a problem, it kind of dynamically makes new metadata, I believe in JSON format or whatnot.
And then it ingest it pack into the parsed object that this exam or the importing this exam, you know, scan file generates. And then you can unlink kind of the temporary files and get this all cleaned up. Boy, oh, boy. Where was this twenty five years ago when I needed it? But, nonetheless, better late than never. And then their second demo has a lot of problems, not just the the rotating issue, but they've got, you know, many many other corrupted markers where it was supposed to be, again, one choice, many multiple choice, and and some fact some of the scan results are even off the page, so to speak, like really gnarly stuff.
And so the package again lets you fix these dynamically, but then you get a nice little looks like HTML based GUI on this one where it actually thinks there's more questions than there should be, but then you can at least select only the ones that are supposed to be selected, and then it'll write back out the actual amount of questions. And then you can, you know, review all this as you go. It keeps everything that you work on in the space so you can easily, you know, parse this through and with your human eye, so to speak, before you make the final call on fixing this up. But, man, I can just imagine how much time this pipeline saves for not just creating these exams, but also ingesting them when you're dealing with these, you know, I hate to say legacy, but pretty legacy formats in terms of the way education is these days. But I'm just really impressed with what they've accomplished here. And like I said, if you're in the education space and you wanna start, you know, learning from what they're doing, they got a whole bunch of blog posts about what they're doing with this suite of packages, even some integrations with AI tooling as well to help, along the way. So definitely learn something new that, again, me from many years ago would have been all over this for my exam generation.
[00:46:18] Mike Thomas:
This is absolutely incredible. It does sort of remind me of the, like, memes out there where, say, like, men will do anything but but fix themselves essentially, which is like, you know, we'll do anything, in our, you know, to parse information and allow us to programmatically interact with scanned, exam documents instead of creating non, like, PDF, you know, scanned exams in the first place. But but c'est la vie, that's that's just the way that it goes. One area that I think is really, really interesting and I'd love to dig in further to see exactly how they did it, but it's these functions that sort of pop up this interactive session. Right? Whether that be some sort of HTML, widget or or or whatever it is, some HTML interface that allow you to either take a look at the PDF, maybe interact with it as well. We have a ton of use cases lately where we are using LLMs to parse data out of PDFs to provide, you know, answers standardized answers for downstream reporting or downstream decision making, things like that.
But our users are wanting some sort of audit step in the middle where they can see the section of the PDF that the LLM used to get its answer, and that's an area that we struggle with. And I feel like if I dove deeper into the internals of, the packages that are being used here, it might give me some some big pointers on how to solve that problem. Because I think that that's probably a problem that exists for a lot of folks who are trying to merge the the worlds of large language models and PDFs these days. So very, very interesting to me. I'm excited to look a little bit further underneath the hood. But if you are in this space at all in education and needing to administer exams and then review them and grade them, and still doing that very manually and looking for tooling that can help. I mean, this is this is gonna be mind blowing if you haven't run into it before.
[00:48:25] Eric Nantz:
Yeah. I think this does just about everything you need in this space. So I I think it's worth worth checking out for sure. And and you know usually I I try to give credit to the authors but this blog post is this a project in general has a lot of people behind it so I have a link to their contact information in in the show notes if you want to get in touch with them but they're also on social media as well. So we'll have a a link to all that. But, yeah, really fantastic effort there. Just like the rest of this is yep. Go ahead. Yeah. It looks like maybe Akeem Zelos is tagged at the bottom of this. Oh, it is. Okay. Good eye. So, Akeem, yeah, fantastic job to you and the rest of the team on this fantastic, fantastic pipeline.
And the rest of the issue is fantastic as well with our weekly, and it never stops even when we were at our at our low for a bit, but it is back up and we're back up and running and luckily not a moment too soon. There are lots of great resources to look at and on top of these highlights that we talked about, we are a bit pressed for time so I will have to, you know, not do our additional fines, but again, you'll see a whole bunch that John's curated here that you can look at your leisure. And, also, we love hearing from you. Again, shout out to all the attendees at Pazit Comp. They've said some nice words about our humble little podcast and wonder when it was coming back. And I said, it was coming back. Trust us. So thank you to all of you that said hi to both Mike and I. It was it was gratifying to to connect with you in person. We always love hearing from all of you. But since we're not a cop anymore, you wanna get in touch with us, there are a few ways of doing that. We have the contact form directly linked in the episode show notes that's still up and running last time I checked. We also have the project itself that you can help out with poll requests.
Everything [email protected]. Click the little GitHub icon in the upper right. You'll get directly to the pull request template so the curator of the next issue can benefit from what you found. And little birdie tells him that might be me next, so I need all the help I can get. But last but not least, you can find us on social media. I am at [email protected]. We'll try to post more often now that I'm getting through some of these crazy August, September conference and presentation stuff that I've been involved with. You can also find me on Mastodon with at rpodcast@podcastindexonsocial, and I'm on LinkedIn. Just search my name. You're gonna see me there. Mike, where can the listeners find you? You find me on blue sky @mike-thomas.bsky.social.
[00:50:53] Mike Thomas:
You probably won't find me on on LinkedIn if you search my name. But if you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to lately.
[00:51:03] Eric Nantz:
Awesome stuff. And like I said, it felt it feels comfy again, being able to do do this. And hopefully, we can keep back to get back to a regular cadence, but we you and I both have been in the middle of a lot of stuff going on. So we'll we'll keep trudging along. But as always, we thank you so much for listening out there from Revier around the world. We really appreciate it, and we look forward to hopefully being back. We have a new episode of our weekly highlights next week.
Eric & Mike take over posit::conf(2025)!
Episode Wrapup