Context is king in a trifecta of R packages harnessing LLMs to be your virtual assistant in package development and data science, plus the world (of data) is at your fingertips for data exploration and sharing your insights using the innovative closeread Quarto extension.
Episode Links
Episode Links
- This week's curator: Jonathan Kitt - @jonathankitt.bsky.social (Bluesky)
- Three experiments in LLM code assist with RStudio and Positron
- Gapminder: how has the world changed?
- Downloading datasets from Our World in Data in R
- Entire issue available at rweekly.org/2025-W06
- From hours to minutes: accelerating your tidymodels code https://youtu.be/pTMiDHFIiPQ?si=4qZC3_NLUVhD5hAa
- Effecient Machine Learning with R https://emlwr.org/
- Closeread: Bringing scrollytelling to Quarto https://youtu.be/KqLxy66B3lQ?si=w7DIB4QhYk2u6cLG
- Closeread Posit contest submissions https://forum.posit.co/tag/closeread-prize-2024
Extract Information From Images and PDFs With R & LLMs https://3mw.albert-rapp.de/p/extract-information-from-images-and-pdfs-with-r-llms
- usgs geological survey mapping water https://waterdata.usgs.gov/blog/acs-maps/
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)
- Mike Thomas: @[email protected] (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter)
- Costa del Sol DANCE - Final Fantasy VII - posu yan - https://ocremix.org/remix/OCR00095
- Aerobotics - Mega Man 8 - Just Coffee - https://ocremix.org/remix/OCR03323
[00:00:03]
Eric Nantz:
Hello, friends. We are back of episode 94 of the Our Weekly Highlights podcast. If you're new, this is the weekly podcast where we talk about the excellent highlights and additional resources that are shared every single week at rweekly.0rg. My name is Eric Nantz, and I'm delighted you join us from wherever you are around the world. And I'm always joined in this February, but it's still my same cohost, and that's my choice here, Mike Thomas. Mike, how are you doing today?
[00:00:29] Mike Thomas:
Doing pretty well, Eric. It was kind of a long January here in The US, and it seems like we're in for an even longer February. But happy to be on the highlights today, and, may your datasets continue to be available.
[00:00:44] Eric Nantz:
Let's certainly hope so. I will say on Saturday, I had a good little diversion from all this, stuff happening. I was with about 70,000 very enthusiastic fans at WWE's Royal Rumble right here in the Midwest, and that was a fun time. My voice has finally come back. Lots of fun surprises, some not so fun surprises, but that's why we go to these things so we can voice our pleasure or displeasure depending on the on the storyline. But it was a awesome time. I've never been to what the WWE has as their, quote unquote, premium events. It used to be called pay per views at a stadium as big as our Lucas Oil Stadium here in Indianapolis. So I I had a great time and, yeah. I'm I'm slowly coming back to the the real world now, but it it was it was it was well worth the price of admission.
[00:01:39] Mike Thomas:
That is super cool. That must have been an awesome experience. Luca Oils Lucas Oil Stadium, a dome?
[00:01:45] Eric Nantz:
It is. Yep. That's the home of the Indianapolis Colts. It's been around for about fifteen years, I believe now. Last time I was there, I was at a final four, our NCAA basketball tournament way way back when where we saw my, one of my favorite college basketball teams, Michigan State. Unfortunately, we lose the butler that year, but it was a good time nonetheless. So, we won't be weighing the the smack down on r for this. We are gonna we're gonna put over r as they say in the business, and that is an emphatic yeet, if you know what I mean. Yeet.
But speaking of enthusiastic, I am very excited that this week's issue is the very first issue curated by our newest member of the our weekly team, Jonathan Kidd. Welcome, Jonathan. We are so happy to have you on board the team. And as always, just like all of us, our first time of curation, it's a lot to learn, but he had tremendous help from our fellow r Wiki team members and contributors like all of you around the world with your poll request and suggestions. And Jonathan did a spectacular job with his first issue, so we're gonna dive right into it with arguably still one of the hottest topics in the world of data science and elsewhere in in tech, and that is how in the world can we leverage the newer large language models, especially in our r and data science and development workflows.
We have sung the praises of recent advancements on the r side of things, I e with Hadley Wickham's Elmer package, which I've had some experience with. And now we're starting to see kind of an ecosystem start to spin up around this foundational package for setting up those connections to hosted or even self hosted large language models and APIs. And in particular, one of his, fellow posit software engineers, Simon Couch from the Tidymodels team. He, had the pleasure of, of enrolling in one of Posit's internal AI hackathons that were being held last year. And he learned about some, you know, the Elmer package and Shiny chat along with others for the first time.
And he saw tremendous potential on how this can be used across different realms of his workflows. Case in point, it is about twice a year that the Posit team or the Tidyverse team, I should say, undergoes spring cleaning of their code base. Now what does this really mean? Well, you can think of it a lot of ways, but in short, it may be updating some code that the packages is using. Maybe it's using an outdated dependency or a deprecated function from another package, and here comes the exercise of making sure that's up to date with, say, the newest, you know, blessed versions of that said function and whatnot, such as the CLI package having a more robust version of the abort functionality when you wanna throw an error in your function as opposed to what our lang was exposing in years before.
It's one thing if you only have a few files to replace, right, with that stop syntax or our or abort syntax from our lang. Imagine if you have hundreds of those instances. And imagine if it's not always so straightforward as that find and replace that you might do in a in an IDE such as r studio or positron. Well, that's where Simon in as a result of in participating in this hackathon, he created a prototype package called CLIPAL, which will let you highlight certain sections of your Rscript in RStudio and then run an add in function call to convert that to, like, a newer syntax, and in this case, that abort syntax going from r lang to the CLI packages version of that.
The proof of concept worked great, but it was obviously a very specific case. Yet, he saw tremendous potential here so much so that he has spun up not one, not two, but three new packages all wrapping the Elmer functionality combined with some interesting integrations with the RStudio API package to give that within editor type of context to these different packages. So I'll lead off, Simon's summary here on where the state is on each of these packages with the first true successor to COIPow, which was called pal. This is comes with built in prompts, if you will, that are tailored to the developer, I e the package developer.
If you think of the more common things that we do in package development, it's, you know, building unit tests or doing our oxygen documentation or having robust messaging via the COI package. Those are just a few things. But pow was constructed to have that additional context of the package and I e the functions you're developing in that package. So you could say, highlight a snippet and say give me the r oxygen documentation already filled out with that that function that you're highlighting. That's just one example. You could also, like I said, build in, like, CLI calls or convert those CLI calls from other calls if you wanna do aborts or messages or warnings and whatnot.
And that already has saved him immense amount of time with his package development, especially in the spring cleaning exercise. He does have plans to put Pal on CRAN in the coming weeks, but he saw tremendous potential here. That's not all. Mike, he he didn't wanna stop there with Pal because there are some other interesting use cases that may not always fit in that specific package development workflow or the type of assumptions
[00:07:59] Mike Thomas:
that Pal gives us. So why don't you walk us through those? Yeah. So I'll there's two more that I'll walk through, and one is more on the package development side, and then one is more on the analysis side for, you know, day to day our users and not necessarily package developers. So the first of which is called ensure, e n s u r e. And, one of the interesting things about ensure is that it it actually does a couple of different things that PAL does not do. And PAL sort of assumes that all of the context that you need is in the selection and the prompt that you you provide it. But, when we think about in the example that Simon gives here writing unit tests, it's actually really important to have additional pieces of context that may not be in just the single file that you're looking at, the prompt that you're writing, or, you know, the highlighted selection, that you've chosen.
You may actually need to have access to package datasets. Right? That you'll need to to, you know, include in that unit test that maybe aren't necessarily in the script or the the snippet of code that you're focusing on, at the moment. So ensure, you know, goes beyond the context that you have highlighted or or is showing on screen and actually sort of looks at the larger universe, I believe, of, all of these scripts and items that are included in your package. And it looks like, you know, unit testing here is probably the the biggest use case for ensure in that you can, leverage a particular function, like a dot r function within your r directory.
And if you want to scaffold or or really create, I guess, a a unit test for that, it's as easy, I believe, as, you know, highlighting the text that you or highlighting the lines of code that you're looking to write a unit test for. And, just a hot key shortcut that will actually spin up a brand new test dash whatever, test that file. It'll stick that file in the appropriate location under, you know, test test that for those of us that are our package developers out there. And it will start to write, the those unit tests on screen for you in that test that file. And there's a nice little GIF here that, shows sort of the user experience. And it's it's pretty incredible, that we have the ability to do that, and it looks really really cool. So I think that's, you know, really the main goal of the insure package.
Then the last one I wanna touch on is called Gander. And again, I think this one is a little bit more, you know, day to day data analysis friendly. The, functionality here is that you are able to highlight a specific, snippet of text or it also looks like Simon mentions that, you know, you can also not highlight anything and it'll actually take a look at, you know, all of the code that, is in the script that you currently have open. And you by pressing, you know, a quick keyboard shortcut, it looks like, you can leverage this add in, which will pop up sort of like a modal that will allow you to enter a prompt.
And in this example, you know, there's a dataset on screen. Simon just highlights the the name of that dataset. I think it's the Stack Overflow dataset, but it's just like Iris or Gapminder. And he highlights it, you know, the modal pops up and he says, you know, create a scatter plot. Right? And all of a sudden, the selection on screen is replaced by a g g plot code that's going to create this scatter plot. And he can continue to do that and iterate on the code by saying, you know, jitter the points or, you know, make the x axis formatted in dollars, things like that. And it's it's really, really cool how quickly he is able to, really create this customized g g plot with formatting, with fastening, all sorts of types of different things, in a way that is is obviously much quicker and and more efficient even if you are having to do some minor tweaks, to what the LLM is going to return at the end of the day than if you were going to just, you know, completely write it from scratch. So, pretty incredible here. There's another GIF that goes along with it demonstrating this. It looks like not only in this pop up window is there an input for the prompts that you wanna give it, but there is also another option called interface, which I believe allows you to control whether you wanna replace the code that you've highlighted, or I would imagine whether you wanna add on to the code that you've highlighted instead of just replacing it, you know, if you wanna create sort of a a new line, with the output of the LLM. So really cool couple of packages here that are definitely creative ways to leverage this new large language model technology to try to use, you know, provide us with some AI assisted coding tools. So big thanks to to Simon for and the team for developing these and sort of the creativity that they're having around, leveraging these LLMs to help us in our day to day workflows.
[00:13:18] Eric Nantz:
Yeah. I see immense potential here and the fact that, you know, with these being native R solutions inside our R sessions, grabbing the context not just potentially from that snippet highlighted, but the other files in that given project, whatever a package or a data science type project with the information on the datasets themselves. Like, that is immense value without you having the really in a separate window or browser and say chat g p t or whatnot, trying to give it all the context you can shake a stick at and hope that it gets it. Not always the case. So there is a lot of interesting extensions here that, again, are made possible by Elmer. And, you know, like like I said, immense potential here. Simon is quick to stress that these are still experimental.
He he sees, obviously, some great advantages to this paradigm of literally the the the bot, if you will, that you're interacting with is injecting directly into your r file that you're you're writing at the time. Again, that's a good thing. It may sometimes be not so a good thing if it is going on a hallucination or something, perhaps, who we don't know. So my my tip here is if you're gonna do this day to day, if you're not using person control, you really should. In case it goes completely nuts on you, You don't want that in your commit history and and somebody asking you to code review. What on earth are you thinking there? Oh, it wasn't me. Really? Well, well, it kinda was when you're using the bot anyway. But nonetheless, having version control, I think, is a must here. But I do see Simon's point that other frameworks in this space, he mentions it kind of in the middle of the post, are leveraging more general kinda back ends to interacting with your IDE or your Git like, functionality of showing you the difference between what you had before and what you have now after the after the AI injection. So you could review that kind of quick before you say, I like it. Yep. Let's get it in, or not so much. I wanna get that stuff out of here and try again. So I would imagine again, this is Eric putting on speculation hat here. With the advancements in Positron and leveraging more general, you know, Versus code ecosystem, extension ecosystem, that there might be even a more robust way to do this down the road on that side of it. But the advantage of the RStudio API package is leveraging is that thanks to some shims that have been created on the Positron side, this works in both the classic RStudio and in Positron. And I think that's, again, tremendous value at this early stage for those that are, you know, preferring, say, R Studio as of now over Positron, but still giving flexibility for the those who wanna stay on the bleeding edge to leverage this tech as well. So I think there's a lot to watch in this space, and and Simon, definitely does a tremendous job with these packages at this early stage.
[00:16:31] Mike Thomas:
That's for sure. Yeah. I appreciate sort of the deposit team's attention to UX because, again, I think that's sort of the most important thing here as we bring in, you know, tools that create very different workflows than maybe what we're necessarily used to. I think it's important that, we meet developers and and data analysts and data scientists, you know, in the best place possible.
[00:16:58] Eric Nantz:
And I mentioned at the outset that Simon is part of the Tidymodels, ecosystem team. I will put some quick, plugs in the show notes because he is leveraging on he's I should say, writing a new book called Efficient Machine Learning with R that he first announced at the R pharma conference last year with a excellent talk. So he's been really knee deep into figuring out the best ways to optimize his development both from a code writing perspective and from an execution perspective in the tiny models ecosystem. So, Simon, I hope you get some sleep, man, because you're doing a lot of awesome work in this space.
[00:17:34] Mike Thomas:
I was thinking the same thing. I don't know how he does it.
[00:17:45] Eric Nantz:
And speaking of someone else that we wonder how on earth did they pull this off at the time they have, our next highlight is, you might say, revisiting a very influential dataset that made a tremendous waves in the data storytelling and visualization space, but with one of the new quartal tools to make it happen. And long time contributor to our weekly highlights, Nicola Rinne, is back again, on the highlights as she has drafted her first use of the close read quarto extension applied to the Hans Rosling famous Gapminder dataset visualization.
If you didn't see or or I should say, if you didn't hear our our previous year's highlights, we did cover the close read quarto extension that was released by Andrew Bray and James Goldie. In fact, there was a talk about this at the aforementioned PasaConf last year, which we'll link to in the show notes. But close read in a nutshell gives you a way to have that interactive, what you might call scrolly telling kind of enhanced web based reading of a report, visualizations, interactive visualizations. You've seen this crop up from time to time from, say, the New York Times blog that relates to data science. Other, other, reporting companies out there or or startups out there have leveraged similar interactive visualizations.
Like, I even had an article on ESPN of all things that was using this kind of approach. So it's used everywhere now. But now us, adopters of Quarto, we can leverage this without having to reinvent the wheel in terms of all the HTML styling and other fancy enhancements we have to make. This close read extension makes all of it happen free of charge. So what exactly is this tremendous, report that Nicole has drafted here? She calls a gapminder, how the world has changed. And right off the bat, on the cover of this report is basically a replica of the famous animation visualization, plotting the GDP or gross domestic product per capita with life expectancy on the y axis with the size of the bubbles, pertaining to the area around those, and for each country, represented there.
So once you start scrolling the page, and again, we're an audio podcast. Right? We're gonna do the best we can with this. She walks through those different components. First, with the GDP with some nice, looks like a line plots that are faceted by the different regions of the world, getting more details on on gross domestic product, and then getting to how that also is affected by population growth. Again, another key key parameter in this life expectancy, calculation, which gets to the life expectancy side of it. And as you're scrolling through it, the plot neatly transitions as you navigate to that new text on the left sidebar.
It is silky smooth, just really, really top notch, user experience here. And then she isolates what one of these years looks like. She calls it the world in 02/2007, showing that kind of quadrant, that four quadrant section you have when you look at low versus high GDP and low versus high life expectancy. And as she's walking through this plot, she's able to zoom in on each of these quadrants as you're scrolling through it to look at, like I said, these four different areas, and that's leveraging the same visualization. It's just using these clever tricks to isolate these different parts of the plot. Again, silky smooth. This is really, really interesting to see how she walks through those four different areas and then closing out with the animation once again that goes through each year from the '25 nineteen fifties all the way to the early two thousands with, again, links to all of her code, GitHub repository, and whatnot.
But for a first, first pass at Close Read, this is a top notch product if I dare say so myself. And boy, oh, boy, I am really interested in trying this out. There was actually a close read contest that was, put together by Posit late last year, and I believe the submissions closed in January this, this past month. But if you want to see how others are doing in this space, addition to Nicola's, visualization here, we'll have a link in the show notes to, pause a community all the posts that are tagged with this close read contest so you can kinda see what other people are doing in this space. And maybe we'll hear about the winners later on. But this one this one will have a good chance of winning if I dare say so myself. So I am super impressed with close read here, and Nikola's
[00:22:52] Mike Thomas:
very quick learning of it. Yeah. It's it's pretty incredible. And I was going through the Quarto document that sort of lives behind this, and it actually seems pretty easy to get up to speed with this scrollytelling concept. It's pretty incredible. I think there's a a couple different specific tags, that, you know, allow you to do this, or maybe to to do it easy. It looks like, there is a dot scale to fill tag, that I believe probably handles a lot of the, zoom in zoom out sort of the aspect ratio of the plots or GIFs in Nicola's case that are being put on screen. Because in her visualization, it's almost like there's this whole left hand side bar, right, that has a lot of the context and the narrative text, that goes along with the visuals on the right side of the screen.
You know, some of the pretty incredible things that I thought were interesting here is, you know, not only was she able to, you know, fit a lot of these plots in a nice aspect ratio on the right side of the screen, but there's also actually a section of the scrolly telling visualization where she zooms in across four different slides, if you will, on four different quadrants of the same plot, to tell the story of these four different quadrants, you know, one being low GDP per capita and low life expectancy, low GDP per capita and high life expectancy, and, you know, the other two as well, vice versa. And it's pretty easy, it's pretty awesome, I guess, how the visualization sort of nicely slides from one quadrant to the other as you scroll to the next slide, if you will.
So this is for for any of the data vis folks out there, data journalism folks out there, I imagine that in order to accomplish something like this in the past, it was probably a lot of d three, JS type of work, and the end product here compared to the quarto code that I'm looking at is it's pretty incredible. And it just sort of gives me the idea that it's a lot of the heavy lifting has been done for us, in in the ability to create these quarto based scrolly telling types of visualizations. So I'm super excited about this.
[00:25:26] Eric Nantz:
You know, it made me go then the way back machine a little bit on this. I'm gonna bring Shani in this because I love to bring Shani in almost all my conversations. But back in 2020, of all things, I remember I had the good fortune of presenting, I at the poster session at the conference, and I had my topic was kinda highlighting the latest innovations in the shiny community, and I was, trying to push for what could we could we ever have something like a shiny verse or whatnot of these community extensions. And to do this poster, I didn't wanna just do, you know, PowerPoint over anything. Come on now. You know me. But I leverage, our good friend, John Coons.
He had a a development package way back in the day called Fullpage, which was a way to create kind of a shiny app that had these scrolly telling like elements. But I will say he was probably too far ahead of his time on that. I won't say it was that easy to use. And, frankly, he would probably acknowledge that too. Here's my idea. I still have the GitHub repo of this, you know, poster I did. I would love to have my hand at converting that to close read and wait for it, somehow embedding a Shiny Live app inside of it. Can it be done?
[00:26:42] Mike Thomas:
I think it can too. I think you'd be breaking some new ground, Eric. But,
[00:26:47] Eric Nantz:
if if anybody's up for that challenge, I know it's you. How did I just nurse night myself? Like, how does that happen, Mike? What you must be hypnotizing me or something without even saying anything. I have no idea.
[00:26:59] Mike Thomas:
Peer pressure.
[00:27:13] Eric Nantz:
Now you may be wondering out there yeah. The Gapminder data, we we are fortunate that we have a great r package that literally gives us this kind of data. So once Nicola has this package loaded, she's able to, you know, create this awesome close read, you know, scrolly telling type of report. Well, there are many, many other sources of data that can surface this very similar important domain such as what we saw in the Gapminder set. And you may be wondering, where can I get my hands on some additional data like this so I can do my own, you know, reporting? Maybe with Close Read or Shiny or Quordle, whatever have you. Our last highlight is giving you another terrific resource of data for these kind of situations.
This last highlight comes to us from Kenneth Tay, who is a applied researcher at LinkedIn, and he has a blog that he is, his latest post is talking about some recent advancements in this portal called Our World in Data, which I have not seen before this highlight, but it is a, I believe, a nonprofit organization whose mission is to create accessible research data to make progress against the world's largest problems. So you might think of, say, poverty, life expectancy, some of the other issues that, say, the Gapminder said highlighted. But they wanna make sure that anybody that has the desire and the skill set to use, say, a language like R or whatever else to produce visualizations to really start to summarize and explore these data, that there is as less friction as possible to access these.
And, yes, you could access their portal. You could download the data manually on their website, but it was earlier in 2024 that this group had exposed an API to access these data. So Kenneth, in his blog post here, walks through what it's like to use this, this, new API, particularly to call it a public chart API because it is the basis for, I believe, some interactive visualizations that their web portal is exposing here. But because there is a an API now, he brings back a little bit of old school flavor here, the h t t r or the hitter package. That was one of those cases where I've been spelling it out all this time, but on on hitter two, the read me, Hadley literally says how it's pronounced. So thank you, Hadley. I wish I wish all our package authors would do that.
[00:29:45] Mike Thomas:
In case the baseball player didn't give it away.
[00:29:48] Eric Nantz:
Exactly. So great great hacks on the new on the new package itself. So back to Kenneth's expiration here, he shows us how with the old school hitter along with a little tidy verse magic and JSON light under, loaded into the session. He needs all three of those because, first, it's one thing to access the data itself, which apparently are exposed as CSV files on the back end, but the API lets you grab these directly. But the metadata comes to that to that in JSON format. So he wants to use JSON like to help massage some of that too.
So the first exploration of this and the snippet on the blog post is looking at the average monthly surface temperature around the world. So once he's get the he's got the URL of the dataset, then he assembles the query parameters, which, again, in the role of APIs, you might have some really, really robust documentation. Maybe some other times you have to kind of guess along the way. It's kind of roll of the dice, isn't it?
[00:30:52] Mike Thomas:
Yeah. I find the the latter to be the case more often, especially in professional settings, unfortunately, which seems to make no sense.
[00:31:01] Eric Nantz:
Who would ever think that? But yet, I feel seen when you say that. Yes. Even as of this past week. My goodness. Don't get me started. So luckily for this, there is a a healthy mix here, I would say. So he's got some query parameters. So look at the version of the API, the type of CSV, the return, which can be the full set or a filter set, which I'll get to in a little bit, and whoever to use long or short column names in the dataset that's returned back. And then, also, he does a similar thing for the metadata. That's another get request, as well, and then he brings that through that content directly, with JSON format.
So the metadata comes back as a list because most of the time when you return JSON back, it is bake basically a big nested list, and that gives some high level information on the dataset that is returning. So you get, basically a list of each character string of the variable name and the description of that variable. So that's great. Now the data directly, again, setting up similar, setting up the query parameters. This time, he's gonna demonstrate what it's like to bring a filtered version of that data right off the bat. And that is where there's a little guessing on that because he went through the web portal of this, played with the interactive filters that this web portal gives gives him, and looked at the end of the URL. So if you're new to the way requests are made for API, you might say get requests where you're running to grab something from the API.
More often than not, you'll attach different flags or different variables at the end of the URL often in, like, key value type pairs with an ampersand separating the different parameters. So once he explored this web portal, he was able to grok that, oh, yeah. There is a parameter for selecting the country. So I'm gonna, you know, put that in the query parameter and feed it in the direct value. And then once he does the get request on that, this is important here, the contact the content, I should say, that's coming back can usually be three different types of flavors. The raw might say binary representation of that value, the textual value of it, or the format of JSON or XML version of it.
In this case, the it was a text value coming back because it's literally the CSV content as if you just had the CSV open on a new file in your computer. That's how the text is coming back. So he feeds that into a read underscore CSV directly. And lo and behold, you got yourself a tidy dataset as a result of that. So and then with that, he just said a simple plot of the time in year versus the temperature of the surface across The USA just to show that that's exactly how you would bring that data in. And there's a lot more you can do with this type of data. But, again, it's a good example of, first, going to the documentation where it's available. But then when things maybe aren't as well documented, yeah, nothing beats a little trial and error. Right? Sometimes that's the best bet we get, and that's how he was able to do that filtered dataset pull. But, nonetheless, if you're looking for inspiration, I'm looking at similar data as we covered in the second highlight, but across a wide range of world specific type of data. I think this portal has a lot of potential.
And, yes, r is your friend. Again, we're grabbing these data from almost any source you can imagine. So really great blog post straight to the point. You could take this code and run with it today. And, in fact, a good exercise would be what would you do to convert that to the hitter two syntax, which shouldn't be too much trouble. But, nonetheless, you've got a great example to base your explorations off of here.
[00:34:53] Mike Thomas:
Yeah. I I think it's just a good reminder in general, especially for, you know, junior data science folks who are just starting out that your data isn't always going to be in a CSV format. Yes. I know that our world in data allows you to export that. But a question that you should be asking, you know, in order to try to automate things as much as possible for yourself is often, you know, is there an API, right, for this this dataset or is there an underlying database that we can connect to so that I can just run my code directly against that, run my script with one click as opposed to having to go someplace and download the data to a CSV first, before I do my analysis. So, you know, if you can sort of automate a recurring script that you have against data that that might be just updating but in the same schema on some particular basis.
I think, yeah, this is a fantastic example of leveraging our world and data's API to do that, some really nice, base plotting, some really nice g g plotting as well, a pretty cool mix here That's been put together. And like you said, Eric, a great example of dealing with what's called a get request, which is where you're actually just modifying the suffix of the URL, in order to filter the dataset that's going to get returned here. So it's a really great example of doing that with a couple of different parameters that are being managed. I guess one parameter being, tab equals chart, another one specifying the time or the date range, that we're looking to get data back within. And then the last one being the the two countries here in the case of this last example where we're plotting the average monthly temperature for the entire world and then, for Thailand as well. So, you know, two items in the legend here. As you said, great great walk through blog post of using a publicly available, API to wrangle some data and and make it pretty.
[00:36:54] Eric Nantz:
Yeah. The the the limit's only your imagination at this point. So like I said earlier, you could take what Nicola made with her close read example, apply it to this kind of data, and and go to town with a a great learning journey. Great for a blog post such as this, you know. All if again, maybe, like you said, speaking to the the data scientists out there that are looking to get into an industry or or an a data science type of role, it never hurts. Well, if you've got the time and the energy to build a portfolio of things like this because you never know just how useful that will be as you're trying to showcase what you find and what what skill set you have to generate insights from data like this. Because not to not to pull the old, back in my day, syntax here, but we didn't have access to these type of data when I was looking for a job earlier. So take advantage of it, folks. It is here for the taking.
Speaking of what else you need to take advantage of, you need to take advantage of our weekly folks because if this isn't bookmarked for reading every single week, you are missing out because this issue has well more than what we just talked about here in these highlights. We got a great batch of additional tutorials, new packages that have been released, new events coming up. It's the full gamut. So we'll take a couple minutes for our additional fines here. And, leveraging what we talked about at the outset of the show with Simon's explorations of interacting with Elmer, a very common problem across many, many different industries and organizations is dealing with data that I'm gonna go on a limb here is kind of trapped in image or PDF format.
Because wherever you like it or not, there's gonna be some team out there that said, you know what? We have this great set of data here, and they act like everything is perfect access. And then you as a data scientist says, oh, yeah. Where are the CSVs? Where where are the parquet files if they're really up to date? Oh, no. No. They're in these PDFs. Oh, gosh. Okay. What do I do now? Yes. There are things like OCR that can help you to an extent, but with the advent of AI, there might be an even easier way to do that. So frequent contributor to the our weekly, highlights and elsewhere, Albert Rapp has another post in his three minute Wednesday series on how he was able to leverage Elmer to extract, text from both an image file of an invoice as well as a PDF version of that image and to be able to grab, you know, certain numeric quantities like number of billable hours, time period, and whatnot.
I think this is a very relatable issue that, again, many organizations, big or small, are gonna have to deal with at some point. And I've seen projects being spun up at the hashtag day job where they're looking at ways of building this from scratch. Well, if you're an R user, maybe Elmer with its image extraction functionality might get you 90% on the way there. Hashtag just saying. So excellent post, Albert and I may be leveraging this sooner than
[00:40:07] Mike Thomas:
later. No. That's awesome. We have some some projects that are doing the same thing with some of these self hosted open weights models to be able to take a look at a PDF and extract very particular pieces of information that we want from it, and we can tell it, you know, give us that back in JSON form, and it allows us to, you know, leverage it downstream. Of course, you have to build a bunch of guardrails around that to make sure it's not hallucinating because it's a Absolutely. Box. Yep. But it's it's pretty powerful stuff, and the accuracy that we're seeing is is pretty shocking, pretty awesome.
But what I want to call out is an article by the USGS, which is the US Geological Survey on mapping water insecurity in R with TidyCensus. They just always do an absolutely beautiful job, with data visualization. All the code is here for a lot of these visuals actually deal with, households that were were lacking plumbing in 2022 in The US, and then changes, via, I guess, barbell plots, they're called. I don't know if there's any other names for them. Lollipop plots?
[00:41:13] Eric Nantz:
Yeah. I've seen them thrown around interchangeably.
[00:41:15] Mike Thomas:
Yep. Yep. To take a look at, improvements in plumbing facilities, particularly in, New Mexico and Arizona, which were the two states in based upon the 2022 census, that I think had the the lowest rates, of of household plumbing. So it's, you know, it may be a a niche topic for some, for lack of a better word. But the the data visualizations that they have here on these choropleth maps are really, really nice. I I love the color palettes that they use. I I really love the walk through that they provide on the website in terms of the code and the narrative around how they made the decisions that they made to go from dataset to visuals. I think it's a great job. You know, on the the East Coast here, water scarcity is not something that we really are concerned about. But I know on the West Coast, because we do a lot of our work in agriculture, it's it's quite a big deal in terms of water rights and water access and things like that.
So I really appreciate the work that the USGS is doing on this particular, you know, niche.
[00:42:27] Eric Nantz:
Yeah. And I have a soft spot for the the great work they're doing. My wife actually was fortunate early in her career to have an internship at the USGS and, albeit this was a day where r wasn't quite as as readily used as it is now, but it's great to see this group in particular being really modern with their approaches. And, again, top notch narrative, top notch visualization, so really exciting to see. And I believe we featured this group on previous highlights, so you wanna check out the back catalog for some of the excellent work they've been doing in this space, previously. So excellent excellent find, Mike, and there are a lot more finds than just those. So, again, we invite you to check out the rest of the r weekly issue at rweekly.0rg.
We, of course, have a direct link to this particular issue in the show notes, but, also, you wanna check the back catalog about both the issue as well as this humble podcast itself because we got so many great things to talk about here, so many great things to learn. As you heard, I've I've basically nurse signed myself for a new project, hopefully, this year that I can work on with Shiny and Close Reads. So we'll see what happens there. But, yeah, if you wanna see what else is happening and if you want to be a part of what's happening here in terms of what the readers are gonna see every week, We value your contributions, and the best way to do that is, again, head to rweekly.0rg.
You'll see in the top right corner a link to this upcoming issues draft where you can send a poll request to tell tell us about that great new package, that great new visualization, that great new use case of shiny or AI or other technologies that you can see in this data science community. We'd love to hear it. Again, all marked down all the time. I I would stress again when told me years ago, if you can't wear an r markdown in five minutes, he would give you $5, and he didn't have to give any money for it. So there you go, folks. And, also, we love hearing from you. You can get in touch with us via the contact page in the episode show notes. You can also send us a fun little boost with the modern podcast app. Those details are in the show notes as well. And you can also get in touch with us on the social medias.
I am now on Mastodon these days with @rpodcastatpodcastindex.social. I am also on Blue Sky as well where I am at rpodcast.bsky.social. I believe that's how to say. Mike, where can the listeners find you?
[00:44:51] Mike Thomas:
Yes. I am on blue sky for the most part these days at mike dash thomas dot b s k y dot social. Also on fa Mastodon a little bit, [email protected]. And you can check out what I'm up to on LinkedIn if you search Catch Broke Analytics, k e t c h b r o o k. And a bit of a shout out here, self plug that we are looking for a DevOps expert. If you are somebody who has expertise in Docker, little Kubernetes, Azure preferred, but it doesn't really matter because we're all spinning up Linux servers at the end of the day, we could use some help managing ours and our clients' ShinyProxy environment. So any DevOps folks out there, please feel free to reach out.
[00:45:40] Eric Nantz:
I'm sure there are many of you out there. So, yeah, take up Mike on this tremendous opportunity. I'm still learning the DevOps ropes. We share many stories about that in our adventures there. So that's that's a great great plug, Mike. And I'm also on LinkedIn as well. But, yeah, we'll, we'll add that little, call out to the show notes as well if you're interested in pursuing that. Nonetheless, we're gonna close-up shop here for episode 94 of Haruki highlights. Before I go, I wanna send a very hearty congratulations to Chris Fisher and the team at Jupiter Broadcasting. You recently had episode 600 of Linux Unplugged.
Tremendous achievement, folks. You'll be seeing a boost from me in the coming days. I don't know if we'll ever get there, Mike, but, nonetheless, that's a huge number for a podcast of that size. So congrats to them. Well, we'll we'll get to 200 at least, and we'll see what happens after that. Alright. We got no place to stop. Well, yeah. Me either. Yep. We'll see what happens, buddy. But, nonetheless, we hope you enjoyed episode 94 of our week highlights, and we'll be back with another episode for one ninety five next week.
Hello, friends. We are back of episode 94 of the Our Weekly Highlights podcast. If you're new, this is the weekly podcast where we talk about the excellent highlights and additional resources that are shared every single week at rweekly.0rg. My name is Eric Nantz, and I'm delighted you join us from wherever you are around the world. And I'm always joined in this February, but it's still my same cohost, and that's my choice here, Mike Thomas. Mike, how are you doing today?
[00:00:29] Mike Thomas:
Doing pretty well, Eric. It was kind of a long January here in The US, and it seems like we're in for an even longer February. But happy to be on the highlights today, and, may your datasets continue to be available.
[00:00:44] Eric Nantz:
Let's certainly hope so. I will say on Saturday, I had a good little diversion from all this, stuff happening. I was with about 70,000 very enthusiastic fans at WWE's Royal Rumble right here in the Midwest, and that was a fun time. My voice has finally come back. Lots of fun surprises, some not so fun surprises, but that's why we go to these things so we can voice our pleasure or displeasure depending on the on the storyline. But it was a awesome time. I've never been to what the WWE has as their, quote unquote, premium events. It used to be called pay per views at a stadium as big as our Lucas Oil Stadium here in Indianapolis. So I I had a great time and, yeah. I'm I'm slowly coming back to the the real world now, but it it was it was it was well worth the price of admission.
[00:01:39] Mike Thomas:
That is super cool. That must have been an awesome experience. Luca Oils Lucas Oil Stadium, a dome?
[00:01:45] Eric Nantz:
It is. Yep. That's the home of the Indianapolis Colts. It's been around for about fifteen years, I believe now. Last time I was there, I was at a final four, our NCAA basketball tournament way way back when where we saw my, one of my favorite college basketball teams, Michigan State. Unfortunately, we lose the butler that year, but it was a good time nonetheless. So, we won't be weighing the the smack down on r for this. We are gonna we're gonna put over r as they say in the business, and that is an emphatic yeet, if you know what I mean. Yeet.
But speaking of enthusiastic, I am very excited that this week's issue is the very first issue curated by our newest member of the our weekly team, Jonathan Kidd. Welcome, Jonathan. We are so happy to have you on board the team. And as always, just like all of us, our first time of curation, it's a lot to learn, but he had tremendous help from our fellow r Wiki team members and contributors like all of you around the world with your poll request and suggestions. And Jonathan did a spectacular job with his first issue, so we're gonna dive right into it with arguably still one of the hottest topics in the world of data science and elsewhere in in tech, and that is how in the world can we leverage the newer large language models, especially in our r and data science and development workflows.
We have sung the praises of recent advancements on the r side of things, I e with Hadley Wickham's Elmer package, which I've had some experience with. And now we're starting to see kind of an ecosystem start to spin up around this foundational package for setting up those connections to hosted or even self hosted large language models and APIs. And in particular, one of his, fellow posit software engineers, Simon Couch from the Tidymodels team. He, had the pleasure of, of enrolling in one of Posit's internal AI hackathons that were being held last year. And he learned about some, you know, the Elmer package and Shiny chat along with others for the first time.
And he saw tremendous potential on how this can be used across different realms of his workflows. Case in point, it is about twice a year that the Posit team or the Tidyverse team, I should say, undergoes spring cleaning of their code base. Now what does this really mean? Well, you can think of it a lot of ways, but in short, it may be updating some code that the packages is using. Maybe it's using an outdated dependency or a deprecated function from another package, and here comes the exercise of making sure that's up to date with, say, the newest, you know, blessed versions of that said function and whatnot, such as the CLI package having a more robust version of the abort functionality when you wanna throw an error in your function as opposed to what our lang was exposing in years before.
It's one thing if you only have a few files to replace, right, with that stop syntax or our or abort syntax from our lang. Imagine if you have hundreds of those instances. And imagine if it's not always so straightforward as that find and replace that you might do in a in an IDE such as r studio or positron. Well, that's where Simon in as a result of in participating in this hackathon, he created a prototype package called CLIPAL, which will let you highlight certain sections of your Rscript in RStudio and then run an add in function call to convert that to, like, a newer syntax, and in this case, that abort syntax going from r lang to the CLI packages version of that.
The proof of concept worked great, but it was obviously a very specific case. Yet, he saw tremendous potential here so much so that he has spun up not one, not two, but three new packages all wrapping the Elmer functionality combined with some interesting integrations with the RStudio API package to give that within editor type of context to these different packages. So I'll lead off, Simon's summary here on where the state is on each of these packages with the first true successor to COIPow, which was called pal. This is comes with built in prompts, if you will, that are tailored to the developer, I e the package developer.
If you think of the more common things that we do in package development, it's, you know, building unit tests or doing our oxygen documentation or having robust messaging via the COI package. Those are just a few things. But pow was constructed to have that additional context of the package and I e the functions you're developing in that package. So you could say, highlight a snippet and say give me the r oxygen documentation already filled out with that that function that you're highlighting. That's just one example. You could also, like I said, build in, like, CLI calls or convert those CLI calls from other calls if you wanna do aborts or messages or warnings and whatnot.
And that already has saved him immense amount of time with his package development, especially in the spring cleaning exercise. He does have plans to put Pal on CRAN in the coming weeks, but he saw tremendous potential here. That's not all. Mike, he he didn't wanna stop there with Pal because there are some other interesting use cases that may not always fit in that specific package development workflow or the type of assumptions
[00:07:59] Mike Thomas:
that Pal gives us. So why don't you walk us through those? Yeah. So I'll there's two more that I'll walk through, and one is more on the package development side, and then one is more on the analysis side for, you know, day to day our users and not necessarily package developers. So the first of which is called ensure, e n s u r e. And, one of the interesting things about ensure is that it it actually does a couple of different things that PAL does not do. And PAL sort of assumes that all of the context that you need is in the selection and the prompt that you you provide it. But, when we think about in the example that Simon gives here writing unit tests, it's actually really important to have additional pieces of context that may not be in just the single file that you're looking at, the prompt that you're writing, or, you know, the highlighted selection, that you've chosen.
You may actually need to have access to package datasets. Right? That you'll need to to, you know, include in that unit test that maybe aren't necessarily in the script or the the snippet of code that you're focusing on, at the moment. So ensure, you know, goes beyond the context that you have highlighted or or is showing on screen and actually sort of looks at the larger universe, I believe, of, all of these scripts and items that are included in your package. And it looks like, you know, unit testing here is probably the the biggest use case for ensure in that you can, leverage a particular function, like a dot r function within your r directory.
And if you want to scaffold or or really create, I guess, a a unit test for that, it's as easy, I believe, as, you know, highlighting the text that you or highlighting the lines of code that you're looking to write a unit test for. And, just a hot key shortcut that will actually spin up a brand new test dash whatever, test that file. It'll stick that file in the appropriate location under, you know, test test that for those of us that are our package developers out there. And it will start to write, the those unit tests on screen for you in that test that file. And there's a nice little GIF here that, shows sort of the user experience. And it's it's pretty incredible, that we have the ability to do that, and it looks really really cool. So I think that's, you know, really the main goal of the insure package.
Then the last one I wanna touch on is called Gander. And again, I think this one is a little bit more, you know, day to day data analysis friendly. The, functionality here is that you are able to highlight a specific, snippet of text or it also looks like Simon mentions that, you know, you can also not highlight anything and it'll actually take a look at, you know, all of the code that, is in the script that you currently have open. And you by pressing, you know, a quick keyboard shortcut, it looks like, you can leverage this add in, which will pop up sort of like a modal that will allow you to enter a prompt.
And in this example, you know, there's a dataset on screen. Simon just highlights the the name of that dataset. I think it's the Stack Overflow dataset, but it's just like Iris or Gapminder. And he highlights it, you know, the modal pops up and he says, you know, create a scatter plot. Right? And all of a sudden, the selection on screen is replaced by a g g plot code that's going to create this scatter plot. And he can continue to do that and iterate on the code by saying, you know, jitter the points or, you know, make the x axis formatted in dollars, things like that. And it's it's really, really cool how quickly he is able to, really create this customized g g plot with formatting, with fastening, all sorts of types of different things, in a way that is is obviously much quicker and and more efficient even if you are having to do some minor tweaks, to what the LLM is going to return at the end of the day than if you were going to just, you know, completely write it from scratch. So, pretty incredible here. There's another GIF that goes along with it demonstrating this. It looks like not only in this pop up window is there an input for the prompts that you wanna give it, but there is also another option called interface, which I believe allows you to control whether you wanna replace the code that you've highlighted, or I would imagine whether you wanna add on to the code that you've highlighted instead of just replacing it, you know, if you wanna create sort of a a new line, with the output of the LLM. So really cool couple of packages here that are definitely creative ways to leverage this new large language model technology to try to use, you know, provide us with some AI assisted coding tools. So big thanks to to Simon for and the team for developing these and sort of the creativity that they're having around, leveraging these LLMs to help us in our day to day workflows.
[00:13:18] Eric Nantz:
Yeah. I see immense potential here and the fact that, you know, with these being native R solutions inside our R sessions, grabbing the context not just potentially from that snippet highlighted, but the other files in that given project, whatever a package or a data science type project with the information on the datasets themselves. Like, that is immense value without you having the really in a separate window or browser and say chat g p t or whatnot, trying to give it all the context you can shake a stick at and hope that it gets it. Not always the case. So there is a lot of interesting extensions here that, again, are made possible by Elmer. And, you know, like like I said, immense potential here. Simon is quick to stress that these are still experimental.
He he sees, obviously, some great advantages to this paradigm of literally the the the bot, if you will, that you're interacting with is injecting directly into your r file that you're you're writing at the time. Again, that's a good thing. It may sometimes be not so a good thing if it is going on a hallucination or something, perhaps, who we don't know. So my my tip here is if you're gonna do this day to day, if you're not using person control, you really should. In case it goes completely nuts on you, You don't want that in your commit history and and somebody asking you to code review. What on earth are you thinking there? Oh, it wasn't me. Really? Well, well, it kinda was when you're using the bot anyway. But nonetheless, having version control, I think, is a must here. But I do see Simon's point that other frameworks in this space, he mentions it kind of in the middle of the post, are leveraging more general kinda back ends to interacting with your IDE or your Git like, functionality of showing you the difference between what you had before and what you have now after the after the AI injection. So you could review that kind of quick before you say, I like it. Yep. Let's get it in, or not so much. I wanna get that stuff out of here and try again. So I would imagine again, this is Eric putting on speculation hat here. With the advancements in Positron and leveraging more general, you know, Versus code ecosystem, extension ecosystem, that there might be even a more robust way to do this down the road on that side of it. But the advantage of the RStudio API package is leveraging is that thanks to some shims that have been created on the Positron side, this works in both the classic RStudio and in Positron. And I think that's, again, tremendous value at this early stage for those that are, you know, preferring, say, R Studio as of now over Positron, but still giving flexibility for the those who wanna stay on the bleeding edge to leverage this tech as well. So I think there's a lot to watch in this space, and and Simon, definitely does a tremendous job with these packages at this early stage.
[00:16:31] Mike Thomas:
That's for sure. Yeah. I appreciate sort of the deposit team's attention to UX because, again, I think that's sort of the most important thing here as we bring in, you know, tools that create very different workflows than maybe what we're necessarily used to. I think it's important that, we meet developers and and data analysts and data scientists, you know, in the best place possible.
[00:16:58] Eric Nantz:
And I mentioned at the outset that Simon is part of the Tidymodels, ecosystem team. I will put some quick, plugs in the show notes because he is leveraging on he's I should say, writing a new book called Efficient Machine Learning with R that he first announced at the R pharma conference last year with a excellent talk. So he's been really knee deep into figuring out the best ways to optimize his development both from a code writing perspective and from an execution perspective in the tiny models ecosystem. So, Simon, I hope you get some sleep, man, because you're doing a lot of awesome work in this space.
[00:17:34] Mike Thomas:
I was thinking the same thing. I don't know how he does it.
[00:17:45] Eric Nantz:
And speaking of someone else that we wonder how on earth did they pull this off at the time they have, our next highlight is, you might say, revisiting a very influential dataset that made a tremendous waves in the data storytelling and visualization space, but with one of the new quartal tools to make it happen. And long time contributor to our weekly highlights, Nicola Rinne, is back again, on the highlights as she has drafted her first use of the close read quarto extension applied to the Hans Rosling famous Gapminder dataset visualization.
If you didn't see or or I should say, if you didn't hear our our previous year's highlights, we did cover the close read quarto extension that was released by Andrew Bray and James Goldie. In fact, there was a talk about this at the aforementioned PasaConf last year, which we'll link to in the show notes. But close read in a nutshell gives you a way to have that interactive, what you might call scrolly telling kind of enhanced web based reading of a report, visualizations, interactive visualizations. You've seen this crop up from time to time from, say, the New York Times blog that relates to data science. Other, other, reporting companies out there or or startups out there have leveraged similar interactive visualizations.
Like, I even had an article on ESPN of all things that was using this kind of approach. So it's used everywhere now. But now us, adopters of Quarto, we can leverage this without having to reinvent the wheel in terms of all the HTML styling and other fancy enhancements we have to make. This close read extension makes all of it happen free of charge. So what exactly is this tremendous, report that Nicole has drafted here? She calls a gapminder, how the world has changed. And right off the bat, on the cover of this report is basically a replica of the famous animation visualization, plotting the GDP or gross domestic product per capita with life expectancy on the y axis with the size of the bubbles, pertaining to the area around those, and for each country, represented there.
So once you start scrolling the page, and again, we're an audio podcast. Right? We're gonna do the best we can with this. She walks through those different components. First, with the GDP with some nice, looks like a line plots that are faceted by the different regions of the world, getting more details on on gross domestic product, and then getting to how that also is affected by population growth. Again, another key key parameter in this life expectancy, calculation, which gets to the life expectancy side of it. And as you're scrolling through it, the plot neatly transitions as you navigate to that new text on the left sidebar.
It is silky smooth, just really, really top notch, user experience here. And then she isolates what one of these years looks like. She calls it the world in 02/2007, showing that kind of quadrant, that four quadrant section you have when you look at low versus high GDP and low versus high life expectancy. And as she's walking through this plot, she's able to zoom in on each of these quadrants as you're scrolling through it to look at, like I said, these four different areas, and that's leveraging the same visualization. It's just using these clever tricks to isolate these different parts of the plot. Again, silky smooth. This is really, really interesting to see how she walks through those four different areas and then closing out with the animation once again that goes through each year from the '25 nineteen fifties all the way to the early two thousands with, again, links to all of her code, GitHub repository, and whatnot.
But for a first, first pass at Close Read, this is a top notch product if I dare say so myself. And boy, oh, boy, I am really interested in trying this out. There was actually a close read contest that was, put together by Posit late last year, and I believe the submissions closed in January this, this past month. But if you want to see how others are doing in this space, addition to Nicola's, visualization here, we'll have a link in the show notes to, pause a community all the posts that are tagged with this close read contest so you can kinda see what other people are doing in this space. And maybe we'll hear about the winners later on. But this one this one will have a good chance of winning if I dare say so myself. So I am super impressed with close read here, and Nikola's
[00:22:52] Mike Thomas:
very quick learning of it. Yeah. It's it's pretty incredible. And I was going through the Quarto document that sort of lives behind this, and it actually seems pretty easy to get up to speed with this scrollytelling concept. It's pretty incredible. I think there's a a couple different specific tags, that, you know, allow you to do this, or maybe to to do it easy. It looks like, there is a dot scale to fill tag, that I believe probably handles a lot of the, zoom in zoom out sort of the aspect ratio of the plots or GIFs in Nicola's case that are being put on screen. Because in her visualization, it's almost like there's this whole left hand side bar, right, that has a lot of the context and the narrative text, that goes along with the visuals on the right side of the screen.
You know, some of the pretty incredible things that I thought were interesting here is, you know, not only was she able to, you know, fit a lot of these plots in a nice aspect ratio on the right side of the screen, but there's also actually a section of the scrolly telling visualization where she zooms in across four different slides, if you will, on four different quadrants of the same plot, to tell the story of these four different quadrants, you know, one being low GDP per capita and low life expectancy, low GDP per capita and high life expectancy, and, you know, the other two as well, vice versa. And it's pretty easy, it's pretty awesome, I guess, how the visualization sort of nicely slides from one quadrant to the other as you scroll to the next slide, if you will.
So this is for for any of the data vis folks out there, data journalism folks out there, I imagine that in order to accomplish something like this in the past, it was probably a lot of d three, JS type of work, and the end product here compared to the quarto code that I'm looking at is it's pretty incredible. And it just sort of gives me the idea that it's a lot of the heavy lifting has been done for us, in in the ability to create these quarto based scrolly telling types of visualizations. So I'm super excited about this.
[00:25:26] Eric Nantz:
You know, it made me go then the way back machine a little bit on this. I'm gonna bring Shani in this because I love to bring Shani in almost all my conversations. But back in 2020, of all things, I remember I had the good fortune of presenting, I at the poster session at the conference, and I had my topic was kinda highlighting the latest innovations in the shiny community, and I was, trying to push for what could we could we ever have something like a shiny verse or whatnot of these community extensions. And to do this poster, I didn't wanna just do, you know, PowerPoint over anything. Come on now. You know me. But I leverage, our good friend, John Coons.
He had a a development package way back in the day called Fullpage, which was a way to create kind of a shiny app that had these scrolly telling like elements. But I will say he was probably too far ahead of his time on that. I won't say it was that easy to use. And, frankly, he would probably acknowledge that too. Here's my idea. I still have the GitHub repo of this, you know, poster I did. I would love to have my hand at converting that to close read and wait for it, somehow embedding a Shiny Live app inside of it. Can it be done?
[00:26:42] Mike Thomas:
I think it can too. I think you'd be breaking some new ground, Eric. But,
[00:26:47] Eric Nantz:
if if anybody's up for that challenge, I know it's you. How did I just nurse night myself? Like, how does that happen, Mike? What you must be hypnotizing me or something without even saying anything. I have no idea.
[00:26:59] Mike Thomas:
Peer pressure.
[00:27:13] Eric Nantz:
Now you may be wondering out there yeah. The Gapminder data, we we are fortunate that we have a great r package that literally gives us this kind of data. So once Nicola has this package loaded, she's able to, you know, create this awesome close read, you know, scrolly telling type of report. Well, there are many, many other sources of data that can surface this very similar important domain such as what we saw in the Gapminder set. And you may be wondering, where can I get my hands on some additional data like this so I can do my own, you know, reporting? Maybe with Close Read or Shiny or Quordle, whatever have you. Our last highlight is giving you another terrific resource of data for these kind of situations.
This last highlight comes to us from Kenneth Tay, who is a applied researcher at LinkedIn, and he has a blog that he is, his latest post is talking about some recent advancements in this portal called Our World in Data, which I have not seen before this highlight, but it is a, I believe, a nonprofit organization whose mission is to create accessible research data to make progress against the world's largest problems. So you might think of, say, poverty, life expectancy, some of the other issues that, say, the Gapminder said highlighted. But they wanna make sure that anybody that has the desire and the skill set to use, say, a language like R or whatever else to produce visualizations to really start to summarize and explore these data, that there is as less friction as possible to access these.
And, yes, you could access their portal. You could download the data manually on their website, but it was earlier in 2024 that this group had exposed an API to access these data. So Kenneth, in his blog post here, walks through what it's like to use this, this, new API, particularly to call it a public chart API because it is the basis for, I believe, some interactive visualizations that their web portal is exposing here. But because there is a an API now, he brings back a little bit of old school flavor here, the h t t r or the hitter package. That was one of those cases where I've been spelling it out all this time, but on on hitter two, the read me, Hadley literally says how it's pronounced. So thank you, Hadley. I wish I wish all our package authors would do that.
[00:29:45] Mike Thomas:
In case the baseball player didn't give it away.
[00:29:48] Eric Nantz:
Exactly. So great great hacks on the new on the new package itself. So back to Kenneth's expiration here, he shows us how with the old school hitter along with a little tidy verse magic and JSON light under, loaded into the session. He needs all three of those because, first, it's one thing to access the data itself, which apparently are exposed as CSV files on the back end, but the API lets you grab these directly. But the metadata comes to that to that in JSON format. So he wants to use JSON like to help massage some of that too.
So the first exploration of this and the snippet on the blog post is looking at the average monthly surface temperature around the world. So once he's get the he's got the URL of the dataset, then he assembles the query parameters, which, again, in the role of APIs, you might have some really, really robust documentation. Maybe some other times you have to kind of guess along the way. It's kind of roll of the dice, isn't it?
[00:30:52] Mike Thomas:
Yeah. I find the the latter to be the case more often, especially in professional settings, unfortunately, which seems to make no sense.
[00:31:01] Eric Nantz:
Who would ever think that? But yet, I feel seen when you say that. Yes. Even as of this past week. My goodness. Don't get me started. So luckily for this, there is a a healthy mix here, I would say. So he's got some query parameters. So look at the version of the API, the type of CSV, the return, which can be the full set or a filter set, which I'll get to in a little bit, and whoever to use long or short column names in the dataset that's returned back. And then, also, he does a similar thing for the metadata. That's another get request, as well, and then he brings that through that content directly, with JSON format.
So the metadata comes back as a list because most of the time when you return JSON back, it is bake basically a big nested list, and that gives some high level information on the dataset that is returning. So you get, basically a list of each character string of the variable name and the description of that variable. So that's great. Now the data directly, again, setting up similar, setting up the query parameters. This time, he's gonna demonstrate what it's like to bring a filtered version of that data right off the bat. And that is where there's a little guessing on that because he went through the web portal of this, played with the interactive filters that this web portal gives gives him, and looked at the end of the URL. So if you're new to the way requests are made for API, you might say get requests where you're running to grab something from the API.
More often than not, you'll attach different flags or different variables at the end of the URL often in, like, key value type pairs with an ampersand separating the different parameters. So once he explored this web portal, he was able to grok that, oh, yeah. There is a parameter for selecting the country. So I'm gonna, you know, put that in the query parameter and feed it in the direct value. And then once he does the get request on that, this is important here, the contact the content, I should say, that's coming back can usually be three different types of flavors. The raw might say binary representation of that value, the textual value of it, or the format of JSON or XML version of it.
In this case, the it was a text value coming back because it's literally the CSV content as if you just had the CSV open on a new file in your computer. That's how the text is coming back. So he feeds that into a read underscore CSV directly. And lo and behold, you got yourself a tidy dataset as a result of that. So and then with that, he just said a simple plot of the time in year versus the temperature of the surface across The USA just to show that that's exactly how you would bring that data in. And there's a lot more you can do with this type of data. But, again, it's a good example of, first, going to the documentation where it's available. But then when things maybe aren't as well documented, yeah, nothing beats a little trial and error. Right? Sometimes that's the best bet we get, and that's how he was able to do that filtered dataset pull. But, nonetheless, if you're looking for inspiration, I'm looking at similar data as we covered in the second highlight, but across a wide range of world specific type of data. I think this portal has a lot of potential.
And, yes, r is your friend. Again, we're grabbing these data from almost any source you can imagine. So really great blog post straight to the point. You could take this code and run with it today. And, in fact, a good exercise would be what would you do to convert that to the hitter two syntax, which shouldn't be too much trouble. But, nonetheless, you've got a great example to base your explorations off of here.
[00:34:53] Mike Thomas:
Yeah. I I think it's just a good reminder in general, especially for, you know, junior data science folks who are just starting out that your data isn't always going to be in a CSV format. Yes. I know that our world in data allows you to export that. But a question that you should be asking, you know, in order to try to automate things as much as possible for yourself is often, you know, is there an API, right, for this this dataset or is there an underlying database that we can connect to so that I can just run my code directly against that, run my script with one click as opposed to having to go someplace and download the data to a CSV first, before I do my analysis. So, you know, if you can sort of automate a recurring script that you have against data that that might be just updating but in the same schema on some particular basis.
I think, yeah, this is a fantastic example of leveraging our world and data's API to do that, some really nice, base plotting, some really nice g g plotting as well, a pretty cool mix here That's been put together. And like you said, Eric, a great example of dealing with what's called a get request, which is where you're actually just modifying the suffix of the URL, in order to filter the dataset that's going to get returned here. So it's a really great example of doing that with a couple of different parameters that are being managed. I guess one parameter being, tab equals chart, another one specifying the time or the date range, that we're looking to get data back within. And then the last one being the the two countries here in the case of this last example where we're plotting the average monthly temperature for the entire world and then, for Thailand as well. So, you know, two items in the legend here. As you said, great great walk through blog post of using a publicly available, API to wrangle some data and and make it pretty.
[00:36:54] Eric Nantz:
Yeah. The the the limit's only your imagination at this point. So like I said earlier, you could take what Nicola made with her close read example, apply it to this kind of data, and and go to town with a a great learning journey. Great for a blog post such as this, you know. All if again, maybe, like you said, speaking to the the data scientists out there that are looking to get into an industry or or an a data science type of role, it never hurts. Well, if you've got the time and the energy to build a portfolio of things like this because you never know just how useful that will be as you're trying to showcase what you find and what what skill set you have to generate insights from data like this. Because not to not to pull the old, back in my day, syntax here, but we didn't have access to these type of data when I was looking for a job earlier. So take advantage of it, folks. It is here for the taking.
Speaking of what else you need to take advantage of, you need to take advantage of our weekly folks because if this isn't bookmarked for reading every single week, you are missing out because this issue has well more than what we just talked about here in these highlights. We got a great batch of additional tutorials, new packages that have been released, new events coming up. It's the full gamut. So we'll take a couple minutes for our additional fines here. And, leveraging what we talked about at the outset of the show with Simon's explorations of interacting with Elmer, a very common problem across many, many different industries and organizations is dealing with data that I'm gonna go on a limb here is kind of trapped in image or PDF format.
Because wherever you like it or not, there's gonna be some team out there that said, you know what? We have this great set of data here, and they act like everything is perfect access. And then you as a data scientist says, oh, yeah. Where are the CSVs? Where where are the parquet files if they're really up to date? Oh, no. No. They're in these PDFs. Oh, gosh. Okay. What do I do now? Yes. There are things like OCR that can help you to an extent, but with the advent of AI, there might be an even easier way to do that. So frequent contributor to the our weekly, highlights and elsewhere, Albert Rapp has another post in his three minute Wednesday series on how he was able to leverage Elmer to extract, text from both an image file of an invoice as well as a PDF version of that image and to be able to grab, you know, certain numeric quantities like number of billable hours, time period, and whatnot.
I think this is a very relatable issue that, again, many organizations, big or small, are gonna have to deal with at some point. And I've seen projects being spun up at the hashtag day job where they're looking at ways of building this from scratch. Well, if you're an R user, maybe Elmer with its image extraction functionality might get you 90% on the way there. Hashtag just saying. So excellent post, Albert and I may be leveraging this sooner than
[00:40:07] Mike Thomas:
later. No. That's awesome. We have some some projects that are doing the same thing with some of these self hosted open weights models to be able to take a look at a PDF and extract very particular pieces of information that we want from it, and we can tell it, you know, give us that back in JSON form, and it allows us to, you know, leverage it downstream. Of course, you have to build a bunch of guardrails around that to make sure it's not hallucinating because it's a Absolutely. Box. Yep. But it's it's pretty powerful stuff, and the accuracy that we're seeing is is pretty shocking, pretty awesome.
But what I want to call out is an article by the USGS, which is the US Geological Survey on mapping water insecurity in R with TidyCensus. They just always do an absolutely beautiful job, with data visualization. All the code is here for a lot of these visuals actually deal with, households that were were lacking plumbing in 2022 in The US, and then changes, via, I guess, barbell plots, they're called. I don't know if there's any other names for them. Lollipop plots?
[00:41:13] Eric Nantz:
Yeah. I've seen them thrown around interchangeably.
[00:41:15] Mike Thomas:
Yep. Yep. To take a look at, improvements in plumbing facilities, particularly in, New Mexico and Arizona, which were the two states in based upon the 2022 census, that I think had the the lowest rates, of of household plumbing. So it's, you know, it may be a a niche topic for some, for lack of a better word. But the the data visualizations that they have here on these choropleth maps are really, really nice. I I love the color palettes that they use. I I really love the walk through that they provide on the website in terms of the code and the narrative around how they made the decisions that they made to go from dataset to visuals. I think it's a great job. You know, on the the East Coast here, water scarcity is not something that we really are concerned about. But I know on the West Coast, because we do a lot of our work in agriculture, it's it's quite a big deal in terms of water rights and water access and things like that.
So I really appreciate the work that the USGS is doing on this particular, you know, niche.
[00:42:27] Eric Nantz:
Yeah. And I have a soft spot for the the great work they're doing. My wife actually was fortunate early in her career to have an internship at the USGS and, albeit this was a day where r wasn't quite as as readily used as it is now, but it's great to see this group in particular being really modern with their approaches. And, again, top notch narrative, top notch visualization, so really exciting to see. And I believe we featured this group on previous highlights, so you wanna check out the back catalog for some of the excellent work they've been doing in this space, previously. So excellent excellent find, Mike, and there are a lot more finds than just those. So, again, we invite you to check out the rest of the r weekly issue at rweekly.0rg.
We, of course, have a direct link to this particular issue in the show notes, but, also, you wanna check the back catalog about both the issue as well as this humble podcast itself because we got so many great things to talk about here, so many great things to learn. As you heard, I've I've basically nurse signed myself for a new project, hopefully, this year that I can work on with Shiny and Close Reads. So we'll see what happens there. But, yeah, if you wanna see what else is happening and if you want to be a part of what's happening here in terms of what the readers are gonna see every week, We value your contributions, and the best way to do that is, again, head to rweekly.0rg.
You'll see in the top right corner a link to this upcoming issues draft where you can send a poll request to tell tell us about that great new package, that great new visualization, that great new use case of shiny or AI or other technologies that you can see in this data science community. We'd love to hear it. Again, all marked down all the time. I I would stress again when told me years ago, if you can't wear an r markdown in five minutes, he would give you $5, and he didn't have to give any money for it. So there you go, folks. And, also, we love hearing from you. You can get in touch with us via the contact page in the episode show notes. You can also send us a fun little boost with the modern podcast app. Those details are in the show notes as well. And you can also get in touch with us on the social medias.
I am now on Mastodon these days with @rpodcastatpodcastindex.social. I am also on Blue Sky as well where I am at rpodcast.bsky.social. I believe that's how to say. Mike, where can the listeners find you?
[00:44:51] Mike Thomas:
Yes. I am on blue sky for the most part these days at mike dash thomas dot b s k y dot social. Also on fa Mastodon a little bit, [email protected]. And you can check out what I'm up to on LinkedIn if you search Catch Broke Analytics, k e t c h b r o o k. And a bit of a shout out here, self plug that we are looking for a DevOps expert. If you are somebody who has expertise in Docker, little Kubernetes, Azure preferred, but it doesn't really matter because we're all spinning up Linux servers at the end of the day, we could use some help managing ours and our clients' ShinyProxy environment. So any DevOps folks out there, please feel free to reach out.
[00:45:40] Eric Nantz:
I'm sure there are many of you out there. So, yeah, take up Mike on this tremendous opportunity. I'm still learning the DevOps ropes. We share many stories about that in our adventures there. So that's that's a great great plug, Mike. And I'm also on LinkedIn as well. But, yeah, we'll, we'll add that little, call out to the show notes as well if you're interested in pursuing that. Nonetheless, we're gonna close-up shop here for episode 94 of Haruki highlights. Before I go, I wanna send a very hearty congratulations to Chris Fisher and the team at Jupiter Broadcasting. You recently had episode 600 of Linux Unplugged.
Tremendous achievement, folks. You'll be seeing a boost from me in the coming days. I don't know if we'll ever get there, Mike, but, nonetheless, that's a huge number for a podcast of that size. So congrats to them. Well, we'll we'll get to 200 at least, and we'll see what happens after that. Alright. We got no place to stop. Well, yeah. Me either. Yep. We'll see what happens, buddy. But, nonetheless, we hope you enjoyed episode 94 of our week highlights, and we'll be back with another episode for one ninety five next week.