Thriving in a multi-lingual data science lifestyle while authoring your next Quarto project, putting LLMs to the scientific test with parsing manuscripts, and replicating a life-saving spatial visualization originally created over 170 years ago!
Episode Links
Episode Links
- This week's curator: Ryo Nakagawara - @[email protected] (Mastodon) & @rbyryo.bsky.social (Bluesky) & @R_by_Ryo) (X/Twitter)
- Creating multilingual documentation with Quarto
- The ellmer package for using LLMs with R is a game changer for scientists
- {SnowData} 1.0.0: Historical Data from John Snow's 1854 Cholera Outbreak Map
- Entire issue available at rweekly.org/2025-W12
- Wes McKinney & Hadley Wickham (on cross-language collaboration, Positron, career beginnings, & more) https://youtu.be/D-xmvFY_i7U
- Tabby Quarto extension https://quarto.thecoatlessprofessor.com/tabby/
- Plotting the Past: The 1854 Cholera Outbreak Visualized in R https://simplifyingstats.wordpress.com/2025/01/18/plotting-the-past-the-1854-cholera-outbreak-visualised-in-r/
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)
- Mike Thomas: @[email protected] (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter)
- Lost in a Nightmare - Castlevania: Symphony of the Night - Palpable - https://ocremix.org/remix/OCR03001
- Cammy's London Drizzle - Super Street Fighter II: The New Challengers - MkVaff - https://ocremix.org/remix/OCR00453
[00:00:03]
Eric Nantz:
Hello, friends. We are back with episode a 99 of the Our Weekly Highlights podcast. This is the weekly show where we talk about the excellent resources that are in our highlights section amongst much more in this week's our weekly issue. My name is Eric Nance, and I'm delighted you join us from wherever you are around the world.
[00:00:22] Mike Thomas:
And very happy to be joined as always by my awesome cohost who's been with me for many of those a hundred and nine ninety nine episodes, Mike Thomas. Mike, how are you doing today? Doing great, Eric. Yeah. It's it's pretty crazy to think that we're coming up on that number. I don't know exactly what my count is. I'll have to do a little web scraping or, leverage some APIs to be able to to run some deep wire and figure out how many of those I've contributed to. But it feels like a lot, and it's been a blast. It sure has been. Maybe we should throw the the RSS feed in some LOM or something, and it'll tell it for us. You know, that's all the rage these days, which
[00:00:56] Eric Nantz:
Wah wah wah. Well, you know, what can you do? And there will be a section where we touch on that a bit, but, nonetheless, we are happy to talk about this latest issue that has been curated by real Nakakura, another one of our OG curators on the our weekly team who has been very helpful for to get me on board for many, many years ago. Seems like many years ago. Maybe it hasn't been that long, but, man, time flies in open source, doesn't it? But as always, he had tremendous help from our foe, our working team members, and contributors like all of you around the world with your poll request and other suggestions. And it is a lifestyle, Mike, that you and I live of every day. We're we're just talking about this on the preshow where we may be losing multiple languages often at once in our projects. It is the reality, whether you're a solo developer and you've seen that great utility and it just isn't in your maybe primary language of choice, but there's an open source equivalent in another language, and you wanna bring it all together.
Certainly, that's been a key focus of many vendors, consultation companies, and enterprises in today's world to build tooling and capabilities that can have the ability to leverage multiple languages at once. And there was an interesting chat with two very prominent roles or leaders in this space about their thoughts on interoperability across different languages and data signs. So our first highlight is a blog post that has been offered by Isabella Velasquez over at Posit, and it is summarizing a recent fireside chat that consisted of none other than Hadley Wickham, chief scientist at Posit, author of the tidy verse, and many other important contributions in the art community. Obviously, needs almost no introduction to our audience here. And he was joined by Wes McKinney, the, you know, the, obviously, the architect of pandas, and it was also been working on, with posit as well.
And they were recently, as I mentioned, part of a fireside chat that, was hosted by posit where they invited, a small gathering of fellow data scientists, much in the style of the what they've done basically every week called the data science hangout. It was kind of a mini hangout of sorts. The YouTube video is online. And if you wanna watch that or listen to that after you listen to this very show, we'll have a link in the show notes, of course, to that recording. But there was one aspect that Isabella touched on in her blog post that I think is very much a important reality, but an important capability that we are living every day that was brought up by one of the audience members in that discussion or that fireside chat is the use of, you know, multiple languages and being able to leverage the Quarto documentation publishing system that to help users look at different code snippets between how you might accomplish something in R and how you might accomplish something in Python.
We both we all know that both R and Python as well as Julia and observable are all languages that are well supported in quarto. And so this, this blog post is talking about a couple solutions that you might have as you're crafting these resources together and you're putting these multiple snippets in place of letting the user really see kind of in a group fashion, these different snippets in real time as they're navigating through your interface. And within Quarto, there is the functionality to do tab sets in your page, to help organize content. This could be a web page. This could be a standard HTML document. It could be a cordial slide deck. We reveal JS. The tab sets are usable almost everywhere they have interactive HTML.
And what a lot of people will do is they'll have, say, a tab maybe for code, another tab for output. And like I said, in this context, they might have different tabs for, like, how you do certain function or certain analysis in r. And then maybe for teaching purposes or, you know, getting members of your team up to speed, you might have the equivalent, snippet in Python. There are two ways to accomplish this. One is, built in to Quarto itself, and another is a really awesome extension that does a similar thing. So the first way to accomplish this is that when you're defining these tab panels, there is a panel kind of div operator that you put in your Chordal document to set up this group.
You give that that group a name. You can call it, like, language or whatnot. In fact, that that could be any naming you want. But then within the subheadings of this tab set, you put the r code with an r heading and then say a Python block with a Python heading. And then you make sure that these are, again, tagged appropriately with the group amongst multiple tabs that so that they have the same group name. And then when you click, like, the r tab in one tab set, the other tab set will react accordingly. Same with clicking the Python tab.
And there are examples of this in action in the blog post, so you're invited to check that out after the, after listening to this. But if you go through that that example, there's, like, an r section where it's just doing a simple scatterplot and then a Python section doing a similar thing with both a scatter plot and, like, a data table. So that's one way to do it built in. But I mentioned there was a great extension to do this as well, and it's called Tabby. This is authored by one of our favorites to follow in the quartal space and amongst all data science, James Balutoma, who goes by the cultless professor, always one of my favorite handles.
And this extension, Mike, I think is just up my alley for some recent projects. Why don't you walk us through this? Yeah. Absolutely. It's an extension that
[00:07:12] Mike Thomas:
accomplishes much of, you know, what the first half of the blog post, talks about. But the first half of the blog post is a little bit more manual in that, you would need to sort of specify the tab sets explicitly as opposed to using this Tabby extension. You can have all of your different chunks, which I think could be dynamic. So, say, under one tab set, for example, you have r, Python, and Julia. And then under another tab set, you have, you know, r and Julia. And you don't actually need to change sort of the wrapper, the the quarto div, if you will, at all. You can just use this dot tabbing, call and then group equals language, and it will programmatically build out the tab set for you as I understand it, for one tab for each of the, chunks that you have underneath it based upon the particular language for each of those chunks. So I think it's a really nifty, smooth way to be able to go about, accomplishing, you know, what the first half of the blog post does a little bit more manually. If you want to install this Tabby extension, just open up a terminal, wherever you have quarto installed and run quarto add, colist dash quarto backslash tabby.
It's gonna install this extension under an underscore extension subdirectory, that you might wanna take a look at if you're using version control. And the setup and ability to get going is is really quick. They have some basic examples with Python, JavaScript, and R, all underneath this dot tabby div, if you will, in quarto. And it generates this beautiful three tabbed tabset with Python, JavaScript, and R code in it. There's a great I guess we'd call it a package down site. I'm not sure if it is a package down site because this isn't an R package. Right? It's a quarto extension. But there's a great a great site that, that we have put together here for this extension. It's quarto.thecultlistprofessor.combackslashtabby.
So maybe that makes me think that, is it James? Has even more, more quarto extensions coming our way. We'll have to see. But, great great blog post here, great extension, super useful. One that I have read about before, but admittedly have not tried and put into my workflow yet, and I need to because we have a billion use cases for this.
[00:09:40] Eric Nantz:
In fact, I have a use case I'll I'll share right now as, Mike knows from our preshow banter that I've been on a AWS, journey, if you will, with leveraging, how we deploy Shiny apps and how we deploy maybe custom APIs. Now most of you know, I'm I'm an R user. I I develop most of my everything in R. And while there are a lot of advancements in the R ecosystem respect to interacting with cloud, you know, mechanisms or cloud providers like AWS, like Azure, and whatnot. When you start to get in the weeds of, like, certain bits of these services and let's say for AWS, you're you know, it was object stores. There's, like, the secrets manager. There's I'm roles and all that jazz. Not important for what I'm talking about here. What is important is as you're thinking about, okay, what are the best ways I can call these APIs from our and you start to search and you get some hits here and there.
But then most of the time when you search for this, there are other languages are coming up at the top of this, especially in this case, Python with the Boto three library. So what I want to do with this kind of paradigm, especially with this Tabby extension, is I'm I right now, I'm writing notes for mainly myself as I'm navigating through all this, but I wanna empower other statisticians and data scientists in my team to be able to deploy these resources too. So I'm not, you know, what I like to call the bus factor for a lot of this. So I'm writing kind of this document, and it probably will be a portal site when I'm finished with it. And I wanna put in the equivalent of, like, the r snippet to tie into these APIs and the Python snippet with this boto three library.
Because a lot of times when IT asks, you know, oh, you're having trouble authenticating that, what are you using? Oh, you're not using Boto three? Like, oops. Well, I I do on the side to do my testing to make sure it's not me messing up the r side of things to make sure that the Python side is working first. So in this documentation, I can do tab sets. So when the user says, okay, here's the r way of connecting to this API or deploying this thing. And then they hit the Python tab, and they'll get the same thing in Python. That's where the majority of our internal documentation from the IT groups have written as is like, we assume we're using Boto three, so here you go. Well, I'm like the the first r user in this journey. So I'm thinking this might be a useful technique to follow, and and what a great way to put this all together with this Tabby extension in Corto. I'm I'm sold.
[00:12:14] Mike Thomas:
Definitely. And if you envision that the audience is either gonna be a Python person or an R person, for example. You can group all of these tab sets together with the Tabby extension so that if someone switches from R to Python in one tab set, all of the tab sets, that you've decided to should be grouped together will switch from r to Python as well.
[00:12:38] Eric Nantz:
That's a huge user experience enhancement and one that, like I said, this is gonna be a huge asset to me personally as I try to document this journey and notes in real time so that, a, I don't forget in a year from now, and b, I can empower other developers to join me on this. But we teased at the outset, everybody that, you know, it is in today's day and age. We usually have something to say about large language models on this podcast these days, but we're trying. One nice thing about the r weekly highlights is we definitely cut through the fluff and, the noise that can be out there. And we try to showcase real novel uses of the technology. And like we said the other week, Mike and I are always learning something new, and it looks like this blog post is gonna give us just that. We have a recent use case with the recently highlighted Elmer package, but this one is titled the Elmer package for using large language models and r is a game changer for scientists.
So this is very much geared towards the research side of things. But this is being authored this blog post has been authored by the seascapes models group, which I think we featured back in episode a 96. I believe the lead of this group is Chris Brown. But we don't exactly know who authored a blog post. So we're just gonna say it's from the seascapes team. And after the introduction where the author talks about, you know, why is the Elmer package in particular a game changer for scientists, well, a lot of the things that we talk about here as we're learning about this, Elmer is really helping you automate a lot of this setup to many of the l o m providers out there, as well as the capabilities of tool calling and some other nice enhancements. So I think we're gonna be getting to later on in this post, especially when dealing with textual data.
So the first part of the post, we'll we'll breeze through this, but Elmer also has great documentation on how you set up your authentication to these services. I believe in this example, they're using the anthropic API from Claude, which is something that Mike and I are using routinely in our projects now. So we're starting to get familiar with that. And then it walks through just the basic ways of setting up the chat object in Elmer based on your system prompt, the model. You can also specify the max number of tokens. That can be quite important if you're working on costs too, to make sure you're not overcharging there.
But there are some interesting nuggets as we get to the use case that this blog post is highlighting here. And the use case that they're talking about, A lot of times the data or the insights that you wanna summarize are contained in PDF format. If they're from a research paper, maybe some internal documentation that was written years ago, It's in some sort of PDF. So Elmer out of the box, and I believe this is new to the CRAN release that we touched on last week or a couple weeks ago. There are functions that can aid with the processing of text from PDFs. Though, the author's first attempt at this is they have a manuscript on turtle fishing, which is interesting read. You can check that out in your leisure.
They first tried to import this, paper online dynamically for the web link using the content PDF URL function that Elmer exposes. Well, that didn't work. There was a four zero three. That's a typical HTML errors means that you're unauthorized or there's a server error around there. So he, the author speculates that this may have been a bot that's trying to prohibit, you know, some rogue processes from scraping web content. So the workaround is that you download the PDF locally on your system. Then they ran the same function, but now it's content PDF file. So that browses to the local copy of this, turtle manuscript, and that actually works. So you get an object. They call it my PDF, and that will be used later on in all this.
But after setting up the clawed chat object and, again, using the the familiar system prompt, the specific model, and the max tokens, this is interesting. Elmer has some built in functions to help you with conversion of different object types. And in this case, there is a type object function, which I have not used yet. But within this, you can have some arbitrary definitions of the information you're trying to look for in this paper here. In this case, they have some for sample size of the study and the manuscript, the year of the study and the method.
And within these type objects, you can have ancillary functions called, like, type underscore number, type underscore string to kind of give it what it should get when it's scraping this PDF after running the model. So if all that's set up, then the author calls all of this and the chat object with the extract data method. Again, feeding in that PDF object that they got from extracting from the file. And then the type is using this type object that was defined. They call it paper stats. So when you get that back, you start to see then in this object, the different slots for those three kind of data types that was looking for the sample size, the year of the study, and then the method, which was another string of like, what was the statistical method and a paragraph or less, and it works, but by cautions everywhere. Right? You do have to be cautious about, you know, what results you're getting back and the type of framing you're putting in with this.
And this, it looked like the sample size was the correct amount. But, again, you have to be cautious. And if you wanna run this in a batch setting, that's the next part of the post. What if it's not just one paper? It's like a hundred or so or such of these. So you could write a wrapper function like they do here to kinda automate this process of grabbing the text, extracting the data with that custom type object, you know, definition, and returning the result. Now the other risks that they talk about is that if the chat, you know, object or the chat service can't find the answer, it might try to make one up called hallucination and, and the common lingo. There are some safeguards you can put in your Elmer calls to try to suppress that in these type, type function calls where there's a parameter called required.
If you set that to false, typically speaking, if it doesn't find that particular set of data with that type you define, it should just leave it as a null. Not always though. They caution that they still might try to do something. So you've gotta, again, look at the results, in their experience when they set that required false, it was still hallucinating on some of these answers from these different types. So again, your mileage may vary, but what I'm learning here is that a, this this type, object definition is hugely important when you're in this context of scraping information from these PDFs or other type of structured text.
And so in some of the reflections that wrap up the post here, some things you wanna think about is, you know, these things aren't free. Right? I was just telling Mike, I just pony it up for paying for the professional cloud account. So I don't wanna burn through my usage in one month because I'm doing a bunch of repeated calls as I'm prototyping things. So you may want to, you know, keep your testing at a little minimal as you're iterating through this. Maybe set the token limit to less or whatnot. And then dealing with structured text or unstructured text can be, can be problematic, especially in PDFs. That's why, if you have a resource in HTML, oh, you're golden, right? Because that's structured in a markup language, easy to scrape, but we don't always have that luxury. But if you have it, they always recommend grabbing HTML first.
And like everything in these in these pipelines, the prompt is a huge part of this. Probably takes a few iterations to get this right. But if you make it as domain specific as possible, I think, and also maybe repeating it a couple times with some experiments, you can kinda see the variation that occurs wherever you do the same prompt or maybe tweak one or two sentences in that prompt. I think that's also a good practice to have. But the potential is here, folks. The potential is here if you got this data trapped in these PDFs or other textual documents, looks like Elmer is trying to help you out to grab this data out, help you define how you wanna extract the different pieces out with these type objects.
And lo and behold, you might have a great way to at least reduce what half your effort, maybe 80% of your effort of trying to do this all yourself. I still remember at the day job, there was a vendor we paid a lot of money for to do this quote unquote curation of these PDFs all manually. It would take them months upon months upon months to finish it. And even then, we get this big old data set out in Excel format. We don't know how heads or tails ought to make sense of it. So will this replace everything? No. But I think this is a huge win for research out there, especially for for this type of information that we're seeing in this, highlight here. So learn something new with these, with the Elmer package as always, Mike. What do you think about all this?
[00:23:02] Mike Thomas:
Yeah. Absolutely. It's pretty crazy. And I think that this blog post and some of these highlights from the Elmer package, actually taught me some things about, what's going on in the, I guess, LLM ecosystem and some things that I hadn't realized yet. I think one thing that we touched on is there is a function in here that's called, what is it? Content PDF file or content PDF URL. And I was thinking initially that that, you know, maybe extracted the text from the PDF, stored it in a vector database, and this was sort of a rag type of a thing. But it sounds like from taking a look at some of the Elmer documentation is that it's actually sending the PDF file to the LLM service itself and then letting the LLM service, do that. And it's it seems like now some of these APIs, at least, and probably the front end interfaces and and, Claude and ChatGPT and and all of them actually allow for a PDF file as input and not just, you know, your your prompt text, if you will. So I think that's what's going on here. Obviously, if you have, you know, sensitive information in a a PDF at your company, you make sure that you wanna check with somebody before you send that to, one of these third party services because you never know exactly what they're going to do with it.
So, and then the this whole type object idea is really interesting too. And I I got to imagine that it's not it's still sort of imperfect a little bit, but it's nice that some of these LLM services are starting to provide some of these, you know, maybe guardrails or or whatever you wanna call them to better, you know, hopefully structure the result that you're going to get. So this this type object, and type number, type string, all of these different object formats that we have the ability to specify within Elmer, it looks like, you know, these are fairly recent updates as well to these LLM services out there that allow us to do that.
Yeah, we have plenty of use cases where we want to do this exact thing. Right? You have some sort of large PDF, and you have a couple data points that your analysis team is is looking to get out of those PDFs. And instead of having them control f through the whole document, it would be great to just throw it at an LLM and get those get get those answers right away, and stick it in an Excel workbook for them to analyze. But, you know, I think you still sort of run the risk depending on how, you know, high risk, of a situation, you know, how important the model's accuracy is of getting the wrong answer. Right? And you still need to sort of spot check, all of those answers if it's a situation where, you know, the results of this LLM are going to be used for downstream decision making. But again, it seems like we're getting some improvements here to make things a little bit more accurate, you know, provide a little bit more guardrails to try to, you know, better, ensure that what we want is what we're getting out of these models. So that's that's really, really interesting to me.
You know, pretty exciting stuff from what I can tell. And, again, sort of that object, this object paradigm is is a recent LLM feature according to, the vignette on structured data in the Elmer package down site is is that this structured data, a k a structured output, is is this recent LLM feature that we have the ability to leverage. And and, again, I'm not sure if this, you know, goes for all of the LLMs out there or only, you know, some of them. But it definitely is interesting because I think if it's not implemented in all of them, it it will be soon. And these new features definitely help us as as data scientists be able to ideally, you know, better serve our end users.
[00:27:01] Eric Nantz:
And I can see this also combine you really nicely with what I mentioned earlier, the fact that Elmer and is is a front end to using this tool calling paradigm, where we can let the r process in your local system help the the chat model get to an answer that may be more dynamically dependent on either, you know, say, the current time or the current situation or current data that you don't want exposed to the LOM, but you want it available to it, in a local sense. And I think imagine having a set of PDFs or other structure or unstructured text that is giving additional context to a data analysis that you're doing or or a summarization of a research process. And you've got some, you know, data already in house that has some of the information, but you wanna help it out with some of this other information, you combine this ability to getting the the PDF text out with some of the recent advancements that Elmer's bringing with tool calling. And then I also plug, I believe, just this week, I saw a post on deposit blog yesterday, sister package, the Elmer called Gander by Simon Couch just hit Cran as well, which will let Elmer be aware of the our environment of the objects that you have loaded in your session.
So I see lots of interesting ways this could all be melded together to help give additional power to the bots, if you will, the chat bots to get these insights more quickly. And hopefully, if you're sifting through, like, hundreds of these documents or hundreds of these documentation pages, it may be more of a screening thing where you're just like, okay. I only need a certain set of this for this decision. When you get to that set that you narrow down from this big fishing net over to, like, this maybe five or 10 set of key inputs, that's what you really wanna validate yourself on is just getting to that to that final answer. But like you said, Mike, everything you said I agree with. This I couldn't aid in decision making. You just gotta be cautious about it.
But it's just interesting to see how you can blend all this together, which I'm still
[00:29:15] Mike Thomas:
trying to get my wrap my head around, but I think it's a really promising start here. I I agree. Yeah. We have a pretty big project going on. Right now, we're we're doing something similar. We get these 200, three hundred page PDFs that we're trying to get 20 data points out of. And it's a it's a situation where accuracy is really important. So we're actually building a Shiny app front end, and our AI engineer has has leveraged some of these open weights models and and put together some pretty cool, you know, system prompts as well as, I think, just sort of the regular prompts that you you throw out these LLMs. And what we're doing is we're returning all of the instances where we found in the PDF, snippets of text that could match the particular answer that the user is looking for, and then giving them radio buttons in the Shiny app to pick which one they think is the correct one, or manually override it in a a text input. So we're pretty excited about it. That's our sort of way to avoid, you know, all of the risk that comes with, you know, automating that process fully.
[00:30:19] Eric Nantz:
Yeah. I'm super excited to see where that effort goes. And and as always, I'm I'm a novice in this space. I'm learning a bit too in my in my local prototyping, but, lots of lots of novel use cases here, and we got the tooling to make it happen. And our last highlight today is, you know, showcasing some of the pioneering efforts that occurred from a long ago time. We're talking about over a hundred years ago to be exact on some novel uses of of visualization to help with a very important health issue that occurred back in 1854, folks. Yeah.
I'm I'm maybe an old timer, and I'm not that old. I I kid. I kid. So this this highlight here has been authored is actually a package that's just been released called snow data. It is authored by Neema Minnag. I know I didn't get that right at all, but it's she's a postdoctoral researcher at Maynooth University, and this package is actually exposing some of the datasets that she curated as part of her exploration from earlier this year in January, to be exact, to understand with modern tooling, how one could recreate this eighteen fifty four cholera outbreak that was visualized this time in R.
But to give you a very quick overview of this, in 1854, a very influential, physician, John Snow, used a data driven approach to trace what was becoming the source of a very, harsh, a very devastating cholera outbreak in London, and he was able to trace it back to a contaminated water pipe. But what he did was to actually start mapping out where the outbreaks were occurring and then trying to trace that back to where these water pipes were actually located in the city using cartography and literally writing on notebooks. You know, there are scans of this online that we can link in the show notes, to basically help help the government figure out, hey, we found the source. We gotta stop this now to stop this outbreak for even further.
So we invite you to check out the blog post that, that has been authored here, as a great accompanying part of the package. It was a package to be transparent, just as the datasets and not much else. So you're kind of left at your devices to figure out how to put all this together, but the blog post literally walks through her effort to, at the time, build this all up herself, which now you can have in the package for these, cholera cases, dataset with the locations, the x and y coordinates, and the observation IDs along with, the water pump location information.
And in the blog post, she shows some clever use of the raster vis package to take a TIFF object that is, again, available online through these different, publicly available domains and render in your graphic device what definitely looks like the scanned copy of this map of these streets in London where these outbreaks are taking place. So it's a it's a cool way to have a vintage looking representation, but annotated on top of the plot are these dots showing the outbreaks. And then she also was able to change the size of the dots based on the prevalence of the outbreaks in that particular region.
And there's a lot more that's going on with respect to how the the streets angles were calculated to get the location a little more precise. Really novel stuff that I wouldn't know heads or tails of how to do myself. So if you see a situation where you've seen maybe some pioneering effort in spatial visualization, whether it's health related or otherwise, I think what what we see in this in this highlight here from from Niam is a great showcase of what you can do with our a little a little getting in the weeds on a bit, but the snow data combined with these techniques for mapping is a great way to recreate that vintage visualization that that came from John Snow back in the day.
[00:34:59] Mike Thomas:
Yeah. Absolutely, Eric. This story was, I think, one of the most famous early uses of geospatial data analysis to solve what was really a public health crisis at the time. This is a such a nice little package sort of surrounding that story. And I think the package could be a great utility for, like, an undergrad data science course, like, mini project. If I'm thinking back to when I first learned R and stats one zero one, two zero one, whatever it was, we did not do any sort of fun interesting projects like this that were tied to. It was like our norm. Let's take a look at the normal distribution. You're soon straight, parameters. Yep. Change the parameters and see what happens. I feel like you could use this as, like, a real world use case that happened with this cholera outbreak, and and try to follow John Snow's steps, and play a little detective, and find the location of the source of the outbreak. I think that would be fun for students to get their hands on a little r programming, a little geospatial, data analysis, data vis type stuff with, you know, this these, I think they're terra datasets, or something like that. So, I think, yeah, a really interesting little package, a really nice, use case to sort of tell this this really awesome story of Jon Snow saving the day.
[00:36:16] Eric Nantz:
Yeah. Excellent. Yeah. Like I said, pioneering work and a major health crisis, and there are so many of these out in history. And if we can get our hands on these original data sources, yeah, we should be able to digitize this and be able to recreate these these novel visualizations. And I admit having a project like this sure would be, you know, the infamous you've got red and balls in an urn, figure out probabilities. Oh, this this one I could sink my teeth into even as a as a visualization novice with spatial visualization. But there's a lot more you can sink your teeth into with this issue of our weekly. We got, of course, the link to the full issue itself. It's got its usual gamut of new packages, great tutorials, blog posts.
We're running a bit short on time, so we won't do our additional fines. But, again, we have everything linked, in the show notes if you wanna check out the full issue. Rio, as always, does a fantastic job here. And, you know, it's all fantastic as well because of you and the community. So if you wanna help out the project, the best way to do that, if you find that great blog post, that great tutorial, is to send us a poll request with that link right at rweekly.0rg. There's a link in the upper right corner, get you to the GitHub template for the poll request.
Little easy to use, template there to follow and all marked down all the time. We'd love to contribute it to the next issue, whatever you find. And, also, we love hearing from you. And as as you saw on the episode number, next week's a big one, episode 200 folks. That's hard to believe we're getting there, but we are getting there. So if you have a favorite memory you wanna share with us, we'd love to read it on the show. You can get in touch with us multiple ways. We have a contact page in the episode show notes. It's right there for you to send a little web form for that. You can also get in touch with us on social media as well. I am on blue sky these days with @rpodcastatbsky.social.
Also, I'm Mastodon, where I'm at [email protected]. And I'm on LinkedIn. Just search my name, and you'll find me there. And, Mike, where can listeners find you?
[00:38:22] Mike Thomas:
You can find me, blue sky at mike dash thomas dot b s k y dot social, or you can find me on LinkedIn. If you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to lately.
[00:38:36] Eric Nantz:
Very good. And, yeah, I got a preview of what you're up to, and I am intrigued to say the least. So we hope to hear more about that soon. But like I said, next week's a big one, episode 200, and we'll see what actually happens there. But we're hoping you can join us for that next week as well. So until then, we'll close-up shop for episode a 99 of our weekly highlights, and we will indeed be back with episode 200 next week.
Hello, friends. We are back with episode a 99 of the Our Weekly Highlights podcast. This is the weekly show where we talk about the excellent resources that are in our highlights section amongst much more in this week's our weekly issue. My name is Eric Nance, and I'm delighted you join us from wherever you are around the world.
[00:00:22] Mike Thomas:
And very happy to be joined as always by my awesome cohost who's been with me for many of those a hundred and nine ninety nine episodes, Mike Thomas. Mike, how are you doing today? Doing great, Eric. Yeah. It's it's pretty crazy to think that we're coming up on that number. I don't know exactly what my count is. I'll have to do a little web scraping or, leverage some APIs to be able to to run some deep wire and figure out how many of those I've contributed to. But it feels like a lot, and it's been a blast. It sure has been. Maybe we should throw the the RSS feed in some LOM or something, and it'll tell it for us. You know, that's all the rage these days, which
[00:00:56] Eric Nantz:
Wah wah wah. Well, you know, what can you do? And there will be a section where we touch on that a bit, but, nonetheless, we are happy to talk about this latest issue that has been curated by real Nakakura, another one of our OG curators on the our weekly team who has been very helpful for to get me on board for many, many years ago. Seems like many years ago. Maybe it hasn't been that long, but, man, time flies in open source, doesn't it? But as always, he had tremendous help from our foe, our working team members, and contributors like all of you around the world with your poll request and other suggestions. And it is a lifestyle, Mike, that you and I live of every day. We're we're just talking about this on the preshow where we may be losing multiple languages often at once in our projects. It is the reality, whether you're a solo developer and you've seen that great utility and it just isn't in your maybe primary language of choice, but there's an open source equivalent in another language, and you wanna bring it all together.
Certainly, that's been a key focus of many vendors, consultation companies, and enterprises in today's world to build tooling and capabilities that can have the ability to leverage multiple languages at once. And there was an interesting chat with two very prominent roles or leaders in this space about their thoughts on interoperability across different languages and data signs. So our first highlight is a blog post that has been offered by Isabella Velasquez over at Posit, and it is summarizing a recent fireside chat that consisted of none other than Hadley Wickham, chief scientist at Posit, author of the tidy verse, and many other important contributions in the art community. Obviously, needs almost no introduction to our audience here. And he was joined by Wes McKinney, the, you know, the, obviously, the architect of pandas, and it was also been working on, with posit as well.
And they were recently, as I mentioned, part of a fireside chat that, was hosted by posit where they invited, a small gathering of fellow data scientists, much in the style of the what they've done basically every week called the data science hangout. It was kind of a mini hangout of sorts. The YouTube video is online. And if you wanna watch that or listen to that after you listen to this very show, we'll have a link in the show notes, of course, to that recording. But there was one aspect that Isabella touched on in her blog post that I think is very much a important reality, but an important capability that we are living every day that was brought up by one of the audience members in that discussion or that fireside chat is the use of, you know, multiple languages and being able to leverage the Quarto documentation publishing system that to help users look at different code snippets between how you might accomplish something in R and how you might accomplish something in Python.
We both we all know that both R and Python as well as Julia and observable are all languages that are well supported in quarto. And so this, this blog post is talking about a couple solutions that you might have as you're crafting these resources together and you're putting these multiple snippets in place of letting the user really see kind of in a group fashion, these different snippets in real time as they're navigating through your interface. And within Quarto, there is the functionality to do tab sets in your page, to help organize content. This could be a web page. This could be a standard HTML document. It could be a cordial slide deck. We reveal JS. The tab sets are usable almost everywhere they have interactive HTML.
And what a lot of people will do is they'll have, say, a tab maybe for code, another tab for output. And like I said, in this context, they might have different tabs for, like, how you do certain function or certain analysis in r. And then maybe for teaching purposes or, you know, getting members of your team up to speed, you might have the equivalent, snippet in Python. There are two ways to accomplish this. One is, built in to Quarto itself, and another is a really awesome extension that does a similar thing. So the first way to accomplish this is that when you're defining these tab panels, there is a panel kind of div operator that you put in your Chordal document to set up this group.
You give that that group a name. You can call it, like, language or whatnot. In fact, that that could be any naming you want. But then within the subheadings of this tab set, you put the r code with an r heading and then say a Python block with a Python heading. And then you make sure that these are, again, tagged appropriately with the group amongst multiple tabs that so that they have the same group name. And then when you click, like, the r tab in one tab set, the other tab set will react accordingly. Same with clicking the Python tab.
And there are examples of this in action in the blog post, so you're invited to check that out after the, after listening to this. But if you go through that that example, there's, like, an r section where it's just doing a simple scatterplot and then a Python section doing a similar thing with both a scatter plot and, like, a data table. So that's one way to do it built in. But I mentioned there was a great extension to do this as well, and it's called Tabby. This is authored by one of our favorites to follow in the quartal space and amongst all data science, James Balutoma, who goes by the cultless professor, always one of my favorite handles.
And this extension, Mike, I think is just up my alley for some recent projects. Why don't you walk us through this? Yeah. Absolutely. It's an extension that
[00:07:12] Mike Thomas:
accomplishes much of, you know, what the first half of the blog post, talks about. But the first half of the blog post is a little bit more manual in that, you would need to sort of specify the tab sets explicitly as opposed to using this Tabby extension. You can have all of your different chunks, which I think could be dynamic. So, say, under one tab set, for example, you have r, Python, and Julia. And then under another tab set, you have, you know, r and Julia. And you don't actually need to change sort of the wrapper, the the quarto div, if you will, at all. You can just use this dot tabbing, call and then group equals language, and it will programmatically build out the tab set for you as I understand it, for one tab for each of the, chunks that you have underneath it based upon the particular language for each of those chunks. So I think it's a really nifty, smooth way to be able to go about, accomplishing, you know, what the first half of the blog post does a little bit more manually. If you want to install this Tabby extension, just open up a terminal, wherever you have quarto installed and run quarto add, colist dash quarto backslash tabby.
It's gonna install this extension under an underscore extension subdirectory, that you might wanna take a look at if you're using version control. And the setup and ability to get going is is really quick. They have some basic examples with Python, JavaScript, and R, all underneath this dot tabby div, if you will, in quarto. And it generates this beautiful three tabbed tabset with Python, JavaScript, and R code in it. There's a great I guess we'd call it a package down site. I'm not sure if it is a package down site because this isn't an R package. Right? It's a quarto extension. But there's a great a great site that, that we have put together here for this extension. It's quarto.thecultlistprofessor.combackslashtabby.
So maybe that makes me think that, is it James? Has even more, more quarto extensions coming our way. We'll have to see. But, great great blog post here, great extension, super useful. One that I have read about before, but admittedly have not tried and put into my workflow yet, and I need to because we have a billion use cases for this.
[00:09:40] Eric Nantz:
In fact, I have a use case I'll I'll share right now as, Mike knows from our preshow banter that I've been on a AWS, journey, if you will, with leveraging, how we deploy Shiny apps and how we deploy maybe custom APIs. Now most of you know, I'm I'm an R user. I I develop most of my everything in R. And while there are a lot of advancements in the R ecosystem respect to interacting with cloud, you know, mechanisms or cloud providers like AWS, like Azure, and whatnot. When you start to get in the weeds of, like, certain bits of these services and let's say for AWS, you're you know, it was object stores. There's, like, the secrets manager. There's I'm roles and all that jazz. Not important for what I'm talking about here. What is important is as you're thinking about, okay, what are the best ways I can call these APIs from our and you start to search and you get some hits here and there.
But then most of the time when you search for this, there are other languages are coming up at the top of this, especially in this case, Python with the Boto three library. So what I want to do with this kind of paradigm, especially with this Tabby extension, is I'm I right now, I'm writing notes for mainly myself as I'm navigating through all this, but I wanna empower other statisticians and data scientists in my team to be able to deploy these resources too. So I'm not, you know, what I like to call the bus factor for a lot of this. So I'm writing kind of this document, and it probably will be a portal site when I'm finished with it. And I wanna put in the equivalent of, like, the r snippet to tie into these APIs and the Python snippet with this boto three library.
Because a lot of times when IT asks, you know, oh, you're having trouble authenticating that, what are you using? Oh, you're not using Boto three? Like, oops. Well, I I do on the side to do my testing to make sure it's not me messing up the r side of things to make sure that the Python side is working first. So in this documentation, I can do tab sets. So when the user says, okay, here's the r way of connecting to this API or deploying this thing. And then they hit the Python tab, and they'll get the same thing in Python. That's where the majority of our internal documentation from the IT groups have written as is like, we assume we're using Boto three, so here you go. Well, I'm like the the first r user in this journey. So I'm thinking this might be a useful technique to follow, and and what a great way to put this all together with this Tabby extension in Corto. I'm I'm sold.
[00:12:14] Mike Thomas:
Definitely. And if you envision that the audience is either gonna be a Python person or an R person, for example. You can group all of these tab sets together with the Tabby extension so that if someone switches from R to Python in one tab set, all of the tab sets, that you've decided to should be grouped together will switch from r to Python as well.
[00:12:38] Eric Nantz:
That's a huge user experience enhancement and one that, like I said, this is gonna be a huge asset to me personally as I try to document this journey and notes in real time so that, a, I don't forget in a year from now, and b, I can empower other developers to join me on this. But we teased at the outset, everybody that, you know, it is in today's day and age. We usually have something to say about large language models on this podcast these days, but we're trying. One nice thing about the r weekly highlights is we definitely cut through the fluff and, the noise that can be out there. And we try to showcase real novel uses of the technology. And like we said the other week, Mike and I are always learning something new, and it looks like this blog post is gonna give us just that. We have a recent use case with the recently highlighted Elmer package, but this one is titled the Elmer package for using large language models and r is a game changer for scientists.
So this is very much geared towards the research side of things. But this is being authored this blog post has been authored by the seascapes models group, which I think we featured back in episode a 96. I believe the lead of this group is Chris Brown. But we don't exactly know who authored a blog post. So we're just gonna say it's from the seascapes team. And after the introduction where the author talks about, you know, why is the Elmer package in particular a game changer for scientists, well, a lot of the things that we talk about here as we're learning about this, Elmer is really helping you automate a lot of this setup to many of the l o m providers out there, as well as the capabilities of tool calling and some other nice enhancements. So I think we're gonna be getting to later on in this post, especially when dealing with textual data.
So the first part of the post, we'll we'll breeze through this, but Elmer also has great documentation on how you set up your authentication to these services. I believe in this example, they're using the anthropic API from Claude, which is something that Mike and I are using routinely in our projects now. So we're starting to get familiar with that. And then it walks through just the basic ways of setting up the chat object in Elmer based on your system prompt, the model. You can also specify the max number of tokens. That can be quite important if you're working on costs too, to make sure you're not overcharging there.
But there are some interesting nuggets as we get to the use case that this blog post is highlighting here. And the use case that they're talking about, A lot of times the data or the insights that you wanna summarize are contained in PDF format. If they're from a research paper, maybe some internal documentation that was written years ago, It's in some sort of PDF. So Elmer out of the box, and I believe this is new to the CRAN release that we touched on last week or a couple weeks ago. There are functions that can aid with the processing of text from PDFs. Though, the author's first attempt at this is they have a manuscript on turtle fishing, which is interesting read. You can check that out in your leisure.
They first tried to import this, paper online dynamically for the web link using the content PDF URL function that Elmer exposes. Well, that didn't work. There was a four zero three. That's a typical HTML errors means that you're unauthorized or there's a server error around there. So he, the author speculates that this may have been a bot that's trying to prohibit, you know, some rogue processes from scraping web content. So the workaround is that you download the PDF locally on your system. Then they ran the same function, but now it's content PDF file. So that browses to the local copy of this, turtle manuscript, and that actually works. So you get an object. They call it my PDF, and that will be used later on in all this.
But after setting up the clawed chat object and, again, using the the familiar system prompt, the specific model, and the max tokens, this is interesting. Elmer has some built in functions to help you with conversion of different object types. And in this case, there is a type object function, which I have not used yet. But within this, you can have some arbitrary definitions of the information you're trying to look for in this paper here. In this case, they have some for sample size of the study and the manuscript, the year of the study and the method.
And within these type objects, you can have ancillary functions called, like, type underscore number, type underscore string to kind of give it what it should get when it's scraping this PDF after running the model. So if all that's set up, then the author calls all of this and the chat object with the extract data method. Again, feeding in that PDF object that they got from extracting from the file. And then the type is using this type object that was defined. They call it paper stats. So when you get that back, you start to see then in this object, the different slots for those three kind of data types that was looking for the sample size, the year of the study, and then the method, which was another string of like, what was the statistical method and a paragraph or less, and it works, but by cautions everywhere. Right? You do have to be cautious about, you know, what results you're getting back and the type of framing you're putting in with this.
And this, it looked like the sample size was the correct amount. But, again, you have to be cautious. And if you wanna run this in a batch setting, that's the next part of the post. What if it's not just one paper? It's like a hundred or so or such of these. So you could write a wrapper function like they do here to kinda automate this process of grabbing the text, extracting the data with that custom type object, you know, definition, and returning the result. Now the other risks that they talk about is that if the chat, you know, object or the chat service can't find the answer, it might try to make one up called hallucination and, and the common lingo. There are some safeguards you can put in your Elmer calls to try to suppress that in these type, type function calls where there's a parameter called required.
If you set that to false, typically speaking, if it doesn't find that particular set of data with that type you define, it should just leave it as a null. Not always though. They caution that they still might try to do something. So you've gotta, again, look at the results, in their experience when they set that required false, it was still hallucinating on some of these answers from these different types. So again, your mileage may vary, but what I'm learning here is that a, this this type, object definition is hugely important when you're in this context of scraping information from these PDFs or other type of structured text.
And so in some of the reflections that wrap up the post here, some things you wanna think about is, you know, these things aren't free. Right? I was just telling Mike, I just pony it up for paying for the professional cloud account. So I don't wanna burn through my usage in one month because I'm doing a bunch of repeated calls as I'm prototyping things. So you may want to, you know, keep your testing at a little minimal as you're iterating through this. Maybe set the token limit to less or whatnot. And then dealing with structured text or unstructured text can be, can be problematic, especially in PDFs. That's why, if you have a resource in HTML, oh, you're golden, right? Because that's structured in a markup language, easy to scrape, but we don't always have that luxury. But if you have it, they always recommend grabbing HTML first.
And like everything in these in these pipelines, the prompt is a huge part of this. Probably takes a few iterations to get this right. But if you make it as domain specific as possible, I think, and also maybe repeating it a couple times with some experiments, you can kinda see the variation that occurs wherever you do the same prompt or maybe tweak one or two sentences in that prompt. I think that's also a good practice to have. But the potential is here, folks. The potential is here if you got this data trapped in these PDFs or other textual documents, looks like Elmer is trying to help you out to grab this data out, help you define how you wanna extract the different pieces out with these type objects.
And lo and behold, you might have a great way to at least reduce what half your effort, maybe 80% of your effort of trying to do this all yourself. I still remember at the day job, there was a vendor we paid a lot of money for to do this quote unquote curation of these PDFs all manually. It would take them months upon months upon months to finish it. And even then, we get this big old data set out in Excel format. We don't know how heads or tails ought to make sense of it. So will this replace everything? No. But I think this is a huge win for research out there, especially for for this type of information that we're seeing in this, highlight here. So learn something new with these, with the Elmer package as always, Mike. What do you think about all this?
[00:23:02] Mike Thomas:
Yeah. Absolutely. It's pretty crazy. And I think that this blog post and some of these highlights from the Elmer package, actually taught me some things about, what's going on in the, I guess, LLM ecosystem and some things that I hadn't realized yet. I think one thing that we touched on is there is a function in here that's called, what is it? Content PDF file or content PDF URL. And I was thinking initially that that, you know, maybe extracted the text from the PDF, stored it in a vector database, and this was sort of a rag type of a thing. But it sounds like from taking a look at some of the Elmer documentation is that it's actually sending the PDF file to the LLM service itself and then letting the LLM service, do that. And it's it seems like now some of these APIs, at least, and probably the front end interfaces and and, Claude and ChatGPT and and all of them actually allow for a PDF file as input and not just, you know, your your prompt text, if you will. So I think that's what's going on here. Obviously, if you have, you know, sensitive information in a a PDF at your company, you make sure that you wanna check with somebody before you send that to, one of these third party services because you never know exactly what they're going to do with it.
So, and then the this whole type object idea is really interesting too. And I I got to imagine that it's not it's still sort of imperfect a little bit, but it's nice that some of these LLM services are starting to provide some of these, you know, maybe guardrails or or whatever you wanna call them to better, you know, hopefully structure the result that you're going to get. So this this type object, and type number, type string, all of these different object formats that we have the ability to specify within Elmer, it looks like, you know, these are fairly recent updates as well to these LLM services out there that allow us to do that.
Yeah, we have plenty of use cases where we want to do this exact thing. Right? You have some sort of large PDF, and you have a couple data points that your analysis team is is looking to get out of those PDFs. And instead of having them control f through the whole document, it would be great to just throw it at an LLM and get those get get those answers right away, and stick it in an Excel workbook for them to analyze. But, you know, I think you still sort of run the risk depending on how, you know, high risk, of a situation, you know, how important the model's accuracy is of getting the wrong answer. Right? And you still need to sort of spot check, all of those answers if it's a situation where, you know, the results of this LLM are going to be used for downstream decision making. But again, it seems like we're getting some improvements here to make things a little bit more accurate, you know, provide a little bit more guardrails to try to, you know, better, ensure that what we want is what we're getting out of these models. So that's that's really, really interesting to me.
You know, pretty exciting stuff from what I can tell. And, again, sort of that object, this object paradigm is is a recent LLM feature according to, the vignette on structured data in the Elmer package down site is is that this structured data, a k a structured output, is is this recent LLM feature that we have the ability to leverage. And and, again, I'm not sure if this, you know, goes for all of the LLMs out there or only, you know, some of them. But it definitely is interesting because I think if it's not implemented in all of them, it it will be soon. And these new features definitely help us as as data scientists be able to ideally, you know, better serve our end users.
[00:27:01] Eric Nantz:
And I can see this also combine you really nicely with what I mentioned earlier, the fact that Elmer and is is a front end to using this tool calling paradigm, where we can let the r process in your local system help the the chat model get to an answer that may be more dynamically dependent on either, you know, say, the current time or the current situation or current data that you don't want exposed to the LOM, but you want it available to it, in a local sense. And I think imagine having a set of PDFs or other structure or unstructured text that is giving additional context to a data analysis that you're doing or or a summarization of a research process. And you've got some, you know, data already in house that has some of the information, but you wanna help it out with some of this other information, you combine this ability to getting the the PDF text out with some of the recent advancements that Elmer's bringing with tool calling. And then I also plug, I believe, just this week, I saw a post on deposit blog yesterday, sister package, the Elmer called Gander by Simon Couch just hit Cran as well, which will let Elmer be aware of the our environment of the objects that you have loaded in your session.
So I see lots of interesting ways this could all be melded together to help give additional power to the bots, if you will, the chat bots to get these insights more quickly. And hopefully, if you're sifting through, like, hundreds of these documents or hundreds of these documentation pages, it may be more of a screening thing where you're just like, okay. I only need a certain set of this for this decision. When you get to that set that you narrow down from this big fishing net over to, like, this maybe five or 10 set of key inputs, that's what you really wanna validate yourself on is just getting to that to that final answer. But like you said, Mike, everything you said I agree with. This I couldn't aid in decision making. You just gotta be cautious about it.
But it's just interesting to see how you can blend all this together, which I'm still
[00:29:15] Mike Thomas:
trying to get my wrap my head around, but I think it's a really promising start here. I I agree. Yeah. We have a pretty big project going on. Right now, we're we're doing something similar. We get these 200, three hundred page PDFs that we're trying to get 20 data points out of. And it's a it's a situation where accuracy is really important. So we're actually building a Shiny app front end, and our AI engineer has has leveraged some of these open weights models and and put together some pretty cool, you know, system prompts as well as, I think, just sort of the regular prompts that you you throw out these LLMs. And what we're doing is we're returning all of the instances where we found in the PDF, snippets of text that could match the particular answer that the user is looking for, and then giving them radio buttons in the Shiny app to pick which one they think is the correct one, or manually override it in a a text input. So we're pretty excited about it. That's our sort of way to avoid, you know, all of the risk that comes with, you know, automating that process fully.
[00:30:19] Eric Nantz:
Yeah. I'm super excited to see where that effort goes. And and as always, I'm I'm a novice in this space. I'm learning a bit too in my in my local prototyping, but, lots of lots of novel use cases here, and we got the tooling to make it happen. And our last highlight today is, you know, showcasing some of the pioneering efforts that occurred from a long ago time. We're talking about over a hundred years ago to be exact on some novel uses of of visualization to help with a very important health issue that occurred back in 1854, folks. Yeah.
I'm I'm maybe an old timer, and I'm not that old. I I kid. I kid. So this this highlight here has been authored is actually a package that's just been released called snow data. It is authored by Neema Minnag. I know I didn't get that right at all, but it's she's a postdoctoral researcher at Maynooth University, and this package is actually exposing some of the datasets that she curated as part of her exploration from earlier this year in January, to be exact, to understand with modern tooling, how one could recreate this eighteen fifty four cholera outbreak that was visualized this time in R.
But to give you a very quick overview of this, in 1854, a very influential, physician, John Snow, used a data driven approach to trace what was becoming the source of a very, harsh, a very devastating cholera outbreak in London, and he was able to trace it back to a contaminated water pipe. But what he did was to actually start mapping out where the outbreaks were occurring and then trying to trace that back to where these water pipes were actually located in the city using cartography and literally writing on notebooks. You know, there are scans of this online that we can link in the show notes, to basically help help the government figure out, hey, we found the source. We gotta stop this now to stop this outbreak for even further.
So we invite you to check out the blog post that, that has been authored here, as a great accompanying part of the package. It was a package to be transparent, just as the datasets and not much else. So you're kind of left at your devices to figure out how to put all this together, but the blog post literally walks through her effort to, at the time, build this all up herself, which now you can have in the package for these, cholera cases, dataset with the locations, the x and y coordinates, and the observation IDs along with, the water pump location information.
And in the blog post, she shows some clever use of the raster vis package to take a TIFF object that is, again, available online through these different, publicly available domains and render in your graphic device what definitely looks like the scanned copy of this map of these streets in London where these outbreaks are taking place. So it's a it's a cool way to have a vintage looking representation, but annotated on top of the plot are these dots showing the outbreaks. And then she also was able to change the size of the dots based on the prevalence of the outbreaks in that particular region.
And there's a lot more that's going on with respect to how the the streets angles were calculated to get the location a little more precise. Really novel stuff that I wouldn't know heads or tails of how to do myself. So if you see a situation where you've seen maybe some pioneering effort in spatial visualization, whether it's health related or otherwise, I think what what we see in this in this highlight here from from Niam is a great showcase of what you can do with our a little a little getting in the weeds on a bit, but the snow data combined with these techniques for mapping is a great way to recreate that vintage visualization that that came from John Snow back in the day.
[00:34:59] Mike Thomas:
Yeah. Absolutely, Eric. This story was, I think, one of the most famous early uses of geospatial data analysis to solve what was really a public health crisis at the time. This is a such a nice little package sort of surrounding that story. And I think the package could be a great utility for, like, an undergrad data science course, like, mini project. If I'm thinking back to when I first learned R and stats one zero one, two zero one, whatever it was, we did not do any sort of fun interesting projects like this that were tied to. It was like our norm. Let's take a look at the normal distribution. You're soon straight, parameters. Yep. Change the parameters and see what happens. I feel like you could use this as, like, a real world use case that happened with this cholera outbreak, and and try to follow John Snow's steps, and play a little detective, and find the location of the source of the outbreak. I think that would be fun for students to get their hands on a little r programming, a little geospatial, data analysis, data vis type stuff with, you know, this these, I think they're terra datasets, or something like that. So, I think, yeah, a really interesting little package, a really nice, use case to sort of tell this this really awesome story of Jon Snow saving the day.
[00:36:16] Eric Nantz:
Yeah. Excellent. Yeah. Like I said, pioneering work and a major health crisis, and there are so many of these out in history. And if we can get our hands on these original data sources, yeah, we should be able to digitize this and be able to recreate these these novel visualizations. And I admit having a project like this sure would be, you know, the infamous you've got red and balls in an urn, figure out probabilities. Oh, this this one I could sink my teeth into even as a as a visualization novice with spatial visualization. But there's a lot more you can sink your teeth into with this issue of our weekly. We got, of course, the link to the full issue itself. It's got its usual gamut of new packages, great tutorials, blog posts.
We're running a bit short on time, so we won't do our additional fines. But, again, we have everything linked, in the show notes if you wanna check out the full issue. Rio, as always, does a fantastic job here. And, you know, it's all fantastic as well because of you and the community. So if you wanna help out the project, the best way to do that, if you find that great blog post, that great tutorial, is to send us a poll request with that link right at rweekly.0rg. There's a link in the upper right corner, get you to the GitHub template for the poll request.
Little easy to use, template there to follow and all marked down all the time. We'd love to contribute it to the next issue, whatever you find. And, also, we love hearing from you. And as as you saw on the episode number, next week's a big one, episode 200 folks. That's hard to believe we're getting there, but we are getting there. So if you have a favorite memory you wanna share with us, we'd love to read it on the show. You can get in touch with us multiple ways. We have a contact page in the episode show notes. It's right there for you to send a little web form for that. You can also get in touch with us on social media as well. I am on blue sky these days with @rpodcastatbsky.social.
Also, I'm Mastodon, where I'm at [email protected]. And I'm on LinkedIn. Just search my name, and you'll find me there. And, Mike, where can listeners find you?
[00:38:22] Mike Thomas:
You can find me, blue sky at mike dash thomas dot b s k y dot social, or you can find me on LinkedIn. If you search Catchbrook Analytics, k e t c h b r o o k, you can see what I'm up to lately.
[00:38:36] Eric Nantz:
Very good. And, yeah, I got a preview of what you're up to, and I am intrigued to say the least. So we hope to hear more about that soon. But like I said, next week's a big one, episode 200, and we'll see what actually happens there. But we're hoping you can join us for that next week as well. So until then, we'll close-up shop for episode a 99 of our weekly highlights, and we will indeed be back with episode 200 next week.