An illuminating set of tips for making the best out of the phrase "fifty shades of grey" in your next monochrome visualisation, how the unique formatting features of the Scotland census data were tamed with the power of R, and how the first-ever native mobile application powered by R has opened the doors wide open for innovation across many parts of data science.
Episode Links
Episode Links
- This week's curator: Jon Carroll - @[email protected] (Mastodon) & @jonocarroll.fosstodon.org.ap.brid.gy (Bluesky) & @carroll_jono (X/Twitter)
- Designing monochrome data visualisations
- The life changing magic of tidying text files
- Rlinguo — Why Did We Build It?
- Entire issue available at rweekly.org/2025-W07
- ColorBrewer palettes https://colorbrewer2.org
- John's GitHub repository for tidying census files https://github.com/johnmackintosh/tidy-scotland-census
- Introducing Rlinguo, a native mobile app that runs R https://rtask.thinkr.fr/introducing-rlinguo-a-native-mobile-app-that-runs-r/
- mirai v2.0 - Continuous Innovation https://shikokuchuo.net/posts/25-mirai-v2/
- Key considerations for retiring/superseding an R package https://epiverse-trace.github.io/posts/superseding-bpmodels/
- Use the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedback
- R-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.
- A new way to think about value: https://value4value.info
- Get in touch with us on social media
- Eric Nantz: @[email protected] (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)
- Mike Thomas: @[email protected] (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter)
- Chillout - Mega Man 2 - sedure - https://ocremix.org/remix/OCR01175
- Black Genesis - Final Fantasy VI Balance & Ruin - Brandon Strader, Rexy - https://ocremix.org/remix/OCR02796
[00:00:03]
Eric Nantz:
Hello, friends. We are back with episode 95 of the Our Weekly Highlights podcast. This is the weekly show where we talk about the great highlights and additional resources that are shared every single week at ourweekly.0rg. My name is Eric Nance, and thank you for being patient with us. We are a couple days away from our usual release because, you know, as much as we would like to control our schedules, sometimes things just fall in our laps and and certain groups just need our attention those days. So that happened on me on our usual recording day, but we're back here. And I say we, of course, because I am joined by my awesome cohost, Mike Thomas. Mike, how are you doing this fine morning? Doing well, Eric. Yeah. We're a little later in the week this week. And for once, it's not me, but I you're helping to helping to balance it out. So I appreciate that. Yes. The the yin and yang are are starting to balance or the force, if you will. So that is it'll happen, and it may even happen next week too. We'll figure it out. But nonetheless, we are we're happy to be back here on on this day, and our issue this week has been curated by another one of our OG curators on the team, Jonathan Carroll.
As always, he had tremendous help from our fellow, our weekly team members, and contributors like all of you around the world with your poll request and other suggestions. And we're gonna visit the, visualization corner right away on this episode of our weekly highlights. And in particular, type of type of technique for visualization that may seem retro to you, especially when you think about how you used to print paper documents back in the in the days of dot matrix printers and not so much colors available in those printers.
So this post is coming to us from Nicola Rennie who is talking to us today about alternatives to the typical color palettes that you might be using in your visualizations in R. Now the first question is, why would you not want to take advantage of colors in your visualizations when you wanna share that with the world or share that in other means? Well, there may be cases where you don't really have a choice. One of those being that in the world of academia or even just other academia in general, you may have requirements from the, public publisher that nope. No color plots. Gotta be monochrome.
Black and white, basically, or what monochrome is technically speaking is it's different shades of a single color. So typically, we do think of that as gray. The different spectrums are gray. Right? And so there are a lot of publications that require that right off the bat. Although there is another side benefit to this that you may not realize when you're going down this route is that this could be a win for accessibility as well if you structure it the right way. And I think what Nicola's, advice here will get you along the way to do that.
Now you might be thinking, okay. Now I have to do the monochrome plot. You know, there is an easy way to do this. Right? You print the PDF and you could choose either color or black or white. Right? And that will just convert everything to that that black and white spectrum. Well, in the first part of the post, Nicola talks about that's probably not a great idea because you're losing a lot of the nuance in these different colors especially if they're closer together via what's called saturation. And she has a nice visual, again, we're audio here. We'll we'll speak of the visualization, horizontal bar chart. And so in this legend of the visualization, she colors the bars by by transmission type, But then when you look at the the black and white converted version of it, you cannot tell really at all the difference between at least two of these lead these items in the legend, which of course is really bad if you can't figure out which color is going with which. So that means yes. You're gonna have to, you know, code this up like you would with any typical visualization.
So your next bet is to look at the different palettes available and of course one of these packages that you know is one of the mainstays for looking at palettes is the color brewer and by proxy the our color brewer palettes and you might be looking at those And then she links to a website, for color brewer where you can look at which palettes are what they call photocopy friendly, meaning that if you were gonna scan or literally copy this this document that has a plot in and it was gonna convert the black and white, which palettes are actually more amenable to that. So she does show in the next example using, the color brewer package, which, pallet, it's called set three, where at least in this time, you're getting a closer to the right direction with legend colors that do look somewhat different. Although I'll be honest, the two grayish ones are so hard to tell on my screen which ones are a movie different, but it's it's it's a small step to get there.
But you may wanna take it a step further and really think about you're in this monochrome paradigm. Maybe there are palette types that are more better suited for this world that you're you're about to embark into. And taking a step back, what are the types of color palettes that we typically use? One is sequential where you have more of a gradient or a gradual decrease or increase in the shade of the color. So in this case, thinking of the a darker color to a lighter color. You also could add diverging palettes where it's like the extremes of those scales are very different, but yet the middle, you know, is kind of in the middle so to speak, kind of blending them together.
And then you have the discrete type of scale where it's just there's no real ordering to it. It's just different colors for each category. Now which one of these is better suited for the monochrome type of visualization? Well, when you look at the sequential palette, it's actually not too bad because you can see, you know, the lighter color being at the lower end of a of a of a continuous type of legend all the way to the darker color. It it actually does translate pretty well. So she's got an example in this case of using, I believe, the highway mileage on the scale there, and the lighter color is the less miles and the darker color is the higher miles on a dot plot. And at least you can see the darker dots compared to the the lighter dots. Is it perfect?
Yeah. Your results may vary there. The ones that definitely does not translate well was divergent palettes because when you think about those different extremes in the middle, how do you really convert that well because the direction might get lost in translation in terms of those those different conversions. But also another difficult one is the discreet palette because you never know based on the choices you chose for those colors which ones are gonna translate well to that conversion. And in fact, that little resource we just told you about the color brewer palette that look up, they said, Nicholas says there's really only one that's photocopy friendly. So you're kinda stuck if you're still searching for that. I'm maybe trying to guess which ones are are really best suited for it.
So she recommends that maybe starting your design of that plot with the monochrome world in mind you might then start to choose different colors that are visually different enough than when you do that monochrome type conversion, which in g g plot two, there is a scale gray variant of functions that will give you that gray, palette converting that from that color palette. And she does another example where you can see the three different legend, items for, in this case, transmission again is like going from almost black all the way to very light gray with a darker gray in the middle and you can see that really pop out in that visualization.
But again, it definitely takes attention to detail to making sure you're picking those colors that are distinguishable enough for that conversion to really have a have a good play there. But you also have to keep in mind the type of chart you're making. This works really well for a bar chart where, obviously, the bars themselves have a lot of area that are being spent or being taken up in that overall canvas. Whereas for the dot plots, obviously, your eyes are at the squint a little bit to look at the size of these dots, and it may not be so obvious to see these distinguished colors even if they look really good in the bar chart too. So you've got to think about the type of plot you're doing that should inform you alongside just having, you know, the monochrome world in mind or the framework in mind which type of colors you want to use for that conversion.
But one thing to open your eyes about is that you don't always have to stick with just colors. When you're in the monochrome mindset, you can take advantage of other features in the visualization to really make your visualizations pop. And I think, Mike, she has some really great advice on the different types of patterns you can use in these visualizations.
[00:09:56] Mike Thomas:
Right? She does. And this took me back a little bit to r one zero one. I think probably using the base plot package. And I don't know if this is the case for everyone or for a lot of maybe the the older folks listening, but I feel like my introductory or plotting, knowledge or classes were filled with using shapes as points. And I remember that quite fondly, and I have honestly, Eric, forgotten that that's even possible. I don't think I've used a shape, and and this is terrible. I'm so sorry, Nicola. I I don't think I've used a shape in a plot in a long, long time in r. But it's useful to remember that that's even something that's possible to do.
And and Nicola mentions two different packages here, and I think that they can be really handy, especially as we think about the monochrome world. The g g pattern package and then the fill pattern package. I think the g g pattern package allows you to leverage, like, texture and shapes, within your g g plots specifically. And again, this is audio, so it's it's hard to do. But it looks like, on on a bar chart example, you can have, you know, one section of the bar have sort of polka dots in it, another section of the bar representing a different discrete, class in a discrete variable has, diagonal lines through it. The other one looks like plaid as well. So it's an interesting approach to getting away from leveraging color to represent, different classes in a discrete variable to actually representing patterns. And it's, it's really, I don't know, really creative to me, to see and and really interesting. And honestly, it does make it very easy to discern the different categories.
I think maybe for folks who aren't traditionally used to living in the monochrome world, just one thing that you may have to to watch out for. If I was giving this to, you know, as a deliverable to a client or something like that, I think that they might be distracted a little bit the patterns, just because it it does look a little bit old school. But I I think if you work with them to communicate the reasons why you did so, and, understanding that, you know, the deliverable should be monochrome for reasons x, y, and z, I think that this is a fantastic approach to consider as well just to make it really easy, for the eye to be able to tease apart the different categories that are at play here.
And, I think the fill pattern package allows you to do the same thing, but it it will work with base r graphics as opposed to the g g pattern package is very specific to g g plot. You know, one of the other considerations, that Nicola sort of has as a theme throughout this blog post is, as you work with monochrome palettes, it may sort of come to light that the plot that you're using, whether it be a scatter plot or a bar chart or or whatnot, may actually not be the best way to represent that data in the monochrome world. And you might actually want to consider, you know, switching to a different type of plot. I I think in the the diverging palette example, right, where it's it's very difficult to represent a diverging, you know, gradient in a monochrome world, she recommended potentially switching to a lollipop chart so that you can actually, you know, measure the magnitude, in terms of change in a different way that would allow for, you know, a monochrome legend to make a whole lot more sense to the end user. Another gotcha that she mentions to to watch out for is in the case of missing data. And she uses the example of choropleth maps, where I think, Eric, I'm sure you've seen this before, but perhaps in a in a state in The US, if if we have a choropleth map, where each state sort of has its own color, sometimes if a state doesn't report data, we'll show that state as being gray as opposed to all of the other states having a particular color. And if you're trying to represent missing data, obviously, in the the monochrome world, gray is not going to work. Right? Because your Exactly. Your color scales are all different shades of gray. And I think that's another case where you're going to have to think creatively about different ways to potentially, showcase the data, that you're or make the point that you're trying to make, with maybe a different chart where it makes more sense to to leverage, you know, your monochrome color palette in a more effective way for others. So a lot of fantastic tips here. Some links at the end of the blog post as well. And I I appreciate, Nicole, not just walking through, you know, the examples of how to apply this, but also really taking the time to to critically think about, you know, all the considerations that you have to take into account when you leverage these types of approaches, really thinking about it holistically.
[00:15:02] Eric Nantz:
Yeah. And, as always, I like to or maybe don't like to date myself so much on this very podcast. So last time I did patterns, like, we're seeing in the in the bar chart example here, was for my dissertation way back in 02/2008, folks, because we've thought, well, okay. It's gonna be published at some point. Swagger I never did. But, nonetheless, for the requirements for the grad school, we did have to monochrome prints. And I remember for these bars, I was like, oh, no. How the heck am I gonna distinguish these different, you know, disease types with the competing risk, you know, output? Oh, patterns.
So, yeah, I got the retrovise when I looked at, the example of Nikola put together here. And overall, yeah, excellent advice here. You you go through the full spectrum of what you might try first all the way to, like, the real principles that she feels we we should all have in mind when we're in this in this, you know, tunnel vision approach to monochrome. Other techniques that she mentions here that can help is if you have the ability to facet your plots, that can help pop out the differences even further in the case of, say, the dot plots that we were seeing in the examples and annotations as well. If you got room to put labels above, like, the bars as well to distinguish those different categories, go for it. Obviously, you have to balance between how busy it's gonna look versus how presentable it is, you know, without it. But in the end, you've got you've got ways to account for these potential ambiguities.
And she does link to more additional resources that she's used to, build this blog post. So lots of great reading if you're if you're find yourself in this space. But I think overall, it's another great testament to the power that we have in visualizations and are to think of multiple perspectives and also think about accessibility at the same time. So really valuable post that's going in my bookmarks for visualization to to take a look at in the future, because you never know if if I ever get back in the publishing world again. Who the heck knows what that publisher is gonna require for us?
[00:17:11] Mike Thomas:
Totally. And I feel like, Nicole is always covering almost forgotten topics like this. So huge props for for tackling this one.
[00:17:34] Eric Nantz:
So as we were just talking about, things can get a little messy when you're dealing with certain types of visualization palettes unless you have the right frame of mind. Well, you know what else can be messy? Data. Because in the real world, we don't get those great textbook examples, right, where everything's just neatly tidied and, like, 20 observations, 20 rows, and no missing values, and everything's rectangular annotated correctly. No. No. No. In the real world, we have to deal with a lot of interesting, sometimes downright confusing formats of how these data are populated.
Our next highlight here is taking us through a journey on how we can do this processing applied to real world type of data coming from, census data at Scotland to be exact. So this post is authored by John Macintosh. John is, has been around the art community, I think, for many, many years. I've seen his name quite a bit. He is currently a data specialist and package developer at the NHS in Highland in Europe, And he talks about motive. The motivation for this post is that he and his team were working with census data from 2022 from Scotland.
And they do to their credit, give you a way to download the data. Again, package it up as CSV files and a large zip archive so you can grab that. He says that once you, you know, extract that out, you got 71 files with around 46,000 rows and, you know, a highly variable number of columns. So already, these files may have some inconsistencies off the bat. But wait, there's more in terms of how these are formatted. So buckle up because this this may, may give you some flashbacks to any, messy pre processing days that you've had in your data science journey. First, there are three rows at the top of these CSVs that have some you might call metadata about these data files, but they're useless for your analysis. So a, they don't want to be imported. You wanna ignore those.
But then at the end of the files, there's about eight or so rows that contain additional text or metadata that could also be discarded. So your data is kind of sandwiched in these these, messy rows. But then once you, you know, wipe those out, you may have file main files will have headers in multiple places. Yikes. So then they have to be combined somehow because it might be telling different information about the columns. And then wait for it. There may be different types of delimiters being used in these first few rows. Oh goodness. Oh, the I I'm already getting triggered just reading this. But as John says, if you only had a couple of files here, you know, sure, you can manually account for that. But when you're talking about the volume of files that you have in this download here, of course, you're gonna have to take a programmatic approach and cross your fingers that you're able to figure this out.
So he walks us through his journey and trying to import these in. His first step is something that I would have tried as well. They have two reads or imports of the of the data per file itself. You first read it in as a as like a temp, you know, say data frame or whatnot or a temp file, And then he was going really low level with this on the second read and using the scan function to try and isolate that output area between those junky rows at the top and the bottom. And then that that way he would track where the actual data begins and then try to figure out where the actual header is in that in that chunk.
He leveraged the Vroom package, v r o o m, that I've seen, I think is authored by Jim Hester years ago that gives you great performance in importing textual data files because that does have a parameter for skipping you know certain rows and then, you know, suppressing the need to get column names so he thought well there but there was unfortunately on top of those parameters he couldn't figure out a good way to know how to eventually skip without doing that prior scan. So he was hoping Vroom could do it all. Not quite.
But then he went back to data dot table which is again has been highlighted quite strongly in in previous episodes which comes with a function called f read which already gets a lot of praise from the community for being a highly performant way to import large textual data files like CSVs and whatnot and sure enough you can, set header to false and then data dot table was, you know, in his opinion, intelligent enough to snap out those junk rows or exclude those junk rows from the beginning, but still was left in the multiple header issue. And then he's got some code snippets where he tries to figure out again that actual data area that rectangular area in between the the junk headers a little bit of grep magic inside, you know, data dot table calls And then to account for the number of header rows, he's got a use of the tail function to try and strip out those extra rows.
And then to be able to combine the header rows together, but then make a intelligent vector that actually consolidates their information that could then be used as the column names. But then it gets even more interesting here. He has to pivot it from the wide format to the long format, grab the values themselves, and then making sure that if there were any messy values or say hyphens or other weird characters that there were they were stripped out, and then it would become a numeric result because I believe it was mostly numeric data coming from this census.
And then he had to, again, make great use of data dot table set functions, other snippets of code that are in this post. And, yes, there is a GitHub repo that has all disassembled, which I ended up having to read before the recording here because I couldn't believe just how complex this was. Like, this is not for the faint of heart, yet data dot table was quite valuable, to do this. But he wanted to take this even more and optimize, format away from the CSV once all this messy stuff was accounted for. And this is gonna, you know, make my data processing, mine light up here. I want to convert this to parquet files and throw these into DuckDb because why not? This is a perfect use case or something like that so you can take advantage of a more efficient file format and to be able to process just what you need.
So there is a lots of benefits of that approach, but in the end, he wraps the post up with some practical tips if you find yourself in this situation. One of which is don't do what I just did when I first read these data files and I was dreading any code I'd have to write. Take a step by step folks. It may seem insurmountable when you look at one of these as a whole, but if you take the approach of, okay, how do I deal with these headers? How do I strip out the junk at the end and then figure out that data area? Then you're you're kinda breaking up into components. Right? You can't just boil the ocean all at once as I say. And that's what leads them in in a second point.
You may think that if you just want to throw all this in a map call where it does each data file in an iterative fashion, may not always get working at that time. So again, really isolate on a few different use cases and then scale up after that. Speaking of per itself, an underrated function that is great in these situations when you can't fully expect reliability when you're using that function is the safely function. You can use safely to wrap that utility function to do your processing or do your importing. And in that way, it doesn't, like, crash the rest of the script. You can then parse after the fact what the error was or just ignore that that stuff altogether if you have other means to account for it.
And then also, the base r, you know, installation itself has a lot of great string functions. I admit I grab string r and string I every time I do string processing. But if you're really thinking about performance or minimizing your dependency footprint, it may be worth the investment and time in learning about the grep, g sub calls that come in the base r. Yeah, the syntax can be a little, little hard to get at first. But if you practice enough, I think you'll get the hang of it. But, again, John's got a great, you know, repository setup where you can see just how he uses those, those base string functions in his call.
And, apparently, he says that after the fact of writing this that there is a way I believe this comes in either Excel or other Microsoft products called Power Query. I've never heard of this before, but, apparently, it can help with these messy imports. Do you know anything about this, Mike? Because I haven't heard about this. You're lucky if you've never heard of Power Query, Eric. Okay. Great. I'll tell you. Keep it out. That. Okay. I will keep it that way. The day job has not forced me to use it, and, I don't plan to. But, nonetheless, he says that could have made things easier. I admit, I don't know if I wanna rely on a proprietary tool to make that easier. I think if you can script it out, future you and future reproducibility will thank you for it, but it's good to know that there are alternatives in this space. All in all, a good reminder that, a, the real world is never as perfect as the textbooks would like you to believe with these data formats.
And b, you can really augment a lot of great functionality from data.table along with some of the base r string processing functions. And if you take a step by step, you can get to where you need to take advantage of the fancy stuff, like ducted b and parquet down the road. So very enlightening post, and, hopefully, I never have to encounter anything as messy as what these census data files presented to John here.
[00:28:16] Mike Thomas:
No. It's a big effort to undertake something like this. I know because I've done it before, and I'm sure a lot of folks listening have done the same thing before as well because the way that data is published sometimes, publicly available data specifically, can be crazy. And this is not an exception, and I I'm sure that all of the folks who are interested in working with the Scotland Census data, hopefully, they see this blog post. Hopefully, they catch wind of the location where, John has landed the data in DuckDB, and it's all clean and easy to pick up. What a great what a great use case. And, yeah. Unfortunately, it seems like this particular project itself had a little bit of everything in it in terms of everything that he he was up against. I really appreciate that the top tips there, the use of PRRS safely to make sure that, you know, if you are looping through something that you can, your your loop can continue or take an action if it runs into an error.
I'll be honest, I'm I'm fairly guilty of using per safely inside, like, an if statement, as opposed to using probably the more appropriate try catch type of function. But it's it's just so easy, unfortunately. I I was really actually just doing this earlier this week on doing a little data archiving, data rescue for some US, government program data for a client that we weren't sure if it's gonna continue to be around or not. And I was, downloading it and sticking it in an s three bucket. And some of these datasets, that in the data dictionary said were available when you went to the URL, they were actually not there at all. So purr safely, you know, saved me quite a bit and allowed me to to loop through everything without having to change, my my iterator, if you will.
And, you know, the last tip there that that base string functions are are very useful and overlooked. I I couldn't agree with that more. I think, you know, for those of us that have to clean up messy data, string manipulation, you know, like grep and, you know, regular expressions and, you know, the what we get from the the string r and the the string I package. Although I can't say I use string I too much. I think a lot of that functionality has been mapped into to string r, are are absolute lifesavers. Sometimes it's tricky to to get that regular expression pattern just right.
I'll be honest, ChatGPT has helped me expedite that process quite a bit. So that would be my tip if you're struggling with regular expressions at all. Try to look there first because it might it might take care of everything that you need for you without having to figure out wild cards and placeholders and, you know, length of characters and all sorts of different crazy stuff like that that happen in regular expressions where the syntax looks really weird, but the power is absolutely incredible. And when it all works, it is so satisfying. So, I think a big thanks to John for his efforts under here and documenting his efforts on this particular project and a a great blog post because I think it's something we can all relate to.
[00:31:26] Eric Nantz:
Yeah. That reminds me my very first use ever of Chad GPT was indeed for regex help because I was like, I could do the whole stack overflow thing, but wait a minute. All these people are talking about that. Let's give it a shot. And, yes, in that case, it worked immensely well. But, yeah, I think, you know, in the current climate, you may find yourself in a situation where you have to grab this data sooner than later from sources. So never never hurts to have these techniques available to you. Now getting back to some of the work I'm doing at the day job, we're working with a vendor who is giving us CSV files of certain event type data.
We've given them the requirements of how we want this data to be formatted. Guess what? They don't always follow that, so we've had to build an internal package to account for those things. But we are hoping that they give us API access to the raw data so then we can have a more consistent, you know, might say reliable pattern of what the data is gonna be represented at because I I I feel more comfortable handling some JSON coming back from an API of certain datasets than a cryptic CSV that may or may not have a header, and then it may or may not have the right columns even spelled correctly. Yes. This happens when you have manual effort from people copying from one system to a CSV that goes through some stupid web portal, and then we have to be the ones who consume it. I'm not bitter at all, but it but it happens, folks. So if you get that chance of leverage an API, take advantage of it if you can.
[00:33:03] Mike Thomas:
Absolutely.
[00:33:15] Eric Nantz:
Well we are gonna in our last highlight here call back and an initiative and a huge development in the world of shiny that have both Mike and I absolutely giddy with what's possible in this new world we find ourselves in of taking advantage of web assembly with R and Shiny itself. And that this last post comes to us from a fellow curator of our weekly, Colin Fay, who, of course, is a brilliant developer and data scientist. I think our author of one of our favorite bar packages in the entire world, Golem, as well as other great innovations in the world of shiny. He wrote a post on the think our blog about talking about the recent mobile app that they released late last year or early this year called Rlingual, that took the shiny community by storm in a in a great way and this post is taking a step back about why did they actually do this. So to give a recap, first check out the back how long we when we talked about this, this great effort in detail but in a in a very quick recap here, Rlingual is a, an actual app that you can install on your mobile device via the Play Store or the Apple App Store, for iOS.
And you can, in essence, take a little quiz about your knowledge about R in this very responsive greatly themed, again, installable application that runs completely self contained on your mobile device that wraps R under the hood via web r and web assembly. I am super excited about this. Those of you who listen to this show for a bit know that I've been on a journey with web assembly and some very important external collaboration. So anytime I get to see WebAssembly in the wild, I am all for it. But Colin talks about is, again, why what is the big picture here?
Why is this so important to the community at large that want to take advantage of R on a mobile device. And the the big takeaway here is that having that capability to be mobile and have the power of R at your disposal is a huge benefit across many different situations. So he walks through a few of these, in each case I can relate to in different aspects of it. One of which is what if you are we were talking about data earlier. What if you're in the field? What if you're in the trenches to grab this data and you need a way to record it on the spot? Maybe you're on location somewhere.
Who knows? You may be in a in a mountain somewhere. You may be in the rainforest. Who knows? But you probably will not have a reliable Wi Fi or Internet access in these remote locations. Right? But yet having a self contained app on your device who can help you track that data and maybe leveraging our to do some processing or some other storage of it absolutely is a massive benefit so that, again, you can run this in a completely offline kind of mode for that. You improve your efficiency, bring the data closer to your actual end product, really really helpful.
Number two, a great way to learn, again, in an offline fashion. Think of when you and I were in school, Mike. Wouldn't it have been great if we are ins if we had the technology now that back then, you know, we had to read textbooks. Right? We had to take notes and hope that we could run it on a maybe a an old Windows installation or something like that and hope that everything just works. Or in the case of my grad school, SSH to a server without knowing what the heck r was at the time. But imagine having this on a mobile device where you can learn about a key concept, about maybe the central limit theorem or maybe some other, you know, very important statistical concept.
And you can learn this wherever you are and and explore it, But, again, not have to be at your computer or have a textbook open to do it. So it can be an interactive learning device, which, again, was very similar to what they did with this quiz app that they worked on. It was completely interactive, but completely offline as well. It had everything self contained. Really, really novel use case. I think education is gonna education is already taken advantage of web assembly, already have a lot of resources. We speak highly about the quartal live extension where you can embed web assembly powered apps into a quartal document.
George Stagg and and the quartal team are doing immense work in this space to have this reimagination of the learn our package with a new way to leverage Quartle. Lots of great potential here on the mobile space as well with WebR and WebAssembly. Then when you think about other industries where you need real time feedback really quickly, he Colin calls out a couple other use cases. One of which is having real time quality control when you're in the manufacturing space. Maybe you need to inspect something that's going through a production line. You need to enter some feedback. You need to check everything's working. You can look at quality control metrics on the spot based on some inputs you give it, Very quickly detect that something's going wrong because those are hugely important in manufacturing to see when things are going wrong, when things are being produced at scale.
And then also, rounding out the posts are other considerations such as managing your supply chain and making sure you're optimizing that based on the logistics that you're dealing with. Maybe you have to run some models on the spot based on certain parameters, and that might change the way you send products out. May you distribute to warehouses or whatnot. That could be quite helpful and going back to being on the field so to speak in the healthcare industry maybe you're again helping out you know with a huge unmet medical need in, like, a clinic, say, somewhere remote in Africa or some other region, and you're gonna need to have to look at maybe inputting patient characteristics and getting some kind of score out of it to determine their next treatment plan. So I'm seeing in the world of diagnostic type of evaluation, this could be a huge role.
Who knows how long that will take because, obviously, I come from life sciences. Things don't exactly move at a breakneck pace, but I do see this as a potential for those in the field to take advantage of r to help with some of that evaluation and some of that processing. So you now have at your disposal a way to run r on your mobile device and all sorts of different use cases. I don't think Colin, gaming begins to cover all the different possibilities here. But what think r has proven is that with the right initiative and with the tools available, you can enter this space and probably cause some really innovative breakthroughs with r at the back end. Am I here for it? Oh, abso freaking lutely I am. This is awesome.
[00:40:45] Mike Thomas:
It is awesome. It's this feels like this Rlingua project has a lot of gasoline on it. And, I think as soon as somebody drops a match, this is going to catch fire throughout the whole our ecosystem. I I think it's it's just new at this point. And as it continues to gain more exposure, I see it really exponentially taking off pretty quickly. It's it's absolutely incredible breakthrough. As you were talking about thinking, about back in, you know, when we were in school having to read textbooks and things like that, I know that I don't think it's a big secret here, but maybe at least in The US, I know a lot of teachers in, you know, high school and things like that, and and probably college as well, struggle with kids being on their phone all the time, right, and being distracted by that. So if you can actually bring the learning to the phone, then they can sit there with their phones and and learn at the same time. And you're, I guess, putting the education in the place where most of their attention is. So maybe this is practically helpful in that respect. I think that's that's pretty interesting. I also know that in, you know, more professional context in manufacturing settings and health care, as you were were saying, that there are actually, you know, companies out there that that make these almost cell phone looking devices to monitor sensor data and to monitor, you know, you know, just just really like data analysis, little handheld tools that are either, you know, smaller versions of an iPad, but they're they're these customized pieces of hardware.
And I think we might just be able to take, you know, the functionality of those pieces of hardware and leverage, you know, your smartphone, by, you know, using this Rlingua project to to visualize this data. You know, we have the power of WebAssembly here. We have power of DuckDV Wasm. And, I think probably all of that number crunching as it gets better and better and better, and it already is fantastic, can probably perform just as well on a mobile device, a smartphone, if you will, as opposed to, you know, some of these customized pieces of technology that I I think are probably getting a little obsolete and a little outdated, compared to what we have right now. And you don't need to find somebody who's a Swift developer to do it. This is my, like, favorite thing of all time because we have clients who ask about, hey, could this be on a mobile app? I was like, yeah, well, we're gonna have to go hire somebody who knows how to write iOS, you know. I think the language is called Swift, if I'm I'm not mistaken.
Yep. And, oh, and then Android too. Who knows what what language that is? Good luck with that SDK folks. I'm just saying. Exactly. So if we can avoid that, it's absolutely incredible and use a programming language. I think that's a little more literate like our to do our work. I think it's gonna be absolutely, you know, incredible. So a lot of potential benefits here. I see this as being a huge breakthrough that is really only a matter of time before it absolutely takes off.
[00:43:51] Eric Nantz:
Yeah. Again, when I installed our lingua even in its preview mode when when Colin gave me the heads up that this was coming out, That was my first takeaway was that I could not tell that this was an our thing. Right? And not just the looking of it, which, again, we have great tools in Shiny in general to make things not look like a Shiny app. But the performance was no lag whatsoever. It literally looked like or performed like it was built by a professional vendor that's been doing mobile development for who knows how many years. Who knows? It could have been in Swift or, you know, Objective C or whatever the heck ever languages are being used in that or Java itself in the SDK. So this just enables so many possibilities here. I'm still wrapping my head around it. I mean, I'm taking baby steps of WebAssembly. The fact that we're able to submit a WebAssembly power shine up to the FDA is still mind boggling to me that we've pulled that off, but that's just a tip of the iceberg. I think where this can take us, in the world of life sciences, but also as as we talk about here, many, many different industries too.
The one thing I just wanna make sure people realize when they're exploring WebAssembly for the first time, especially with the power of WebR and Shiny Live. And again, it's more for awareness. I'm not trying to be a Debbie downer here, but if you do interact with online resources in a API call, probably not the best idea because in theory, when you bring all this stuff down, all that stuff, all that stuff for authentication is in the manifest, if you will, the bundle of code that is being translated. So that's just something to consider if you're gonna build something like this that does in fact interact with an API layer. But like I said, with Collins use cases here, these being built in a self contained way that have no reliance on off on online access to actually power the machinery behind it, you're good to go there. I just wanted to make sure that I throw that out there because I often get people asking me, why isn't that thing in WebAssembly that I'm building?
I would love to, but if I have to interact with an online service, yeah, that that that's one little gotcha. But maybe that gets better someday. Who knows?
[00:46:05] Mike Thomas:
Yeah. It's a great consideration.
[00:46:07] Eric Nantz:
But but all in all, still very inspiring. And also, you know, to be to be honest, you should be inspired by almost everything in this issue here. John is, curated an excellent issue here. He he always does a fantastic job. So we'll take a minute for a cup or a couple minutes, I should say, for our additional finds here. And speaking of huge improvements in the ecosystem, I do have a soft spot because of my day job and my research areas and my career of high performance computing with our and one of the very, solid foundations that has come to light in the last few years.
The Mirai package, author by Charlie Gao, has had a substantial update in the recent version two dot o upgrade. This is a here's your least for for Charlie and and his, team of developers, And he has a great blog post. I'll mention a couple of great additional enhancements here, especially if you find yourself leveraging cloud resources as your high performance computing backend, talking like AWS batch or other resources. He's got a much easier way to launch all those background processes via a more clever use of SSH under the hood to do that. And by proxy changing the protocol for bringing the connection information back and forth from WebSocket layer. Now it's using TCP level, which basically is a lower level way to communicate with these online resources. It's faster and in his words even more reliable.
And the other one that caught my eye too is we are leveraging this within a Shiny app, whereas what Charlie's done in collaboration with Joe Chang brought Mirai integration with the new extended task paradigm and Shiny itself, you now have a much more elegant way to cancel that workflow in a Shiny context if you decide that, oh, wait a minute. That's not what I wanted to run. Get me out of here. There's a great way to implement that canceling functionality, and he's got linked to, I believe a vignette article where it talks about that integration in great detail along with some great interaction with the per package as well because now per is in the development version providing more parallel mapping capabilities, and not Mirai can be a great back end to power that and not just say the future package like it was before then. So great updates of version two, Amirai, and I'm super excited for what Charlie has in store for future releases. It's a wonderful package here. Yes. Absolutely. All that
[00:48:52] Mike Thomas:
asynchronous possibility, I think, is is really incredible in terms of what we're able to to push the envelope on. So my takeaway here is a, blog on the epiverse site in the it's authored by James Imba Azam, Hugo Gruesome, and Sebastian Funk. And the title is Key Considerations for Retiring or Superseding an Art Package. And I believe the epiverse probably has a a suite of different art packages, that they they help work with. And I also saw a a recent, I think, post by Hadley Wickham, maybe on Blue Sky or somewhere like that, where he is authoring sort of a history of the Tidyverse. And, I'm sure part of that history will include things like ggplot versus ggplot two. Maybe, you know, the the Reshape two package moving to dplyr and tidyr, and, you know, the sort of life cycles that these packages take when maybe, you know, you're you're hoping that users move to the new version, but you don't want to fully take away the old version, you know, even though it's it's pretty legacy and it's not maintained. But perhaps folks have, you know, legacy workflows that leverage your your old versions of packages. So there's gotta be a lot of decision points that you need to make, in order to try to accommodate as many people as possible. And I think this blog post is a really nice reflection on that that I wanted to highlight.
[00:50:21] Eric Nantz:
Yeah. This is a very, very important post especially if you find yourself developing a lot of packages, but yet you learn so much along the way that you wanna make sure that, a, you do have a way to take advantage of what you learn, but also not forget those that were early adopters of your previous packages. I remember my colleague, Will Landau, was wrestling through a lot of this too as he transitioned from his Drake package over to Target, because Targets, in his opinion, contains a lot of what he learned in the process of developing Drake and what he made better for for the Target's ecosystem. So I know he's thought about a lot of these principles too. So highly recommended reading, and I'll be I'll be watching, Hadley's, blue sky post on that too. And hopefully, he comes out with a interesting, article or whatever whatever knowledge is is shared there alongside this great post from the Epiverse team. Really, really solid find. And, again, lots of solid finds in the rest of the issue too, so we invite you to check it out. It is linked directly in the show notes as always.
And, also, you can find the full gamut of updated packages, new blog posts, new tutorials, and upcoming events, and whatnot. So lots of lots of great things to sync your reading, chops into. And, also, we love, hearing from you, as far as helping us with the project as a whole. Our weekly is a cure is a community project through and through. No corporate sponsor overloads here. We are just driven by the the efforts and even yours truly will have an issue with a cure rate next week. And I'm building a completely over engineered shiny. I have to manage our scheduling paradigm that I hope to talk about in the near future, but I'm doing it all late at night hacking if you will because it's not my day job, folks. But where we can have your help is finding those great resources. If you've seen it online, wherever you authored it or you found someone else that did, we're just a poll request away because everything's open on GitHub, folks. Just head to r0e.0rg.
Open the poll request tab in the upper right corner. You'll get taken to the template where you can simply fill that out in our Curator of the Week, which if you do this now, it'll be yours truly. We'll be glad to get you in that in that next issue, that resource. But we also love hearing from you, online as well. We are on the social medias. I am on blue sky as well at [email protected] or social, something like that. Also, I'm Mastodon with at [email protected], and I'm on LinkedIn, causing all sorts of fun stuff there. Search my name and you'll find me there. Mike, where can they find you? Primarily on Blue Sky these days, in terms of social media,
[00:53:03] Mike Thomas:
non LinkedIn, at mike dash thomas dot b s k y dot social, or on LinkedIn if you search Ketchbrooke Analytics, k e t c h b r o o k, You can find out what we are up to, and we are still on the hunt for a DevOps engineer, for anybody interested out there who knows a little bit of Terraform, Docker, Kubernetes, and Azure. That's the stack.
[00:53:28] Eric Nantz:
Yeah. I know there's a lot of great people out there that are working with that stack and even now I'm trying to educate myself on some of those and it is a brand new world to me. So having that kind of expertise is always helpful. So if you're interested, get a hold of Mike. He'll be glad to talk to you. But with that, we will close-up shop with episode one ninety five of our weekly highlights. We thank you so much for joining us, and we'll be back with another episode of our weekly highlights next week.
Hello, friends. We are back with episode 95 of the Our Weekly Highlights podcast. This is the weekly show where we talk about the great highlights and additional resources that are shared every single week at ourweekly.0rg. My name is Eric Nance, and thank you for being patient with us. We are a couple days away from our usual release because, you know, as much as we would like to control our schedules, sometimes things just fall in our laps and and certain groups just need our attention those days. So that happened on me on our usual recording day, but we're back here. And I say we, of course, because I am joined by my awesome cohost, Mike Thomas. Mike, how are you doing this fine morning? Doing well, Eric. Yeah. We're a little later in the week this week. And for once, it's not me, but I you're helping to helping to balance it out. So I appreciate that. Yes. The the yin and yang are are starting to balance or the force, if you will. So that is it'll happen, and it may even happen next week too. We'll figure it out. But nonetheless, we are we're happy to be back here on on this day, and our issue this week has been curated by another one of our OG curators on the team, Jonathan Carroll.
As always, he had tremendous help from our fellow, our weekly team members, and contributors like all of you around the world with your poll request and other suggestions. And we're gonna visit the, visualization corner right away on this episode of our weekly highlights. And in particular, type of type of technique for visualization that may seem retro to you, especially when you think about how you used to print paper documents back in the in the days of dot matrix printers and not so much colors available in those printers.
So this post is coming to us from Nicola Rennie who is talking to us today about alternatives to the typical color palettes that you might be using in your visualizations in R. Now the first question is, why would you not want to take advantage of colors in your visualizations when you wanna share that with the world or share that in other means? Well, there may be cases where you don't really have a choice. One of those being that in the world of academia or even just other academia in general, you may have requirements from the, public publisher that nope. No color plots. Gotta be monochrome.
Black and white, basically, or what monochrome is technically speaking is it's different shades of a single color. So typically, we do think of that as gray. The different spectrums are gray. Right? And so there are a lot of publications that require that right off the bat. Although there is another side benefit to this that you may not realize when you're going down this route is that this could be a win for accessibility as well if you structure it the right way. And I think what Nicola's, advice here will get you along the way to do that.
Now you might be thinking, okay. Now I have to do the monochrome plot. You know, there is an easy way to do this. Right? You print the PDF and you could choose either color or black or white. Right? And that will just convert everything to that that black and white spectrum. Well, in the first part of the post, Nicola talks about that's probably not a great idea because you're losing a lot of the nuance in these different colors especially if they're closer together via what's called saturation. And she has a nice visual, again, we're audio here. We'll we'll speak of the visualization, horizontal bar chart. And so in this legend of the visualization, she colors the bars by by transmission type, But then when you look at the the black and white converted version of it, you cannot tell really at all the difference between at least two of these lead these items in the legend, which of course is really bad if you can't figure out which color is going with which. So that means yes. You're gonna have to, you know, code this up like you would with any typical visualization.
So your next bet is to look at the different palettes available and of course one of these packages that you know is one of the mainstays for looking at palettes is the color brewer and by proxy the our color brewer palettes and you might be looking at those And then she links to a website, for color brewer where you can look at which palettes are what they call photocopy friendly, meaning that if you were gonna scan or literally copy this this document that has a plot in and it was gonna convert the black and white, which palettes are actually more amenable to that. So she does show in the next example using, the color brewer package, which, pallet, it's called set three, where at least in this time, you're getting a closer to the right direction with legend colors that do look somewhat different. Although I'll be honest, the two grayish ones are so hard to tell on my screen which ones are a movie different, but it's it's it's a small step to get there.
But you may wanna take it a step further and really think about you're in this monochrome paradigm. Maybe there are palette types that are more better suited for this world that you're you're about to embark into. And taking a step back, what are the types of color palettes that we typically use? One is sequential where you have more of a gradient or a gradual decrease or increase in the shade of the color. So in this case, thinking of the a darker color to a lighter color. You also could add diverging palettes where it's like the extremes of those scales are very different, but yet the middle, you know, is kind of in the middle so to speak, kind of blending them together.
And then you have the discrete type of scale where it's just there's no real ordering to it. It's just different colors for each category. Now which one of these is better suited for the monochrome type of visualization? Well, when you look at the sequential palette, it's actually not too bad because you can see, you know, the lighter color being at the lower end of a of a of a continuous type of legend all the way to the darker color. It it actually does translate pretty well. So she's got an example in this case of using, I believe, the highway mileage on the scale there, and the lighter color is the less miles and the darker color is the higher miles on a dot plot. And at least you can see the darker dots compared to the the lighter dots. Is it perfect?
Yeah. Your results may vary there. The ones that definitely does not translate well was divergent palettes because when you think about those different extremes in the middle, how do you really convert that well because the direction might get lost in translation in terms of those those different conversions. But also another difficult one is the discreet palette because you never know based on the choices you chose for those colors which ones are gonna translate well to that conversion. And in fact, that little resource we just told you about the color brewer palette that look up, they said, Nicholas says there's really only one that's photocopy friendly. So you're kinda stuck if you're still searching for that. I'm maybe trying to guess which ones are are really best suited for it.
So she recommends that maybe starting your design of that plot with the monochrome world in mind you might then start to choose different colors that are visually different enough than when you do that monochrome type conversion, which in g g plot two, there is a scale gray variant of functions that will give you that gray, palette converting that from that color palette. And she does another example where you can see the three different legend, items for, in this case, transmission again is like going from almost black all the way to very light gray with a darker gray in the middle and you can see that really pop out in that visualization.
But again, it definitely takes attention to detail to making sure you're picking those colors that are distinguishable enough for that conversion to really have a have a good play there. But you also have to keep in mind the type of chart you're making. This works really well for a bar chart where, obviously, the bars themselves have a lot of area that are being spent or being taken up in that overall canvas. Whereas for the dot plots, obviously, your eyes are at the squint a little bit to look at the size of these dots, and it may not be so obvious to see these distinguished colors even if they look really good in the bar chart too. So you've got to think about the type of plot you're doing that should inform you alongside just having, you know, the monochrome world in mind or the framework in mind which type of colors you want to use for that conversion.
But one thing to open your eyes about is that you don't always have to stick with just colors. When you're in the monochrome mindset, you can take advantage of other features in the visualization to really make your visualizations pop. And I think, Mike, she has some really great advice on the different types of patterns you can use in these visualizations.
[00:09:56] Mike Thomas:
Right? She does. And this took me back a little bit to r one zero one. I think probably using the base plot package. And I don't know if this is the case for everyone or for a lot of maybe the the older folks listening, but I feel like my introductory or plotting, knowledge or classes were filled with using shapes as points. And I remember that quite fondly, and I have honestly, Eric, forgotten that that's even possible. I don't think I've used a shape, and and this is terrible. I'm so sorry, Nicola. I I don't think I've used a shape in a plot in a long, long time in r. But it's useful to remember that that's even something that's possible to do.
And and Nicola mentions two different packages here, and I think that they can be really handy, especially as we think about the monochrome world. The g g pattern package and then the fill pattern package. I think the g g pattern package allows you to leverage, like, texture and shapes, within your g g plots specifically. And again, this is audio, so it's it's hard to do. But it looks like, on on a bar chart example, you can have, you know, one section of the bar have sort of polka dots in it, another section of the bar representing a different discrete, class in a discrete variable has, diagonal lines through it. The other one looks like plaid as well. So it's an interesting approach to getting away from leveraging color to represent, different classes in a discrete variable to actually representing patterns. And it's, it's really, I don't know, really creative to me, to see and and really interesting. And honestly, it does make it very easy to discern the different categories.
I think maybe for folks who aren't traditionally used to living in the monochrome world, just one thing that you may have to to watch out for. If I was giving this to, you know, as a deliverable to a client or something like that, I think that they might be distracted a little bit the patterns, just because it it does look a little bit old school. But I I think if you work with them to communicate the reasons why you did so, and, understanding that, you know, the deliverable should be monochrome for reasons x, y, and z, I think that this is a fantastic approach to consider as well just to make it really easy, for the eye to be able to tease apart the different categories that are at play here.
And, I think the fill pattern package allows you to do the same thing, but it it will work with base r graphics as opposed to the g g pattern package is very specific to g g plot. You know, one of the other considerations, that Nicola sort of has as a theme throughout this blog post is, as you work with monochrome palettes, it may sort of come to light that the plot that you're using, whether it be a scatter plot or a bar chart or or whatnot, may actually not be the best way to represent that data in the monochrome world. And you might actually want to consider, you know, switching to a different type of plot. I I think in the the diverging palette example, right, where it's it's very difficult to represent a diverging, you know, gradient in a monochrome world, she recommended potentially switching to a lollipop chart so that you can actually, you know, measure the magnitude, in terms of change in a different way that would allow for, you know, a monochrome legend to make a whole lot more sense to the end user. Another gotcha that she mentions to to watch out for is in the case of missing data. And she uses the example of choropleth maps, where I think, Eric, I'm sure you've seen this before, but perhaps in a in a state in The US, if if we have a choropleth map, where each state sort of has its own color, sometimes if a state doesn't report data, we'll show that state as being gray as opposed to all of the other states having a particular color. And if you're trying to represent missing data, obviously, in the the monochrome world, gray is not going to work. Right? Because your Exactly. Your color scales are all different shades of gray. And I think that's another case where you're going to have to think creatively about different ways to potentially, showcase the data, that you're or make the point that you're trying to make, with maybe a different chart where it makes more sense to to leverage, you know, your monochrome color palette in a more effective way for others. So a lot of fantastic tips here. Some links at the end of the blog post as well. And I I appreciate, Nicole, not just walking through, you know, the examples of how to apply this, but also really taking the time to to critically think about, you know, all the considerations that you have to take into account when you leverage these types of approaches, really thinking about it holistically.
[00:15:02] Eric Nantz:
Yeah. And, as always, I like to or maybe don't like to date myself so much on this very podcast. So last time I did patterns, like, we're seeing in the in the bar chart example here, was for my dissertation way back in 02/2008, folks, because we've thought, well, okay. It's gonna be published at some point. Swagger I never did. But, nonetheless, for the requirements for the grad school, we did have to monochrome prints. And I remember for these bars, I was like, oh, no. How the heck am I gonna distinguish these different, you know, disease types with the competing risk, you know, output? Oh, patterns.
So, yeah, I got the retrovise when I looked at, the example of Nikola put together here. And overall, yeah, excellent advice here. You you go through the full spectrum of what you might try first all the way to, like, the real principles that she feels we we should all have in mind when we're in this in this, you know, tunnel vision approach to monochrome. Other techniques that she mentions here that can help is if you have the ability to facet your plots, that can help pop out the differences even further in the case of, say, the dot plots that we were seeing in the examples and annotations as well. If you got room to put labels above, like, the bars as well to distinguish those different categories, go for it. Obviously, you have to balance between how busy it's gonna look versus how presentable it is, you know, without it. But in the end, you've got you've got ways to account for these potential ambiguities.
And she does link to more additional resources that she's used to, build this blog post. So lots of great reading if you're if you're find yourself in this space. But I think overall, it's another great testament to the power that we have in visualizations and are to think of multiple perspectives and also think about accessibility at the same time. So really valuable post that's going in my bookmarks for visualization to to take a look at in the future, because you never know if if I ever get back in the publishing world again. Who the heck knows what that publisher is gonna require for us?
[00:17:11] Mike Thomas:
Totally. And I feel like, Nicole is always covering almost forgotten topics like this. So huge props for for tackling this one.
[00:17:34] Eric Nantz:
So as we were just talking about, things can get a little messy when you're dealing with certain types of visualization palettes unless you have the right frame of mind. Well, you know what else can be messy? Data. Because in the real world, we don't get those great textbook examples, right, where everything's just neatly tidied and, like, 20 observations, 20 rows, and no missing values, and everything's rectangular annotated correctly. No. No. No. In the real world, we have to deal with a lot of interesting, sometimes downright confusing formats of how these data are populated.
Our next highlight here is taking us through a journey on how we can do this processing applied to real world type of data coming from, census data at Scotland to be exact. So this post is authored by John Macintosh. John is, has been around the art community, I think, for many, many years. I've seen his name quite a bit. He is currently a data specialist and package developer at the NHS in Highland in Europe, And he talks about motive. The motivation for this post is that he and his team were working with census data from 2022 from Scotland.
And they do to their credit, give you a way to download the data. Again, package it up as CSV files and a large zip archive so you can grab that. He says that once you, you know, extract that out, you got 71 files with around 46,000 rows and, you know, a highly variable number of columns. So already, these files may have some inconsistencies off the bat. But wait, there's more in terms of how these are formatted. So buckle up because this this may, may give you some flashbacks to any, messy pre processing days that you've had in your data science journey. First, there are three rows at the top of these CSVs that have some you might call metadata about these data files, but they're useless for your analysis. So a, they don't want to be imported. You wanna ignore those.
But then at the end of the files, there's about eight or so rows that contain additional text or metadata that could also be discarded. So your data is kind of sandwiched in these these, messy rows. But then once you, you know, wipe those out, you may have file main files will have headers in multiple places. Yikes. So then they have to be combined somehow because it might be telling different information about the columns. And then wait for it. There may be different types of delimiters being used in these first few rows. Oh goodness. Oh, the I I'm already getting triggered just reading this. But as John says, if you only had a couple of files here, you know, sure, you can manually account for that. But when you're talking about the volume of files that you have in this download here, of course, you're gonna have to take a programmatic approach and cross your fingers that you're able to figure this out.
So he walks us through his journey and trying to import these in. His first step is something that I would have tried as well. They have two reads or imports of the of the data per file itself. You first read it in as a as like a temp, you know, say data frame or whatnot or a temp file, And then he was going really low level with this on the second read and using the scan function to try and isolate that output area between those junky rows at the top and the bottom. And then that that way he would track where the actual data begins and then try to figure out where the actual header is in that in that chunk.
He leveraged the Vroom package, v r o o m, that I've seen, I think is authored by Jim Hester years ago that gives you great performance in importing textual data files because that does have a parameter for skipping you know certain rows and then, you know, suppressing the need to get column names so he thought well there but there was unfortunately on top of those parameters he couldn't figure out a good way to know how to eventually skip without doing that prior scan. So he was hoping Vroom could do it all. Not quite.
But then he went back to data dot table which is again has been highlighted quite strongly in in previous episodes which comes with a function called f read which already gets a lot of praise from the community for being a highly performant way to import large textual data files like CSVs and whatnot and sure enough you can, set header to false and then data dot table was, you know, in his opinion, intelligent enough to snap out those junk rows or exclude those junk rows from the beginning, but still was left in the multiple header issue. And then he's got some code snippets where he tries to figure out again that actual data area that rectangular area in between the the junk headers a little bit of grep magic inside, you know, data dot table calls And then to account for the number of header rows, he's got a use of the tail function to try and strip out those extra rows.
And then to be able to combine the header rows together, but then make a intelligent vector that actually consolidates their information that could then be used as the column names. But then it gets even more interesting here. He has to pivot it from the wide format to the long format, grab the values themselves, and then making sure that if there were any messy values or say hyphens or other weird characters that there were they were stripped out, and then it would become a numeric result because I believe it was mostly numeric data coming from this census.
And then he had to, again, make great use of data dot table set functions, other snippets of code that are in this post. And, yes, there is a GitHub repo that has all disassembled, which I ended up having to read before the recording here because I couldn't believe just how complex this was. Like, this is not for the faint of heart, yet data dot table was quite valuable, to do this. But he wanted to take this even more and optimize, format away from the CSV once all this messy stuff was accounted for. And this is gonna, you know, make my data processing, mine light up here. I want to convert this to parquet files and throw these into DuckDb because why not? This is a perfect use case or something like that so you can take advantage of a more efficient file format and to be able to process just what you need.
So there is a lots of benefits of that approach, but in the end, he wraps the post up with some practical tips if you find yourself in this situation. One of which is don't do what I just did when I first read these data files and I was dreading any code I'd have to write. Take a step by step folks. It may seem insurmountable when you look at one of these as a whole, but if you take the approach of, okay, how do I deal with these headers? How do I strip out the junk at the end and then figure out that data area? Then you're you're kinda breaking up into components. Right? You can't just boil the ocean all at once as I say. And that's what leads them in in a second point.
You may think that if you just want to throw all this in a map call where it does each data file in an iterative fashion, may not always get working at that time. So again, really isolate on a few different use cases and then scale up after that. Speaking of per itself, an underrated function that is great in these situations when you can't fully expect reliability when you're using that function is the safely function. You can use safely to wrap that utility function to do your processing or do your importing. And in that way, it doesn't, like, crash the rest of the script. You can then parse after the fact what the error was or just ignore that that stuff altogether if you have other means to account for it.
And then also, the base r, you know, installation itself has a lot of great string functions. I admit I grab string r and string I every time I do string processing. But if you're really thinking about performance or minimizing your dependency footprint, it may be worth the investment and time in learning about the grep, g sub calls that come in the base r. Yeah, the syntax can be a little, little hard to get at first. But if you practice enough, I think you'll get the hang of it. But, again, John's got a great, you know, repository setup where you can see just how he uses those, those base string functions in his call.
And, apparently, he says that after the fact of writing this that there is a way I believe this comes in either Excel or other Microsoft products called Power Query. I've never heard of this before, but, apparently, it can help with these messy imports. Do you know anything about this, Mike? Because I haven't heard about this. You're lucky if you've never heard of Power Query, Eric. Okay. Great. I'll tell you. Keep it out. That. Okay. I will keep it that way. The day job has not forced me to use it, and, I don't plan to. But, nonetheless, he says that could have made things easier. I admit, I don't know if I wanna rely on a proprietary tool to make that easier. I think if you can script it out, future you and future reproducibility will thank you for it, but it's good to know that there are alternatives in this space. All in all, a good reminder that, a, the real world is never as perfect as the textbooks would like you to believe with these data formats.
And b, you can really augment a lot of great functionality from data.table along with some of the base r string processing functions. And if you take a step by step, you can get to where you need to take advantage of the fancy stuff, like ducted b and parquet down the road. So very enlightening post, and, hopefully, I never have to encounter anything as messy as what these census data files presented to John here.
[00:28:16] Mike Thomas:
No. It's a big effort to undertake something like this. I know because I've done it before, and I'm sure a lot of folks listening have done the same thing before as well because the way that data is published sometimes, publicly available data specifically, can be crazy. And this is not an exception, and I I'm sure that all of the folks who are interested in working with the Scotland Census data, hopefully, they see this blog post. Hopefully, they catch wind of the location where, John has landed the data in DuckDB, and it's all clean and easy to pick up. What a great what a great use case. And, yeah. Unfortunately, it seems like this particular project itself had a little bit of everything in it in terms of everything that he he was up against. I really appreciate that the top tips there, the use of PRRS safely to make sure that, you know, if you are looping through something that you can, your your loop can continue or take an action if it runs into an error.
I'll be honest, I'm I'm fairly guilty of using per safely inside, like, an if statement, as opposed to using probably the more appropriate try catch type of function. But it's it's just so easy, unfortunately. I I was really actually just doing this earlier this week on doing a little data archiving, data rescue for some US, government program data for a client that we weren't sure if it's gonna continue to be around or not. And I was, downloading it and sticking it in an s three bucket. And some of these datasets, that in the data dictionary said were available when you went to the URL, they were actually not there at all. So purr safely, you know, saved me quite a bit and allowed me to to loop through everything without having to change, my my iterator, if you will.
And, you know, the last tip there that that base string functions are are very useful and overlooked. I I couldn't agree with that more. I think, you know, for those of us that have to clean up messy data, string manipulation, you know, like grep and, you know, regular expressions and, you know, the what we get from the the string r and the the string I package. Although I can't say I use string I too much. I think a lot of that functionality has been mapped into to string r, are are absolute lifesavers. Sometimes it's tricky to to get that regular expression pattern just right.
I'll be honest, ChatGPT has helped me expedite that process quite a bit. So that would be my tip if you're struggling with regular expressions at all. Try to look there first because it might it might take care of everything that you need for you without having to figure out wild cards and placeholders and, you know, length of characters and all sorts of different crazy stuff like that that happen in regular expressions where the syntax looks really weird, but the power is absolutely incredible. And when it all works, it is so satisfying. So, I think a big thanks to John for his efforts under here and documenting his efforts on this particular project and a a great blog post because I think it's something we can all relate to.
[00:31:26] Eric Nantz:
Yeah. That reminds me my very first use ever of Chad GPT was indeed for regex help because I was like, I could do the whole stack overflow thing, but wait a minute. All these people are talking about that. Let's give it a shot. And, yes, in that case, it worked immensely well. But, yeah, I think, you know, in the current climate, you may find yourself in a situation where you have to grab this data sooner than later from sources. So never never hurts to have these techniques available to you. Now getting back to some of the work I'm doing at the day job, we're working with a vendor who is giving us CSV files of certain event type data.
We've given them the requirements of how we want this data to be formatted. Guess what? They don't always follow that, so we've had to build an internal package to account for those things. But we are hoping that they give us API access to the raw data so then we can have a more consistent, you know, might say reliable pattern of what the data is gonna be represented at because I I I feel more comfortable handling some JSON coming back from an API of certain datasets than a cryptic CSV that may or may not have a header, and then it may or may not have the right columns even spelled correctly. Yes. This happens when you have manual effort from people copying from one system to a CSV that goes through some stupid web portal, and then we have to be the ones who consume it. I'm not bitter at all, but it but it happens, folks. So if you get that chance of leverage an API, take advantage of it if you can.
[00:33:03] Mike Thomas:
Absolutely.
[00:33:15] Eric Nantz:
Well we are gonna in our last highlight here call back and an initiative and a huge development in the world of shiny that have both Mike and I absolutely giddy with what's possible in this new world we find ourselves in of taking advantage of web assembly with R and Shiny itself. And that this last post comes to us from a fellow curator of our weekly, Colin Fay, who, of course, is a brilliant developer and data scientist. I think our author of one of our favorite bar packages in the entire world, Golem, as well as other great innovations in the world of shiny. He wrote a post on the think our blog about talking about the recent mobile app that they released late last year or early this year called Rlingual, that took the shiny community by storm in a in a great way and this post is taking a step back about why did they actually do this. So to give a recap, first check out the back how long we when we talked about this, this great effort in detail but in a in a very quick recap here, Rlingual is a, an actual app that you can install on your mobile device via the Play Store or the Apple App Store, for iOS.
And you can, in essence, take a little quiz about your knowledge about R in this very responsive greatly themed, again, installable application that runs completely self contained on your mobile device that wraps R under the hood via web r and web assembly. I am super excited about this. Those of you who listen to this show for a bit know that I've been on a journey with web assembly and some very important external collaboration. So anytime I get to see WebAssembly in the wild, I am all for it. But Colin talks about is, again, why what is the big picture here?
Why is this so important to the community at large that want to take advantage of R on a mobile device. And the the big takeaway here is that having that capability to be mobile and have the power of R at your disposal is a huge benefit across many different situations. So he walks through a few of these, in each case I can relate to in different aspects of it. One of which is what if you are we were talking about data earlier. What if you're in the field? What if you're in the trenches to grab this data and you need a way to record it on the spot? Maybe you're on location somewhere.
Who knows? You may be in a in a mountain somewhere. You may be in the rainforest. Who knows? But you probably will not have a reliable Wi Fi or Internet access in these remote locations. Right? But yet having a self contained app on your device who can help you track that data and maybe leveraging our to do some processing or some other storage of it absolutely is a massive benefit so that, again, you can run this in a completely offline kind of mode for that. You improve your efficiency, bring the data closer to your actual end product, really really helpful.
Number two, a great way to learn, again, in an offline fashion. Think of when you and I were in school, Mike. Wouldn't it have been great if we are ins if we had the technology now that back then, you know, we had to read textbooks. Right? We had to take notes and hope that we could run it on a maybe a an old Windows installation or something like that and hope that everything just works. Or in the case of my grad school, SSH to a server without knowing what the heck r was at the time. But imagine having this on a mobile device where you can learn about a key concept, about maybe the central limit theorem or maybe some other, you know, very important statistical concept.
And you can learn this wherever you are and and explore it, But, again, not have to be at your computer or have a textbook open to do it. So it can be an interactive learning device, which, again, was very similar to what they did with this quiz app that they worked on. It was completely interactive, but completely offline as well. It had everything self contained. Really, really novel use case. I think education is gonna education is already taken advantage of web assembly, already have a lot of resources. We speak highly about the quartal live extension where you can embed web assembly powered apps into a quartal document.
George Stagg and and the quartal team are doing immense work in this space to have this reimagination of the learn our package with a new way to leverage Quartle. Lots of great potential here on the mobile space as well with WebR and WebAssembly. Then when you think about other industries where you need real time feedback really quickly, he Colin calls out a couple other use cases. One of which is having real time quality control when you're in the manufacturing space. Maybe you need to inspect something that's going through a production line. You need to enter some feedback. You need to check everything's working. You can look at quality control metrics on the spot based on some inputs you give it, Very quickly detect that something's going wrong because those are hugely important in manufacturing to see when things are going wrong, when things are being produced at scale.
And then also, rounding out the posts are other considerations such as managing your supply chain and making sure you're optimizing that based on the logistics that you're dealing with. Maybe you have to run some models on the spot based on certain parameters, and that might change the way you send products out. May you distribute to warehouses or whatnot. That could be quite helpful and going back to being on the field so to speak in the healthcare industry maybe you're again helping out you know with a huge unmet medical need in, like, a clinic, say, somewhere remote in Africa or some other region, and you're gonna need to have to look at maybe inputting patient characteristics and getting some kind of score out of it to determine their next treatment plan. So I'm seeing in the world of diagnostic type of evaluation, this could be a huge role.
Who knows how long that will take because, obviously, I come from life sciences. Things don't exactly move at a breakneck pace, but I do see this as a potential for those in the field to take advantage of r to help with some of that evaluation and some of that processing. So you now have at your disposal a way to run r on your mobile device and all sorts of different use cases. I don't think Colin, gaming begins to cover all the different possibilities here. But what think r has proven is that with the right initiative and with the tools available, you can enter this space and probably cause some really innovative breakthroughs with r at the back end. Am I here for it? Oh, abso freaking lutely I am. This is awesome.
[00:40:45] Mike Thomas:
It is awesome. It's this feels like this Rlingua project has a lot of gasoline on it. And, I think as soon as somebody drops a match, this is going to catch fire throughout the whole our ecosystem. I I think it's it's just new at this point. And as it continues to gain more exposure, I see it really exponentially taking off pretty quickly. It's it's absolutely incredible breakthrough. As you were talking about thinking, about back in, you know, when we were in school having to read textbooks and things like that, I know that I don't think it's a big secret here, but maybe at least in The US, I know a lot of teachers in, you know, high school and things like that, and and probably college as well, struggle with kids being on their phone all the time, right, and being distracted by that. So if you can actually bring the learning to the phone, then they can sit there with their phones and and learn at the same time. And you're, I guess, putting the education in the place where most of their attention is. So maybe this is practically helpful in that respect. I think that's that's pretty interesting. I also know that in, you know, more professional context in manufacturing settings and health care, as you were were saying, that there are actually, you know, companies out there that that make these almost cell phone looking devices to monitor sensor data and to monitor, you know, you know, just just really like data analysis, little handheld tools that are either, you know, smaller versions of an iPad, but they're they're these customized pieces of hardware.
And I think we might just be able to take, you know, the functionality of those pieces of hardware and leverage, you know, your smartphone, by, you know, using this Rlingua project to to visualize this data. You know, we have the power of WebAssembly here. We have power of DuckDV Wasm. And, I think probably all of that number crunching as it gets better and better and better, and it already is fantastic, can probably perform just as well on a mobile device, a smartphone, if you will, as opposed to, you know, some of these customized pieces of technology that I I think are probably getting a little obsolete and a little outdated, compared to what we have right now. And you don't need to find somebody who's a Swift developer to do it. This is my, like, favorite thing of all time because we have clients who ask about, hey, could this be on a mobile app? I was like, yeah, well, we're gonna have to go hire somebody who knows how to write iOS, you know. I think the language is called Swift, if I'm I'm not mistaken.
Yep. And, oh, and then Android too. Who knows what what language that is? Good luck with that SDK folks. I'm just saying. Exactly. So if we can avoid that, it's absolutely incredible and use a programming language. I think that's a little more literate like our to do our work. I think it's gonna be absolutely, you know, incredible. So a lot of potential benefits here. I see this as being a huge breakthrough that is really only a matter of time before it absolutely takes off.
[00:43:51] Eric Nantz:
Yeah. Again, when I installed our lingua even in its preview mode when when Colin gave me the heads up that this was coming out, That was my first takeaway was that I could not tell that this was an our thing. Right? And not just the looking of it, which, again, we have great tools in Shiny in general to make things not look like a Shiny app. But the performance was no lag whatsoever. It literally looked like or performed like it was built by a professional vendor that's been doing mobile development for who knows how many years. Who knows? It could have been in Swift or, you know, Objective C or whatever the heck ever languages are being used in that or Java itself in the SDK. So this just enables so many possibilities here. I'm still wrapping my head around it. I mean, I'm taking baby steps of WebAssembly. The fact that we're able to submit a WebAssembly power shine up to the FDA is still mind boggling to me that we've pulled that off, but that's just a tip of the iceberg. I think where this can take us, in the world of life sciences, but also as as we talk about here, many, many different industries too.
The one thing I just wanna make sure people realize when they're exploring WebAssembly for the first time, especially with the power of WebR and Shiny Live. And again, it's more for awareness. I'm not trying to be a Debbie downer here, but if you do interact with online resources in a API call, probably not the best idea because in theory, when you bring all this stuff down, all that stuff, all that stuff for authentication is in the manifest, if you will, the bundle of code that is being translated. So that's just something to consider if you're gonna build something like this that does in fact interact with an API layer. But like I said, with Collins use cases here, these being built in a self contained way that have no reliance on off on online access to actually power the machinery behind it, you're good to go there. I just wanted to make sure that I throw that out there because I often get people asking me, why isn't that thing in WebAssembly that I'm building?
I would love to, but if I have to interact with an online service, yeah, that that that's one little gotcha. But maybe that gets better someday. Who knows?
[00:46:05] Mike Thomas:
Yeah. It's a great consideration.
[00:46:07] Eric Nantz:
But but all in all, still very inspiring. And also, you know, to be to be honest, you should be inspired by almost everything in this issue here. John is, curated an excellent issue here. He he always does a fantastic job. So we'll take a minute for a cup or a couple minutes, I should say, for our additional finds here. And speaking of huge improvements in the ecosystem, I do have a soft spot because of my day job and my research areas and my career of high performance computing with our and one of the very, solid foundations that has come to light in the last few years.
The Mirai package, author by Charlie Gao, has had a substantial update in the recent version two dot o upgrade. This is a here's your least for for Charlie and and his, team of developers, And he has a great blog post. I'll mention a couple of great additional enhancements here, especially if you find yourself leveraging cloud resources as your high performance computing backend, talking like AWS batch or other resources. He's got a much easier way to launch all those background processes via a more clever use of SSH under the hood to do that. And by proxy changing the protocol for bringing the connection information back and forth from WebSocket layer. Now it's using TCP level, which basically is a lower level way to communicate with these online resources. It's faster and in his words even more reliable.
And the other one that caught my eye too is we are leveraging this within a Shiny app, whereas what Charlie's done in collaboration with Joe Chang brought Mirai integration with the new extended task paradigm and Shiny itself, you now have a much more elegant way to cancel that workflow in a Shiny context if you decide that, oh, wait a minute. That's not what I wanted to run. Get me out of here. There's a great way to implement that canceling functionality, and he's got linked to, I believe a vignette article where it talks about that integration in great detail along with some great interaction with the per package as well because now per is in the development version providing more parallel mapping capabilities, and not Mirai can be a great back end to power that and not just say the future package like it was before then. So great updates of version two, Amirai, and I'm super excited for what Charlie has in store for future releases. It's a wonderful package here. Yes. Absolutely. All that
[00:48:52] Mike Thomas:
asynchronous possibility, I think, is is really incredible in terms of what we're able to to push the envelope on. So my takeaway here is a, blog on the epiverse site in the it's authored by James Imba Azam, Hugo Gruesome, and Sebastian Funk. And the title is Key Considerations for Retiring or Superseding an Art Package. And I believe the epiverse probably has a a suite of different art packages, that they they help work with. And I also saw a a recent, I think, post by Hadley Wickham, maybe on Blue Sky or somewhere like that, where he is authoring sort of a history of the Tidyverse. And, I'm sure part of that history will include things like ggplot versus ggplot two. Maybe, you know, the the Reshape two package moving to dplyr and tidyr, and, you know, the sort of life cycles that these packages take when maybe, you know, you're you're hoping that users move to the new version, but you don't want to fully take away the old version, you know, even though it's it's pretty legacy and it's not maintained. But perhaps folks have, you know, legacy workflows that leverage your your old versions of packages. So there's gotta be a lot of decision points that you need to make, in order to try to accommodate as many people as possible. And I think this blog post is a really nice reflection on that that I wanted to highlight.
[00:50:21] Eric Nantz:
Yeah. This is a very, very important post especially if you find yourself developing a lot of packages, but yet you learn so much along the way that you wanna make sure that, a, you do have a way to take advantage of what you learn, but also not forget those that were early adopters of your previous packages. I remember my colleague, Will Landau, was wrestling through a lot of this too as he transitioned from his Drake package over to Target, because Targets, in his opinion, contains a lot of what he learned in the process of developing Drake and what he made better for for the Target's ecosystem. So I know he's thought about a lot of these principles too. So highly recommended reading, and I'll be I'll be watching, Hadley's, blue sky post on that too. And hopefully, he comes out with a interesting, article or whatever whatever knowledge is is shared there alongside this great post from the Epiverse team. Really, really solid find. And, again, lots of solid finds in the rest of the issue too, so we invite you to check it out. It is linked directly in the show notes as always.
And, also, you can find the full gamut of updated packages, new blog posts, new tutorials, and upcoming events, and whatnot. So lots of lots of great things to sync your reading, chops into. And, also, we love, hearing from you, as far as helping us with the project as a whole. Our weekly is a cure is a community project through and through. No corporate sponsor overloads here. We are just driven by the the efforts and even yours truly will have an issue with a cure rate next week. And I'm building a completely over engineered shiny. I have to manage our scheduling paradigm that I hope to talk about in the near future, but I'm doing it all late at night hacking if you will because it's not my day job, folks. But where we can have your help is finding those great resources. If you've seen it online, wherever you authored it or you found someone else that did, we're just a poll request away because everything's open on GitHub, folks. Just head to r0e.0rg.
Open the poll request tab in the upper right corner. You'll get taken to the template where you can simply fill that out in our Curator of the Week, which if you do this now, it'll be yours truly. We'll be glad to get you in that in that next issue, that resource. But we also love hearing from you, online as well. We are on the social medias. I am on blue sky as well at [email protected] or social, something like that. Also, I'm Mastodon with at [email protected], and I'm on LinkedIn, causing all sorts of fun stuff there. Search my name and you'll find me there. Mike, where can they find you? Primarily on Blue Sky these days, in terms of social media,
[00:53:03] Mike Thomas:
non LinkedIn, at mike dash thomas dot b s k y dot social, or on LinkedIn if you search Ketchbrooke Analytics, k e t c h b r o o k, You can find out what we are up to, and we are still on the hunt for a DevOps engineer, for anybody interested out there who knows a little bit of Terraform, Docker, Kubernetes, and Azure. That's the stack.
[00:53:28] Eric Nantz:
Yeah. I know there's a lot of great people out there that are working with that stack and even now I'm trying to educate myself on some of those and it is a brand new world to me. So having that kind of expertise is always helpful. So if you're interested, get a hold of Mike. He'll be glad to talk to you. But with that, we will close-up shop with episode one ninety five of our weekly highlights. We thank you so much for joining us, and we'll be back with another episode of our weekly highlights next week.