Recently I was reading a post from 2020 on the RStudio Blog, when I followed a link in the post only to find…
Of course, it was easy to find the intended page with Google, but it made me curious:
How many HTTP 404 (Page Not Found) errors like this exist in the RStudio Blog?
Which links are broken?
Could these broken links be easily fixed?
It turns out we can get at these questions pretty quickly with R. Especially, if we break the overall mini-project into even smaller tasks such as:
Get all of the blog posts on the RStudio Blog
Get all of the links out of each blog post
Test out all of the links
Getting All Blog Posts
Clicking around in the blog reveals 38 pages of blog posts ranging from early 2011 to recently in December 2021 (Happy belated 10th birthday RStudio blog!).
Let’s see if we can harvest or “rvest” all of the links.
First, let’s build the 38 urls we need to retrieve links from.
n_pages <- 38
url_bloghome <- 'https://www.rstudio.com/blog' # This is page 1
url_blogpages <- c()
for (i in 1:38){
current_url <- url_bloghome
if (i > 1){
current_url <- file.path(url_bloghome, 'page', i)
}
url_blogpages <- c(url_blogpages, current_url)
}
str(url_blogpages)
## chr [1:38] "https://www.rstudio.com/blog" ...
head(url_blogpages)
## [1] "https://www.rstudio.com/blog" "https://www.rstudio.com/blog/page/2"
## [3] "https://www.rstudio.com/blog/page/3" "https://www.rstudio.com/blog/page/4"
## [5] "https://www.rstudio.com/blog/page/5" "https://www.rstudio.com/blog/page/6"
Now, let’s pull all of the blog post links out of these 38 urls.
library(rvest)
get_blogposts <- function(url){
read_html(url) %>%
html_nodes('.pt-3') %>%
html_nodes('a') %>%
html_attr('href')
}
blogposts <- unname(unlist(sapply(url_blogpages, FUN=get_blogposts)))
str(blogposts)
## chr [1:562] "https://www.rstudio.com/blog/three-ways-to-program-in-python-with-rstudio/" ...
head(blogposts, 25)
## [1] "https://www.rstudio.com/blog/three-ways-to-program-in-python-with-rstudio/"
## [2] "https://www.rstudio.com/blog/rstudio-community-monthly-events-december-2021/"
## [3] "https://www.rstudio.com/blog/r-markdown-tips-tricks-1-rstudio-ide/"
## [4] "https://www.rstudio.com/blog/announcing-the-rstudio-blog-s-new-vision-and-design/"
## [5] "https://www.rstudio.com/blog/augment-tableau-with-r-python/"
## [6] "https://www.rstudio.com/blog/building-code-movies-with-flipbookr/"
## [7] "https://www.rstudio.com/blog/rstudio-community-monthly-events-november-2021/"
## [8] "https://www.rstudio.com/blog/announcing-rstudio-on-amazon-sagemaker/"
## [9] "https://www.rstudio.com/blog/how-the-clusterbuster-shiny-app-helps-battle-covid-19-in-the-netherlands/"
## [10] "https://www.rstudio.com/blog/announcing-the-2021-rstudio-communications-survey/"
## [11] "https://www.rstudio.com/blog/rstudio-at-r-pharma-2021/"
## [12] "https://www.rstudio.com/blog/how-data-scientists-and-security-teams-can-work-together/"
## [13] "https://www.rstudio.com/blog/pro-drivers-2021-10-0-release/"
## [14] "https://www.rstudio.com/blog/the-inspire-u2-program-student-reflections/"
## [15] "https://www.rstudio.com/blog/embedding-shiny-apps-in-tableau-dashboards-using-shinytableau/"
## [16] "https://www.rstudio.com/blog/the-inspire-u2-program/"
## [17] "https://www.rstudio.com/blog/why-your-ds-team-might-need-a-shiny-deployment-engineer/"
## [18] "https://www.rstudio.com/blog/rstudio-connect-2021-09-0-tableau-analytics-extensions/"
## [19] "https://www.rstudio.com/blog/teaching-data-science-with-rstudio-cloud/"
## [20] "https://www.rstudio.com/blog/pins-1-0-0/"
## [21] "https://www.rstudio.com/blog/rstudio-table-contest-2021/"
## [22] "https://www.rstudio.com/blog/how-to-use-shinymatrix-and-plotly-graphs/"
## [23] "https://www.rstudio.com/blog/rstudio-2021.09.0-update-whats-new/"
## [24] "https://www.rstudio.com/blog/what-s-new-on-rstudio-cloud-september-2021/"
## [25] "https://www.rstudio.com/blog/curating-for-wearerladies-on-twitter/"
Check that out, we have 562 blog post urls now!
Get Links from the Blog Posts
Let’s pull all of the links from all of those blog posts.
extract_links <- function(blogpost){
read_html(blogpost) %>%
html_nodes('a') %>%
html_attr('href') %>%
unique() %>%
sort()
}
links <- sapply(blogposts, extract_links)
Clean the Links
Now let’s get a unique list of links so we can check for the 404 errors.
unique_links <- sort(unique(unname(unlist(links))))
str(unique_links)
## chr [1:4270] "" " https://blog.rstudio.com/2020/01/29/rstudio-pbc/" ...
head(unique_links, 10)
## [1] ""
## [2] " https://blog.rstudio.com/2020/01/29/rstudio-pbc/"
## [3] " https://rladies.org/about-us/help/"
## [4] "./handoff3.jpeg"
## [5] "./tableau-extract3.png"
## [6] "/"
## [7] "/2014/06/18/r-markdown-v2/"
## [8] "/2014/06/19/interactive-documents-an-incredibly-easy-way-to-use-shiny/"
## [9] "/2015/06/24/dt-an-r-interface-to-the-datatables-library/"
## [10] "/2016/12/02/announcing-bookdown/"
tail(unique_links, 10)
## [1] "mailto:info@epi-interactive.com"
## [2] "mailto:info@rstudio.com"
## [3] "mailto:josh@rstudio.com"
## [4] "mailto:life-sciences-healthcare@rstudio.com"
## [5] "mailto:sales@rstudio.com"
## [6] "mailto:shinyapps-support@rstudio.com"
## [7] "mailto:support@rstudio.com"
## [8] "mailto:training@rstudio.com"
## [9] "r4ds.io/join"
## [10] "sendmail:%20training@rstudio.com"
It looks like we have both absolute and relative urls, so let’s clean this up a bit. To keep things simple for this analysis, we’ll assume:
- All absolute links start with
http
- All absolute links start with
/
- We will ignore all other links (such as
mailto:...
)
absolute_links <- unique_links[substring(unique_links, 1, 4) == 'http']
relative_links <- unique_links[substring(unique_links, 1, 1) == '/']
clean_links <- c(paste0('https://www.rstudio.com', relative_links), absolute_links)
str(clean_links)
## chr [1:4137] "https://www.rstudio.com/" ...
head(clean_links)
## [1] "https://www.rstudio.com/"
## [2] "https://www.rstudio.com/2014/06/18/r-markdown-v2/"
## [3] "https://www.rstudio.com/2014/06/19/interactive-documents-an-incredibly-easy-way-to-use-shiny/"
## [4] "https://www.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/"
## [5] "https://www.rstudio.com/2016/12/02/announcing-bookdown/"
## [6] "https://www.rstudio.com/2017/06/26/bigrquery-0-4-0/"
tail(clean_links)
## [1] "https://youtu.be/sB8CYGlPN0o?t=158"
## [2] "https://youtu.be/t25Lbi5D6kg"
## [3] "https://youtu.be/Y2zoRCXgPwk"
## [4] "https://youtu.be/yb_mBJz3iSc"
## [5] "https://yutani.rbind.io/post/2017-10-25-blogdown-custom/"
## [6] "https://zotero.org"
It looks like we’ve dropped less than 150 links. That seems reasonable.
Let’s limit our scope even further to only links to RStudio.com web pages. Why? Two motivating reasons:
These are likely easily fixed by RStudio or the R community. Whereas, if companyabc.com existed about five years ago and no longer existed, there is not really much to do about it.
This is meant to be a fun mini project & not an all encompassing exercise (More generally, without scope limits, things intended to take hours can easily take days.)
clean_rstudio_links <- clean_links[grepl('rstudio.com', clean_links, fixed=TRUE)]
str(clean_rstudio_links)
## chr [1:1570] "https://www.rstudio.com/" ...
This still leaves us with over 1500 links to check :-) even with our limited scope. Let’s see if we can quick categorize these links a bit. Perhaps by whatever comes right before rstudio.com.
get_prefix <- function(link, term='rstudio'){
dotsplit <- unlist(strsplit(link, '\\.'))
rsindex <- grep(term, dotsplit)
dotsplit[rsindex - 1]
}
clean_rstudio_links2 <- unlist(sapply(clean_rstudio_links, get_prefix))
table(clean_rstudio_links2)
## clean_rstudio_links2
## http://blogs http://community http://cran
## 1 4 14
## http://cran-logs http://db http://docs
## 1 2 64
## http://education http://ggvis http://glimmer
## 1 1 1
## http://pins http://rmarkdown http://rviews
## 7 1 2
## http://shiny http://spark http://support
## 9 1 2
## http://www https://blog https://blogs
## 1 159 5
## https://colorado https://community https://cran
## 4 168 7
## https://db https://doc https://docs
## 2 2 156
## https://education https://environments https://global
## 2 3 2
## https://gt https://keras https://packagemanager
## 16 32 5
## https://pins https://pkgs https://resources
## 16 25 13
## https://rmarkdown https://rstudio https://rviews
## 34 34 8
## https://shiny https://solutions https://spark
## 93 33 24
## https://support https://swag https://tensorflow
## 43 1 23
## https://TensorFlow https://www rstudio
## 1 481 160
Finally, let’s remove a few of these that will break my checking function below (or would require me to add error handling). Specifically, the 2 links that https://doc.rstudio.com/
with are non-starters as those take my to a “Can’t Find the Server” error. So, yes, those are broken links, but not 404 errors, and we are going to manually remove them for this analysis.
clean_rstudio_links3 <- clean_rstudio_links[!grepl('doc.rstudio.com', clean_rstudio_links, fixed=TRUE)]
str(clean_rstudio_links3)
## chr [1:1568] "https://www.rstudio.com/" ...
Check the Links
library(httr)
link_status <- unlist(sapply(clean_rstudio_links3, function(x)
unname(GET(x)['status_code'])))
str(link_status)
## Named int [1:1568] 200 404 404 404 404 404 404 404 404 404 ...
## - attr(*, "names")= chr [1:1568] "https://www.rstudio.com/" "https://www.rstudio.com/2014/06/18/r-markdown-v2/" "https://www.rstudio.com/2014/06/19/interactive-documents-an-incredibly-easy-way-to-use-shiny/" "https://www.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/" ...
How many HTTP 404 errors did we find? 117
links404 <- names(link_status[link_status == 404])
str(links404)
## chr [1:117] "https://www.rstudio.com/2014/06/18/r-markdown-v2/" ...
head(links404)
## [1] "https://www.rstudio.com/2014/06/18/r-markdown-v2/"
## [2] "https://www.rstudio.com/2014/06/19/interactive-documents-an-incredibly-easy-way-to-use-shiny/"
## [3] "https://www.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/"
## [4] "https://www.rstudio.com/2016/12/02/announcing-bookdown/"
## [5] "https://www.rstudio.com/2017/06/26/bigrquery-0-4-0/"
## [6] "https://www.rstudio.com/2017/09/11/announcing-blogdown/"
tail(links404)
## [1] "https://www.rstudio.com/resources/webinars/introducing-notebooks-with-r-markdown/"
## [2] "https://www.rstudio.com/resources/webinars/shiny-developer-conference/"
## [3] "https://www.rstudio.com/rstudio/download/preview/"
## [4] "https://www.rstudio.com/workshops/applied-machine-learning/"
## [5] "https://www.rstudio.com/workshops/extending-the-tidyverse/"
## [6] "https://www.rstudio.com/workshops/what-they-forgot-to-teach-you-about-r/"
Sanity Check
Let’s do a quick manual check here on at least one of these broken links. How about the first one? (https://www.rstudio.com/2014/06/18/r-markdown-v2/).
To truly sanity check this, we would want to find the original blog post containing the link and try clicking the link in the post.
So which blog post contained this link?
result <- sapply(links, function(x) '/2014/06/18/r-markdown-v2/' %in% x)
result[result]
## https://www.rstudio.com/blog/introducing-ggvis/
## TRUE
And indeed if we navigate to https://www.rstudio.com/blog/introducing-ggvis/ and try clicking on the R Markdown v2 link in the blog post, we are taken to the Page Not Found error page.
Concluding Remarks
To conclude, let’s circle back to our original questions.
- How many HTTP 404 (Page Not Found) errors like this exist in the RStudio Blog?
We ended limiting scope to only rstudio.com links and found that 117 of over 1500 unique
links are currently returning a 404 error. Furthermore, there are two links to doc.rstudio.com that return a Can't Find the Server
error.
- Which links are broken?
A full list of the broken links is printed in the appendix (sorted alphabetically).
- Could these broken links be easily fixed?
The RStudio blog does not appear to be open source, thus we cannot create a PR to fix the links. However, the most notable thing is that many of the broken links seem to follow certain patterns, notably related to absence or presence of dates that could hopefully be easily fixed. Consider these two examples.
EXAMPLE 1
This blog post, after redirects, is linked to https://www.rstudio.com/blog/driving-real-lasting-value-with-serious-data-science/ but the correct link should have a yyyy-mm-dd in the slug: https://www.rstudio.com/blog/2020-05-19-driving-real-lasting-value-with-serious-data-science/
EXAMPLE 2
This other blog post
is linked to https://www.rstudio.com/2014/06/18/r-markdown-v2/ and in this case the /yyyy/mm/dd/
folders need to be replaced with simply /blog/
. The correct link should be https://www.rstudio.com/blog/r-markdown-v2/.
In other words, afik these links cannot be fixed by the community (i.e. me via a PR), but someone with access could presumably fix this up with a little effort. Also, at this point, we’d likely want someone with domain knowledge of the RStudio Blog who could combine these results with their knowledge to determine a next steps.
Thank you for reading, I hope you have enjoyed this analysis!
Appendix
Here is the full vector of 404 error links.
links404
## [1] "https://www.rstudio.com/2014/06/18/r-markdown-v2/"
## [2] "https://www.rstudio.com/2014/06/19/interactive-documents-an-incredibly-easy-way-to-use-shiny/"
## [3] "https://www.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/"
## [4] "https://www.rstudio.com/2016/12/02/announcing-bookdown/"
## [5] "https://www.rstudio.com/2017/06/26/bigrquery-0-4-0/"
## [6] "https://www.rstudio.com/2017/09/11/announcing-blogdown/"
## [7] "https://www.rstudio.com/2017/09/13/rstudio-v1.1-the-little-things/"
## [8] "https://www.rstudio.com/2018/09/19/radix-for-r-markdown/"
## [9] "https://www.rstudio.com/2018/11/19/rstudio-1-2-preview-the-little-things/"
## [10] "https://www.rstudio.com/2019/01/17/announcing-rstudio-connect-1-7-0/"
## [11] "https://www.rstudio.com/2020/03/17/rstudio-1-3-the-little-things/"
## [12] "https://www.rstudio.com/2020/07/17/rstudio-global-2021"
## [13] "https://www.rstudio.com/2020/07/17/rstudio-global-call-for-talks"
## [14] "https://www.rstudio.com/2020/09/30/rstudio-v1-4-preview-visual-markdown-editing/"
## [15] "https://www.rstudio.com/2020/11/09/rstudio-1-4-preview-citations/"
## [16] "https://www.rstudio.com/2020/12/07/distill/"
## [17] "https://www.rstudio.com/2021/01/18/blogdown-v1.0/"
## [18] "https://www.rstudio.com/2021/02/04/rstudio-cloud1/"
## [19] "https://www.rstudio.com/2021/06/02/announcing-rstudio-workbench/"
## [20] "https://www.rstudio.com/2021/06/02/rstudio-workbench-vscode-sessions/"
## [21] "https://www.rstudio.com/2021/06/24/winners-of-the-3rd-annual-shiny-contest/"
## [22] "https://www.rstudio.com/s/photos/brian-mcgowan-tomorrowland?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"
## [23] "https://www.rstudio.com/s/photos/match?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"
## [24] "http://cran.rstudio.com/web/packages/dplyr/vignettes/databases.html"
## [25] "http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html"
## [26] "http://cran.rstudio.com/web/packages/shiny/NEWS"
## [27] "http://docs.rstudio.com/connect/1.4.5/admin/user-management.html#user-roles"
## [28] "http://pins.rstudio.com/articles/advanced-versions.html"
## [29] "http://pins.rstudio.com/articles/boards-azure.html"
## [30] "http://pins.rstudio.com/articles/boards-dospace.html"
## [31] "http://pins.rstudio.com/articles/boards-gcloud.html"
## [32] "http://pins.rstudio.com/articles/boards-kaggle.html"
## [33] "http://pins.rstudio.com/articles/boards-rsconnect.html"
## [34] "http://pins.rstudio.com/articles/boards-s3.html"
## [35] "http://rstudio.com/training/curriculum/advanced-r-programming.html"
## [36] "http://rstudio.com/training/curriculum/effective-data-visualization.html"
## [37] "http://rstudio.com/training/curriculum/package-development.html"
## [38] "http://rstudio.com/training/curriculum/reports-and-reproducible-research.html"
## [39] "http://rstudio.com/training/on-site.html"
## [40] "http://rstudio.com/training/philosophy.html"
## [41] "http://rstudio.com/training/public-courses.html"
## [42] "http://rstudio.com/training/trainers.html"
## [43] "https://blog.rstudio.com/2019/01/18/summer-internships-2019/"
## [44] "https://blog.rstudio.com/2019/02/28/rstudio-instructor-training/"
## [45] "https://blog.rstudio.com/2019/05/21/rstudio-instructor-training-updates/"
## [46] "https://blog.rstudio.com/2020/05/19/driving-real-lasting-value-with-serious-data-science/"
## [47] "https://blog.rstudio.com/2020/07/09/why-you-need-a-world-class-ide-to-do-serious-data-science/"
## [48] "https://blog.rstudio.com/2020/09/15/announcing-the-2020-rstudio-table-contest/"
## [49] "https://blog.rstudio.com/2020/12/07/package-manager-1-2-0/"
## [50] "https://blog.rstudio.com/tags/bi-tools/"
## [51] "https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html"
## [52] "https://docs.rstudio.com/connect/1.6.11/admin/python.html"
## [53] "https://docs.rstudio.com/connect/admin/authentication.html#authentication-oauth2"
## [54] "https://docs.rstudio.com/connect/admin/authentication.html#authentication-saml"
## [55] "https://docs.rstudio.com/connect/admin/authentication.html#change-auth-provider"
## [56] "https://docs.rstudio.com/connect/admin/cli.html#cli-usermanager"
## [57] "https://docs.rstudio.com/connect/admin/getting-started.html#need-help"
## [58] "https://pins.rstudio.com/articles/boards-azure.html"
## [59] "https://pins.rstudio.com/articles/boards-gcloud.html"
## [60] "https://pins.rstudio.com/articles/boards-kaggle.html"
## [61] "https://pins.rstudio.com/articles/boards-rsconnect.html"
## [62] "https://pins.rstudio.com/articles/boards-s3.html"
## [63] "https://pins.rstudio.com/articles/boards-websites.html"
## [64] "https://pins.rstudio.com/articles/pins-rstudio.html"
## [65] "https://resources.rstudio.com/rstudio-conf-2020/value-in-data-science-beyond-models-in-production-eduardo-arino-de-la-rubia"
## [66] "https://rmarkdown.rstudio.com/flexdashboard/examples.html"
## [67] "https://rmarkdown.rstudio.com/flexdashboard/layouts.html"
## [68] "https://rmarkdown.rstudio.com/flexdashboard/shiny.html"
## [69] "https://rmarkdown.rstudio.com/flexdashboard/using.html"
## [70] "https://rmarkdown.rstudio.com/flexdashboard/using.html#storyboards"
## [71] "https://rstudio.com/resources/rstudioconf-2020/making-the-shiny-contest/"
## [72] "https://rstudio.com/resources/rstudioconf-2020/value-in-data-science-beyond-models-in-production/"
## [73] "https://rstudio.com/resources/rstudioglobal-2021/,"
## [74] "https://shiny.rstudio.com/articles/single-file.html"
## [75] "https://shiny.rstudio.com/articles/upgrade-0.14.html#full-changelog"
## [76] "https://shiny.rstudio.com/gallery/widgets-gallery.html"
## [77] "https://shiny.rstudio.com/reference/shiny/latest/removeUI.html"
## [78] "https://shiny.rstudio.com/reference/shiny/latest/showReactLog.html"
## [79] "https://solutions.rstudio.com/2019/12/30/rstudio-connect-custom-emails-with-blastula/"
## [80] "https://solutions.rstudio.com/data-science-admin/deploy/apis/"
## [81] "https://solutions.rstudio.com/deploy/overview/"
## [82] "https://solutions.rstudio.com/deploy/promote/"
## [83] "https://solutions.rstudio.com/examples/jobs-overview/"
## [84] "https://solutions.rstudio.com/examples/rest-apis-overview/#log-details-about-api-requests-and-responses"
## [85] "https://solutions.rstudio.com/examples/rsc-apis/acl-audit-report"
## [86] "https://solutions.rstudio.com/examples/rsc-apis/basic-audit-report"
## [87] "https://solutions.rstudio.com/examples/rsc-apis/tag-audit-report"
## [88] "https://solutions.rstudio.com/examples/rsc-apis/vanity-audit-report"
## [89] "https://solutions.rstudio.com/examples/rsc-server-api-overview/"
## [90] "https://solutions.rstudio.com/launcher/kubernetes/"
## [91] "https://solutions.rstudio.com/launcher/kubernetes/#want-to-learn-more-about-rstudio-server-pro-and-kubernetes"
## [92] "https://solutions.rstudio.com/production/integrations/"
## [93] "https://spark.rstudio.com/articles/guides-distributed-r.html"
## [94] "https://spark.rstudio.com/deployment_examples.html"
## [95] "https://spark.rstudio.com/h2o.html"
## [96] "https://spark.rstudio.com/images/sparklyr-cheatsheet.pdf"
## [97] "https://spark.rstudio.com/mllib.html"
## [98] "https://tensorflow.rstudio.com/gallery/"
## [99] "https://tensorflow.rstudio.com/learn/examples.html"
## [100] "https://www.rstudio.com/about/news-events/"
## [101] "https://www.rstudio.com/conference/rstudioconf-tickets/"
## [102] "https://www.rstudio.com/ide/docs/authoring/using_markdown.html"
## [103] "https://www.rstudio.com/ide/docs/release_notes_v0.97.html"
## [104] "https://www.rstudio.com/ide/docs/release_notes_v0.98.html"
## [105] "https://www.rstudio.com/ide/download/server-pro-evaluation.html"
## [106] "https://www.rstudio.com/products/rstudio-server-pro2/"
## [107] "https://www.rstudio.com/resources/videos/debugging-techniques/"
## [108] "https://www.rstudio.com/resources/videos/plumbing-apis-with-plumber/"
## [109] "https://www.rstudio.com/resources/videos/scaling-shiny-apps-with-async-programming-june-2018/"
## [110] "https://www.rstudio.com/resources/videos/scaling-shiny/"
## [111] "https://www.rstudio.com/resources/webinars/introducing-an-r-interface-for-apache-spark/"
## [112] "https://www.rstudio.com/resources/webinars/introducing-notebooks-with-r-markdown/"
## [113] "https://www.rstudio.com/resources/webinars/shiny-developer-conference/"
## [114] "https://www.rstudio.com/rstudio/download/preview/"
## [115] "https://www.rstudio.com/workshops/applied-machine-learning/"
## [116] "https://www.rstudio.com/workshops/extending-the-tidyverse/"
## [117] "https://www.rstudio.com/workshops/what-they-forgot-to-teach-you-about-r/"