What are the most popular keywords of CRAN packages?

2014-07-25 Kun Ren 更多博文 » 博客 » GitHub »

r scraping

原文链接 https://renkun-ken.github.io/blog/2014/07/25/what-are-the-most-popular-keywords-of-cran-packages.html
注:以下为加速网络访问所做的原文缓存,经过重新格式化,可能存在格式方面的问题,或偶有遗漏信息,请以原文为准。


(This article is adapted to the latest version of rvest package.)

A large proportion of R's power should be attributed to the enormous amount of extension packages. Many packages are published to CRAN.

These packages cover a wide range of fields. In this post, I'll show you how to use R to scrap the titles of all CRAN packages from the web page and find out which keywords are the most popular.

To minimize the efforts, we try best to avoid reinventing the wheels and get some answer as quickly as possible. We only use existing packages to do all the work.

Here is our toolbox that is useful in this task:

  • rvest: Scrape from the web page by selector
  • rlist: Quickly perform mapping and filtering in functional style
  • pipeR: Pipe all operations at high performance

First, we equip our R environment with these tools.

library(rvest)
library(rlist)
library(pipeR)

Then we download and parse the web page.

url <- "http://cran.r-project.org/web/packages/available_packages_by_date.html"
page <- html(url)

Now page is a parsed HTML document object that is well structured and is ready to query. Note that we need to get the texts in the third column of the table. Here we use XPath to locate the information we want. Or you can use CSS selector to do the same work.

The following code are written in fluent style with pipeline.

words <- page %>>%
  html_node("//tr//td[3]//text()", xpath = TRUE) %>>% 
  # select the 3rd column
  list.map( # map each node to ...
    # 1. get the trimmed text in the XML node
      XML::xmlValue(.) %>>%
      # 2. split the text by non-word-letters
      strsplit("[^a-zA-Z]") %>>%
      # 3. put everything together in vector
      unlist(use.names = FALSE) %>>%
      # 4. lower all words
      tolower %>>%
      # 5. filter words with more than 3 letters to be meaningful
      list.filter(nchar(.) > 3L)) %>>%
  # put everything in a large character vector
  unlist %>>%
  # create a table of word count
  table %>>%
  # sort the table descending
  sort(decreasing = TRUE) %>>%
  # take out the first 100 elements
  head(100) %>>%
  # print out the results
  print

          data       analysis         models           with      functions 
           864            718            484            404            371 
       package     regression     estimation          model          based 
           336            308            273            249            238 
         using          tools           from       bayesian         linear 
           235            225            194            173            169 
       methods           time      interface   multivariate    statistical 
           169            168            160            133            124 
          test    generalized     clustering          tests         series 
           114            112            105            105            104 
     inference     statistics         random   distribution      selection 
           101            101            100             97             96 
      modeling        spatial      algorithm       multiple     simulation 
            89             89             87             87             82 
         mixed         method     likelihood  distributions      modelling 
            81             78             77             76             73 
       network           sets classification        mixture       sampling 
            72             70             68             67             64 
       effects         robust         sparse       survival       variable 
            63             63             60             60             60 
          high        fitting           gene       function   optimization 
            58             57             57             56             56 
     graphical        testing       networks          files  nonparametric 
            55             55             54             52             52 
         plots         sample    dimensional        genetic          multi 
            52             52             51             51             51 
     utilities  visualization implementation        density         matrix 
            51             51             50             49             49 
  hierarchical          lasso       learning         markov    correlation 
            48             48             48             48             47 
       dynamic           plot     prediction       censored           meta 
            47             47             47             46             46 
      datasets       gaussian       response       adaptive    association 
            45             45             45             44             44 
        binary         design          least         normal         system 
            44             44             43             43             43 
          fast     functional          point       analyses     confidence 
            42             42             42             41             41 
   experiments       graphics        objects     population        process 
            41             41             41             41             41 

The work is done, in 12 lines, in only a little more than 2 seconds!

If you want to know more about these packages, please visit their project pages. Hope you can do more amazing things in your work.