rlist: a new package for working with list objects in R

2014-06-26 Kun Ren 更多博文 » 博客 » GitHub »

r list rlist package

原文链接 https://renkun-ken.github.io/blog/2014/06/26/rlist-a-new-package-for-working-with-list-objects-in-r.html
注:以下为加速网络访问所做的原文缓存,经过重新格式化,可能存在格式方面的问题,或偶有遗漏信息,请以原文为准。


In recent years, non-relational data have attracted more and more attentions. Roughly speaking, all datasets that are hard to put into a rectangular table with rows and columns are non-relational datasets.

The following data is a very simple non-relational dataset in JSON format. The dataset contains some information of three programmers, each of whom has a name, an age, some interests, and a list of programming languages with the number of years used.

{
  "p1" : {
    "name" : "Ken",
        "age" : 24,
        "interest" : [
            "reading",
            "music",
            "movies"
        ],
        "lang" : {
            "r" : 2,
            "csharp" : 4,
            "python" : 3
        }
    },
    "p2" : {
        "name" : "James",
        "age" : 25,
        "interest" : [
            "sports",
            "music"
        ],
        "lang" : {
            "r" : 3,
            "java" : 2,
            "cpp" : 5
        }
    },
    "p3" : {
        "name" : "Penny",
        "age" : 24,
        "interest" : [
            "movies",
            "reading"
        ],
        "lang" : {
            "r" : 1,
            "cpp" : 4,
            "python" : 2
        }
    }
}

It takes efforts to fit such a dataset into several relational data tables. If we really need to do so, we may create a table of names and ages, a table of interests, and a table of languages, and use some relations to represent how the records in every table corresponds to each other.

One of the most popular solution for processing non-relational data structures is MongoDB, which uses JSON/BSON format to store such kind of data and use similar syntax to query the dataset.

In R, list object is powerful enough to represent a wide range of non-relational datasets like this. In the recent month, I have been working on a new package called rlist hosted by GitHub. It is a set of tools for working with list objects.

This package has two main goals:

  • Make it easier to work with list objects used to store data in more flexible structures than data frames.
  • Perform a wide range of functions on non-relational data using list constructs.

Installation

You can install this package from CRAN with

install.packages("rlist")

or install the latest development version from GitHub with

devtools::install_github("rlist","renkun-ken")

Functions

The package provides a wide range of functions to work with list objects. Suppose we work with the developers dataset we just mentioned.

library(rlist)
devs <-
  list(
    p1=list(name="Ken",age=24,
      interest=c("reading","music","movies"),
      lang=list(r=2,csharp=4,python=3)),
    p2=list(name="James",age=25,
      interest=c("sports","music"),
      lang=list(r=3,java=2,cpp=5)),
    p3=list(name="Penny",age=24,
      interest=c("movies","reading"),
      lang=list(r=1,cpp=4,python=2)))

Filtering

Filter members whose age is no less than 25 by calling list.filter.

str(list.filter(devs,age >= 25))
List of 1
 $ p2:List of 4
  ..$ name    : chr "James"
  ..$ age     : num 25
  ..$ interest: chr [1:2] "sports" "music"
  ..$ lang    :List of 3
  .. ..$ r   : num 3
  .. ..$ java: num 2
  .. ..$ cpp : num 5

Mapping

Get the name of each person by calling list.map that maps each member by an expression.

list.map(devs, name)
$p1
[1] "Ken"

$p2
[1] "James"

$p3
[1] "Penny"

Get the programming language each person has been using for the longest time by calling list.map.

list.map(devs, sort(unlist(lang),decreasing = T)[1])
$p1
csharp 
     4 

$p2
cpp 
  5 

$p3
cpp 
  4 

Selecting

Select the name and age of each member by calling list.select.

str(list.select(devs,name,age))
List of 3
 $ p1:List of 2
  ..$ name: chr "Ken"
  ..$ age : num 24
 $ p2:List of 2
  ..$ name: chr "James"
  ..$ age : num 25
 $ p3:List of 2
  ..$ name: chr "Penny"
  ..$ age : num 24

Select the name and evaluate the range of the number of years using programming languages.

str(list.select(devs,name,score.range=range(unlist(lang))))
List of 3
 $ p1:List of 2
  ..$ name       : chr "Ken"
  ..$ score.range: num [1:2] 2 4
 $ p2:List of 2
  ..$ name       : chr "James"
  ..$ score.range: num [1:2] 2 5
 $ p3:List of 2
  ..$ name       : chr "Penny"
  ..$ score.range: num [1:2] 1 4

Grouping

Build a list that contains sublists each represents an age group by calling list.group.

str(list.group(devs,age))
List of 2
 $ 24:List of 2
  ..$ p1:List of 4
  .. ..$ name    : chr "Ken"
  .. ..$ age     : num 24
  .. ..$ interest: chr [1:3] "reading" "music" "movies"
  .. ..$ lang    :List of 3
  .. .. ..$ r     : num 2
  .. .. ..$ csharp: num 4
  .. .. ..$ python: num 3
  ..$ p3:List of 4
  .. ..$ name    : chr "Penny"
  .. ..$ age     : num 24
  .. ..$ interest: chr [1:2] "movies" "reading"
  .. ..$ lang    :List of 3
  .. .. ..$ r     : num 1
  .. .. ..$ cpp   : num 4
  .. .. ..$ python: num 2
 $ 25:List of 1
  ..$ p2:List of 4
  .. ..$ name    : chr "James"
  .. ..$ age     : num 25
  .. ..$ interest: chr [1:2] "sports" "music"
  .. ..$ lang    :List of 3
  .. .. ..$ r   : num 3
  .. .. ..$ java: num 2
  .. .. ..$ cpp : num 5

Sorting

Sort the developers by the number of interests in descending order, then by the number of years they have been using R in descending order by calling list.sort.

str(list.sort(devs,desc(length(interest)),desc(lang$r)))
List of 3
 $ p1:List of 4
  ..$ name    : chr "Ken"
  ..$ age     : num 24
  ..$ interest: chr [1:3] "reading" "music" "movies"
  ..$ lang    :List of 3
  .. ..$ r     : num 2
  .. ..$ csharp: num 4
  .. ..$ python: num 3
 $ p2:List of 4
  ..$ name    : chr "James"
  ..$ age     : num 25
  ..$ interest: chr [1:2] "sports" "music"
  ..$ lang    :List of 3
  .. ..$ r   : num 3
  .. ..$ java: num 2
  .. ..$ cpp : num 5
 $ p3:List of 4
  ..$ name    : chr "Penny"
  ..$ age     : num 24
  ..$ interest: chr [1:2] "movies" "reading"
  ..$ lang    :List of 3
  .. ..$ r     : num 1
  .. ..$ cpp   : num 4
  .. ..$ python: num 2

Updating

Use list.update to update the list by removing age and lang columns and introducing the number of languages each member uses as nlang.

str(list.update(devs,age=NULL,lang=NULL,nlang=length(lang)))
List of 3
 $ p1:List of 3
  ..$ name    : chr "Ken"
  ..$ interest: chr [1:3] "reading" "music" "movies"
  ..$ nlang   : int 3
 $ p2:List of 3
  ..$ name    : chr "James"
  ..$ interest: chr [1:2] "sports" "music"
  ..$ nlang   : int 3
 $ p3:List of 3
  ..$ name    : chr "Penny"
  ..$ interest: chr [1:2] "movies" "reading"
  ..$ nlang   : int 3

More functions

Much more functions are provided than the examples show. Please read the documentation of the package.

Working with pipeline

Pipeline operators may hugely improve the readibility of the code especially when a chain of commands are executed. pipeR package is recommended to co-work with this package.

The following code returns the developers whose age is no more than 24 and create a data frame where they are sorted by the number of years using R in descending order and each row tells us the name, years of using R, and the longest time using a language they know.

library(pipeR)
devs %>>%
  list.filter(age <= 24) %>>%
  list.sort(desc(lang$r)) %>>%
  list.map(data.frame(name=name,r=lang$r,
    longest=max(unlist(lang)))) %>>%
  list.rbind
    name r longest
p1   Ken 2       4
p3 Penny 1       4

Lambda expression

Most functions in this package supports lambda expressions like x ~ f(x) where x refers to the list member itself. Otherwise, . will by default be used to represent it.

nums <- list(a=c(1,2,3),b=c(2,3,4),c=c(3,4,5))
nums %>>%
  list.map(data.frame(min=min(.),max=max(.))) %>>%
  list.rbind
  min max
a   1   3
b   2   4
c   3   5
nums %>>%
  list.map(x ~ sum(x))
$a
[1] 6

$b
[1] 9

$c
[1] 12
nums %>>%
  list.filter(x ~ mean(x)>=3)
$b
[1] 2 3 4

$c
[1] 3 4 5

Conclusion

rlist package can be used to deal with list objects in a very flexible and streamlined manner. It can work together with many other packages such as pipeR, plyr, dplyr, etc.