R (programming language)
R is a programming language for statistical computing and data visualization. It has been widely adopted in the fields of data mining, bioinformatics, data analysis, and data science.
The core R language is extended by a large number of software packages, which contain reusable code, documentation, and sample data. Some of the most popular R packages are in the tidyverse collection, which enhances functionality for visualizing, transforming, and modelling data, as well as improves the ease of programming.
R is free and open-source software distributed under the GNU General Public License. The language is implemented primarily in C, Fortran, and R itself. Precompiled executables are available for the major operating systems.
Its core is an interpreted language with a native command line interface. In addition, multiple third-party applications are available as graphical user interfaces; such applications include RStudio, Jupyter, as well as Termux and Google Colab for mobile devices.
History
R was started by professors Ross Ihaka and Robert Gentleman as a programming language to teach introductory statistics at the University of Auckland. The language was inspired by the S programming language, with most S programs able to run unaltered in R. The language was also inspired by Scheme's lexical scoping, allowing for local variables.The name of the language, R, comes from being both an S language successor and the shared first letter of the authors, Ross and Robert. In August 1993, Ihaka and Gentleman posted a binary file of R on StatLib — a data archive website. At the same time, they announced the posting on the s-news mailing list. On 5 December 1997, R became a GNU project when version 0.60 was released. On 29 February 2000, the 1.0 version was released.
Packages
s are collections of functions, documentation, and data that expand R. For example, packages can add reporting features and support for various statistical techniques. Ease of package installation and use have contributed to the language's adoption in data science.Immediately available when starting R after installation, base packages provide the fundamental and necessary syntax and commands for programming, computing, graphics production, basic arithmetic, and statistical functionality.
An example is the tidyverse collection of R packages, which bundles several subsidiary packages to provide a common API. The collection specializes in tasks related to accessing and processing "tidy data", which are data contained in a two-dimensional table with a single row for each observation and a single column for each variable.
Installing a package occurs only once. For example, to install the tidyverse collection:
> install.packages
To load the functions, data, and documentation of a package, one calls the
library function. To load the tidyverse collection, one can execute the following code:> # The package name can be enclosed in quotes
> library
> # But the package name can also be used without quotes
> library
The Comprehensive R Archive Network was founded in 1997 by Kurt Hornik and Friedrich Leisch to host R's source code, executable files, documentation, and user-created packages. CRAN's name and scope mimic the Comprehensive TeX Archive Network and the Comprehensive Perl Archive Network. CRAN originally had only three mirror sites and twelve contributed packages., it has 90 mirrors and 22,390 contributed packages. Packages are also available in repositories such as R-Forge, Omegahat, and GitHub.
To provide guidance on the CRAN web site, its area lists packages that are relevant for specific topics; sample topics include causal inference, finance, genetics, high-performance computing, machine learning, medical imaging, meta-analysis, social sciences, and spatial statistics.
The Bioconductor project provides packages for genomic data analysis, complementary DNA, microarray, and high-throughput sequencing methods.
Community
There are three main groups that help support R software development:- The R Core Team was founded in 1997 to maintain the R source code.
- The R Foundation for Statistical Computing was founded in April 2003 to provide financial support.
- The R Consortium is a Linux Foundation project to develop R infrastructure.
The R community hosts many conferences and in-person meetups. These groups include:
- UseR!: an annual international R user conference
- Directions in Statistical Computing
- R-Ladies: an organization to promote gender diversity in the R community
- SatRdays: R-focused conferences held on Saturdays
- Data Science & AI Conferences
- posit::conf
#rstats can be used to follow new developments in the R community.Examples
Hello, World!
The following is a "Hello, World!" program:"Hello, World!"
cat function:> cat
Hello, World!
Basic syntax
The following examples illustrate the basic syntax of the language and use of the command-line interface.In R, the generally preferred assignment operator is an arrow made from two characters
<-, although = can be used in some cases.> x <- 1:6 # Create a numeric vector in the current environment
> y <- x^2 # Similarly, create a vector based on the values in x.
> y # Print the vector’s contents.
1 4 9 16 25 36
> z <- x + y # Create a new vector that is the sum of x and y
> z # Return the contents of z to the current environment.
2 6 12 20 30 42
> z_matrix <- matrix # Create a new matrix that transforms the vector z into a 3x2 matrix object
> z_matrix
2 20
6 30
12 42
> 2 * t - 2 # Transpose the matrix; multiply every element by 2; subtract 2 from each element in the matrix; and then return the results to the terminal.
2 10 22
38 58 82
> new_df <- data.frame, row.names = c) # Create a new dataframe object that contains the data from a transposed z_matrix, with row names 'A' and 'B'
> names <- c # Set the column names of the new_df dataframe as X, Y, and Z.
> new_df # Print the current results.
X Y Z
A 2 6 12
B 20 30 42
> new_df$Z # Output the Z column
12 42
> new_df$Z new_df && new_df new_df$Z # The dataframe column Z can be accessed using the syntax $Z, , or , and the values are the same.
TRUE
> attributes # Print information about attributes of the new_df dataframe
$names
"X" "Y" "Z"
$row.names
"A" "B"
$class
"data.frame"
> attributes$row.names <- c # Access and then change the row.names attribute; this can also be done using the rownames function
> new_df
X Y Z
one 2 6 12
two 20 30 42
Structure of a function
R is able to create functions that add new functionality for code reuse. Objects created within the body of the function remain accessible only from within the function, and any data type may be returned. In R, almost all functions and all user-defined functions are closures.The following is an example of creating a function to perform an arithmetic calculation:
- The function, named f, returns a linear combination of x and y.
- As an alternative, the last statement executed in a function is returned implicitly.
The following is some output from using the function defined above:
> f # 3 * 1 + 4 * 2 = 3 + 8
11
> f, c) # Element-wise calculation
23 18 25
> f # Equivalent to f, c)
19 22 25
It is possible to define functions to be used as infix operators by using the special syntax
`%name%`, where "name" is the function variable name:> `%sumx2y2%` <- function
> 1:3 %sumx2y2% -
2 8 18
Since R version 4.1.0, functions can be written in a short notation, which is useful for passing anonymous functions to higher-order functions:
> sapply # here \ is the same as function
1 4 9 16 25
Native pipe operator
In R version 4.1.0, a native pipe operator,|>, was introduced. This operator allows users to chain functions together, rather than using nested function calls.> nrow # Nested without the pipe character
11
> mtcars |> subset |> nrow # Using the pipe character
11
An alternative to nested functions is the use of intermediate objects, rather than the pipe operator:
> mtcars_subset_rows <- subset
> num_mtcars_subset <- nrow
11
Object-oriented programming
The R language has native support for object-oriented programming. There are two native frameworks, the so-called S3 and S4 systems. The former, being more informal, supports single dispatch on the first argument, and objects are assigned to a class simply by setting a "class" attribute in each object. The latter is a system like the Common Lisp Object System, with formal classes and generic methods, which supports multiple dispatch and multiple inheritanceIn the example below,
summary is a generic function that dispatches to different methods depending on whether its argument is a numeric vector or a factor:> data <- c
> summary
Length Class Mode
5 character character
> summary
a b c NA's
2 1 1 1