R & Python Basic

More Details of R in R for Data Science (2e) and Advanced R

More Details of Python in Automate the Boring Stuff with Python and W3 School Python

This article serves as a brief introduction to the fundamental coding aspects of both R and Python. It provides a first impression of these scripting languages. For a more comprehensive understanding and in-depth techniques related to both languages, you are encouraged to explore the website mentioned above. The content here is primarily a condensed compilation of information from the provided links, aimed at facilitating a comparison between R and Python.

Data and Functions are the two essential components of every programming language, especially in the context of data science and data processing. They can be likened to nouns and verbs in natural languages. Data describes information, while Functions define actions for manipulating that data.

This article is divided into two main sections: Data (Section 1) and Coding (Section 2).

In the Data section, we will explore:

  1. Basic datatypes & structures, such as datatypes with numbers, characters, and booleans, data structures with list or data frame.
  2. Fundamentals of CRUD (Create, Read, Update, Delete) operations.

In the Coding section, we will delve into three key aspects:

  1. Fundamental mathematics.
  2. Control flow, including decision-making (choices) and looping.
  3. Creating and invoking functions.

The above five elements can be considered as the most fundamental elements of every scripting language. Additionally, we will explore object creation and naming in a section called ‘New Objects’ (Section 3). Objects can encompass functions and variables, further enriching our understanding of scripting.

This article will provide a solid introduction to the core concepts in programming, laying the groundwork for further exploration in both R and Python.

Overview:

1 Data

In the data section, we will explore various aspects of data, including:

  • Understanding basic data types and structures: We’ll delve into how data is stored and organized, laying the foundation for data manipulation.

  • Mastering indexing and subsetting: We’ll investigate indexing methods across different programming languages and learn how to extract subsets from various data structures.

  • Navigating CRUD operations: We’ll cover the fundamentals of CRUD (Create, Read, Update, Delete) operations, essential for data manipulation and management, among other topics.

1.1 Datatypes & Structure

In programming, the concept of datatypes is fundamental. It forms the basis for how we handle and manipulate information in software. The most basic data types, such as integers, numerics, booleans, characters, and bytes, are supported by almost all programming languages. Additionally, there are more complex data types built upon these basics, like strings, which are sequences of characters, and dates, which can be represented as variables of integers and more.

Data structures are equally important, as they determine the organization of data, whether it involves the same data types in multiple dimensions or combinations of different types. Data types and structures are intertwined, serving as the cornerstone for our programming endeavors.

Variables play a pivotal role in storing data of different types. The choice of data type and structure is critical, as different types and structures enable various operations and functionalities. Therefore, understanding data types and structures is paramount before embarking on data manipulation tasks.

1.1.1 Datatypes

A data type of a variable specifies the type of data that is stored inside that variable. In this context, we will just discuss Atomic Variables, which represent fundamental data types. There are six basic atomic data types:

  1. Logical (boolean data type)
    • can only have two values: TRUE and FALSE
  2. Numeric (double, float, lang)
    • represents all real numbers with or without decimal values.
  3. Integer
    • specifies real values without decimal points.
  4. Complex
    • is used to specify purely imaginary values
  5. Character (string)
    • data type is used to specify character or string values in a variable
  6. Raw (bytes)
    • specifies values as raw bytes

In R, variables do not require explicit declaration with a particular data type. Instead, R is dynamically typed, allowing variables to adapt to the data they contain. You can use the following techniques to work with data types in R:

  • Checking Data Types: To determine the data type of a variable, you can use the class() function.

  • Type Conversion: When needed, you can change the data type of a variable using R’s conversion functions, typically prefixed with as..

R’s flexibility in data type handling simplifies programming tasks and allows for efficient data manipulation without the need for explicit type declarations.

# Numeric
x <- 10.5
class(x)
[1] "numeric"
# Integer
x <- 1000L
class(x)
[1] "integer"
# Complex
x <- 9i + 3
class(x)
[1] "complex"
# Character/String
x <- "R is exciting"
class(x)
[1] "character"
# Logical/Boolean
x <- TRUE
class(x)
[1] "logical"
# Convert
y <- as.numeric(x)
class(y)
[1] "numeric"
# Raw (bytes)
x <- charToRaw("A")
x
[1] 41
class(x)
[1] "raw"

In Python, variables also do not require explicit declaration with a particular data type. Python is dynamically typed, allowing variables to adapt to the data they contain. You can use the following techniques to work with data types in Python:

  • Checking Data Types: To determine the data type of a variable, you can use the type() function. It allows you to inspect the current data type of a variable.

  • Type Conversion: When needed, you can change the data type of a variable in Python using various conversion functions, like float().

Python’s flexibility in data type handling simplifies programming tasks and allows for efficient data manipulation without the need for explicit type declarations.

# Numeric
x = 10.5
print(type(x))
<class 'float'>
# Integer
x = 1000
print(type(x))
<class 'int'>
# Complex
x = 9j + 3
print(type(x))
<class 'complex'>
# Character/String
x = "Python is exciting"
print(type(x))
<class 'str'>
# Logical/Boolean
x = True
print(type(x))
<class 'bool'>
# Convert to Numeric
y = float(x)
print(type(y))
<class 'float'>
# Raw (bytes)
x = b'A'
print(x)
b'A'
print(type(x))
<class 'bytes'>

1.1.2 Data Structure

Comparatively, data structures between R and Python tend to exhibit more differences than their data types. However, by incorporating additional libraries like NumPy and pandas, we can access shared data structures which play a vital role in the field of data science.

  1. Vector: A set of multiple values (items)
    • Contains items of the same data type or structure
    • Indexed: Allows you to get and change items using indices
    • Allows duplicates
    • Changeable: You can modify, add, and remove items after creation
  2. Array: A multi-dimensional extension of a vector
    • Matrix: two dimensions
  3. List: A set of multiple values (items)
    • Contains items of different data types or structures
    • Indexed: Allows you to get and change items using indices
    • Allows duplicates
    • Changeable: You can modify, add, and remove items after creation
  4. Table (Data Frame): Tabular data structure
    • Two-dimensional objects with rows and columns
    • Contains elements of several types
    • Each column has the same data type

The structure of R variable can be checked with str()ucture:

# Create a vector
vct_Test <- c(1,5,7)
# View the structure
str(vct_Test)
 num [1:3] 1 5 7
# Create a array
ary_Test <- array(1:24, c(2,3,4))
# View the structure
str(ary_Test)
 int [1:2, 1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
# Create a matrix
mat_Test <- matrix(1:24, 6, 4)
mat_Test
     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    2    8   14   20
[3,]    3    9   15   21
[4,]    4   10   16   22
[5,]    5   11   17   23
[6,]    6   12   18   24
# View the structure
str(mat_Test)
 int [1:6, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
# Create a list
lst_Test <- list(c(1,3,5), "abc", FALSE)
# View the structure
str(lst_Test)
List of 3
 $ : num [1:3] 1 3 5
 $ : chr "abc"
 $ : logi FALSE
# Create a table (data frame)
df_Test <- data.frame(name = c("Bob", "Tom"), age = c(12, 13))
df_Test
  name age
1  Bob  12
2  Tom  13
# View the structure
str(df_Test)
'data.frame':   2 obs. of  2 variables:
 $ name: chr  "Bob" "Tom"
 $ age : num  12 13

In Python, the structure of a variable is treated as the data type, and you can confirm it using the type() function.

It’s important to note that some of the most commonly used data structures, such as arrays and data frames (tables), are not part of the core Python language itself. Instead, they are provided by two popular libraries: numpy and pandas.

import numpy as np
import pandas as pd

# Create a vector (list in Python)
vct_Test = [1, 5, 7]
# View the structure
print(type(vct_Test))
<class 'list'>
# Create a 3D array (NumPy ndarray)
ary_Test = np.arange(1, 25).reshape((2, 3, 4))
# View the structure
print(type(ary_Test))
<class 'numpy.ndarray'>
# Create a matrix (NumPy ndarray)
mat_Test = np.arange(1, 25).reshape((6, 4))
print(type(mat_Test))
<class 'numpy.ndarray'>
# Create a list
lst_Test = [[1, 3, 5], "abc", False]
# View the structure
print(type(lst_Test))
<class 'list'>
# Create a table (pandas DataFrame)
df_Test = pd.DataFrame({"name": ["Bob", "Tom"], "age": [12, 13]})
print(type(df_Test))
<class 'pandas.core.frame.DataFrame'>
print(df_Test)
  name  age
0  Bob   12
1  Tom   13

Python offers several original data structures, including:

  1. Tuples: Tuples are ordered collections of elements, similar to lists, but unlike lists, they are immutable, meaning their elements cannot be changed after creation. Tuples are often used to represent fixed collections of items.

  2. Sets: Sets are unordered collections of unique elements. They are valuable for operations that require uniqueness, such as finding unique values in a dataset or performing set-based operations like unions and intersections.

  3. Dictionaries: Dictionaries, also known as dicts, are collections of key-value pairs. They are used to store data in a structured and efficient manner, allowing quick access to values using their associated keys.

While these data structures may not be as commonly used in data manipulation and calculations as arrays and data frames, they have unique features and use cases that can be valuable in various programming scenarios.

1.2 Index & subset

Additionally, subsetting plays a crucial role in data manipulation. Subsetting allows you to extract specific subsets of data based on conditions, criteria, or filters.

More Details in Advanced R: 4 Subsetting.

R’s subsetting operators are fast and powerful. Mastering them allows you to succinctly perform complex operations in a way that few other languages can match. Subsetting in R is easy to learn but hard to master because you need to internalise a number of interrelated concepts:

  • There are six ways to subset atomic vectors.

  • There are three subsetting operators, [[, [, and $.

  • Subsetting operators interact differently with different vector types (e.g., atomic vectors, lists, factors, matrices, and data frames).

Subsetting is a natural complement to str(). While str() shows you all the pieces of any object (its structure).

Tip

In Python, indexing starts from 0, not 1.

1.2.1 Vector

  • Positive integers return elements at the specified positions:
x <- c(2.1, 4.2, 3.3, 5.4)

# One value
x[1]
[1] 2.1
# More values
x[c(1:2, 4)]
[1] 2.1 4.2 5.4
# Duplicate indices will duplicate values
x[c(1, 1)]
[1] 2.1 2.1
# Real numbers are silently truncated to integers
x[c(2.1, 2.9)]
[1] 4.2 4.2
  • Negative integers exclude elements at the specified positions:
# Exclude elements
x[-c(3, 1)]
[1] 4.2 5.4
NOTE

Note that you can’t mix positive and negative integers in a single subset:

x[c(-1, 2)]
Error in x[c(-1, 2)]: nur Nullen dürfen mit negativen Indizes gemischt werden
  • Positive integers return elements at the specified positions:
import numpy as np
import pandas as pd

# Create a NumPy array
x = np.array([2.1, 4.2, 3.3, 5.4])

# One value
print(x[0])
2.1
# More values
print(x[np.array([0, 1, 3])])
[2.1 4.2 5.4]
# Duplicate indices will duplicate values
print(x[np.array([0, 0])])
[2.1 2.1]
  • egative indexing to access an array from the end:
# One value
print(x[-1])
5.4
# More values
print(x[-np.array([1, 3])])
[5.4 4.2]

1.2.2 Matrices and arrays

The most common way of subsetting matrices (2D) and arrays (>2D) is a simple generalisation of 1D subsetting: supply a 1D index for each dimension, separated by a comma. Blank subsetting is now useful because it lets you keep all rows or all columns.

# Create a matrix
a2 <- matrix(1:9, nrow = 3)
# Rename the columns (equivalent to colnames in R)
colnames(a2) <- c("A", "B", "C")
# Access a specific element using column name
a2[1, "A"]
A 
1 
# Select specific rows with all columns
a2[1:2, ]
     A B C
[1,] 1 4 7
[2,] 2 5 8
# columns which are excluded 
a2[0, -2]
     A C
# Create a 3D array
a3 <- array(1:24, c(2,3,4))
# Access a specific element(s), in different dimensions
a3[1,2,2]
[1] 9
a3[1,2,]
[1]  3  9 15 21
a3[1,,]
     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    3    9   15   21
[3,]    5   11   17   23

In Python, the : symbol is used to indicate all elements of a particular dimension or slice. It allows you to select or reference all items along that dimension in a sequence, array, or data structure.

import numpy as np

# Create a NumPy matrix
a2 = np.array([[1, 2, 3],
               [4, 5, 6],
               [7, 8, 9]])

# Rename the columns (equivalent to colnames in R)
colnames = ["A", "B", "C"]

# Access a specific element using column name
print(a2[0, colnames.index("A")])
1
# Select the first two rows
print(a2[0:2, :])
[[1 2 3]
 [4 5 6]]
# Create a NumPy 3D array
a3 = np.arange(1, 25).reshape((2, 3, 4))

# Access a specific element in the 3D array
print(a3[0, 1, 1])
6
print(a3[0, 1, :])
[5 6 7 8]
print(a3[0, :, :])
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

1.2.3 Data frames

Data frames have the characteristics of both lists and matrices:

  • When subsetting with a single index, they behave like lists and index the columns, so df[1:2] selects the first two columns.

  • When subsetting with two indices, they behave like matrices, so df[1:3, ] selects the first three rows (and all the columns)[^python-dims].

# Create a DataFrame
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])

# Select rows
df[df$x == 2, ]
  x y z
2 2 2 b
df[c(1, 3), ]
  x y z
1 1 3 a
3 3 1 c
# There are two ways to select columns from a data frame
# Like a list
df[c("x", "z")]
  x z
1 1 a
2 2 b
3 3 c
# Like a matrix
df[, c("x", "z")]
  x z
1 1 a
2 2 b
3 3 c
# There's an important difference if you select a single 
# column: matrix subsetting simplifies by default, list 
# subsetting does not.
str(df["x"])
'data.frame':   3 obs. of  1 variable:
 $ x: int  1 2 3
str(df[, "x"])
 int [1:3] 1 2 3

More detail about Function pandas.Seies.iloc() and pandas.Seies.loc() in pandas document

  • loc gets rows (and/or columns) with particular labels.

  • iloc gets rows (and/or columns) at integer locations.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'x': range(1, 4), 'y': range(3, 0, -1), 'z': list('abc')})

# Select rows
print(df[df['x'] == 2])
   x  y  z
1  2  2  b
print(df.iloc[[0, 2]])
   x  y  z
0  1  3  a
2  3  1  c
# Select columns
print(df[['x', 'z']])
   x  z
0  1  a
1  2  b
2  3  c
# Select columns like a DataFrame
print(df.loc[:, ['x', 'z']])
   x  z
0  1  a
1  2  b
2  3  c
# Select a single column as a Series (simplifies by default)
print(df['x'])
0    1
1    2
2    3
Name: x, dtype: int64
# Select a single column as a DataFrame (does not simplify)
print(df[['x']])
   x
0  1
1  2
2  3

1.2.4 List

There are two other subsetting operators: [[ and $. [[ is used for extracting single items, while x$y is a useful shorthand for x[["y"]].

[[ is most important when working with lists because subsetting a list with [ always returns a smaller list. To help make this easier to understand we can use a metaphor:

[[ can return only a single item, you must use it with either a single positive integer or a single string.

x <- list(a = 1:3, b = "a", d = 4:6)

# Get the subset 
x[1]
$a
[1] 1 2 3
str(x[1])
List of 1
 $ a: int [1:3] 1 2 3
x[1:2]
$a
[1] 1 2 3

$b
[1] "a"
# Get the element
x[[1]]
[1] 1 2 3
str(x[1])
List of 1
 $ a: int [1:3] 1 2 3
# with Label
x$a
[1] 1 2 3
x[["a"]]
[1] 1 2 3

In Python there are no effectiv ways to create a items named list. It can always get the element of the list but not a subset of the list.

In Python, there are no effective ways to create items with named elements in a list. While you can access individual elements by their positions, there isn’t a straightforward method to create a subset of the list with named elements.

# Create a Python list with nested lists
x = [list(range(1, 4)), "a", list(range(4, 7))]

# Get the subset (Python list slice)
print([x[0]])
[[1, 2, 3]]
# Get the element using list indexing
print(x[0])
[1, 2, 3]
print(type(x[0]))
<class 'list'>

However, dictionaries in Python excel in this regard, as they allow you to assign and access elements using user-defined keys, providing a more efficient way to work with named elements and subsets of data.

# Create a dictionary with labels
x = {"a": list(range(1, 4)), "b": "a", "d": list(range(4, 7))}


# Get the element using dictionary indexing
print(x["a"])
[1, 2, 3]
# Access an element with a label
print(x["a"])
[1, 2, 3]
print(x.get("a"))
[1, 2, 3]
print(type(x["a"]))
<class 'list'>

1.3 Data CRUD

Data manipulation is the art and science of transforming raw data into a more structured and useful format for analysis, interpretation, and decision-making. It’s a fundamental process in data science, analytics, and database management.

Operations for creating and managing persistent data elements can be summarized as CRUD:

  1. Create (Add): The creation of new data elements or records.

  2. Read: The retrieval and access of existing data elements for analysis or presentation.

  3. Update: The modification or editing of data elements to reflect changes or corrections.

  4. Delete: The removal or elimination of data elements that are no longer needed or relevant.

Combining CRUD operations with subsetting provides a powerful toolkit for working with data, ensuring its accuracy, relevance, and utility in various applications, from database management to data analysis.

1.3.1 Create & Add

Most of the original data we work with is often loaded from external data sources or files. This process will be discussed in detail in the article titled Data Load.

In this section, we will focus on the fundamental aspects of creating and adding data, which may have already been mentioned several times in the preceding text.

Creating new objects in R is commonly done using the assignment operator <-.

When it comes to vectors or list, there are two primary methods to append new elements:

  • c(): allows you to combine the original vector with a new vector or element, effectively extending the vector.

  • append(): enables you to append a new vector or element at a specific location within the original vector.

# Automic value
a <- 1 / 200 * 30

# vector
x_v <- c(2.1, 4.2, 3.3, 5.4)
# List
x_l <- list(a = 1:3, b = "a", d = 4:6)
# add new elements
c(x_v, c(-1,-5.6))
[1]  2.1  4.2  3.3  5.4 -1.0 -5.6
c(x_l, list(e = c(TRUE, FALSE)))
$a
[1] 1 2 3

$b
[1] "a"

$d
[1] 4 5 6

$e
[1]  TRUE FALSE
# append after 2. Element
append(x_v, c(-1,-5.6), 2)
[1]  2.1  4.2 -1.0 -5.6  3.3  5.4
append(x_l, list(e = c(TRUE, FALSE)), 2)
$a
[1] 1 2 3

$b
[1] "a"

$e
[1]  TRUE FALSE

$d
[1] 4 5 6

When working with 2D matrices or data frames in R, you can use the following functions to add new elements in the row or column dimensions:

  • cbind(): to combine data frames or matrices by adding new columns.

  • rbind(): to combine data frames or matrices by adding new rows.

# Create a matrix
x_m <- matrix(1:9, nrow = 3)
# data frame
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
# append in colum dimension
cbind(x_m, -1:-3)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   -1
[2,]    2    5    8   -2
[3,]    3    6    9   -3
cbind(df, k = -1:-3)
  x y z  k
1 1 3 a -1
2 2 2 b -2
3 3 1 c -3
# append in row dimension
rbind(x_m, -1:-3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
[4,]   -1   -2   -3
rbind(df, list(-1, -2, "z")) # try with rbind(df, c(-1, -2, "z"))
   x  y z
1  1  3 a
2  2  2 b
3  3  1 c
4 -1 -2 z

Additionally, for both lists and data frames in R, you can use the $ <- operator to add new elements:

# Data frame
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
cbind(df, k = -1:-3)
  x y z  k
1 1 3 a -1
2 2 2 b -2
3 3 1 c -3
df$k <- -1:-3 # same to df[['k']] <- -1:-3
df
  x y z  k
1 1 3 a -1
2 2 2 b -2
3 3 1 c -3
# List
x_l <- list(a = 1:3, b = "a", d = 4:6)
c(x_l, list(e = c(TRUE, FALSE)))
$a
[1] 1 2 3

$b
[1] "a"

$d
[1] 4 5 6

$e
[1]  TRUE FALSE
x_l$e <- c(TRUE, FALSE) # same to x_l[['e']] <- c(TRUE, FALSE)
x_l
$a
[1] 1 2 3

$b
[1] "a"

$d
[1] 4 5 6

$e
[1]  TRUE FALSE

Creating new objects in Python is often accomplished using the assignment operator =. When it comes to adding elements to list, there are three primary functions to consider:

  • append(): add a single element to the end of a list.

  • insert(): add an element at a specific position within a list.

  • extend() same as +: append elements from an iterable (e.g., another list) to the end of an existing list, allowing for the expansion of the list with multiple elements.

# Atomic element
a = 1 / 200 * 30
b = a + 1
print(a)
0.15
print(b)
1.15
# List
x = [2.1, 4.2, 3.3, 5.4]

# Append on element
x.append(-1)
print(x)
[2.1, 4.2, 3.3, 5.4, -1]
# Insert on eelement
x.insert(3, -5.6)
print(x)
[2.1, 4.2, 3.3, -5.6, 5.4, -1]
# Extend with new list
x.extend([6.7, 7.9])
print(x)
[2.1, 4.2, 3.3, -5.6, 5.4, -1, 6.7, 7.9]

When working with numpy.array in Python, you can add elements in two primary ways:

  • append(): add element or a new numpy array to the end.

  • insert(): insert element or a new numpy array at specific locations within the original numpy array.

import numpy as np

# Create a NumPy array
x_a = np.array([2.1, 4.2, 3.3, 5.4])

print(np.append(x_a, -1))
[ 2.1  4.2  3.3  5.4 -1. ]
print(np.append(x_a, np.array([6.7, 7.9])))
[2.1 4.2 3.3 5.4 6.7 7.9]
print(np.insert(x_a, 3, -5.6))
[ 2.1  4.2  3.3 -5.6  5.4]
print(np.insert(x_a, 3, np.array([6.7, 7.9])))
[2.1 4.2 3.3 6.7 7.9 5.4]

1.3.2 Read

The read process is essentially a form of subsetting, where you access specific elements or subsets of data using their indexes. The crucial aspect of this operation is how to obtain and utilize these indexes effectively.

# Create a DataFrame
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])

# Access using integer index 
df[1,2]
[1] 3
# Access using names index
df[,"z"]
[1] "a" "b" "c"
df$z
[1] "a" "b" "c"
# Access with a value condition
idx <- which(df$x > 1)
df[idx,]
  x y z
2 2 2 b
3 3 1 c
df[idx, "z"]
[1] "b" "c"
idx <- which(df$z == "a")
df[idx,]
  x y z
1 1 3 a
df[idx, 1:2]
  x y
1 1 3
import pandas as pd

# Create a pandas DataFrame
df = pd.DataFrame({'x': range(1, 4), 'y': range(3, 0, -1), 'z': list('abc')})

# Access using integer index (iloc)
print(df.iloc[0, 1])
3
# Access using column label
print(df['z'])
0    a
1    b
2    c
Name: z, dtype: object
print(df.z)
0    a
1    b
2    c
Name: z, dtype: object
# Access with a value condition
idx = df['x'] > 1
print(df[idx])
   x  y  z
1  2  2  b
2  3  1  c
print(df[df['z'] == 'a'])
   x  y  z
0  1  3  a
print(df[df['z'] == 'a'][['x', 'y']])
   x  y
0  1  3

1.3.3 Update

The update operation builds upon the principles of reading. It involves replacing an existing value with a new one, but with certain constraints. The new value must have the same data type, size, and structure as the original value. This ensures data consistency and integrity when modifying data elements. About “data type” it is not so strength, somtimes it is chanable if you replace the whol e.g. colums in data frame.

It’s important to note that the concept of ‘data type’ isn’t always rigid. There are cases where data types can change, particularly when replacing entire columns in a data frame, for instance. While data types typically define the expected format and behavior of data, specific operations and transformations may lead to changes in data types to accommodate new values or structures.

# Create a DataFrame
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
df
  x y z
1 1 3 a
2 2 2 b
3 3 1 c
# Update using integer index 
df[1,2] <- 0
df
  x y z
1 1 0 a
2 2 2 b
3 3 1 c
# Update using names index
df[2,"z"] <- "lk"
df
  x y  z
1 1 0  a
2 2 2 lk
3 3 1  c
# Update with a value condition
idx <- which(df$x > 1)
df[idx, "z"] <- "bg1"
df
  x y   z
1 1 0   a
2 2 2 bg1
3 3 1 bg1
idx <- which(df$z == "a")
df[idx,] <- c(-1, -5, "new_a")
df
   x  y     z
1 -1 -5 new_a
2  2  2   bg1
3  3  1   bg1
import pandas as pd

# Create a pandas DataFrame
df = pd.DataFrame({'x': range(1, 4), 'y': range(3, 0, -1), 'z': list('abc')})
print(df)
   x  y  z
0  1  3  a
1  2  2  b
2  3  1  c
# Update using integer index
df.iat[0, 1] = 0
print(df)
   x  y  z
0  1  0  a
1  2  2  b
2  3  1  c
# Update using column label and row index
df.at[1, 'z'] = "lk"
print(df)
   x  y   z
0  1  0   a
1  2  2  lk
2  3  1   c
# Update with a value condition
idx_x_gt_1 = df['x'] > 1
df.loc[idx_x_gt_1, 'z'] = "bg1"
print(df)
   x  y    z
0  1  0    a
1  2  2  bg1
2  3  1  bg1
idx_z_eq_a = df['z'] == 'a'
df.loc[idx_z_eq_a] = [-1, -5, "new_a"]
print(df)
   x  y      z
0 -1 -5  new_a
1  2  2    bg1
2  3  1    bg1

1.3.4 Delete

Deletion in R can be accomplished relatively easily using methods like specifying negative integer indices or setting elements to NULL within a list. However, it’s essential to recognize that there are limitations to deletion operations. For instance, when dealing with multi-dimensional arrays, you cannot delete a single element in the same straightforward manner; instead, you can only delete entire sub-dimensions.

# Create a DataFrame
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
df
  x y z
1 1 3 a
2 2 2 b
3 3 1 c
# Delete using negative integer index 
df[,-2]
  x z
1 1 a
2 2 b
3 3 c
df[-2,]
  x y z
1 1 3 a
3 3 1 c
# Setting elements to `NULL`
df$y <- NULL
df
  x z
1 1 a
2 2 b
3 3 c

In Python is to use the .drop() command to delete the elemnts in datatframe. More details in pandas document

df = pd.DataFrame({'x': range(1, 4), 'y': range(3, 0, -1), 'z': list('abc')})
print(df)
   x  y  z
0  1  3  a
1  2  2  b
2  3  1  c
# Drop columns
print(df.drop(['x', 'z'], axis=1))
   y
0  3
1  2
2  1
print(df.drop(columns=['x', 'y']))
   z
0  a
1  b
2  c
# Drop a row by index
print(df.drop([0, 1]))
   x  y  z
2  3  1  c

2 Coding

2.1 Math

  • ‘+’ ‘-’ ’*’ ‘/’
  • Exponent, Logarithm
  • Trigonometric functions
  • Linear algebra, Matrix multiplication
1 / 200 * 30
[1] 0.15
(59 + 73 - 2) / 3
[1] 43.33333
3^2
[1] 9
sin(pi / 2) # pi as Const number in R
[1] 1
print(1 / 200 * 30)
0.15
print((59 + 73 - 2) / 3)
43.333333333333336
print(3**2)
9
import math
print(math.sin(math.pi/2))
1.0

2.2 Control flow

There are two primary tools of control flow: choices and loops.

  • Choices, like if statements calls, allow you to run different code depending on the input.
  • Loops, like for and while, allow you to repeatedly run code, typically with changing options.

2.2.1 choices

2.2.1.1 Basic If-Else

The basic form of an if statement in R is as follows:

if (condition) {
  true_action
}
if (condition) {
  true_action
} else {
  false_action
}

If condition is TRUE, true_action is evaluated; if condition is FALSE, the optional false_action is evaluated.

Typically the actions are compound statements contained within {:

if returns a value so that you can assign the results:

a <- 6
b <- 8

if (b > a) {
  cat("b is greater than a\n")
} else if (a == b) {
  cat("a and b are equal\n")
} else {
  cat("a is greater than b\n")
}
b is greater than a
# if statements
if condition: 
  true_action
  
# if-else
if condition: 
  true_action 
else: 
  false_action


# if-ifel-else
if condition1: 
  true_action1 
elif condition2: 
  true_action2 
else: 
  false_action
a = 6
b = 8
if b > a:
  print("b is greater than a")
elif a == b:
  print("a and b are equal")
else:
  print("a is greater than b")
b is greater than a

2.2.1.2 switch

Closely related to if is the switch()-statement. It’s a compact, special purpose equivalent that lets you replace code like:

x_option <- function(x) {
  if (x == "a") {
    "option 1"
  } else if (x == "b") {
    "option 2" 
  } else if (x == "c") {
    "option 3"
  } else {
    stop("Invalid `x` value")
  }
}

with the more succinct:

x_option <- function(x) {
  switch(x,
    a = "option 1",
    b = "option 2",
    c = "option 3",
    stop("Invalid `x` value")
  )
}
x_option("b")
[1] "option 2"

The last component of a switch() should always throw an error, otherwise unmatched inputs will invisibly return NULL:

match subject:
    case <pattern_1>:
        <action_1>
    case <pattern_2>:
        <action_2>
    case <pattern_3>:
        <action_3>
    case _:
        <action_wildcard>
def x_option(x):
    options = {
        "a": "option 1",
        "b": "option 2",
        "c": "option 3"
    }
    return options.get(x, "Invalid `x` value")

print(x_option("b"))
option 2

2.2.1.3 Vectorised if

Given that if only works with a single TRUE or FALSE, you might wonder what to do if you have a vector of logical values. Handling vectors of values is the job of ifelse(): a vectorised function with test, yes, and no vectors (that will be recycled to the same length):

x <- 1:10
ifelse(x %% 5 == 0, "XXX", as.character(x))
 [1] "1"   "2"   "3"   "4"   "XXX" "6"   "7"   "8"   "9"   "XXX"
ifelse(x %% 2 == 0, "even", "odd")
 [1] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"

Note that missing values will be propagated into the output.

I recommend using ifelse() only when the yes and no vectors are the same type as it is otherwise hard to predict the output type. See https://vctrs.r-lib.org/articles/stability.html#ifelse for additional discussion.

2.2.2 Loops

2.2.2.1 for-Loops

A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string). For each item in vector, perform_action is called once; updating the value of item each time.

In R, for loops are used to iterate over items in a vector. They have the following basic form:

for (item in vector) perform_action
for (i in 1:3) {
  print(i)
}
[1] 1
[1] 2
[1] 3
for item in vector 
  perform_action
for i in range(1, 3):
  print(i)
1
2

2.2.2.2 while-Loops

With the while loop we can execute a set of statements as long as a condition is TRUE:

i <- 1
while (i < 6) {
  print(i)
  i <- i + 1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
i = 1
while i < 6:
  print(i)
  i += 1
1
2
3
4
5

2.2.2.3 terminate

There are two ways to terminate a for loop early:

  • next exits the current iteration.
  • break exits the entire for loop.
for (i in 1:10) {
  if (i < 3) 
    next

  print(i)
  
  if (i >= 5)
    break
}
[1] 3
[1] 4
[1] 5
for i in range(1, 10):
    if i < 3:
        continue
    
    print(i)
    
    if i >= 5:
        break
3
4
5

2.3 Function

More details of in Advanced R Chapter 6

A function is a block of code which only runs when it is called. It can be broken down into three components:

  • The formals(), the list of arguments that control how you call the function.

  • The body(), the code inside the function.

  • The environment(), the data structure that determines how the function finds the values associated with the names.

While the formals and body are specified explicitly when you create a function, the environment is specified implicitly, based on where you defined the function. This location could be within another package or within the workspace (global environment).

The function environment always exists, but it is only printed when the function isn’t defined in the global environment.

fct_add <- function(x, y) {
  # A comment
  x + y
}

# Get the formal arguments
formals(fct_add)
$x


$y
# Get the function's source code (body)
body(fct_add)
{
    x + y
}
# Get the function's global environment (module-level namespace)
environment(fct_add)
<environment: R_GlobalEnv>
def fct_add(x, y):
    # A comment
    return x + y

# Get the formal arguments
print(fct_add.__code__.co_varnames)
('x', 'y')
# Get the function's source code (body)
print(fct_add.__code__.co_code)
b'\x97\x00|\x00|\x01z\x00\x00\x00S\x00'
# Get the function's global environment (module-level namespace)
print(fct_add.__globals__)
{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, 'r': <__main__.R object at 0x000002577B44E4D0>, 'x': [2.1, 4.2, 3.3, -5.6, 5.4, -1, 6.7, 7.9], 'y': 1.0, 'np': <module 'numpy' from 'C:\\Users\\lei\\AppData\\Local\\Programs\\Python\\PYTHON~1\\Lib\\site-packages\\numpy\\__init__.py'>, 'pd': <module 'pandas' from 'C:\\Users\\lei\\AppData\\Local\\Programs\\Python\\PYTHON~1\\Lib\\site-packages\\pandas\\__init__.py'>, 'vct_Test': [1, 5, 7], 'ary_Test': array([[[ 1,  2,  3,  4],
        [ 5,  6,  7,  8],
        [ 9, 10, 11, 12]],

       [[13, 14, 15, 16],
        [17, 18, 19, 20],
        [21, 22, 23, 24]]]), 'mat_Test': array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17, 18, 19, 20],
       [21, 22, 23, 24]]), 'lst_Test': [[1, 3, 5], 'abc', False], 'df_Test':   name  age
0  Bob   12
1  Tom   13, 'a2': array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]]), 'colnames': ['A', 'B', 'C'], 'a3': array([[[ 1,  2,  3,  4],
        [ 5,  6,  7,  8],
        [ 9, 10, 11, 12]],

       [[13, 14, 15, 16],
        [17, 18, 19, 20],
        [21, 22, 23, 24]]]), 'df':    x  y  z
0  1  3  a
1  2  2  b
2  3  1  c, 'a': 6, 'b': 8, 'x_a': array([2.1, 4.2, 3.3, 5.4]), 'idx': 0    False
1     True
2     True
Name: x, dtype: bool, 'idx_x_gt_1': 0    False
1     True
2     True
Name: x, dtype: bool, 'idx_z_eq_a': 0     True
1    False
2    False
Name: z, dtype: bool, 'math': <module 'math' (built-in)>, 'x_option': <function x_option at 0x000002577C79CCC0>, 'i': 5, 'fct_add': <function fct_add at 0x000002577C7A5260>}

2.3.1 Call

Calling Syntax:

function_name(argument1 = value1, argument2 = value2, ...)

Try using seq(), which makes regular sequences of numbers:

seq(from = 1, to = 10)
 [1]  1  2  3  4  5  6  7  8  9 10

We often omit the names of the first several arguments in function calls, so we can rewrite this as follows:

seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10

We can also check the arguments and other information with:

?seq

The “help” windows shows as:

Calling Syntax:

function_name(argument1 = value1, argument2 = value2)
sequence = list(range(1, 11))
print(sequence)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

2.3.2 Define

Use the function() keyword:

my_add1 <- function(x) {
  x + 1
}

calling the function my_add1:

my_add1(2)
[1] 3
Tip

In R, the return statement is not essential for a function to yield a value as its result. By default, R will return the result of the last command within the function as its output.

In Python a function is defined using the def keyword:

def my_add(x):
  return x + 1

calling the function my_add1:

print(my_add(2))
3
Important

The return statement is essential for a function to yield a value as its result.

3 Naming

3.1 Naming rules

  • must start with a letter
  • can only contain letters, numbers, underscores _, and dot .
  • case-sensitive (age, Age and AGE are three different variables)
  • cannot be any of the Reserved Words
    • TRUE FALSE
    • NULL Inf NaN NA NA_real NA_complex_ NA_character_
    • if else
    • for while repeat
    • next break
    • function
    • in
Legal

i_use_snake_case

otherPeopleUseCamelCase

some.people.use.periods

aFew.People_RENOUNCEconvention6

Illegal

_start_with_underscores

1_start_with_number

if

contain sapce

contain-other+charater

more Reserved Words in:

help("reserved")
  • must start with a letter or the underscore character _
  • can only contain letters, numbers, and underscores _
  • case-sensitive (age, Age and AGE are three different variables)
  • cannot be any of the Python keywords (35 keywors in Python 3.8)
    • True False
    • None
    • if else elif
    • for while repeat
    • try break continue finally
    • def
    • in and or not
    • return
Legal

i_use_snake_case

_start_with_underscores

otherPeopleUseCamelCase

aFew_People_RENOUNCEconvention6

Illegal

want.contain.dot

1_start_with_number

if

contain sapce

contain-other+charater

More Keywords in:

help("keywords")

3.2 Naming Conventions

  • Camel Case
    • Each word, except the first, starts with a capital letter:
    • myVariableName
  • Pascal Case
    • Each word starts with a capital letter:
    • MyVariableName
  • Snake Case
    • Each word is separated by an underscore character:
    • my_variable_name