A brief primer on scientific and mathematical notations

As I finished writing the final draft of my first first author paper, survClust, there were a lot of other firsts! In my opinion writing the methods and a crisp conclusion and discussion were the difficult parts.

Below, I share my notes that really came in handy while I was writing the methods section of my manuscript.

What this is?

Notes on how to describe a statistical methodology. Some basic rules and notations that you should keep in mind.

Scientific notations

  • Random variables are usually written in uppercase roman letters: \(X,Y\), etc.

  • Probability density functions (pdfs) and probability mass functions are denoted by lowercase letters, e.g. \(f_{(x)}\), or \(f_{X}(x)\).

  • Cumulative distribution functions (cdfs) are denoted by uppercase letters, e.g. \(F(x)\), or \(F_{X}(x)\).

Let's summarize the above three points with an example -

A random variable \(X\) has density \(f_{X}\) as follows -

\[ Pr[a\leq X\leq b]=\int _{a}^{b}f_{X}(x)\,dx\]

Hence, if \(F_{X}\) is the cumulative distribution function of \(X\) then:

\[F_{X}(x)=\int _{-\infty }^{x}f_{X}(u)\,du,\]

and

\[f_{X}(x)={\frac {d}{dx}}F_{X}(x).\]

Now, let's go over some quick statistical nitty-gritties:

  • Greek letters \(\theta, \beta\) are commonly used to denote unknown parameters.

  • Placing a hat, or caret, over a true parameter denotes an estimator of it, e.g., \(\widehat {\theta }\) is an estimator for \(\theta\) .

  • Building on the above point the sample mean, variance and correlation coefficient are denoted as \(\bar{x}, s^2, r\) respectively. On the other hand population parameters are represented as follows - population mean \(\mu\), population variance \(\sigma^2\), and population correlation as \(\rho\)

Finally most of the time you will need to know the following writing notions while drafting the methods section of your manuscript -

  • Input or independent variables are denoted by \(X\), output or dependent variables are denoted by \(Y\), and qualitative outputs by \(G\).

  • If \(X\) is a vector, annotate its values by subscripts \(X_j\)

  • Observed values are written in lowercase; hence the \(i^{th}\) observed value of \(X\) is written as \(x_i\), where \(x_i\) is a scalar or vector.

  • Matrices are represented by bold uppercase letters; for example a matrix \(\textbf{X}\), with dimensions \(N\) x \(p\) i.e a set of \(N\) input \(p\)-vectors. In general, vectors will not be bold, except when they have \(N\) components; Note that all vectors are assumed to be column vectors.

Let's break it down with an example -

Given a vector of inputs \(\textbf{X}^T = (X_1,X_2,...,X_p)\), we predict the output \(\textbf{Y}\) via a simple linear regression -

\[\hat{\textbf{Y}} = \hat\beta_0 + \sum_{n=1}^{p} \textbf{X}_{j}\hat\beta_{j}\] Or writing this in a vector form as an inner product - \(\hat{\textbf{Y}} = \textbf{X}^T\hat\beta\) To solve this we need to estimate a value of \(\beta\) such that it minimizes the Residual Sum of Squares or RSS as follows -

\[RSS(\beta) = \sum_{i=1}^{N} (y_i - x_{i}^T\beta)^2\]

Or in matrix notation we can write it as,

\[RSS(\beta) = (\textbf{y} - \textbf{X}\beta)^T(\textbf{y} - \textbf{X}\beta)\] where \(\textbf{X}\) is an \(N × p\) matrix with each row an input vector, and \(\textbf{y}\) is an \(N\)-vector of the outputs. See how \(\textbf{y}\) is in bold in the above question.

Or take one of your favorite papers, and try to go over its methods section to iron and figure out other key details!

Arshi Arora
Arshi Arora
Research Biostatistician

New Yorker from Jaipur, India. Cancer Genomics, pottery and biking.