# DataMatrix

## What is DataMatrix?

DataMatrix is a Python library for working with tabular data, i.e. spreadsheet-like data that consists of named columns and numbered rows. DataMatrix is a light-weight library that allows you to do many things in an intuitive way. (For more advanced functionality, you can use pandas.).

This tutorial is an abbreviated version of the following page:

## Basics

### Creating a DataMatrix

Create a new `DataMatrix` object. The `length` keyword specifies the number of rows. A newly created `DataMatrix` doesn't have any columns yet.

``````from datamatrix import DataMatrix

dm = DataMatrix(length=3)
print(dm)
``````

Output:

``````+---+
| # |
+---+
| 0 |
| 1 |
| 2 |
+---+
``````

### Working with columns

If you assign a value to a column that doesn't exist, the column will be automatically created. If you assign a single value to a column, this will set all cells to that value:

``````dm.col = 'This creates a new column'
print(dm)
``````

Output:

``````+---+---------------------------+
| # |            col            |
+---+---------------------------+
| 0 | This creates a new column |
| 1 | This creates a new column |
| 2 | This creates a new column |
+---+---------------------------+
``````

If you assign a sequence to a column, this will set the cells based on that sequence. This sequence must have the same length as the column (3 in this case).

``````dm.col = 'one', 'two', 'three'
print(dm)
``````

Output:

``````+---+-------+
| # |  col  |
+---+-------+
| 0 |  one  |
| 1 |  two  |
| 2 | three |
+---+-------+
``````

### Slicing and assigning to column cells

You can assign to one cell, to multiple cells at the same time, or to a slice of cells.

``````dm.col = ':-)'  # Assign to one cell
dm.col[0, 2] = ':P'  # Assign to two cells
dm.col[2:] = ':D'  # Assing to a slice of cells
print(dm)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 0 |  :P |
| 1 | :-) |
| 2 |  :D |
+---+-----+
``````

### Column properties

Basic numeric properties, such as the mean, can be accessed directly. Only numeric values are taken into account.

``````dm.col = 1, 2, 'not a number'
# Numeric descriptives
print('mean: %s' % dm.col.mean)
print('median: %s' % dm.col.median)
print('standard deviation: %s' % dm.col.std)
print('sum: %s' % dm.col.sum)
print('min: %s' % dm.col.min)
print('max: %s' % dm.col.max)
# Other properties
print('unique values: %s' % dm.col.unique)
print('number of unique values: %s' % dm.col.count)
print('column name: %s' % dm.col.name)
``````

Output:

``````mean: 1.5
median: 1.5
standard deviation: 0.7071067811865476
sum: 3.0
min: 1.0
max: 2.0
unique values: [1, 2, 'not a number']
number of unique values: 3
column name: col
``````

### Iterating over rows, columns, and cells

By iterating directly over a `DataMatrix` object, you get successive `Row` objects. From a `Row` object, you can directly access cells. By iterating over a `Row` object, you get `column name, cell value` tuples.

``````dm.col = 'a', 'b', 'c'
for row in dm:
print(row.col)
for colname, cell in row:
print('{} = {}'.format(colname, cell))
``````

Output:

``````a
col = a
b
col = b
c
col = c
``````

By iterating over `DataMatrix.columns`, you get successive `(column_name, column)` tuples.

``````for colname, col in dm.columns:
print('{} = []'.format(colname, col))
``````

Output:

``````col = []
``````

By iterating over a column, you get successive cells:

``````for cell in dm.col:
print(cell)
``````

Output:

``````a
b
c
``````

The `column_names` property gives a sorted list of all column names (without the corresponding column objects):

``````print(dm.column_names)
``````

Output:

``````['col']
``````

### Element-wise column operations

You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the `+` operator, which then results in concatenation.

``````dm = DataMatrix(length=3)
dm.col = 0, 'a', 20
dm.col2 = dm.col * .5
dm.col3 = dm.col + 10
dm.col4 = dm.col - 10
dm.col5 = dm.col / 50
print(dm)
``````

Output:

``````+---+-----+------+------+------+------+
| # | col | col2 | col3 | col4 | col5 |
+---+-----+------+------+------+------+
| 0 |  0  | 0.0  |  10  | -10  | 0.0  |
| 1 |  a  |  a   | a10  |  a   |  a   |
| 2 |  20 | 10.0 |  30  |  10  | 0.4  |
+---+-----+------+------+------+------+
``````

You can apply a function (or `lambda` expression) to all cells in a column simultaneously with the `@` operator.

``````def times_two_plus_one(x):
return x * 2 + 1

dm = DataMatrix(length=3)
dm.col = 0, 1, 2
dm.col2 = dm.col @ times_two_plus_one
print(dm)
``````

Output:

``````+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 |  0  |  1   |
| 1 |  1  |  3   |
| 2 |  2  |  5   |
+---+-----+------+
``````

You can read and write files with functions from the `datamatrix.io` module. The main supported file types are `csv` and `xlsx`.

``````from datamatrix import io

dm = DataMatrix(length=3)
dm.col = 1, 2, 3
# Write to disk
io.writetxt(dm, 'my_datamatrix.csv')
io.writexlsx(dm, 'my_datamatrix.xlsx')
# And read it back from disk!
``````

## Splitting and selecting data

### Selecting

You can select by directly comparing columns to values. This returns a new `DataMatrix` object with only the selected rows.

``````dm = DataMatrix(length=10)
dm.col = range(10)
dm_subset = dm.col > 5
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 6 |  6  |
| 7 |  7  |
| 8 |  8  |
| 9 |  9  |
+---+-----+
``````

You can select by multiple criteria using the `|` (or), `&` (and), and `^` (xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because `|`, `&`, and `^` have priority over other operators.

``````dm_subset = (dm.col < 1) | (dm.col > 8)
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 0 |  0  |
| 9 |  9  |
+---+-----+
``````

### Splitting

You can also split a DataMatrix into subsets, using the `split()` function from `datamatrix.operations` (usually imported as `ops`). To do this, specify a column with discrete values that indicate the groups into which you want to split. For example, consider this dataset from Moore, McCabe, & Craig (included as example data with JASP). This contains heart rate values for runners and non-runners (controls).

The first way to use `ops.split()` is by only specifying a column to split the data based on. In this case, the function returns a sequence of `value, subdatamatrix` tuples that you can iterate through with a `for` loop: (Note the alternative way to refer to columns that contain spaces or non-ascii characters: `dm['Heart Rate']`.)

``````from datamatrix import io
from datamatrix import operations as ops

for group, group_dm in ops.split(dm.Group):
print('Mean heart rate for {} is {} bpm'.format(
group,
group_dm['Heart Rate'].mean)
)
``````

Output:

``````Mean heart rate for Control is 139.0 bpm
Mean heart rate for Runners is 109.98 bpm
``````

The second way to use `ops.split()` is by also specifying the values that you want to split on. In this case, the function returns a sequence of `subdatamatrix` objects that match the order of the splitting values. Like so:

``````dm_runner, dm_control = ops.split(dm.Group, 'Runners', 'Control')
print('Mean heart rate for runners is {} bpm'.format(dm_runner['Heart Rate'].mean))
print('Mean heart rate for controls is {} bpm'.format(dm_control['Heart Rate'].mean))
``````

Output:

``````Mean heart rate for runners is 109.98 bpm
Mean heart rate for controls is 139.0 bpm
``````

## Exercises

### Analyzing movie ratings

Download this file and read it with DataMatrix. The file contains titles, years, and ratings for many movies. Select only movies that were released between 1990 and 2000. Then, for each year separately, print the mean rating for the movies released in that year.

View solution

### Removing poorly performing participants

This is a challenging exercise!

Download this dataset from Zhou et al. (2020) and read it with datamatrix. This dataset contains reaction times on a search task (`search_rt`) for different participants (`subject_nr`). Each row in this dataset corresponds to a single response, i.e. it is unaggregated data.

The goal of this exercise is to remove all data from those participants whose mean reaction time deviates more than two standard deviations from the grand mean reaction time. Here, both the standard deviation and the grand mean are derived from per-participant mean reaction times (i.e. a mean-of-means approach).

Hints:

• You first need to create a datamatrix with the mean reaction time for each participant.
• The function `datamatrix.operations.group()` groups a datamatrix based on a column, in this case `subject_nr`. The resulting datamatrix will contain a `search_rt` column for each subject, where each cell contains multiple values (i.e. multiple reaction times).
• You can use `datamatrix.series.reduce_()` to get the mean reaction time for each subject.

View solution