Working with tabular data using DataMatrix

Contents

What is DataMatrix?
Basics
Splitting and selecting data
- Selecting
- Splitting
Exercises
- Analyzing movie ratings
- Removing poorly performing participants

What is DataMatrix?

DataMatrix is a Python library for working with tabular data, i.e. spreadsheet-like data that consists of named columns and numbered rows. DataMatrix is a light-weight library that allows you to do many things in an intuitive way. (For more advanced functionality, you can use pandas.).

This tutorial is an abbreviated version of the following page:

https://pydatamatrix.eu/basic/

Basics

Creating a DataMatrix

Create a new DataMatrix object. The length keyword specifies the number of rows. A newly created DataMatrix doesn't have any columns yet.

from datamatrix import DataMatrix

dm = DataMatrix(length=3)
print(dm)

Output:

+---+
| # |
+---+
| 0 |
| 1 |
| 2 |
+---+

Working with columns

If you assign a value to a column that doesn't exist, the column will be automatically created. If you assign a single value to a column, this will set all cells to that value:

dm.col = 'This creates a new column'
print(dm)

Output:

+---+---------------------------+
| # |            col            |
+---+---------------------------+
| 0 | This creates a new column |
| 1 | This creates a new column |
| 2 | This creates a new column |
+---+---------------------------+

If you assign a sequence to a column, this will set the cells based on that sequence. This sequence must have the same length as the column (3 in this case).

dm.col = 'one', 'two', 'three'
print(dm)

Output:

+---+-------+
| # |  col  |
+---+-------+
| 0 |  one  |
| 1 |  two  |
| 2 | three |
+---+-------+

Slicing and assigning to column cells

You can assign to one cell, to multiple cells at the same time, or to a slice of cells.

dm.col[1] = ':-)'  # Assign to one cell
dm.col[0, 2] = ':P'  # Assign to two cells
dm.col[2:] = ':D'  # Assing to a slice of cells
print(dm)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  :P |
| 1 | :-) |
| 2 |  :D |
+---+-----+

Column properties

Basic numeric properties, such as the mean, can be accessed directly. Only numeric values are taken into account.

dm.col = 1, 2, 'not a number'
# Numeric descriptives
print('mean: %s' % dm.col.mean)
print('median: %s' % dm.col.median)
print('standard deviation: %s' % dm.col.std)
print('sum: %s' % dm.col.sum)
print('min: %s' % dm.col.min)
print('max: %s' % dm.col.max)
# Other properties
print('unique values: %s' % dm.col.unique)
print('number of unique values: %s' % dm.col.count)
print('column name: %s' % dm.col.name)

Output:

mean: 1.5
median: 1.5
standard deviation: 0.7071067811865476
sum: 3.0
min: 1.0
max: 2.0
unique values: [1, 2, 'not a number']
number of unique values: 3
column name: col

Iterating over rows, columns, and cells

By iterating directly over a DataMatrix object, you get successive Row objects. From a Row object, you can directly access cells. By iterating over a Row object, you get column name, cell value tuples.

dm.col = 'a', 'b', 'c'
for row in dm:
    print(row.col)
    for colname, cell in row:
        print('{} = {}'.format(colname, cell))

Output:

a
col = a
b
col = b
c
col = c

By iterating over DataMatrix.columns, you get successive (column_name, column) tuples.

for colname, col in dm.columns:
    print('{} = []'.format(colname, col))

Output:

col = []

By iterating over a column, you get successive cells:

for cell in dm.col:
    print(cell)

Output:

a
b
c

The column_names property gives a sorted list of all column names (without the corresponding column objects):

print(dm.column_names)

Output:

['col']

Element-wise column operations

You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the + operator, which then results in concatenation.

dm = DataMatrix(length=3)
dm.col = 0, 'a', 20
dm.col2 = dm.col * .5
dm.col3 = dm.col + 10
dm.col4 = dm.col - 10
dm.col5 = dm.col / 50
print(dm)

Output:

+---+-----+------+------+------+------+
| # | col | col2 | col3 | col4 | col5 |
+---+-----+------+------+------+------+
| 0 |  0  | 0.0  |  10  | -10  | 0.0  |
| 1 |  a  |  a   | a10  |  a   |  a   |
| 2 |  20 | 10.0 |  30  |  10  | 0.4  |
+---+-----+------+------+------+------+

You can apply a function (or lambda expression) to all cells in a column simultaneously with the @ operator.

def times_two_plus_one(x):
    return x * 2 + 1

dm = DataMatrix(length=3)
dm.col = 0, 1, 2
dm.col2 = dm.col @ times_two_plus_one
print(dm)

Output:

+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 |  0  |  1   |
| 1 |  1  |  3   |
| 2 |  2  |  5   |
+---+-----+------+

Reading and writing files

You can read and write files with functions from the datamatrix.io module. The main supported file types are csv and xlsx.

from datamatrix import io

dm = DataMatrix(length=3)
dm.col = 1, 2, 3
# Write to disk
io.writetxt(dm, 'my_datamatrix.csv')
io.writexlsx(dm, 'my_datamatrix.xlsx')
# And read it back from disk!
dm = io.readtxt('my_datamatrix.csv')
dm = io.readxlsx('my_datamatrix.xlsx')

Splitting and selecting data

Selecting

You can select by directly comparing columns to values. This returns a new DataMatrix object with only the selected rows.

dm = DataMatrix(length=10)
dm.col = range(10)
dm_subset = dm.col > 5
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 6 |  6  |
| 7 |  7  |
| 8 |  8  |
| 9 |  9  |
+---+-----+

You can select by multiple criteria using the | (or), & (and), and ^ (xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because |, &, and ^ have priority over other operators.

dm_subset = (dm.col < 1) | (dm.col > 8)
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  0  |
| 9 |  9  |
+---+-----+

Splitting

You can also split a DataMatrix into subsets, using the split() function from datamatrix.operations (usually imported as ops). To do this, specify a column with discrete values that indicate the groups into which you want to split. For example, consider this dataset from Moore, McCabe, & Craig (included as example data with JASP). This contains heart rate values for runners and non-runners (controls).

The first way to use ops.split() is by only specifying a column to split the data based on. In this case, the function returns a sequence of value, subdatamatrix tuples that you can iterate through with a for loop: (Note the alternative way to refer to columns that contain spaces or non-ascii characters: dm['Heart Rate'].)

from datamatrix import io
from datamatrix import operations as ops

dm = io.readtxt('data/heartrate.csv')
for group, group_dm in ops.split(dm.Group):
    print('Mean heart rate for {} is {} bpm'.format(
        group,
        group_dm['Heart Rate'].mean)
    )

Output:

Mean heart rate for Control is 139.0 bpm
Mean heart rate for Runners is 109.98 bpm

The second way to use ops.split() is by also specifying the values that you want to split on. In this case, the function returns a sequence of subdatamatrix objects that match the order of the splitting values. Like so:

dm_runner, dm_control = ops.split(dm.Group, 'Runners', 'Control')
print('Mean heart rate for runners is {} bpm'.format(dm_runner['Heart Rate'].mean))
print('Mean heart rate for controls is {} bpm'.format(dm_control['Heart Rate'].mean))

Output:

Mean heart rate for runners is 109.98 bpm
Mean heart rate for controls is 139.0 bpm

Exercises

Analyzing movie ratings

Download this file and read it with DataMatrix. The file contains titles, years, and ratings for many movies. Select only movies that were released between 1990 and 2000. Then, for each year separately, print the mean rating for the movies released in that year.

View solution

Removing poorly performing participants

This is a challenging exercise!

Download this dataset from Zhou et al. (2020) and read it with datamatrix. This dataset contains reaction times on a search task (search_rt) for different participants (subject_nr). Each row in this dataset corresponds to a single response, i.e. it is unaggregated data.

The goal of this exercise is to remove all data from those participants whose mean reaction time deviates more than two standard deviations from the grand mean reaction time. Here, both the standard deviation and the grand mean are derived from per-participant mean reaction times (i.e. a mean-of-means approach).

Hints:

You first need to create a datamatrix with the mean reaction time for each participant.
The function datamatrix.operations.group() groups a datamatrix based on a column, in this case subject_nr. The resulting datamatrix will contain a search_rt column for each subject, where each cell contains multiple values (i.e. multiple reaction times).
You can use datamatrix.series.reduce_() to get the mean reaction time for each subject.

View solution

You're done with this section!

Continue with Plotting >>