Working with tabular data using DataMatrix
What is DataMatrix?
DataMatrix is a Python library for working with tabular data, i.e. spreadsheet-like data that consists of named columns and numbered rows. DataMatrix is a light-weight library that allows you to do many things in an intuitive way. (For more advanced functionality, you can use pandas.).
This tutorial is an abbreviated version of the following page:
Basics
Creating a DataMatrix
Create a new DataMatrix
object. The length
keyword specifies the number of rows. A newly created DataMatrix
doesn't have any columns yet.
from datamatrix import DataMatrix
dm = DataMatrix(length=3)
print(dm)
Output:
+---+
| # |
+---+
| 0 |
| 1 |
| 2 |
+---+
Working with columns
If you assign a value to a column that doesn't exist, the column will be automatically created. If you assign a single value to a column, this will set all cells to that value:
dm.col = 'This creates a new column'
print(dm)
Output:
+---+---------------------------+
| # | col |
+---+---------------------------+
| 0 | This creates a new column |
| 1 | This creates a new column |
| 2 | This creates a new column |
+---+---------------------------+
If you assign a sequence to a column, this will set the cells based on that sequence. This sequence must have the same length as the column (3 in this case).
dm.col = 'one', 'two', 'three'
print(dm)
Output:
+---+-------+
| # | col |
+---+-------+
| 0 | one |
| 1 | two |
| 2 | three |
+---+-------+
Slicing and assigning to column cells
You can assign to one cell, to multiple cells at the same time, or to a slice of cells.
dm.col[1] = ':-)' # Assign to one cell
dm.col[0, 2] = ':P' # Assign to two cells
dm.col[2:] = ':D' # Assing to a slice of cells
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | :P |
| 1 | :-) |
| 2 | :D |
+---+-----+
Column properties
Basic numeric properties, such as the mean, can be accessed directly. Only numeric values are taken into account.
dm.col = 1, 2, 'not a number'
# Numeric descriptives
print('mean: %s' % dm.col.mean)
print('median: %s' % dm.col.median)
print('standard deviation: %s' % dm.col.std)
print('sum: %s' % dm.col.sum)
print('min: %s' % dm.col.min)
print('max: %s' % dm.col.max)
# Other properties
print('unique values: %s' % dm.col.unique)
print('number of unique values: %s' % dm.col.count)
print('column name: %s' % dm.col.name)
Output:
mean: 1.5
median: 1.5
standard deviation: 0.7071067811865476
sum: 3.0
min: 1.0
max: 2.0
unique values: [1, 2, 'not a number']
number of unique values: 3
column name: col
Iterating over rows, columns, and cells
By iterating directly over a DataMatrix
object, you get successive Row
objects. From a Row
object, you can directly access cells. By iterating over a Row
object, you get column name, cell value
tuples.
dm.col = 'a', 'b', 'c'
for row in dm:
print(row.col)
for colname, cell in row:
print('{} = {}'.format(colname, cell))
Output:
a
col = a
b
col = b
c
col = c
By iterating over DataMatrix.columns
, you get successive (column_name, column)
tuples.
for colname, col in dm.columns:
print('{} = []'.format(colname, col))
Output:
col = []
By iterating over a column, you get successive cells:
for cell in dm.col:
print(cell)
Output:
a
b
c
The column_names
property gives a sorted list of all column names (without the corresponding column objects):
print(dm.column_names)
Output:
['col']
Element-wise column operations
You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the +
operator, which then results in concatenation.
dm = DataMatrix(length=3)
dm.col = 0, 'a', 20
dm.col2 = dm.col * .5
dm.col3 = dm.col + 10
dm.col4 = dm.col - 10
dm.col5 = dm.col / 50
print(dm)
Output:
+---+-----+------+------+------+------+
| # | col | col2 | col3 | col4 | col5 |
+---+-----+------+------+------+------+
| 0 | 0 | 0.0 | 10 | -10 | 0.0 |
| 1 | a | a | a10 | a | a |
| 2 | 20 | 10.0 | 30 | 10 | 0.4 |
+---+-----+------+------+------+------+
You can apply a function (or lambda
expression) to all cells in a column simultaneously with the @
operator.
def times_two_plus_one(x):
return x * 2 + 1
dm = DataMatrix(length=3)
dm.col = 0, 1, 2
dm.col2 = dm.col @ times_two_plus_one
print(dm)
Output:
+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 | 0 | 1 |
| 1 | 1 | 3 |
| 2 | 2 | 5 |
+---+-----+------+
Reading and writing files
You can read and write files with functions from the datamatrix.io
module. The main supported file types are csv
and xlsx
.
from datamatrix import io
dm = DataMatrix(length=3)
dm.col = 1, 2, 3
# Write to disk
io.writetxt(dm, 'my_datamatrix.csv')
io.writexlsx(dm, 'my_datamatrix.xlsx')
# And read it back from disk!
dm = io.readtxt('my_datamatrix.csv')
dm = io.readxlsx('my_datamatrix.xlsx')
Splitting and selecting data
Selecting
You can select by directly comparing columns to values. This returns a new DataMatrix
object with only the selected rows.
dm = DataMatrix(length=10)
dm.col = range(10)
dm_subset = dm.col > 5
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
+---+-----+
You can select by multiple criteria using the |
(or), &
(and), and ^
(xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because |
, &
, and ^
have priority over other operators.
dm_subset = (dm.col < 1) | (dm.col > 8)
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | 0 |
| 9 | 9 |
+---+-----+
Splitting
You can also split a DataMatrix into subsets, using the split()
function from datamatrix.operations
(usually imported as ops
). To do this, specify a column with discrete values that indicate the groups into which you want to split. For example, consider this dataset from Moore, McCabe, & Craig (included as example data with JASP). This contains heart rate values for runners and non-runners (controls).
The first way to use ops.split()
is by only specifying a column to split the data based on. In this case, the function returns a sequence of value, subdatamatrix
tuples that you can iterate through with a for
loop: (Note the alternative way to refer to columns that contain spaces or non-ascii characters: dm['Heart Rate']
.)
from datamatrix import io
from datamatrix import operations as ops
dm = io.readtxt('data/heartrate.csv')
for group, group_dm in ops.split(dm.Group):
print('Mean heart rate for {} is {} bpm'.format(
group,
group_dm['Heart Rate'].mean)
)
Output:
Mean heart rate for Control is 139.0 bpm
Mean heart rate for Runners is 109.98 bpm
The second way to use ops.split()
is by also specifying the values that you want to split on. In this case, the function returns a sequence of subdatamatrix
objects that match the order of the splitting values. Like so:
dm_runner, dm_control = ops.split(dm.Group, 'Runners', 'Control')
print('Mean heart rate for runners is {} bpm'.format(dm_runner['Heart Rate'].mean))
print('Mean heart rate for controls is {} bpm'.format(dm_control['Heart Rate'].mean))
Output:
Mean heart rate for runners is 109.98 bpm
Mean heart rate for controls is 139.0 bpm
Exercises
Analyzing movie ratings
Download this file and read it with DataMatrix. The file contains titles, years, and ratings for many movies. Select only movies that were released between 1990 and 2000. Then, for each year separately, print the mean rating for the movies released in that year.
Removing poorly performing participants
This is a challenging exercise!
Download this dataset from Zhou et al. (2020) and read it with datamatrix. This dataset contains reaction times on a search task (search_rt
) for different participants (subject_nr
). Each row in this dataset corresponds to a single response, i.e. it is unaggregated data.
The goal of this exercise is to remove all data from those participants whose mean reaction time deviates more than two standard deviations from the grand mean reaction time. Here, both the standard deviation and the grand mean are derived from per-participant mean reaction times (i.e. a mean-of-means approach).
Hints:
- You first need to create a datamatrix with the mean reaction time for each participant.
- The function
datamatrix.operations.group()
groups a datamatrix based on a column, in this casesubject_nr
. The resulting datamatrix will contain asearch_rt
column for each subject, where each cell contains multiple values (i.e. multiple reaction times). - You can use
datamatrix.series.reduce_()
to get the mean reaction time for each subject.
You're done with this section!
Continue with Plotting >>