Skip to content

Commit

Permalink
Better file format detection; Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
nx10 committed May 15, 2024
1 parent 30f80df commit 6023155
Show file tree
Hide file tree
Showing 9 changed files with 174 additions and 66 deletions.
20 changes: 20 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Copyright (c) 2024 Child Mind Institute

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
87 changes: 62 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,31 @@
> [!NOTE]
> `actfast` is currently in early development and is not yet ready for production use. Please open an issue if you encounter any problems or have any feature requests.
# `actfast` Fast actigraphy data reader

`actfast` is a minimal Python package for reading raw actigraphy data from various devices. It is written in Rust and Python, and is designed with performance and memory safety in mind.
`actfast` is a Python package for reading raw actigraphy data of various devices and manufacturers. It is designed to be lightweight, fast, and memory efficient, and is suitable for reading large datasets.

In preliminary benchmarks, `actfast` showed around 40x speedup compared to `pygt3x` for reading ActiGraph GT3X files.

## Supported devices

The package currently supports the following formats:

| Format | Manufacturer | Device | Implementation status |
| --- | --- | --- | --- |
| GT3X | Actigraph | wGT3X-BT ||
| BIN | GENEActiv | GENEActiv ||
| CWA | Axivity | AX3, AX6 ||
| BIN | Genea | Genea ||
| BIN | Movisens | Movisens ||
| WAV | Axivity | Axivity | Use general-purpose WAV audio file reader |
| AGD/SQLite | Actigraph | ActiGraph | Use general-purpose SQLite reader |
| AWD | Philips | Actiwatch | Use general-purpose CSV reader |
| MTN | Philips | Actiwatch | Use general-purpose XML reader |
| CSV | Any | Any | Use general-purpose CSV reader |
| XLS, XLSX, ODS | Any | Any | Use general-purpose Excel reader |

This package exclusively reads non-standard files that contain sensor data. It does not read CSV or other standard file formats used by various manufacturers. Use any general-purpose CSV reader to read these files. Because CSV files do not necessarily contain a unique header, we cannot identify them from the file contents.

The package is designed to be easily extensible to support new formats and devices. If you have a non-standard device format that is not supported yet, please open an issue and attach a sample file. We will do our best to add support for it.

## Installation

Install from PyPI via:
Expand All @@ -21,39 +40,57 @@ Or, install the latest development version from GitHub via:
pip install git+https://github.com/childmindresearch/actfast.git
```

## Hardware support
## Tested devices

This package has been tested with data captured by the following devices:
This package has been extensively tested with data captured by the following devices:

| Device | Firmware | API |
| --- | --- | --- |
| ActiGraph wGT3X-BT | `1.9.2` | `actfast.read_actigraph_gt3x(file)` |
| GENEActiv 1.2 | `Ver06.17 15June23` | `actfast.read_geneactiv_bin(file)` |
| Device | Firmware |
| --- | --- |
| ActiGraph wGT3X-BT | `1.9.2` |
| GENEActiv 1.2 | `Ver06.17 15June23` |

Similar devices might work, but have not been tested. Please open an issue and attach a sample file if you have a device that is not supported yet. We will do our best to add support for it.
Similar devices might work, but have not been tested. Please open an issue and attach a sample file if you encounter any issues.

## Usage

The package provides a single function, `read`, which reads a binary actigraphy file and returns a dictionary with the data.

```python
import actfast

subject1 = actfast.read_actigraph_gt3x("data/subject1.gt3x")
subject1 = actfast.read("data/subject1.gt3x")
```
If you are using Pandas and want a similar dataframe to what `pygt3x` offers, you can convert the data to a dataframe using the following code snippet:

The returned dictionary has the following structure:

```python
import pandas as pd
{
"format": "Actigraph GT3X", # file format, any of "Actigraph GT3X", "Axivity CWA", "GeneActiv BIN", "Genea BIN", "Unknown WAV", "Unknown SQLite"
"metadata": {
# device specific key value pairs of metadata (e.g., device model, firmware version)
},
"timeseries": {
# device specific key value pairs of "timeseries name" -> {timeseries data}, e.g.:
"high_frequency": {
"datetime": # 1D int64 numpy array of timestamps in nanoseconds (Unix epoch time)
# other data fields are various device specific sensor data, e.g.:
"acceleration": # 2D numpy array (n_samples x 3) of acceleration data (x, y, z)
"light": # 1D numpy array of light data
"temperature": # temperature data
# ...
},
"low_frequency": {
# similar structure as high_frequency
}
},
```

accel = subject1["timeseries"]["acceleration"]
## Architecture & usage considerations

df = pd.DataFrame.from_dict({
"Timestamp": accel["datetime"],
"X": accel["acceleration"][:, 0],
"Y": accel["acceleration"][:, 1],
"Z": accel["acceleration"][:, 2]
})
All supported formats seem to be constructed in a similar way: A header followed by a series of variable-length, variable-content records. While this stream of records is easy to write for the manufacturers, it is not ideal for vectorized operations. `actfast` collects data in linear buffers and then reshapes them into numpy arrays.

df["Timestamp"] = pd.to_datetime(df["Timestamp"], unit='ns')
df = df.set_index("Timestamp")
```
Consider reading large datasets once and storing them in a more efficient format (e.g., Parquet, HDF5) for subsequent analysis. This will reduce the time spent reading files and the memory footprint of the data dramatically.

## License

This package is licensed under the MIT License. See the [LICENSE](LICENSE) file for more information.
2 changes: 1 addition & 1 deletion src/actigraph/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -484,7 +484,7 @@ mod tests {

#[test]
fn test_actigraph_reader() {
let data = include_bytes!("../../test_data/actigraph.gt3x");
let data = include_bytes!("../../test_data/cmi/actigraph.gt3x");
let mut reader = ActigraphReader::new();
let mut metadata = HashMap::new();
let mut sensor_table = HashMap::new();
Expand Down
51 changes: 51 additions & 0 deletions src/file_format.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
/// File format identification
///
/// This module provides a function to identify the file format of a file based on magic numbers.
///
/// There are also a lot of CSV and other standard file formats used by various manufacturers.
/// These are *not* supported by this library.
/// This library is designed to read binary files that contain raw sensor data.
/// Use any general purpose CSV reader to read these files.
/// Because CSV files do not necessarily contain a unique header we can not identify them from the file contents.
///
/// Examples of CSV and other standard file formats that are not supported by:
/// - Philips Actiwatch AWD:
/// Uses the AWD file extension, but is a CSV file with 7 lines of header followed
/// by [{time}" , "{activity counts}"\n"] in minute intervals.
/// (Discontinued 2023: https://www.camntech.com/actiwatch-discontinued/ )
/// - ActiGraph CSV:
/// Note that ActiGraph also has a binary format (GT3X) that *is supported* by this library.
/// - ActiGraph AGD:
/// These are sqlite databases.
/// - ActiWatch MTN:
/// These are XML files.
/// - Axivity WAV:
/// These are WAV audio files. First three channels are X, Y, Z accelerometer data,
/// the fourth is temperature.
/// - Misc. XLS, XLSX, ODS, etc.:
/// These are Microsoft Excel or Open Document Spreadsheets.
/// File formats supported by this library
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FileFormat {
ActigraphGt3x,
AxivityCwa,
GeneactivBin,
GeneaBin,
// TODO: MovisensBin, These are folders with a bunch of files in them (one called 'acc.bin').
UnknownWav,
UnknownSqlite,
}

/// Identify the file format of a file based on its magic number
pub fn identify(magic: &[u8; 4]) -> Option<FileFormat> {
match magic {
b"PK\x03\x04" => Some(FileFormat::ActigraphGt3x),
b"Devi" => Some(FileFormat::GeneactivBin),
[b'M', b'D', ..] => Some(FileFormat::AxivityCwa),
b"GENE" => Some(FileFormat::GeneaBin),
b"RIFF" => Some(FileFormat::UnknownWav),
b"SQLi" => Some(FileFormat::UnknownSqlite),
_ => None,
}
}
2 changes: 1 addition & 1 deletion src/geneactiv/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -402,7 +402,7 @@ mod tests {
let mut reader = GeneActivReader::new();
let mut metadata = HashMap::new();
let mut sensor_table = HashMap::new();
let data = include_bytes!("../../test_data/geneactiv.bin");
let data = include_bytes!("../../test_data/cmi/geneactiv.bin");
assert!(reader
.read(
Cursor::new(data),
Expand Down
76 changes: 37 additions & 39 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
mod file_format;
mod actigraph;
//mod axivity;
mod geneactiv;
Expand All @@ -11,33 +12,6 @@ use pyo3::types::PyDict;

use sensors::SensorsFormatReader;

enum FileFormat {
ActigraphGt3x,
GeneactivBin,
AxivityCwa,
}

fn guess_file_format(path: &str) -> std::io::Result<Option<FileFormat>> {
let file = std::fs::File::open(path)?;
let mut reader = std::io::BufReader::new(file);
let mut magic = [0; 4];
reader.read_exact(&mut magic)?;

Ok(
if magic[0] == 0x50 && magic[1] == 0x4b && magic[2] == 0x03 && magic[3] == 0x04 {
// this is the general zip magic number
// if we add another file format that uses zip, we need to check the contents
Some(FileFormat::ActigraphGt3x)
} else if magic[0] == 0x44 && magic[1] == 0x65 && magic[2] == 0x76 && magic[3] == 0x69 {
Some(FileFormat::GeneactivBin)
} else if magic[0] == 0x4d && magic[1] == 0x44 {
Some(FileFormat::AxivityCwa)
} else {
None
},
)
}

fn sensor_data_dyn_to_pyarray<'py, T>(
py: Python<'py>,
data: &[T],
Expand All @@ -61,8 +35,14 @@ where
}

#[pyfunction]
fn read(_py: Python, path: &str) -> PyResult<PyObject> {
let file_format = guess_file_format(path)?
fn read(_py: Python, path: std::path::PathBuf) -> PyResult<PyObject> {

let file = std::fs::File::open(&path)?;
let mut reader = std::io::BufReader::new(file);
let mut magic = [0; 4];
reader.read_exact(&mut magic)?;

let format_type = file_format::identify(&magic)
.ok_or(PyErr::new::<pyo3::exceptions::PyValueError, _>(
"Unknown file format",
))?;
Expand Down Expand Up @@ -147,31 +127,49 @@ fn read(_py: Python, path: &str) -> PyResult<PyObject> {
.unwrap();
};

let fname = std::path::Path::new(path);
let file = std::fs::File::open(fname)?;
let file = std::fs::File::open(&path)?;

match file_format {
FileFormat::ActigraphGt3x => {
match format_type {
file_format::FileFormat::ActigraphGt3x => {
actigraph::ActigraphReader::new()
.read(file, metadata_callback, sensor_table_callback)
.or(Err(PyErr::new::<pyo3::exceptions::PyValueError, _>(
"Failed to read file",
)))?;
}
FileFormat::GeneactivBin => {
file_format::FileFormat::GeneactivBin => {
geneactiv::GeneActivReader::new()
.read(file, metadata_callback, sensor_table_callback)
.or(Err(PyErr::new::<pyo3::exceptions::PyValueError, _>(
"Failed to read file",
)))?;
}
FileFormat::AxivityCwa => {}
file_format::FileFormat::UnknownWav => {
return Err(PyErr::new::<pyo3::exceptions::PyValueError, _>(
"Unsupported file format: WAV audio. Use a general purpose \
audio reader (such as Python standard library 'wave') to read these files.",
));
}
file_format::FileFormat::UnknownSqlite => {
return Err(PyErr::new::<pyo3::exceptions::PyValueError, _>(
"Unsupported file format: SQLite. Use a general purpose \
SQLite reader (such as Python standard library 'sqlite3') to read these files.",
));
}
_ => {
return Err(PyErr::new::<pyo3::exceptions::PyValueError, _>(
format!("Unimplemented file format: {:?}", format_type),
));
}
};

let format_str = match file_format {
FileFormat::ActigraphGt3x => "Actigraph GT3X",
FileFormat::GeneactivBin => "GeneActiv BIN",
FileFormat::AxivityCwa => "Axivity CWA",
let format_str = match format_type {
file_format::FileFormat::ActigraphGt3x => "Actigraph GT3X",
file_format::FileFormat::AxivityCwa => "Axivity CWA",
file_format::FileFormat::GeneactivBin => "GeneActiv BIN",
file_format::FileFormat::GeneaBin => "Genea BIN",
file_format::FileFormat::UnknownWav => "Unknown WAV",
file_format::FileFormat::UnknownSqlite => "Unknown SQLite",
};
dict.set_item("format", format_str)?;

Expand Down
2 changes: 2 additions & 0 deletions test_data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ggir
pyactigraphy
File renamed without changes.
File renamed without changes.

0 comments on commit 6023155

Please sign in to comment.