How to load the itemified data into the TSAI models? #914

bcaogh · 2024-07-12T00:23:15Z

bcaogh
Jul 12, 2024

I am trying to apply the data preparation techniques discussed on
https://colab.research.google.com/github/timeseriesAI/tsai/blob/master/tutorial_nbs/00_How_to_efficiently_work_with_very_large_numpy_arrays.ipynb
The np.memmap approach works but the itemify approach did not work and I find no other resources to help to give an example or discuss the application directly. If someone can help to clarify how itemify can be connected to model training, that will be greatly appreciated.

Code used

import numpy as np
import pandas as pd
from tsai.all import *
from fastcore.foundation import L

Example data preparation

df = pd.DataFrame({
'device': np.repeat(np.arange(10), 100),
'region': np.tile(np.repeat(['A', 'B'], 5), 100),
'time': np.tile(np.arange(100), 10),
'var_0': np.random.randn(1000),
'var_1': np.random.randn(1000),
'target': np.random.randint(0, 2, 1000)
})

Determine the total number of windows

window_len = 5
stride = 1
n_windows = 0
for device in df['device'].unique():
n_device_windows = (len(df[df['device'] == device]) - window_len) // stride + 1
n_windows += n_device_windows

Apply SlidingWindowPanel first on a small sample to get the shape of the resulting arrays

sample_df = df.iloc[:window_len + 2].copy() # Create a copy to avoid SettingWithCopyError
sample_X, sample_y = SlidingWindowPanel(
window_len=window_len,
unique_id_cols=['device'],
stride=stride,
start=0,
get_x=df.columns[3:5],
get_y=['target'],
horizon=0,
seq_first=True,
sort_by=['time'],
ascending=True
)(sample_df)

Verify the shapes

print("Sample X shape:", sample_X.shape)
print("Sample y shape:", sample_y.shape)

Initialize memory-mapped files

X_shape = (n_windows, sample_X.shape[1], sample_X.shape[2]) # Adjust dimensions to (n_windows, features, steps)
y_shape = (n_windows,) # Adjust dimensions for 1D y

import os

Specify the full paths

X_memmap_path = os.path.abspath('C:/AIML/TSAI Study/X_data.memmap')
y_memmap_path = os.path.abspath('C:/AIML/TSAI Study/y_data.memmap')

Remove any existing files to avoid conflicts

if os.path.exists(X_memmap_path):
os.remove(X_memmap_path)
if os.path.exists(y_memmap_path):
os.remove(y_memmap_path)

Create memory-mapped files

X_memmap = np.memmap(X_memmap_path, dtype='float32', mode='w+', shape=X_shape)
y_memmap = np.memmap(y_memmap_path, dtype='float32', mode='w+', shape=y_shape)

Process the DataFrame in chunks and write to memory-mapped files

chunk_size = 100 # Define an appropriate chunk size

n_chunks = len(df) // chunk_size + 1

the chunk size is fixed for each device so n_chunks is easy to calculate

n_chunks = len(df) // chunk_size

window_idx = 0

for i in range(n_chunks):
start_idx = i * chunk_size
end_idx = min((i + 1) * chunk_size, len(df))
df_chunk = df.iloc[start_idx:end_idx].copy() # Create a copy to avoid SettingWithCopyError

# Apply SlidingWindowPanel on the chunk
X_chunk, y_chunk = SlidingWindowPanel(
    window_len=window_len,
    unique_id_cols=['device'],
    stride=stride,
    start=0,
    get_x=df.columns[3:5],
    get_y=['target'],
    horizon=0,
    seq_first=True,
    sort_by=['time'],
    ascending=True
)(df_chunk)

# Write to memory-mapped files
n_chunk_windows = X_chunk.shape[0]
X_memmap[window_idx:window_idx + n_chunk_windows] = X_chunk
y_memmap[window_idx:window_idx + n_chunk_windows] = y_chunk
window_idx += n_chunk_windows

Flush changes to disk

X_memmap.flush()
y_memmap.flush()

Read back the data using np.memmap

X_memmap = np.memmap(X_memmap_path, dtype='float32', mode='r', shape=X_shape)
y_memmap = np.memmap(y_memmap_path, dtype='float32', mode='r', shape=y_shape)

Convert y_memmap to integers and then to strings

y_memmap = y_memmap.astype(int).astype(str)

Verify the shapes again

print("X_memmap shape:", X_memmap.shape)
print("y_memmap shape:", y_memmap.shape)

splits = get_splits(y_memmap, valid_size=0.2, stratify=True, random_state=42)

Create TSDatasets and TSDataLoaders

tfms = [None, [Categorize()]]
dsets = TSDatasets(X_memmap, y_memmap, tfms=tfms, splits=splits)
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=[64, 128], num_workers=0)

Example of using TSAI with the DataLoaders

model = build_ts_model(InceptionTimePlus, dls=dls)
learn = Learner(dls, model, metrics=accuracy)
learn.fit_one_cycle(25, lr_max=1e-3)

This code works up to here for loading the data into the model and training the model.

but the attempt to apply itemify failed below. I cannot find any helpful information to this issue.

there is no example of itemified data objects being transformed into TSDatasets

Use itemify to handle large np.memmap arrays efficiently

def itemify(*x): return L(*x).zip()

X_items = itemify(X_memmap)
y_items = itemify(y_memmap)

splits = get_splits(y_items, valid_size=0.2, stratify=True, random_state=42)

Create TSDatasets and TSDataLoaders

tfms = [None, [Categorize()]]
dsets = TSDatasets(X_items, y_items, tfms=tfms, splits=splits)

Traceback (most recent call last):

Cell In[117], line 1

dsets = TSDatasets(X_items, y_items, tfms=tfms, splits=splits)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tsai\data\core.py:450 in init

X = to3d(X)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tsai\utils.py:172 in to3d

if isinstance(o, (np.ndarray, pd.DataFrame)): return to3darray(o)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tsai\utils.py:151 in to3darray

assert False, f'Please, review input dimensions {o.ndim}'

AssertionError: Please, review input dimensions 4

dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=[64, 128], num_workers=0)

Example of using TSAI with the DataLoaders

model = build_ts_model(InceptionTimePlus, dls=dls)
learn = Learner(dls, model, metrics=accuracy)
learn.fit_one_cycle(25, lr_max=1e-3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load the itemified data into the TSAI models? #914

{{title}}

Replies: 0 comments

Select a reply

How to load the itemified data into the TSAI models? #914

bcaogh Jul 12, 2024

Code used

Example data preparation

Determine the total number of windows

Apply SlidingWindowPanel first on a small sample to get the shape of the resulting arrays

Verify the shapes

Initialize memory-mapped files

Specify the full paths

Remove any existing files to avoid conflicts

Create memory-mapped files

Process the DataFrame in chunks and write to memory-mapped files

n_chunks = len(df) // chunk_size + 1

the chunk size is fixed for each device so n_chunks is easy to calculate

Flush changes to disk

Read back the data using np.memmap

Convert y_memmap to integers and then to strings

Verify the shapes again

Create TSDatasets and TSDataLoaders

Example of using TSAI with the DataLoaders

This code works up to here for loading the data into the model and training the model.

but the attempt to apply itemify failed below. I cannot find any helpful information to this issue.

there is no example of itemified data objects being transformed into TSDatasets

Use itemify to handle large np.memmap arrays efficiently

Create TSDatasets and TSDataLoaders

Traceback (most recent call last):

Cell In[117], line 1

dsets = TSDatasets(X_items, y_items, tfms=tfms, splits=splits)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tsai\data\core.py:450 in init

X = to3d(X)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tsai\utils.py:172 in to3d

if isinstance(o, (np.ndarray, pd.DataFrame)): return to3darray(o)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tsai\utils.py:151 in to3darray

assert False, f'Please, review input dimensions {o.ndim}'

AssertionError: Please, review input dimensions 4

Example of using TSAI with the DataLoaders

Replies: 0 comments

bcaogh
Jul 12, 2024