why shuffling data? #7

ochoch · 2019-04-20T07:56:07Z

Hello,
Nice and interresting work, I learned a lot.
During train and testing dataset building process, why are you shuffling data? I though that regarding time serie we should not shuffling data.

data_utils.py

def split_dataset(dataset, ratio=None):
size = dataset.size
if ratio is None:
ratio = _choose_optimal_train_ratio(size)

mask = np.zeros(size, dtype=np.bool_)
train_size = int(size * ratio)
mask[:train_size] = True
np.random.shuffle(mask)

train_x = dataset.x[mask, :]
train_y = dataset.y[mask]

mask = np.invert(mask)
test_x = dataset.x[mask, :]
test_y = dataset.y[mask]

return DataSet(train_x, train_y), DataSet(test_x, test_y)

Regards,

maxim5 · 2019-04-20T09:19:26Z

Hi @ochoch I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias.

ochoch · 2019-04-24T13:49:40Z

Hi Maxim, Thanks for your reply. I played a bit with your implementation and add a provider (FXCM), using pyfxcm ( https://github.com/fxcm/RestAPI/tree/master/fxcmpy). At the end, as it is time consumming to connect to FXCM servers and they are not delivering the last bar(!), I integrate your python scripts with MT4. On each tick I mn providing the last data (replacement of get_latest_data method), I am providing a csv file, and replace raw_df dataframe with a read_csv method. Then I run predict.py and get prediction for the next bar and draw the result on a chart... [image: image.png] At this stage, I am also calculating some accuracy... And to be honest it is quit hard to get some tradable predictions... I have more or less following accuracy on forward testing : TF High Accuracy (%) Low Accuracy (%) m15 57.25 56.29 H4 56.25 63.55 D1 65.63 57.29 W1 52.08 58.33 Maybe we should add some additionnal features with selection feature algorithm. Any insights? Regards, och Le sam. 20 avr. 2019 à 11:19, Maxim Podkolzine <notifications@github.com> a écrit :

…

Hi @ochoch <https://github.com/ochoch> I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABTHQD4XEF6YQBYEWAKYTLTPRLNZ7ANCNFSM4HHJFWYA> .

maxim5 · 2019-05-16T13:09:50Z

Hi @ochoch sorry for the delay.

Unfortunately that's the way it is: there is so much noise and so little signal in financial data. If you are able to find a reliable signal more than 50% accurate, it's good enough and you can make money.

In terms of features: that's the key question. All ML algorithms that make money boil down to features. I haven't worked much on crypto data since then. Do you have any ideas in mind?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why shuffling data? #7

why shuffling data? #7

ochoch commented Apr 20, 2019

maxim5 commented Apr 20, 2019

ochoch commented Apr 24, 2019 via email

maxim5 commented May 16, 2019

why shuffling data? #7

why shuffling data? #7

Comments

ochoch commented Apr 20, 2019

maxim5 commented Apr 20, 2019

ochoch commented Apr 24, 2019 via email

maxim5 commented May 16, 2019