Train/Test leak in RandomNodeSplit? #9331

AdarshMJ · 2024-05-18T18:01:13Z

AdarshMJ
May 18, 2024

Hi @rusty1s I am trying to perform node classification and I use the following transform on the cora dataset -

transform2 = RandomNodeSplit(split="test_rest",num_splits = 10)
data  = transform2(data)
print(data)

If I do -

train_nodes = data.train_mask.nonzero(as_tuple=True)[0].cpu().numpy()
test_nodes = data.test_mask.nonzero(as_tuple=True)[0].cpu().numpy()
leakage_nodes = np.intersect1d(train_nodes, test_nodes)
if len(leakage_nodes) > 0:
        print(f"Warning: Found {len(leakage_nodes)} nodes in both the training and test sets.")
    else:
        print("No leakage detected.")

I get -

Found 1045 nodes in both the training and test sets

Am I using the transform incorrectly?

Answered by rusty1s

May 27, 2024

Don't you need to check every the splits in isolation?

import numpy as np

from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import RandomNodeSplit

dataset = Planetoid('/tmp/Cora', name='Cora')
data = dataset[0]
transform2 = RandomNodeSplit(split="test_rest", num_splits=10)
data = transform2(data)

for i in range(10):
    train_nodes = data.train_mask[:, i].nonzero(as_tuple=True)[0].cpu().numpy()
    test_nodes = data.test_mask[:, i].nonzero(as_tuple=True)[0].cpu().numpy()

    leakage_nodes = np.intersect1d(train_nodes, test_nodes)
    if len(leakage_nodes) > 0:
        print(
            f"Warning: Found {len(leakage_nodes)} nodes in both the training and t…

View full answer

rusty1s · 2024-05-27T07:12:05Z

rusty1s
May 27, 2024
Maintainer

Don't you need to check every the splits in isolation?

import numpy as np

from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import RandomNodeSplit

dataset = Planetoid('/tmp/Cora', name='Cora')
data = dataset[0]
transform2 = RandomNodeSplit(split="test_rest", num_splits=10)
data = transform2(data)

for i in range(10):
    train_nodes = data.train_mask[:, i].nonzero(as_tuple=True)[0].cpu().numpy()
    test_nodes = data.test_mask[:, i].nonzero(as_tuple=True)[0].cpu().numpy()

    leakage_nodes = np.intersect1d(train_nodes, test_nodes)
    if len(leakage_nodes) > 0:
        print(
            f"Warning: Found {len(leakage_nodes)} nodes in both the training and test sets."
        )
    else:
        print("No leakage detected.")

No leakage detected.
No leakage detected.
No leakage detected.
No leakage detected.
No leakage detected.
No leakage detected.
No leakage detected.
No leakage detected.
No leakage detected.
No leakage detected.

1 reply

AdarshMJ May 27, 2024
Author

Yes that makes sense. Thank you :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train/Test leak in RandomNodeSplit? #9331

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Train/Test leak in RandomNodeSplit? #9331

AdarshMJ May 18, 2024

Replies: 1 comment · 1 reply

rusty1s May 27, 2024 Maintainer

AdarshMJ May 27, 2024 Author

AdarshMJ
May 18, 2024

Replies: 1 comment 1 reply

rusty1s
May 27, 2024
Maintainer

AdarshMJ May 27, 2024
Author