This repository contains the source code and datasets for Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich Networks, published in KDD 2023.
The code is written in Python 3.6. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):
pip3 install -r requirements.txt
Heterformer is a Transformer architecture (language model) for representation on heterogeneous text-rich (text-attributed) networks. It can take text data associated with nodes and heterogeneous network structure information into consideration.
- Download raw data from DBLP, Twitter and Goodreads.
- Data processing: Run the cells in data/$dataset/data_processing.ipynb for first step data processing.
- Network Sampling: Run the cells in data/$dataset/sampling.ipynb for ego-network sampling and train/val/test data generation.
- Pretrain data: Run the cells in data/$dataset/generate_pretrain_data.ipynb for textless node pretraining data generation.
- Pretrain textless node embeddings. Take Goodreads dataset as an example.
cd pretrain/
bash run.sh
- Prepare textless node embedding file for Heterformer training.
Run the cells in pretrain/transfer_embed.ipynb
- Heterformer training.
cd ..
python main.py --data_path data/$dataset --model_type Heterformer --pretrain_embed True --pretrain_dir data/$dataset/pretrain_embed
python main.py --data_path data/$dataset --model_type Heterformer --mode test --load_ckpt_name $load_ckpt_dir
python main.py --data_path data/$dataset --model_type Heterformer --mode infer --load 1 --load_ckpt_name $load_ckpt_dir
cd downstream/
python classification.py --mode transductive --dataset $dataset --method Heterformer
python classification.py --mode inductive --dataset $dataset --method Heterformer
python author_classification.py --dataset $dataset --method Heterformer
python clustering.py --mode transductive --dataset $dataset --method Heterformer
python retrieval.py --method Heterformer
Please cite the following paper if you find the code helpful for your research.
@inproceedings{jin2023heterformer,
title={Heterformer: Transformer-based deep node representation learning on heterogeneous text-rich networks},
author={Jin, Bowen and Zhang, Yu and Zhu, Qi and Han, Jiawei},
booktitle={Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages={1020--1031},
year={2023}
}