A repository for generating stylized talking 3D faces and 2D videos. This is the repository for paper Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis, MM 2021 The demo video can be viewed in this link: https://hcsi.cs.tsinghua.edu.cn/demo/MM21-HAOZHEWU.mp4.
conda create -n python36 python=3.6
conda activate python36
- Install necessary packages through
pip install -r requirements.txt
- Download the pretrained deepspeech model from the Link, and then unzip the zipped file to
./deepspeech
folder. - Same as the instructions of Deep 3D Face Reconstruction.
- Download the Basel Face Model. Due to the license agreement of Basel Face Model, you have to download the BFM09 model after submitting an application on its home page. After getting the access to BFM data, download "01_MorphableModel.mat" and put it into ./deep_3drecon/BFM subfolder.
- Download Download the Expression Basis provided by Guo et al. You can find a link named "CoarseData" in the first row of Introduction part in their repository. Download and unzip the Coarse_Dataset.zip. Put "Exp_Pca.bin" into ./deep_3drecon/BFM subfolder. The expression basis are constructed using Facewarehouse data and transferred to BFM topology. Download the pre-trained reconstruction network, unzip it and put "FaceReconModel.pb" into ./deep_3drecon/network subfolder.
- Download BFM_model_front.mat and put it into ./deep_3drecon/BFM subfolder.
- Download the pretrained audio2motion model, put it into
./audio2motion/model
- Download the pretrained texture encoder and render, put it into
./render/model
To run our demo, you need at least one GPU with 11G GPU memory.
python demo.py --in_img [*.png] --in_audio [*.wav] --output_path [path]
We provide 10 example talking styles in style.npy
, you can also calculate your own style codes with the following code. Where the exp is the 3DMM series and pose is the pose matrix reconstructed from Deep 3D Face Reconstruction. Usually we calculate style codes with videos of 5-20 seconds.
def get_style_code(exp, pose):
exp_mean_std = pkl.load(open("./data/ted_hd/exp_mean_std.pkl", 'rb'))
exp_std_mean = exp_mean_std['s_m']
exp_std_std = exp_mean_std['s_s']
exp_diff_std_mean = exp_mean_std['d_s_m']
exp_diff_std_std = exp_mean_std['d_s_s']
pose_mean_std = pkl.load(open("./data/ted_hd/pose_mean_std.pkl", 'rb'))
pose_diff_std_mean = pose_mean_std['d_s_m']
pose_diff_std_std = pose_mean_std['d_s_s']
diff_exp = exp[:-1, :] - exp[1:, :]
exp_std = (np.std(exp, axis = 0) - exp_std_mean) / exp_std_std
diff_exp_std = (np.std(diff_exp, axis = 0) - exp_diff_std_mean) / exp_diff_std_std
diff_pose = pose[:-1, :] - pose[1:, :]
diff_pose_std = (np.std(diff_pose, axis = 0) - pose_diff_std_mean) / pose_diff_std_std
return np.concatenate((exp_std, diff_exp_std, diff_pose_std))
Notice that the pose of each talking face is static in current demo, you can control the pose of face by modifying the coeff_array in demo.py in line 93. The coeff_array has shape of
Our project organizes the files as follows:
├── README.md
├── data_process
├── deepspeech
├── face_alignment
├── deep_3drecon
├── render
├── audio2motion
The data process folder contains processing code of several datasets.
We leverage the DeepSpeech project to extract audio related features. Please download the pretrained deepspeech model from the Link. In deepspeech/evaluate.py
, we implement the funtion get_prob
to get the latent deepspeech features with input audio path. The latent deepspeech features have 50 frames per second. We should align the deepspeech features to 25 fps videos in subsequent implementations.
We modify Face Alignment for data preprocess. Different from the original project, we enforce the face alignment to detect only the largest face in each frame for speed-up.
We modify Deep 3D Face Reconstruction for data preprocess. We add batch-api, uv-texture unwarpping api and uv coodinate image generation api in deep_3drecon/utils.py
.
We implement our texture encoder and rendering model in the render folder. We also implement some other renders like neural voice puppertry.
We implement our stylized audio to facial motion model in audio2motion folder.
We leverage lmdb
to store the fragmented data. The data can be downloaded from link, and then run cat xa* > data.mdb
. You can obtain the train/test video with the code bellow. We use the Ted-HD data to train the audio2motion model. We also provide the reconstructed 3D param and landmarks in the lmdb.
import lmdb
def test():
lmdb_path = "./lmdb"
env = lmdb.open(lmdb_path, map_size=1099511627776, max_dbs = 64)
train_video = env.open_db("train_video".encode())
train_audio = env.open_db("train_audio".encode())
train_lm5 = env.open_db("train_lm5".encode())
test_video = env.open_db("test_video".encode())
test_audio = env.open_db("test_audio".encode())
test_lm5 = env.open_db("test_lm5".encode())
with env.begin(write = False) as txn:
video = txn.get(str(0).encode(), db=test_video)
audio = txn.get(str(0).encode(), db=test_audio)
video_file = open("test.mp4", "wb")
audio_file = open("test.wav", "wb")
video_file.write(video)
audio_file.write(audio)
video_file.close()
audio_file.close()
print(txn.stat(db=train_video))
print(txn.stat(db=test_video)) # we can obtain the database size here
For the training of render, we will not provide the processed dataset due to the license of LRW.
@inproceedings{wu2021imitating,
title={Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis},
author={Wu, Haozhe and Jia, Jia and Wang, Haoyu and Dou, Yishun and Duan, Chao and Deng, Qingshan},
booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
pages={1478--1486},
year={2021}
}
- Current render is still buggy, there are noisy dots in the synthesized videos, we will fix this problem.
- We will optimize the rendering results of particular person with video footage of only 2-3 seconds.
- We will blend the synthesized results with backgrounds.
- We will add controllable dynamic textures and light control.