The task at hand is to identify speakers in the voice recordings from the VoxCeleb2 Dataset using the transformer and conformer encoders.
Each voice recording is represented as a mel-spectrogram with 40 dimensions. Because the length of each voice recording varies, a fixed segment of 128x15ms is randomly extracted from each recording to serve as input data.
The accuracy rate reaches 53.94% when implementing only one layer of the transformer encoder.

When implementing two layers of the transformer encoder, the accuracy rate increases to 66.49%.

By implementing two layers of the transformer encoder and increasing the number of expected features in the input (d_model) from 80 to 256, the accuracy rate further improves to 72.95%.

Maintaining the above hyperparameters but increasing the number of heads from 4 to 64 results in an accuracy rate of 77.24%.

It is important to note that further increasing the values of the above hyperparameters does not significantly improve the performance of the model.
After implementing conformer encoder and tuning the parameters with 32 heads and 3 layers in the conformer block, the accurancy rate boosts to 91.84%.

Although the iteration steps for training increase from 70,000 to 400,000, the high accuracy rate of predicting validating data is not the result of overfitting. Instead, both accuracy rates of predicting training data and validation increase during training. In other words, as the accuracy rates of predicting training data increase, the trained model also identifies validating data more accurately. The results indicate that the model is learning meaningful patterns and becoming more adept at generalizing to unseen data.
PyTorch 1.8.1
tqdm 4.65.0
I am grateful to the Center for Information Services and High Performance Computing [Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)] at TU Dresden for providing its facilities for high throughput calculations. Part of code snippets are retrieved from the assignment of machine learning course lectured by Prof. Hung-yi Lee at National Taiwan University.