This repository is the source code for the paper:
MICO: Selective Search with Mutual Information Co-training
In Proceedings of the International Conference on Computational Linguistics (COLING) , 2022
Zhanyu Wang, Xiao Zhang, Hyokun Yun, Choon Hui Teo and Trishul Chilimb
This is the package of Mutual Information Co-training (MICO) for End2End topic sharding. MICO uses BERT to generate sentence representations, and performs query routing and document assignment with the representations. The document assignment module in MICO outputs almost equal-sized clusters, and the query routing module routes the queries to the cluster containing most (if not all) of its relevant documents. MICO achieves very high performance for topic sharding.
This package can be tested through the example usage below.
You can save the command below as a bash file and run it in the current folder. You can also find and run it in ./example/scripts/run_mico.sh
. It will take less than 5 minutes to finish running.
The results will be saved in ./results/
. In the folder example_pair_BERT-finetune_layer-1_CLS_TOKEN_maxlen64_bs64_lr-bert5e-6_lr2e-4_warmup1000_entropy5_seed1
for this example experiment, we can see the final evaluation metrics saved in metrics.json
. The document assigned to the clusters are saved in clustered_docs.json
in a dictionary. The log files for training and evaluation are *.log
. The model is saved as *.pt
. The folder ./log
contains Tensorboard results for visualization.
The dataset_name
in the training command is set as example
since we have an example dataset saved in ../example/data/example_dataset/
. You can change the train_folder_path
and test_folder_path
according to your needs.
During training, the batch_size
is for each GPU card. If the current choice of batch_size
is good on a machine with one GPU, we do not need to change it when switching to machines with more than one GPU (each with the same GPU memory). This is because we use the DistributedDataParallel
function in PyTorch
to support multi-GPU training: we assign one sub-process for each GPU and it maintains its own dataloader and counts its own epoch number (hence people usually focus on the iteration number instead of the epoch number). For a 4-GPU machine, finishing one epoch for each process means training the model for 4 epochs in total. For a GPU with 16GB memory, setting batch_size=64
is good for the first try.
During testing, we use DataParallel
in PyTorch
for better efficiency (we only go through the dataset once with multi-GPU, much less than using DistributedDataParallel
), and the batch_size
is across all GPUs. Usually for testing, you can set a much larger batch_size
than the one used in training, e.g., for four GPUs (each with 16GB memory), we can use batch_size=2048
. You can also test the trained model directly by setting --eval_only
.
#!/bin/bash
dataset_name=example
train_folder_path=./example/data/${dataset_name}_train_csv/
test_folder_path=./example/data/${dataset_name}_test_csv/
batch_size=64
selected_layer_idx=-1
pooling_strategy=CLS_TOKEN
max_length=64
lr=2e-4
lr_bert=5e-6
entropy_weight=5
num_warmup_steps=1000
seed=1
model_path=./example/results/${dataset_name}_pair_BERT-finetune_layer${selected_layer_idx}\
_${pooling_strategy}\
_maxlen${max_length}\
_bs${batch_size}\
_lr-bert${lr_bert}\
_lr${lr}\
_warmup${num_warmup_steps}\
_entropy${entropy_weight}\
_seed${seed}/
python -u ./main.py \
--model_path=${model_path} \
--train_folder_path=${train_folder_path} \
--test_folder_path=${test_folder_path} \
--dim_input=768 \
--number_clusters=64 \
--dim_hidden=8 \
--num_layers_posterior=0 \
--batch_size=${batch_size} \
--lr=${lr} \
--num_warmup_steps=${num_warmup_steps} \
--lr_prior=0.1 \
--num_steps_prior=1 \
--init=0.0 \
--clip=1.0 \
--epochs=1 \
--log_interval=10 \
--check_val_test_interval=10000 \
--save_per_num_epoch=100 \
--num_bad_epochs=10 \
--seed=${seed} \
--entropy_weight=${entropy_weight} \
--num_workers=0 \
--cuda \
--lr_bert=${lr_bert} \
--max_length=${max_length} \
--pooling_strategy=${pooling_strategy} \
--selected_layer_idx=${selected_layer_idx}
To visualize the curves of the metrics calculated during training and evaluation, please use Tensorboard (for Pytorch
we use TensorboardX
which is installed in the setting up section.)
The results for each experiment is saved in the folder specified by --model_path
in the bash commands. We also have log files in text format in that folder. After running the following command, you can open your browser and type localhost:14095
to view the training results.
# start tensorboard
tensorboard --logdir=./results/ --port=14095 serve
Although we have adopted several techniques to decrease the memory usage, it is still possible that one encounters memory problem when running with large scale dataset. You can try this memory profiling method to estimate how much memory you will need for running MICO.
Some tips:
- Setting
num_worker=0
is a good way to save memory and it almost does not affect the training speed. - Running MICO on more GPUs will create more sub-process automatically, and each sub-process may consume much memory. Therefore, the memory usage increases linearly with the GPU number. If needed, you can set
export CUDA_VISIBLE_DEVICES=0
to only use 1 GPU in training to save memory.
To use the memory profiling method below, please make sure that the python package memory_profiler
is installed. (If not, you can install it with pip install memory_profiler
.) It can track the memory usage of the Python codes. For more details, please see https://pypi.org/project/memory-profiler/.
To use it to track the memory usage, you can try the command below.
mprof run --interval=10 --multiprocess --include-children './your_bash_file.sh'
During the bash file running, you can plot the memory usage over time by the command below. Please replace mprofile_***.dat
with the name of the profile results you want to plot (the lastest dat
file will be used if the file is not specified). The figure will be saved as memory_profile_result.png
.
mprof plot -o memory_profile_result.png --backend agg mprofile_***.dat
For setting up a new EC2 machine to run the scripts, please use the codes below
wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
bash ./Anaconda3-2021.05-Linux-x86_64.sh
source ~/.bashrc
conda install pytorch=1.7.1 cudatoolkit=9.2 -c pytorch
pip install -r requirements.txt
pip install memory_profiler
After download the data, you can replace the two folders (for training and testing data) in ./example/data/
by the two large scale datasets. Then, you can modify and run the script ./example/scripts/run_mico.sh
.
This project is licensed under the Apache-2.0 License.