Git repo: Question Classifier
tokenization->word embedding->sentence vector->training the classifier
.
βββ README.md
βββ data
βΒ Β βββ config
βΒ Β βΒ Β βββ xxx.config
βΒ Β βββ dev.txt
βΒ Β βββ glove.small.txt
βΒ Β βββ labels.txt
βΒ Β βββ raw_data.txt
βΒ Β βββ stopword.txt
βΒ Β βββ train.txt
βΒ Β βββ trec.txt
βΒ Β βββ vocabulary.txt
βββ document
βΒ Β βββ README.md
βΒ Β βββ document.md
βΒ Β βββ document.pdf
βββ src
βΒ Β βββ classifier
βΒ Β βΒ Β βββ __init__.py
βΒ Β βΒ Β βββ network.py
βΒ Β βββ config.ini
βΒ Β βββ config.py
βΒ Β βββ dataloader.py
βΒ Β βββ model.py
βΒ Β βββ question_classifier.py
βΒ Β βββ sentVect
βΒ Β βΒ Β βββ __init__.py
βΒ Β βΒ Β βββ bow.py
βΒ Β βΒ Β βββ bow_bilstm.py
βΒ Β βΒ Β βββ mybilstm.py
βΒ Β βββ utils
βΒ Β βββ __init__.py
βΒ Β βββ file_preload.py
βΒ Β βββ preprocess.py
βββ
[your task]: what you did in this commit
e.g.: 'wordEmbedding: word2vec model initialize'
...
Developing and testing environment: macOS10.15.7, Anaconda python3.8, with 8-gen Core i5 CPU and 16GB RAM.
Training set: 5500-labeled questions
Testing set: TREC 10
mkdir data/models
cd src
Preprocess data using --preprocess flag. Please make sure preprocessing has been done before running training.
python3 question_classifier.py --preprocess --config [config-file-path]
dev training mode: Leaving 10% of training set out as validation set
python3 question_classifier.py --dev --config [config-file-path]
training mode: Train the model with the whole dataset
python3 question_classifier.py --train --config [config-file-path]
test mode: Read an existing model and test it on TREC 10 dataset
python3 question_classifier.py --test --config [config-file-path]