Usage

Quick Start

Predict a new network using a trained model

Pre-trained models can be downloaded from here. Candidate pairs should be in tab-separated (.tsv) format with no header, and columns for [protein name 1], [protein name 2]. Optionally, a third column with [label] can be provided, so predictions can be made using training or test data files (but the label will not affect the predictions).

dscript predict --pairs [input data] --seqs [sequences, .fasta format] --model [model file]

Embed sequences with language model

Sequences should be in .fasta format.

dscript embed --seqs [sequences] --outfile [embedding file]

Train and save a model

Training and validation data should be in tab-separated (.tsv) format with no header, and columns for [protein name 1], [protein name 2], [label].

dscript train --train [training data] --val [validation data] --embedding [embedding file] --save-prefix [prefix]

Evaluate a trained model

dscript evaluate --model [model file] --test [test data] --embedding [embedding file] --outfile [result file]

Prediction

usage: dscript predict [-h] --pairs PAIRS --model MODEL [--seqs SEQS]
                    [--embeddings EMBEDDINGS] [-o OUTFILE] [-d DEVICE]
                    [--thresh THRESH]

Make new predictions with a pre-trained model. One of --seqs and --embeddings is required.

optional arguments:
-h, --help            show this help message and exit
--pairs PAIRS         Candidate protein pairs to predict
--model MODEL         Pretrained Model
--seqs SEQS           Protein sequences in .fasta format
--embeddings EMBEDDINGS
                        h5 file with embedded sequences
-o OUTFILE, --outfile OUTFILE
                        File for predictions
-d DEVICE, --device DEVICE
                        Compute device to use
--thresh THRESH       Positive prediction threshold - used to store contact
                        maps and predictions in a separate file. [default:
                        0.5]

Embedding

usage: dscript embed [-h] --seqs SEQS --outfile OUTFILE [-d DEVICE]

Generate new embeddings using pre-trained language model

optional arguments:
-h, --help            show this help message and exit
--seqs SEQS           Sequences to be embedded
--outfile OUTFILE     h5 file to write results
-d DEVICE, --device DEVICE
                        Compute device to use

Training

usage: dscript train [-h] --train TRAIN --test TEST --embedding EMBEDDING
                 [--no-augment] [--input-dim INPUT_DIM]
                 [--projection-dim PROJECTION_DIM] [--dropout-p DROPOUT_P]
                 [--hidden-dim HIDDEN_DIM] [--kernel-width KERNEL_WIDTH]
                 [--no-w] [--no-sigmoid] [--do-pool]
                 [--pool-width POOL_WIDTH] [--num-epochs NUM_EPOCHS]
                 [--batch-size BATCH_SIZE] [--weight-decay WEIGHT_DECAY]
                 [--lr LR] [--lambda INTERACTION_WEIGHT] [--topsy-turvy]
                 [--glider-weight GLIDER_WEIGHT]
                 [--glider-thresh GLIDER_THRESH] [-o OUTFILE]
                 [--save-prefix SAVE_PREFIX] [-d DEVICE]
                 [--checkpoint CHECKPOINT]

Train a new model.

optional arguments:
  -h, --help            show this help message and exit

Data:
  --train TRAIN         list of training pairs
  --test TEST           list of validation/testing pairs
  --embedding EMBEDDING
                        h5py path containing embedded sequences
  --no-augment          data is automatically augmented by adding (B A) for
                        all pairs (A B). Set this flag to not augment data

Projection Module:
  --input-dim INPUT_DIM
                        dimension of input language model embedding (per amino
                        acid) (default: 6165)
  --projection-dim PROJECTION_DIM
                        dimension of embedding projection layer (default: 100)
  --dropout-p DROPOUT_P
                        parameter p for embedding dropout layer (default: 0.5)

Contact Module:
  --hidden-dim HIDDEN_DIM
                        number of hidden units for comparison layer in contact
                        prediction (default: 50)
  --kernel-width KERNEL_WIDTH
                        width of convolutional filter for contact prediction
                        (default: 7)

Interaction Module:
  --no-w                don't use weight matrix in interaction prediction
                        model
  --no-sigmoid          don't use sigmoid activation at end of interaction
                        model
  --do-pool             use max pool layer in interaction prediction model
  --pool-width POOL_WIDTH
                        size of max-pool in interaction model (default: 9)

Training:
  --num-epochs NUM_EPOCHS
                        number of epochs (default: 10)
  --batch-size BATCH_SIZE
                        minibatch size (default: 25)
  --weight-decay WEIGHT_DECAY
                        L2 regularization (default: 0)
  --lr LR               learning rate (default: 0.001)
  --lambda INTERACTION_WEIGHT
                        weight on the similarity objective (default: 0.35)
  --topsy-turvy         run in Topsy-Turvy mode -- use top-down GLIDER scoring
                        to guide training (reference TBD)
  --glider-weight GLIDER_WEIGHT
                        weight on the GLIDER accuracy objective (default: 0.2)
  --glider-thresh GLIDER_THRESH
                        proportion of GLIDER scores treated as positive edges
                        (0 < gt < 1) (default: 0.925)

Output and Device:
  -o OUTPUT, --output OUTPUT
                        output file path (default: stdout)
  --save-prefix SAVE_PREFIX
                        path prefix for saving models
  -d DEVICE, --device DEVICE
                        compute device to use
  --checkpoint CHECKPOINT
                        checkpoint model to start training from

Evaluation

usage: dscript eval [-h] --model MODEL --test TEST --embedding EMBEDDING
                    [-o OUTFILE] [-d DEVICE]

Evaluate a trained model

optional arguments:
-h, --help            show this help message and exit
--model MODEL         Trained prediction model
--test TEST           Test Data
--embedding EMBEDDING
                        h5 file with embedded sequences
-o OUTFILE, --outfile OUTFILE
                        Output file to write results
-d DEVICE, --device DEVICE
                        Compute device to use