API

dscript.alphabets

class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]

Bases: object

From Bepler & Berger.

Parameters
  • chars (byte str) – List of characters in alphabet

  • encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]

  • mask (bool) – Set encoding mask [default: False]

  • missing (int) – Number to use for a value outside the alphabet [default: 255]

decode(x)[source]

Decode numeric encoding to byte string of this alphabet

Parameters

x (np.ndarray) – Numeric encoding

Returns

Amino acid string

Return type

byte str

encode(x)[source]

Encode a byte string into alphabet indices

Parameters

x (byte str) – Amino acid string

Returns

Numeric encoding

Return type

np.ndarray

get_kmer(h, k)[source]

retrieve byte string of length k decoded from integer h

unpack(h, k)[source]

unpack integer h into array of this alphabet with length k

class dscript.alphabets.SDM12(mask=False)[source]

Bases: dscript.alphabets.Alphabet

A D KER N TSQ YF LIVM C W H G P

See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732308/#B33 “Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment” Peterson et al. 2009. Bioinformatics.

class dscript.alphabets.Uniprot21(mask=False)[source]

Bases: dscript.alphabets.Alphabet

Uniprot 21 Amino Acid Encoding.

From Bepler & Berger.

dscript.fasta

dscript.fasta.count_bins(array, bins)[source]
dscript.fasta.parse(f, comment='#')[source]
dscript.fasta.parse_directory(directory, extension='.seq')[source]
dscript.fasta.write(nam, seq, f)[source]

dscript.language_model

dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source]

Embed all files in a directory in .fasta format using pre-trained language model from Bepler & Berger.

Parameters
  • directory (str) – Input directory (.fasta format)

  • outputPath (str) – Output embedding file (.h5 format)

  • device (int) – Compute device to use for embeddings [default: 0]

  • verbose (bool) – Print embedding progress

  • extension (str) – Extension of all files to read in

dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]

Embed sequences using pre-trained language model from Bepler & Berger.

Parameters
  • fastaPath (str) – Input sequence file (.fasta format)

  • outputPath (str) – Output embedding file (.h5 format)

  • device (int) – Compute device to use for embeddings [default: 0]

  • verbose (bool) – Print embedding progress

dscript.language_model.lm_embed(sequence, use_cuda=False)[source]

Embed a single sequence using pre-trained language model from Bepler & Berger.

Parameters
  • sequence (str) – Input sequence to be embedded

  • use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]

Returns

Embedded sequence

Return type

torch.Tensor

dscript.pretrained

dscript.pretrained.get_pretrained(version='human_v2')[source]

Get pre-trained model object.

See the documentation for most up-to-date list.

  • lm_v1 - Language model from Bepler & Berger.

  • human_v1 - Human trained model from D-SCRIPT manuscript.

  • human_v2 - Human trained model from Topsy-Turvy manuscript.

Default: human_v2

Parameters

version (str) – Version of pre-trained model to get

Returns

Pre-trained model

Return type

dscript.models.*

dscript.pretrained.get_state_dict(version='human_v2', verbose=True)[source]

Download a pre-trained model if not already exists on local device.

Parameters
  • version (str) – Version of trained model to download [default: human_1]

  • verbose (bool) – Print model download status on stdout [default: True]

Returns

Path to state dictionary for pre-trained language model

Return type

str

dscript.pretrained.get_state_dict_path(version: str) → str[source]
dscript.pretrained.retry(retry_count: int)[source]

dscript.glider

dscript.glider.compute_X_normalized(A, D, t=- 1, lm=1, is_normalized=True)[source]
dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted score between p and q.

Parameters
  • p – A node of the graph

  • q – Another node in the graph

  • edgedict (dict) – A dictionary with key (p, q) and value w.

  • ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}

  • params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted normalized score between p and q.

Parameters
  • p – A node of the graph

  • q – Another node in the graph

  • edgedict (dict) – A dictionary with key (p, q) and value w.

  • ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}

  • params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_degree_vec(edgelist)[source]
dscript.glider.compute_l3_score_mat(p, q, edgedict, ndict, params=None)[source]
dscript.glider.compute_l3_unweighted_mat(A)[source]
dscript.glider.compute_l3_weighted_mat(A)[source]
dscript.glider.compute_pinverse_diagonal(D)[source]
dscript.glider.create_edge_dict(edgelist)[source]

Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.

Parameters

edgelist (list) – list with elements of form (p, q, w)

Returns

A dictionary with key (p, q) and value w.

Return type

dict

dscript.glider.create_neighborhood_dict(edgelist)[source]

Create a dictionary with nodes as key and a list of neighborhood nodes as the value

Parameters

edgelist (list) – A list with elements of form (p, q, w)

Returns

neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}

Return type

dict

dscript.glider.densify(edgelist, dim=None, directed=False)[source]

Given an adjacency list for the graph, computes the adjacency matrix.

Parameters
  • edgelist (list) – Graph adjacency list

  • dim (int) – Number of nodes in the graph

  • directed (bool) – Whether the graph should be treated as directed

Returns

Graph as an adjacency matrix

Return type

np.ndarray

dscript.glider.get_dim(edgelist)[source]

Given an adjacency list for a graph, returns the number of nodes in the graph.

Parameters

edgelist (list) – Graph adjacency list

Returns

Number of nodes in the graph

Return type

int

dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source]

Return glide_mat and glide_map.

Parameters
  • pos_df (pd.DataFrame) – Dataframe of weighted edges

  • thres_p (float) – Threshold to treat an edge as positive

  • params (dict) – Parameters for GLIDE

Returns

glide_matrix and corresponding glide_map

Return type

tuple(np.ndarray, dict)

Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.

Parameters
  • edgelist – A list with elements of type (p, q, wt)

  • X – A nxk embedding matrix

  • params – A dictionary with entries

{

alpha => real number beta => real number delta => real number loc => String, can be cw for common weighted, l3 for l3 local scoring

### To enable ctypes, the following entries should be there ###

ctypes_on => True # This key should only be added if ctypes is on (dont add this

# if ctypes is not added)

so_location => String location of the .so dynamic library

}

dscript.glider.glider_score(p, q, glider_map, glider_mat)[source]

dscript.utils

class dscript.utils.PairedDataset(X0, X1, Y)[source]

Bases: torch.utils.data.dataset.Dataset

Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.

Parameters
  • X0 – List of first item in the pair

  • X1 – List of second item in the pair

  • Y – List of labels

dscript.utils.RBF(D, sigma=None)[source]

Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.

\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)

Parameters
  • D (np.ndarray) – Distance matrix

  • sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]

Returns

Similarity matrix

Return type

np.ndarray

dscript.utils.collate_paired_sequences(args)[source]

Collate function for PyTorch data loader.

dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=- 1)[source]

Load keys from hdf5 file into memory

Parameters
  • file_path (str) – Path to hdf5 file

  • keys (list[str]) – List of keys to get

Returns

Dictionary with keys and records in memory

Return type

dict

dscript.utils.log(m, file=None, timestamped=True, print_also=False)[source]