API

dscript.alphabets

class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]

Bases: object

From Bepler & Berger.

Parameters
  • chars (byte str) – List of characters in alphabet

  • encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]

  • mask (bool) – Set encoding mask [default: False]

  • missing (int) – Number to use for a value outside the alphabet [default: 255]

decode(x)[source]

Decode numeric encoding to byte string of this alphabet

Parameters

x (np.ndarray) – Numeric encoding

Returns

Amino acid string

Return type

byte str

encode(x)[source]

Encode a byte string into alphabet indices

Parameters

x (byte str) – Amino acid string

Returns

Numeric encoding

Return type

np.ndarray

get_kmer(h, k)[source]

retrieve byte string of length k decoded from integer h

unpack(h, k)[source]

unpack integer h into array of this alphabet with length k

class dscript.alphabets.SDM12(mask=False)[source]

Bases: dscript.alphabets.Alphabet

A D KER N TSQ YF LIVM C W H G P

See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732308/#B33 “Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment” Peterson et al. 2009. Bioinformatics.

class dscript.alphabets.Uniprot21(mask=False)[source]

Bases: dscript.alphabets.Alphabet

Uniprot 21 Amino Acid Encoding.

From Bepler & Berger.

dscript.fasta

dscript.fasta.parse(f: str)[source]

Parse a FASTA file and return a tuple of sequence names and sequences.

Parameters

f (file-like object) – file-like object representing the FASTA file to parse.

Returns

A tuple containing: - list of str: Sequence names. - list of str: Sequences.

Return type

(list of str, list of str)

dscript.fasta.parse_dict(f: str)[source]

Parse a FASTA file and return a dictionary of sequences.

Parameters

f (str) – The FASTA file to parse (file-like object).

Returns

A dictionary where keys are sequence names and values are sequences.

Return type

dict

dscript.fasta.parse_directory(directory, extension='.seq')[source]

Parse all files in a directory with a specific extension and return their names and sequences.

Parameters
  • directory (str) – Directory containing the files to parse.

  • extension (str) – File extension to filter files (default is “.seq”).

Returns

A tuple containing: - list of str: Sequence names. - list of str: Sequences.

Return type

(list of str, list of str)

dscript.fasta.parse_from_list(f: str, names: list)[source]

Parse a FASTA file and return a dictionary of sequences for specified names.

Parameters
  • f (str) – The FASTA file to parse (file-like object).

  • names (list of str) – List of sequence names to extract from the FASTA file.

Returns

A dictionary where keys are sequence names and values are sequences.

Return type

dict

dscript.fasta.write(nam, seq, f)[source]

Write a set of sequences to a FASTA file.

Parameters
  • nam (list of str) – A list of keys (sequence names).

  • seq (list of str) – A list of sequences.

  • f (file-like object) – The file to write to.

dscript.foldseek

dscript.foldseek.get_3di_sequences(pdb_files: list)[source]

Extract 3Di sequences from PDB/mmCIF files using biotite.structure.alphabet.to_3di(atoms). Returns a dict {basename: SeqRecord}.

At this time, this function will only extract a 3Di sequence for the first chain in each PDB file. If you need to extract multiple chains, you will need to modify this function. This is to maintain consistent naming support with the rest of D-SCRIPT training and inference scripts, as the current requirement is that pdb file names match fasta header names.

dscript.foldseek.get_foldseek_onehot(n0, size_n0, fold_record, fold_vocab)[source]

fold_record is just a dictionary {ensembl_gene_name => foldseek_sequence}

dscript.glider

dscript.glider.compute_X_normalized(A, D, t=- 1, lm=1, is_normalized=True)[source]
dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted score between p and q.

Parameters
  • p – A node of the graph

  • q – Another node in the graph

  • edgedict (dict) – A dictionary with key (p, q) and value w.

  • ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}

  • params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted normalized score between p and q.

Parameters
  • p – A node of the graph

  • q – Another node in the graph

  • edgedict (dict) – A dictionary with key (p, q) and value w.

  • ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}

  • params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_degree_vec(edgelist)[source]
dscript.glider.compute_l3_score_mat(p, q, edgedict, ndict, params=None)[source]
dscript.glider.compute_l3_unweighted_mat(A)[source]
dscript.glider.compute_l3_weighted_mat(A)[source]
dscript.glider.compute_pinverse_diagonal(D)[source]
dscript.glider.create_edge_dict(edgelist)[source]

Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.

Parameters

edgelist (list) – list with elements of form (p, q, w)

Returns

A dictionary with key (p, q) and value w.

Return type

dict

dscript.glider.create_neighborhood_dict(edgelist)[source]

Create a dictionary with nodes as key and a list of neighborhood nodes as the value

Parameters

edgelist (list) – A list with elements of form (p, q, w)

Returns

neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}

Return type

dict

dscript.glider.densify(edgelist, dim=None, directed=False)[source]

Given an adjacency list for the graph, computes the adjacency matrix.

Parameters
  • edgelist (list) – Graph adjacency list

  • dim (int) – Number of nodes in the graph

  • directed (bool) – Whether the graph should be treated as directed

Returns

Graph as an adjacency matrix

Return type

np.ndarray

dscript.glider.get_dim(edgelist)[source]

Given an adjacency list for a graph, returns the number of nodes in the graph.

Parameters

edgelist (list) – Graph adjacency list

Returns

Number of nodes in the graph

Return type

int

dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source]

Return glide_mat and glide_map.

Parameters
  • pos_df (pd.DataFrame) – Dataframe of weighted edges

  • thres_p (float) – Threshold to treat an edge as positive

  • params (dict) – Parameters for GLIDE

Returns

glide_matrix and corresponding glide_map

Return type

tuple(np.ndarray, dict)

Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.

Parameters
  • edgelist (list) – A list with elements of type (p, q, wt)

  • X (np.ndarray) – A nxk embedding matrix

  • params (dict) –

    A dictionary with entries:

    • alpha: real number

    • beta: real number

    • delta: real number

    • loc: String, can be cw for common weighted, l3 for l3 local scoring

    To enable ctypes, the following entries should be there:

    • ctypes_on: True (This key should only be added if ctypes is on)

    • so_location: String location of the .so dynamic library

  • thres_p (float) – Threshold percentile value

Returns

Glide matrix

Return type

np.ndarray

dscript.glider.glider_score(p, q, glider_map, glider_mat)[source]

dscript.language_model

dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source]

Embed all files in a directory in .fasta format using pre-trained language model from Bepler & Berger.

Parameters
  • directory (str) – Input directory (.fasta format)

  • outputPath (str) – Output embedding file (.h5 format)

  • device (int) – Compute device to use for embeddings [default: 0]

  • verbose (bool) – Print embedding progress

  • extension (str) – Extension of all files to read in

dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]

Embed sequences using pre-trained language model from Bepler & Berger.

Parameters
  • fastaPath (str) – Input sequence file (.fasta format)

  • outputPath (str) – Output embedding file (.h5 format)

  • device (int) – Compute device to use for embeddings [default: 0]

  • verbose (bool) – Print embedding progress

dscript.language_model.lm_embed(sequence, use_cuda=False)[source]

Embed a single sequence using pre-trained language model from Bepler & Berger.

Parameters
  • sequence (str) – Input sequence to be embedded

  • use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]

Returns

Embedded sequence

Return type

torch.Tensor

dscript.load_worker

dscript.loading

class dscript.loading.LoadingPool(file_path, n_jobs=- 1, timeout=60)[source]

Bases: object

load(keys, progress=False)[source]
load_once(keys, progress=True)[source]
shutdown()[source]

dscript.pretrained

dscript.pretrained.get_pretrained(version='human_v2')[source]

Get pre-trained model object.

See the documentation for most up-to-date list.

  • lm_v1 - Language model from Bepler & Berger.

  • human_v1 - Human trained model from D-SCRIPT manuscript.

  • human_v2 - Human trained model from Topsy-Turvy manuscript.

  • human_tt3d - Human trained model with FoldSeek sequence inputs

Default: human_v2

Parameters

version (str) – Version of pre-trained model to get

Returns

Pre-trained model

Return type

dscript.models.*

dscript.pretrained.get_state_dict(version='human_v2', verbose=True)[source]

Download a pre-trained model if not already exists on local device.

Parameters
  • version (str) – Version of trained model to download [default: human_1]

  • verbose (bool) – Print model download status on stdout [default: True]

Returns

Path to state dictionary for pre-trained language model

Return type

str

dscript.pretrained.get_state_dict_path(version: str) → str[source]
dscript.pretrained.retry(retry_count: int)[source]

dscript.utils

class dscript.utils.PairedDataset(X0, X1, Y)[source]

Bases: Generic[torch.utils.data.dataset._T_co]

Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.

Parameters
  • X0 – List of first item in the pair

  • X1 – List of second item in the pair

  • Y – List of labels

dscript.utils.RBF(D, sigma=None)[source]

Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.

\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)

Parameters
  • D (np.ndarray) – Distance matrix

  • sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]

Returns

Similarity matrix

Return type

np.ndarray

dscript.utils.collate_paired_sequences(args)[source]

Collate function for PyTorch data loader.

dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=- 1, return_dict=True)[source]

Load keys from hdf5 file into memory

Parameters
  • file_path (str) – Path to hdf5 file

  • keys (iterable[str]) – List of keys to get

Returns

if return_dict, a mapping of keys (proteins names) to pointers to empbeddings. otherwise, a list of pointers in the same order as keys

Return type

list

dscript.utils.log(m, file=None, timestamped=True, print_also=False)[source]

Legacy log function that wraps loguru for backward compatibility.

Parameters
  • m (str) – Message to log

  • file (file handle or None) – File handle to write to (if None, uses stdout)

  • timestamped (bool) – Whether to include timestamp (handled by loguru)

  • print_also (bool) – Whether to also print to stdout when writing to file

dscript.utils.parse_device(device_arg, logFile)[source]
dscript.utils.setup_logger(log_file=None, also_stdout=False)[source]

Setup loguru logger for D-SCRIPT.

Parameters
  • log_file (file handle, str, or None) – File handle or path to write logs to

  • also_stdout (bool) – Whether to also log to stdout