RDKit
Open-source cheminformatics and machine learning.
Loading...
Searching...
No Matches
RDKit::MHFPFingerprints::MHFPEncoder Class Reference

#include <MHFP.h>

Public Member Functions

 MHFPEncoder (unsigned int n_permutations=2048, unsigned int seed=42)
 Constructor.
 
std::vector< uint32_tFromStringArray (const std::vector< std::string > &vec)
 Creates a MinHash from a vector of strings.
 
std::vector< uint32_tFromArray (const std::vector< uint32_t > &vec)
 Creates a MinHash from a list of unsigned integers.
 
std::vector< std::string > CreateShingling (const ROMol &mol, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1)
 Creates a molecular shingling based on circular substructures.
 
std::vector< std::string > CreateShingling (const std::string &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
std::vector< uint32_tEncode (ROMol &mol, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1)
 Creates a MinHash vector from a molecule.
 
std::vector< std::vector< uint32_t > > Encode (std::vector< ROMol > &mols, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
std::vector< uint32_tEncode (std::string &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
std::vector< std::vector< uint32_t > > Encode (std::vector< std::string > &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
ExplicitBitVect EncodeSECFP (ROMol &mol, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048)
 Creates a binary fingerprint based on circular sub-SMILES.
 
std::vector< ExplicitBitVectEncodeSECFP (std::vector< ROMol > &mols, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
ExplicitBitVect EncodeSECFP (std::string &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
std::vector< ExplicitBitVectEncodeSECFP (std::vector< std::string > &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 

Static Public Member Functions

static double Distance (const std::vector< uint32_t > &a, const std::vector< uint32_t > &b)
 Calculates the Hamming distance between two MHFP fingerprints.
 

Detailed Description

Definition at line 44 of file MHFP.h.

Constructor & Destructor Documentation

◆ MHFPEncoder()

RDKit::MHFPFingerprints::MHFPEncoder::MHFPEncoder ( unsigned int  n_permutations = 2048,
unsigned int  seed = 42 
)

Constructor.

Construct a MHFPEncoder

The MHFPEncoder class is instantieted with a given number of permutations and a seed. Fingerprints / minhashes created with a different number of permutations or a different seed are not compatible.

Parameters
n_permutationsthe number of permutations used to create hash functions. This will be the dimensionality of the resulting vector. Default: 2048.
seeda random seed. Default: 42.

Member Function Documentation

◆ CreateShingling() [1/2]

std::vector< std::string > RDKit::MHFPFingerprints::MHFPEncoder::CreateShingling ( const ROMol mol,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1 
)

Creates a molecular shingling based on circular substructures.

A molecular shingling is a vector of SMILES that were extracted from and represent a molecule. This method extracts substructures centered at each atom of the molecule with different radii. A molecule with 10 atoms will generate 10 * 3 shingles when a radius of 3 is chosen.

Parameters
radiusthe maximum radius of the substructure that is generated at each atom. Default: 3.
ringswhether the rings (SSSR) are extrected from the molecule and added to the shingling. Given the molecule "C1CCCCCC1C(=O)C", "C1CCCCCC1" would be added to the shingling. Default: true.
isomericwhether the SMILES added to the shingling are isomeric. Default: false.
kekulizewhether the SMILES added to the shingling are kekulized. Default: true. NOTE that this will throw an exception if the molecule cannot be kekulized.
min_radiusthe minimum radius that is used to extract n-grams. Default: 1.
Returns
the shingling of a molecule.

◆ CreateShingling() [2/2]

std::vector< std::string > RDKit::MHFPFingerprints::MHFPEncoder::CreateShingling ( const std::string &  smiles,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ Distance()

static double RDKit::MHFPFingerprints::MHFPEncoder::Distance ( const std::vector< uint32_t > &  a,
const std::vector< uint32_t > &  b 
)
inlinestatic

Calculates the Hamming distance between two MHFP fingerprints.

Parameters
aan MHFP fingerprint vector.
ban MHFP fingerprint vector.
Returns
the Hamming distance between the two fingerprints.

Definition at line 239 of file MHFP.h.

◆ Encode() [1/4]

std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::Encode ( ROMol mol,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1 
)

Creates a MinHash vector from a molecule.

This methods is a wrapper around MHFPEncoder::CreateShingling and MHFPEncoder::FromStringArray. When a vector of molecules or SMILES is passed and RDKit was compiled with OpenMP, it is parallelized and will speed up by a factor of the number of cores.

Parameters
radiusthe maximum radius of the substructure that is generated at each atom. Default: 3.
ringswhether the rings (SSSR) are extrected from the molecule and added to the shingling. Given the molecule "C1CCCCCC1C(=O)C", "C1CCCCCC1" would be added to the shingling. Default: true.
isomericwhether the SMILES added to the shingling are isomeric. Default: false.
kekulizewhether the SMILES added to the shingling are kekulized. Default: true. NOTE that this will throw an exception if the molecule cannot be kekulized.
min_radiusthe minimum radius that is used to extract n-grams. Default: 1.
Returns
the MHFP fingerprint.

◆ Encode() [2/4]

std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::Encode ( std::string &  smiles,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ Encode() [3/4]

std::vector< std::vector< uint32_t > > RDKit::MHFPFingerprints::MHFPEncoder::Encode ( std::vector< ROMol > &  mols,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ Encode() [4/4]

std::vector< std::vector< uint32_t > > RDKit::MHFPFingerprints::MHFPEncoder::Encode ( std::vector< std::string > &  smiles,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ EncodeSECFP() [1/4]

ExplicitBitVect RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP ( ROMol mol,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1,
size_t  length = 2048 
)

Creates a binary fingerprint based on circular sub-SMILES.

Creates a binary fingerprint similar to ECFP. However, instead of using a Morgan-style hashing, circular n-grams (sub-SMILES) are created, hashed directly and folded.

Parameters
radiusthe maximum radius of the substructure that is generated at each atom. Default: 3.
ringswhether the rings (SSSR) are extrected from the molecule and added to the shingling. Given the molecule "C1CCCCCC1C(=O)C", "C1CCCCCC1" would be added to the shingling. Default: true.
isomericwhether the SMILES added to the shingling are isomeric. Default: false.
kekulizewhether the SMILES added to the shingling are kekulized. Default: true. NOTE that this will throw an exception if the molecule cannot be kekulized.
min_radiusthe minimum radius that is used to extract n-grams. Default: 1.
lengththe length into which the fingerprint is folded. Default: 2048.
Returns
the SECFP fingerprint.

◆ EncodeSECFP() [2/4]

ExplicitBitVect RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP ( std::string &  smiles,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1,
size_t  length = 2048 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ EncodeSECFP() [3/4]

std::vector< ExplicitBitVect > RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP ( std::vector< ROMol > &  mols,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1,
size_t  length = 2048 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ EncodeSECFP() [4/4]

std::vector< ExplicitBitVect > RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP ( std::vector< std::string > &  smiles,
unsigned char  radius = 3,
bool  rings = true,
bool  isomeric = false,
bool  kekulize = false,
unsigned char  min_radius = 1,
size_t  length = 2048 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ FromArray()

std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::FromArray ( const std::vector< uint32_t > &  vec)

Creates a MinHash from a list of unsigned integers.

This method is exposed in order to enable advanced usage of this MHFP implementation such as MinHashing a sparse array generated by another fingerprint (e.g. Morgan / ECFP).

Parameters
veca vector containg unsigned integers.
Returns
the MinHash of the input.

◆ FromStringArray()

std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::FromStringArray ( const std::vector< std::string > &  vec)

Creates a MinHash from a vector of strings.

This method is exposed in order to enable advanced usage of this MHFP implementation such as customizing the properties that are hashed in order to create an MHFP instance. In theory, any number of values that can be represented as strings can be minhashed. This method is called by MHFPEncoder::Encode.

Parameters
veca vector containg strings (e.g. the smiles shingling of a molecule).
Returns
the MinHash of the input.

The documentation for this class was generated from the following file: