Substructure Search & Fingerprints¶
Isomorphism, SMARTS queries, query building, and molecular fingerprints.
Substructure Check¶
from chython import smiles
benzene = smiles('c1ccccc1')
toluene = smiles('Cc1ccccc1')
# Operator-based
benzene < toluene # True: benzene is substructure of toluene
benzene <= toluene # True: substructure or equal
benzene < benzene # False: not strict substructure of itself
benzene <= benzene # True: equal
# Method-based
benzene.is_substructure(toluene) # True
benzene.is_equal(toluene) # False
Enumerating Matches¶
get_mapping() yields all substructure mappings as dicts {query_atom: target_atom}:
query = smiles('CC')
target = smiles('CCC')
# First match
mapping = next(query.get_mapping(target))
# All matches (automorphism_filter=True by default skips symmetric duplicates)
for mapping in query.get_mapping(target):
print(mapping)
# All symmetry-equivalent matches
for mapping in query.get_mapping(target, automorphism_filter=False):
print(mapping)
# Restrict search to specific atoms
for mapping in query.get_mapping(target, searching_scope=[1, 2]):
print(mapping)
SMARTS Queries¶
smarts() returns a QueryContainer with pattern-matching semantics:
from chython import smarts, smiles
# Carbonyl
query = smarts('[C]=[O]')
query <= smiles('CC(=O)O') # True
for mapping in query.get_mapping(smiles('CC(=O)O')):
print(mapping)
# Aromatic nitrogen
smarts('[N;a]') <= smiles('c1ccncc1') # True
# Element containment shortcut
'N' in smiles('c1ccncc1') # True
'Br' in smiles('c1ccncc1') # False
SMARTS Language¶
Chython’s SMARTS differs from RDKit/OpenBabel in several ways.
Atom Primitives¶
Standard:
#N- atomic number (#6for carbon)D- degree / neighbor count (D3)h- implicit hydrogen count (h1)r- ring size membership (r5,r6)!R- acyclic (not in any ring)a- aromaticA- any element (wildcard)Charge:
+,-,+2,-3Isotope:
[14C]Stereo:
@,@@
Chython extensions:
z- hybridization:z1= sp3,z2= sp2,z3= sp,z4= aromaticx- heteroatom neighbor count:x0= none,x2= twoM- any metal (d-element)
smarts('[C;z2;x0]') # sp2 carbon, no heteroatom neighbors
smarts('[O;D1;z1;x0][C;D3;x2;z2]=O') # carboxylic acid
smarts('[M]') # any metal
NOT Supported¶
Recursive SMARTS
$(...)Valence
vTotal connectivity
XRing count
Rwithout size (userwith explicit sizes or!R)Implicit AND
&(use;instead)
Logical Operators¶
;= AND between primitives:[C;D3;r6],= OR within same primitive type:[r5,r6],[C,N]
OR cannot mix different primitive types: [D1,h1] raises an error.
smarts('[C;r5,r6;a]') # aromatic C in 5- or 6-membered ring
smarts('[C,N]') # carbon or nitrogen
smarts('[C;!R]') # acyclic carbon
Bond Queries¶
-single,=double,#triple,:aromatic,~anyOR:
-,=(single or double)Negation:
!-(not single),!:(not aromatic)Ring bonds:
-;@(single and in ring),-;!@(single and not in ring)
Ring bond modifiers (@, !@) require an explicit bond order prefix.
smarts('[C]-;!@[C]') # non-ring single bond between two carbons
smarts('[C]-;@[C]') # single bond in a ring
smarts('[C]-,=[C]') # single or double bond
CXSMARTS Extensions¶
Radicals via CXSMARTS notation. Hybridization and heteroatom count use
chython’s z and x primitives instead of CXSMARTS atomProp:
# Radical on atom at position 1
q = smarts('[C] |^1:0|')
# Aromatic C in 5/6-ring, non-ring single bond to radical C
# with 0-1 hydrogens — use z (hybridization) and x (heteroatoms) primitives
q = smarts('[C;r5,r6;a]-;!@[C;h0,h1;z1,z2;x0] |^1:1|')
Query Building API¶
The recommended way to build queries is via SMARTS strings:
from chython import smarts, smiles
# Acyclic thia/oxa carbonyl with 3 neighbors
q = smarts('[C;D3;z2;!R;h0]=[O,S]')
q < smiles('CC(=O)O') # True (acid)
q < smiles('CC(=S)C') # True (thioketone)
q < smiles('CC=O') # False (aldehyde - only 2 neighbors)
For queries that cannot be expressed in SMARTS, build programmatically. Add atoms first, then set query constraints on the atom objects:
from chython import QueryContainer, smiles
from chython.containers.bonds import QueryBond
from chython.periodictable import ListElement
q = QueryContainer('')
q.add_atom('C')
a = q.atom(1)
a.neighbors = 3
a.hybridization = 2
a.ring_sizes = 0
a.implicit_hydrogens = 0
q.add_atom(ListElement(['O', 'S']), n=3)
q.add_bond(1, 3, 2)
q < smiles('CC(=O)O') # True (acid)
q < smiles('CC(=S)C') # True (thioketone)
# Ring-ring linker using QueryBond(order, in_ring)
q = QueryContainer('')
q.add_atom('C')
q.add_atom('C')
q.atom(1).ring_sizes = 6
q.atom(1).hybridization = 4
q.atom(2).ring_sizes = 6
q.atom(2).hybridization = 4
q.add_bond(1, 2, QueryBond(1, False)) # single bond, NOT in ring
q < smiles('c1ccc(cc1)-c1ccccc1') # True (biphenyl)
q < smiles('C1CC(=O)CC1') # False
Query atom constraint values can be single integers or tuples for OR logic:
a = q.atom(1)
a.neighbors = (2, 3) # 2 OR 3 neighbors
a.hybridization = 4 # aromatic only
a.ring_sizes = (5, 6) # in 5- or 6-membered ring
a.implicit_hydrogens = 0 # no implicit H
a.heteroatoms = 0 # no heteroatom neighbors
# ring_sizes = 0 means "not in any ring"
Automorphism¶
mol = smiles('c1ccccc1')
mol.is_automorphic() # True for benzene
for mapping in mol.get_automorphism_mapping():
print(mapping)
Maximum Common Substructure (MCS)¶
mol1 = smiles('c1ccccc1O') # phenol
mol2 = smiles('c1ccccc1N') # aniline
for mapping in mol1.get_mcs_mapping(mol2, limit=10000):
print(mapping) # {mol1_atom: mol2_atom}
break # first = largest MCS
Morgan Fingerprints¶
Similar to ECFP / RDKit Morgan fingerprints:
from chython import smiles
import numpy as np
mol = smiles('c1ccccc1O')
# Binary fingerprint as numpy array (shape: (1024,))
fp = mol.morgan_fingerprint(
min_radius=1,
max_radius=4,
length=1024,
number_active_bits=2,
)
# Bit indices only (more memory efficient)
bits = mol.morgan_bit_set(min_radius=1, max_radius=4, length=1024)
# Raw hashes (no folding) - useful for exact fragment matching
hashes = mol.morgan_hash_set(min_radius=1, max_radius=4)
Linear Fingerprints¶
Based on linear path fragments (similar to RDKit RDKFingerprint):
fp = mol.linear_fingerprint(
min_radius=1,
max_radius=4,
length=1024,
number_active_bits=2,
number_bit_pairs=4, # count-sensitive bits
)
bits = mol.linear_bit_set(min_radius=1, max_radius=4, length=1024)
hashes = mol.linear_hash_set(min_radius=1, max_radius=4)
Tanimoto Similarity¶
import numpy as np
mol1 = smiles('c1ccccc1O')
mol2 = smiles('c1ccccc1N')
fp1 = mol1.morgan_fingerprint()
fp2 = mol2.morgan_fingerprint()
# Via numpy
tanimoto = np.dot(fp1, fp2) / (fp1.sum() + fp2.sum() - np.dot(fp1, fp2))
# Via bit sets (faster)
bits1 = mol1.morgan_bit_set()
bits2 = mol2.morgan_bit_set()
tanimoto = len(bits1 & bits2) / len(bits1 | bits2)