Blog
Why Chemical Identifiers are the Unsung Heroes of Innovation
data:image/s3,"s3://crabby-images/97119/97119bda18d9cc2f34af97a2cd2fda6f30615264" alt=""
How do you tell if two chemicals are the same? Before the digital age, a chemist would examine a drawing of the structure or check the names and decide if they matched. But with databases of chemicals containing millions or billions of compounds, we need a computer to be able to match chemicals for us. This is where chemical identifiers come in.
A chemical identifier represents a chemical in a way that a computer can understand so we can check if two inputs are the same. Here are some things that we might consider, but don’t get the job done because they’re not specific enough:
- Name: A given chemical can have many names. For example, benzene is also called benzol and cyclohexatriene. And that’s before we consider languages besides English.
- Molecular formula: One molecular formula can apply to many different chemicals. For example, the molecular formula C4H10 fits both butane and isobutane, but they are different molecules with different properties.
data:image/s3,"s3://crabby-images/da5df/da5dfc1fbeeae7fb7be094e2f69f6de5b188f612" alt="Butane"
data:image/s3,"s3://crabby-images/730ee/730eea68df78095e2ed13d593b8d7d98473285e3" alt="Isobutane"
- Molecular weight: This is even less specific than molecular formula. For example, butane (C₄H₁₀) and propionaldehyde (C₃H₆O) both have molecular weight of 58.1 grams per mol, but they have different numbers of constituent atoms.
data:image/s3,"s3://crabby-images/5dc49/5dc49c65556895dacf1a1d26fe69d787f2b3836a" alt=""
data:image/s3,"s3://crabby-images/8369c/8369c69287244788fa08c6e3301cd28dbd6dd552" alt=""
While these are not specific enough to be identifiers, they can serve as useful checks that the molecule found to be a match with an identifier is reasonable.
Computer-friendly chemical identifiers
Chemists often recognize molecules by their molecular structure, a drawing of the atoms and their bonds. But that’s a two-dimensional representation (which might try to convey three-dimensional information), and a computer does better with a one-dimensional identifier such as a string of characters or digits.
Types of chemical identifiers
There are two broad categories of chemical identifiers.
Molecular structure (graph)
In a mathematical sense, we can think of a molecular structure as a graph where the atoms are nodes and the bonds are edges. So one type of chemical identifier reduces that graph to one dimension, called a line notation.
SMILES: Simplified Molecular Input Line Entry System
Created in the 1980s, SMILES represents the atoms and bonds in a molecule in a straightforward way. For example, the SMILES for acetaldehyde can be written as CC=O where
- C and O are the atomic symbols for carbon and oxygen, respectively
- = indicates a double bond between the second carbon and the oxygen
- The implied bond between the two carbon atoms (CC) is a single bond: If no bond type is given, it’s assumed to be a single bond
- Also implied are the hydrogen atoms. For example, the first carbon has only one explicit bond, so there are three implied hydrogen atoms on that carbon atom.
data:image/s3,"s3://crabby-images/fae83/fae8322de73cf7db9e6c4d845d194ad26840d8c5" alt=""
One limitation of SMILES is that it describes a molecule as a static graph, whereas in reality a molecule may rapidly interconvert to a different form via tautomerization. In other words, we can write a SMILES, but the atoms in the molecule may rearrange themselves to form a more energetically-favorable structure which will have a different SMILES. This means that the structure encoded in a database will not necessarily be how this molecule is likely to manifest in reality. It also means that if you search a database with a SMILES, you may not find the molecule because the SMILES you searched for may be different from the SMILES in the database even though the two SMILES are tautomers and thus likely functionally equivalent.
By default, SMILES do not include stereochemistry–the arrangement of atoms in 3D space–but they can by using isomeric SMILES.
InChI: International Chemical Identifier
The InChI format attempts to handle such subtleties by describing the molecule at multiple levels:
- Chemical formula
- Hydrogen atoms sub-layer
- Charge layer
- Stereochemical layer
The goal is to incorporate tautomerism into the InChI identifier so that all tautomer structures of a given molecule can be represented by one InChI identifier, though making InChI tautomer-invariant is still a work in progress.
To facilitate searching large databases, the InChI can be hashed into an InChI Key. If the database contains the InChI key, you can convert your molecule’s InChI to an InChI Key, which gets around the inefficiency of searching for long strings for large chemicals.
Molecular structure (graph) advantages and disadvantages
Advantages
- Because the identifier encodes the molecular graph, we can use the identifier to create the molecule in code, and then extract desired properties of the molecule.
Disadvantages
- There may be many ways to represent a molecule in that line notation, so we need to be careful to compare canonicalized (standardized) versions; further, different cheminformatics packages produce different “canonical” identifiers (e.g. SMILES) for the same input, so you must canonicalize all identifiers with the same package.
- For large molecules, the identifier will be long, which is inconvenient for storing and searching.
- There can be multiple molecules which are subtly different, for example varying only by stereochemistry, which add complexity to molecular structure identifiers and make it easy to confuse such similar molecules by visual inspection of their identifiers.
Serial number
With all the complications of line notations, you can understand why people might take a different approach. An alternative is a serial number. A serial number has no inherent chemical information; it’s a lookup number to something that does have chemical information. The serial numbers must be generated from a single source to ensure that the same serial number is not assigned to two different chemicals.
CAS Registry® Number
The CAS Registry Number is commonly used by manufacturers and buyers to identify chemicals. As the name implies, CAS registers a substance by assigning it a number such as 67-56-1 for ethanol. There is not necessarily a single CAS Registry Number for a chemical, though: CAS may deprecate an existing number and make a different number the recommended value.
PubChem CID
PubChem is a database run by the US government. It assigns a CID (compound identifier) to each compound in its database, and the molecular structure for a CID is freely available on the PubChem website. The PubChem page for a compound often includes much additional information, such as chemical and physical properties, chemical vendors, and hazards.
Serial number advantages and disadvantages
Advantages
- Easier to match in a database.
Disadvantages
- One must consult a source outside of the identifier itself to get the chemical information.
- There is no indication from the serial number whether two chemicals are closely related, or even the same. For example, the Organization for the Prohibition of Chemical Weapons would like to be able to tell if a given compound is sarin, a nerve gas. The CAS Registry Number for sarin CC(C)OP(=O)(C)F is 107-44-8. However, CAS Registry Number 1415799-56-2 is for sarin with a water molecule CC(C)OP(=O)(C)F.O, which could presumably be either used or converted into a nerve gas. So simply comparing the CAS Registry Number is not sufficient to match compounds that serve the same purpose.
- The chemical must be entered into the database by the maintainer for them to assign an identifier to it. So if you come up with a new chemical that’s not in the database, you will need to use an alternative identifier, at least until the maintainer adds that chemical to their database.
A comparison of types of chemical identifiers
Type | Molecular structure | Serial number |
Examples | SMILES, InChI | CAS Registry Number, PubChem CID, Molport ID, Chemspace ID |
Closely related chemicals: Ethanol ![]() Propanol ![]() | SMILES: CCO CCCO | CAS: 67-56-1, 2597-43-5, 2143-68-2 64-17-5, 2154-50-9, 2348-46-1 |
Encodes molecular structure | Yes | No: Is a tag applied after a molecular-structure identifier has been used to match chemicals. The number must be assigned by an authority. |
Uniqueness | Can be assured via canonicalization (must use the same cheminformatics package for all chemicals) | Difficult to ensure: Two serial numbers can be applied to same (or very similar) chemical at different times |
Length and complexity of identifier | Variable, complex | Fixed, simple |
_ the complexity of chemicals: Stereochemistry, isotopes, etc. | Shows | Obscures the fact that two chemicals are closely related and may be indistinguishable for a purpose |
Can be used to find one structure in another (substructure match) | Yes | No |
Can be used to find similar structures | Yes | No |
Uses of chemical identifiers
Chemical identifiers are often used to search databases. For example, if we have a molecule of interest, we might want to check PubChem to obtain its properties, or check vendor websites to tell if we can buy it from them. It might be required, or at least advantageous, to search a given database with its preferred identifier. So we may need or want to use that identifier, which may not be the identifier that we prefer or already have.
Chemical identifiers are also used to tell if two chemicals are the same. For example, if you are in charge of purchasing chemicals for your organization, and different people give you their lists of desired chemicals using different identifiers, you would want to match up those lists using a common identifier to check if multiple people are asking for the same chemical.
Summary of chemical identifiers
Chemical identifiers are a way to represent 2- or 3-dimensional chemical structures, which may be in flux due to tautomerization, in a format that a computer can work with. Chemical identifiers let us search databases to get information about chemicals. Molecular structure identifiers let us calculate molecular properties yet can be unwieldy and nuanced. Serial number identifiers are simpler in form yet do not reveal relationships between similar chemicals. Despite their challenges, both types are useful for representing and matching chemicals, and a given database may require a particular identifier to search it for chemicals.
References
By
data:image/s3,"s3://crabby-images/bacc5/bacc5d063a7383918dd2b751990f7fe429a9da02" alt="Dr. Jeremy Monat"
Principal Scientific Developer
Newsletter Registration
Subscribe to our newsletter and stay updated with the latest from Aionics.