InChIKey is a new format directly derived from InChI. I has several features that distinguish it from InChI and make it attractive for slightly different purposes.
The following table summarizes the basic differences between InChI and InChIKey.
|One string for molecule1||yes||yes|
|One molecule for string2||yes||no3|
- Is there only one representation of a specific molecule or are more representations possible?
- Does one string always represent only one molecule or are collisions possible?
- Because of the unlimited number of possible molecules and limited size of the InChIKey it is unavoidable that more molecules will have the same InChIKey. On the other hand as for now there are no known collisions of InChIKeys (no two structures with different InChIs that would have the same InChIKey have been found). More info can be found below.
- This refers to a slightly vague quality that I call transfer safety
which describes how well the data survive a transfer from one place to another. In ideal conditions both formats are safe, however as was pointed out earlier on the inchi-discuss mailing list, InChI string might loose its integrity when transferred via email of inside a wiki page. This is mostly because the InChI string can be very long and the software might try to break it into separate lines. Because of the length of InChIKey and special care that was taken in designing it, InChIKey is much more robust under such conditions.
- The checksum character was removed in the 1.02 final version of the InChI software.
InChIKey is a fixed-length format directly derived from InChI. It is based on a strong hash (SHA-256 algorithm) of an InChI string.
Because of the hash nature of the InChIKey, there is no guarantee that two distinct molecules will have different InChIKeys. At the time or writing of this article no such collisions are
known, but they are unavoidable in the future. On the other hand it is possible that the first collision will not be found in the next 100 or 1000 years. The nature of the hash algorithm also means that it is virtually impossible to deduce the original InChI from InChIKey (hashes are designed especially for this purpose). The only possible way is to use brute-force method of trying InChIs of all known chemical compounds.
The nature of InChIKey makes it ideal for database storage, especially for indexing purposes. On the other hand it cannot be used as the only format for chemical structure storage because it is not convertible to the original structure.
InChIKey is also a very good format for online publishing in form of metadata. Its small length and compact form guarantee that search engines will read and index them properly, which might not be true for long InChIs.
The 27 characters long InChIKey is made of three parts connected by hyphens. The first part is 14 characters long and is based on the connectivity and proton layers of an InChI string. The second part, contains 9 characters that are related to all other InChI layers (isotopes, stereochemistry, etc.) and also contains the version of InChI and its standard/non-standard property in the last two characters. The third part is one letter, describing the (de)protonation layer of the original InChI.
The first and second parts of the InChIKey are based on a truncated SHA-256 hash of the corresponding InChI layers. For encoding of the data only uppercase ASCII letters are used which ensures that the indexing engines will not split the data and also avoids case-insensitivity problems.