buildingSMART Forums

IfcGloballyUniqueIds spec description is incorrect - proposal to simplify

I am reading the specification, in particular how a IfcGloballyUniqueId are generated.

http://www.buildingsmart-tech.org/ifc/IFC4/Add2/html/schema/ifcutilityresource/lexical/ifcgloballyuniqueid.htm

I am curious because it’s a rather concise and simple description: generate a 128-bit number, and base64 encode it using the charset provided, and give a resultant 22 character string.

I have a few concerns.

  1. Firstly, given that it is a 128-bit number, and UUIDs are indeed 128-bit, it would make sense that I’d use one of the UUID versions (although the IFC spec doesn’t specify anything about this). I guess I could equally well create one on my head, say “1”, and call it a day. It doesn’t say if the Nil UUID should be treated specially, or if any particular UUID version is preferred for collision prevention.

  2. Regardless, my 128-bit number translates into a 24-character base64 string (21*6+2 remainder), which means that there will be two == padding. Given that the EXPRESS specification wants a 22-character string, it makes sense that it is probably because they want to truncate the == padding. This is not explicit in the definition, but it is merely a sensible assumption. It would be good to make this clearer in the spec.

  3. Finally, given that I have a 2 bit remainder, it means that my last (i.e. 22nd) base64 character will be padded by 4 bits. The four resulting possibilities are 000000, 010000, 100000, and 110000. These translate to 0, 16, 32, and 48, and given the charset defined, are the characters 0, G, W, and m. Therefore, any IfcGloballyUniqueId must end in one of those four characters. Yet in the examples show values that end in other characters. Testing an implementation such as IfcOpenShell with ifcopenshell.guid.new() also gives me values which don’t end in those characters.

  4. It seems odd why IFC doesn’t use the standard A-Za-z0-9 + 2 special base64 charset but instead defines its own?

Am I misunderstanding the spec? Or are the examples / implementations wrong?

It also doesn’t help that GUID is a rather Microsoft-oriented term, it’ll be a little bit more politically correct to call it a UUID, eh :slight_smile:

A little bit of investigation shows that this problem exists with a lot of implementations. Here’s a list that I have tested so far:

  • IfcOpenShell
  • FreeCAD
  • GeometryGym
  • ArchiCAD
  • Revit

For example, this is an IfcGloballyUniqueId string produced by Revit: 18mPNPiNXBUv50hwee2Yod. Because it ends in d, it doesn’t seem to be base64 encoded, for example: 18mPNPiNXBUv50hwee2Yod, 18mPNPiNXBUv50hwee2Yoe, 18mPNPiNXBUv50hwee2Yof, (where the last character is in the set {WXYZabcdefghijkl}) etc all resolve to the exact same 128-bit number, which increases the likelihood of collisions. This is somewhat contrary to the purpose of a UUID.

I have written a short reference implementation in Python, which is platform independent.

import base64
import uuid
from string import maketrans

class IfcGloballyUniqueId:
    def __init__(self):
        self.b64_charset = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
        self.ifc_charset = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$'

    def generate(self):
        return self.encode_uuid_to_ifc_guid(uuid.uuid4().bytes)

    def encode_uuid_to_ifc_guid(self, uuid):
        return base64.b64encode(uuid)[0:22].translate(
            maketrans(self.b64_charset, self.ifc_charset))

    def decode_ifc_guid_to_uuid(self, guid):
        return uuid.UUID(bytes=base64.b64decode(guid.translate(
            maketrans(self.ifc_charset, self.b64_charset)) + '=='))

Here’s a little demo for generating a GUID, starting from some other agnostic UUID (say from a stored DB of BIM data), and going there and back again.

# Note how possible ending characters are 0, G, W and m
ifc_guid = IfcGloballyUniqueId()
print(ifc_guid.generate()) # 9Fs5IkRuIgIQsQaT_sdI00
some_uuid = uuid.uuid4()
some_ifc_guid = ifc_guid.encode_uuid_to_ifc_guid(some_uuid.bytes)
some_uuid_again = ifc_guid.decode_ifc_guid_to_uuid(some_ifc_guid)
print(some_uuid) # 316d658d-6db1-43c2-b9bf-c3f5104ac16b
print(some_ifc_guid) # CMrbZMsnGyAvlyFr44h1Qm
print(some_uuid_again) # 316d658d-6db1-43c2-b9bf-c3f5104ac16b

To resolve this fully, there are a few options:

  1. Implementers will have to follow the spec and use standard base64 encoding. Unfortunately, This will break existing IFC middleware that use the GUID in this fashion
  2. Simplify the specification, and then update implementations:
  • Use the standard b64 charset
  • Specify treatment of the null GUID (already questioned by @lassi.liflander and @jonm)
  • Specify UUID v4 to maximise randomness.
  • Maybe consider removing the b64 encoding? Just make it a UUID string.
  1. Change the spec completely and just say that the IFC GUID is a random 22-character string where characters are a subset of the character set. This is kinda bad because it doesn’t use the UUID standard which is the whole point of preventing collisions and making things unique, and it also encourages developers to go rogue and use weak randomness algorithms which increase collisions.

My own recommendation is to do option 2, but keep the b64 encoding (to match format of older IFC files for legacy reasons).

Thoughts?

1 Like

Right now in FreeCAD we create new UUIDs with:
import uuid, ifcopenshell
uid = ifcopenshell.guid.compress(uuid.uuid1().hex)
but that code is rather old. We could start using ifcopenshell.guid.new() which would remove one bad implementation from the game… Or use @Moult 's code above. Any thought @aothms ?

Please do not use @Moult 's code, this will make Ids that will not work in existing software. Look here to find how to generate the valid Ids:
http://www.buildingsmart-tech.org/implementation/get-started/ifc-guid/ifc-guid-summary
The point is that the last two bits, that Moult wants encoded in the last digit, is actually encoded in the first digit. So, a valid IfcGloballyUniqueId always starts with 0, 1, 2 or 3.

@trondholen, I have read the link you have posted, and it says nothing about encoding the last two bits in the first digit (or did I miss it?). Encoding the last two bits in the first digit is unforunately not how base64 works, I believe. Here is some random independent online b64 bytes<->text encoder I found to demonstrate. My Python code also uses purely stdlib functions that results in GUIDs encoded with the last bits in the last character.

However the link you posted does refer to code examples - the code examples do encode as you’ve described - this means that something was lost in translation: perhaps the code was incorrectly implemented and copied by everyone, or perhaps the spec was an incomplete description of the code? I’m not sure.

This leads to two possible solutions:

  1. Enforce the simpler, more stdlib approach (see option 2 in my previous post), maybe in IFC5?
  2. Decide to stick to a legacy implementation, and rewrite the spec to describe the actual GUID generation, encoding, and decoding requirements.
1 Like

I would go with solution 2 then, change the documentation to stick to the legacy implementation.

If we are to change how GUIDs should be encoded in IFC5, then it should be changed to write them in the same way that is done everywhere except in STEP files, like this:
“DA85046C-CE8A-4226-A38D-732788C5C1E7”

Sounds good, are you suggesting to modify the spec in the meantime for IFC4x, and then in the next IFC5 release, revise the description of IFC GUIDs to a canonical UUID representation of 36 FIXED STRING xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx?

I think this is a very clean and simple solution. This also prevents the issues with the non-standard base64 charset and complexity-adding padding truncation.

In a separate discussion, a clearer definition of the purpose of a NULL GUID can be defined. Maybe we can also decide whether it is worth it to promote a specific UUID version?

@yorik the only difference between your code and ifcopenshell.guid.new() is the UUID version that is used: 4 for pure random instead of the deterministic mac + timestamp (in line with @Moult 's suggestion, although probably there are many valid use cases for the other versions, both derived as well as mac based) https://github.com/IfcOpenShell/IfcOpenShell/blob/master/src/ifcopenshell-python/ifcopenshell/guid.py#L56

The IfcOpenShell Python implementation for the guid encoding is here, the equivalent is also implemented in C++ somewhere: https://github.com/IfcOpenShell/IfcOpenShell/blob/master/src/ifcopenshell-python/ifcopenshell/guid.py#L38

The procedure is as follows, although I agree this should be written in the documentation and not just hidden in some sample implementations:

  • The first byte is encoded in the first two characters
  • then the remaining bytes are encoded in groups of 3, taking up 4 characters

In the first two characters there are indeed some bits unused as 64 ^ 2 > 256. The remaining sequence takes up all bits as 64 ^ 4 = 256 ^ 3

I have also been under the impression that a valid IfcGloballyUniqueId always starts with 0, 1, 2 or 3.

Thanks for clarifying @aothms. Maybe that wording can be used in the spec. And yes, @lassi.liflander, if people follow the definition described by @aothms, the pattern does emerge, but this is not what is described in the spec.

To clear up the confusion, maybe we can just remove the b64 encoding step?

Please see this document. https://tools.ietf.org/html/draft-leach-uuids-guids-01
I believe that implementation is used by most of the vendors that have been around twenty years, or so.

Hey @lassi.liflander, you’ve linked to the UUID standard. I think the UUID generation is OK. The issue lies in the b64 encoding. The IFC spec states a standard base64 encoding, but the actual code example is a non-standard base64 encoding (@aothms summarises it - basically they do the first two bytes back to front).

Getting the encoding right is important because if you do it in a different way to others then you can’t decode it, and you can’t correlate IDs.

The Python snippet (only using stdlib functions) I wrote is demonstrates a standard base64 encoding as described in the spec. This is why @trondholen suggests to just scrap the whole encoding as that’s what’s causing all the confusion.

1 Like

Given that there’s been a renewed interest in the modernisation of IFC, perhaps this thread is worth resolving? The current state of affairs is a legacy implementation.