CycleServer’s datastore supports 128-bit values used to identify records uniquely. These identifiers are created automatically when storing records for types where AutoGenerateKey is true.
Although identifiers have a canonical string representation, they are not strings. They are a new type of ClassAd expression with the following syntax:
That is, an initial # character followed by any number of hexadecimal characters, plus a “-” and “.” Everything that matches this format will be parsed as an identifier, although in many cases it will not be a valid identifier and the parse will fail.
You can convert an identifier to a string with the string() ClassAd function, and convert a string to an identifier with the id() function. Identifiers support equality, “is”, “isnt”, and comparisons with themselves. Comparisons are done by comparing the 128 bits directly. Identifiers are not equal to values of another type, and operators (except for “is”, “isnt”, “===”, and “=!=”), return ”undefined” if compared to different types.
Identifiers can be generated in two ways:
- Automatically, when records of a certain type are stored
- On-demand, with the adstore.nextValueFromSequence() function
If a type is defined with AutoGenerateKey=true, then the key attribute will be given a unique identifier when it is stored. By default, it gets an ordered identifier<datastore-ouid>`, but you can change with the AutoGenerateKeyMethod attribute on the type:
- A standard random UUID
- An ordered, globally-unique identifier
- A locally unique autoincrementing number
You can get the next value from a sequence with the adstore.nextValueFromSequence(name) function. There are two predefined sequences (random and ordered), and a sequence created for each autogenerated type named type:NAME. Here is an example using the Datastore API.:
$ $CS_HOME/util/jython Jython 2.7.1b3 (, Feb 16 2016, 20:04:41) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_77 Type "help", "copyright", "credits" or "license" for more information. >>> from application import adstore >>> ds = adstore.connectToRemoteDataStore('http://127.0.0.1:8080', 'user', 'pass') >>> adstore.nextValueFromSequence('random') #39e454b8-3f1d-476d-b49d-cda8e452f6c7 >>> adstore.nextValueFromSequence('ordered') #571eed18-0031-000000000002-1 >>> adstore.nextValueFromSequence('ordered') #571eed19-0041-000000000002-1
You can also create your own sequences by defining a record of type Sequence. You can then get the values from the sequence by supplying its name to the adstore.nextValueFromSequence() method.
Sequences may be configured to return the value as a string or an integer, if they have ValueType=string or ValueType=integer on the sequence. Strings do not contain the # character at the start. This works for autogenerated-key sequences too. (Note that integer will only work for autoincrement sequences.)
Identifiers in Detail
Note: The information in this section is very detailed, and can be skipped.
Kinds of Identifiers
There are three kinds of identifiers:
- UUIDs, which are the standard version 4 UUIDs generated by Java
- OUIDs (ordered unique identifiers), which are ordered by the time each is created
- LUIDs (locally unique identifiers), a simpler identifier for some purposes
The primary differences between a standard UUID and an OUID are the following:
- All bits (except for a few version bits) in UUIDs are random, and
thus there is no ordering between subsequent generated ids. OUIDs
are ordered by time overall.
- Because of the preceding, UUIDs always require the full 32 hex
digits to be identified (plus 4 hyphens). OUIDs are constructed to
group zero bits together, and use a format that allows leading zeros
to be omitted. This drops off several characters using a standard
formatting that includes some padding for alignment. Often the
number of digits needed to be typed in (such as when one person is
transcribing to another, or memorizing the id temporarily) can be
even less, since all leading zeros can be omitted.
All identifiers, regardless of their type, are compared simply as 128 bit values. No processing or extraction is needed to compare them for equality or ordering.
An OUID consists of 4 sections: the time, an offset, a node identifier, and a clock sequence. The time section occupies the highest bits, to provide the primary ordering of OUIDs. It provides per-second granularity. The offset count provides uniqueness within that second, by incrementing a counter for each id in the current second. The node identifier is a 48 bit random value assigned to each CycleServer instance. Together, the first three provide almost guaranteed uniqueness: time advances monotonically, and each node can generate a new, unique id every 60 ns without conflict.
The format of an OUID is a hex string divided into four sections:
- # A literal character, used to indicate this is an identifier (optional in some contexts)
- T (36 bits, 9 hex digits): The time since January 1, 1970 UTC in seconds.
- C (24 bits, 6 hex digits): A count of the specific id generated in the second given by T, up to 16 million
- V (4 bits, 1 hex digit): The lower three bits are the version (currently 1), to allow for future modifications without collision. The upper bit indicates whether the clock sequence is current or backfill.
- N (48 bits, 12 hex digits): A node identifier, which is 47 bits of randomly generated data. Note that the upper bit is always 0, so that these ids will never collide with the random UUIDs generated by CycleServer.
- R (16 bits, 4 hex digits): The clock sequence.
This lets the identifiers encode times up to year 4070, with a maximum of 16 million ids per second per node (which would allow us a theoretical rate of one id per 60 ns), across many nodes.
For formatted output, dropping all leading zeros is not ideal, because in general values won’t align. In order to balance the need for alignment against the need to drop overly redundant values, a shorter standard format is the following:
This allows times up to the year 2038, per-second counts up to 65535, and 16 separate clock jumps or backfills, without losing alignment. Higher values are allowed (up to the limits described earlier) but that will cause ids to be no longer aligned.
The format of UUIDs is similar (as encoded by this class, with a leading #):
where X can be any digit, and y is 8, 9, A, or B.
Note that OUIDs can be distinguished from UUIDs in several ways: first, there are 4 sections (3 hyphens), vs 5 sections (4 hyphens). Second, there will be fewer digits, or at least many 0s. Third, and least helpful for people but most helpful for computers, bit 63 will be 0 for OUIDs, and 1 for UUIDs.
Nodes are separate instances of CycleServer. Often this will correspond to a host machine, but one host may run more than one instance of CycleServer, and thus have more than one node. A node is assigned the first time an instance is started, and it retains the same value across reboots (as well as when it joins another cluster when CycleServer is distributed). Nodes are identified by a randomly generated 48-bit string (of which 47 bits are meaningful). This gives us 143 trillion unique node identifiers in theory, but < 0.00004% chance of collision with < 10000 nodes, and < .003% chance with 100000 nodes. It would be possible to use node identifiers with less significant bits and less chance of collision by getting them from the master server for a CS cluster, but that would remove the ability to import records from other datastore instances.
Clocks normally move strictly forward, which provides a good basis for per-node uniqueness. However, there are two cases where this is not true:
- Clocks can be set backwards (on purpose or accidentally), which
would violate the assumption that time advances monotonically
- It is sometimes desirable to bulk load (“backfill”) records that should
be dated in the past (so they sort correctly relative to other
entries). This is important for legacy data we currently have like
audit logs, as well as people writing import plugins that start by
loading in historical data stored elsewhere.
The clock sequence lets us handle these cases. First, the lowest bit of the clock sequence indicates whether this is a normal (ie, “current”) timestamp, or a backfill. The upper bits represent an integer that is incremented every time that we need to generate ids whose timestamps may collide with others generated. First, if we are generating backfill ids, we will generate a new clock sequence for each set of ids generated. (This implies a local stable storage to track this.) To avoid using too many clock sequences, we can track the min and max identifiers created from a backfill, and if this new set is outside the range, we just widen that range and keep the clock sequence; if it overlaps, we use the range for the new identifiers and generate a new clock sequence.
Non-monotonic clocks can be detected in a similar way, by treating current ids as backfilling from [now, infinity). That is, if the clock moves forward monotonically, at every instant we will be making a new, non-overlapping range of times. If it moves backwards, it will overlap the range so far; if it moves forward, it won’t. We track the union of all ranges of times generated.
This leads to the following algorithm: store the current clock sequence, plus the union of the range of ids generated for that clock sequence (separately for current and backfill identifiers). When we boot up, and whenever the clock changes to a new second, we make sure that it is larger than the max id. If it is, we stay at the current clock sequence and widen the range (which should be flushed to disk periodically, but not necessarily more often than the length of time it takes to reboot–say less than 10-15 seconds). If not, we increment the clock sequence and start over with a new range. The process is similar for backfills, except that we need a way to know how to get the “proper time” from each record being imported. That gives us a range for import, and if it overlaps the current range for backfills, we generate a new clock sequence and start over with the new range; if not, we use the current clock sequence and widen the existing range to accommodate the new range.
The reason that backfill gets a separate clock sequence is because non-monotonic clocks are rare, and thus the clock sequence for that will be low (usually 0, if it never happens). This helps keep most ids shorter, by maximizing the number of leading zeros per section. Backfill ids may get up to a large clock sequence, but normal ids will continue to have a small clock sequence.
This 16 bits allocated to the clock sequence gives 64 thousand unique values for both current and backfill ids, which should be plenty for non-monotonic clocks. For backfilling, it is still reasonable. In general, there will be one backfill at the start of gathering data for a system, which corresponds to one clock sequence at most (assuming it overlaps). This allows at least 64k separate plugins to run without conflict (more if the historical times do not overlap, but that is not likely for typical backfills). Even if that is not enough, there is the extra protection from the fact that different plugins will probably backfill records of a different type.
If the clock sequence hits the limit we can just let it wrap, with the hopes that all ids generated the last time it was 0 are all in the distant past or were used for different record types.
Locally Unique Identifiers
For test purposes and possibly for certain uses, there is a very simple format that creates ids that are only unique locally (in a given context), called LUIDs. These are simply incrementing numbers, from 1 up, and are represented with no dashes. Their primary advantage is maximal simplicity in presentation, for cases where you need to autogenerate an id, and show it to the end user so they can do something about it. For instance, submission ids in CycleServer fit this use case: CycleServer has to create a unique id, but the id is displayed directly on the jobs page so that end users can coordinate on what jobs are running (“submission 1495 has been running for 10 hours…can you check on it?”).
For this reason, they are represented in decimal, and they have the strict minimum of digits, and no hyphens. Typically they would be generated with a separate sequence for each type of object (for instance, users might get ids starting at 1, and groups would also get ids starting at 1). This is what makes them ”locally” unique.) However, generating them across hosts requires global coordination, typically to allocate a range of ids for a host from the master when it needs to create local identifiers. Luckily, since their advantage (few digits) will quickly disappear if they are used to generate large numbers of ids, the slower speed is usually worth it when their simplicity is needed.
The upper 64 bits of LUIDs are 0, which distinguishes them from UOIDs (the version starts at 1), and UUIDs (there is a required 0100 bit pattern in the upper 64 bits). This means that LUIDs can be up to 64 bits, which is an extremely large range for the types of objects that should be identified with an LUID (it is a 20-digit number, in decimal).