Skip to end of metadata
Go to start of metadata

We plan to refactor the database schema for ContentHostingService in Sakai 2.4. Among the goals and objectives of that refactoring are:

  • Use arbitrary "long" integer 36-byte UUIDs as keys for relationships among tables used by CHS.
  • Totally eliminate XML.
  • Use SQL rather than Hibernate for all DB operations to support the CHS APIs (see To Hibernate Or Not To Hibernate).
  • Keep main entries for collections and resources in one table
  • Include all common properties/attributes in a main table (the CONTENT_ENTITY table) with a long integer key.
    • IN_COLLECTION (links to the containing collection)
    • ENTITY_PATH (the full path to the resource/collection within CHS)
    • ENTITY_ID (the SHA1 encoding of ENTITY_PATH)
    • ENTITY_UUID (generated by CHS; the primary key for this and other tables)
    • VERSION
    • DISPLAY_NAME (Use last "part" of ENTITY_PATH as display name instead of allowing users to partially "rename" a resource)
    • RESOURCE_TYPE (folder, fileUpload, textDocument, htmlDocument, etc)
    • CREATED_TIME
    • CREATED_BY
    • MODIFIED_TIME
    • MODIFIED_BY
    • ACCESS_MODE (grouped, site, inherited, public)
    • HIDDEN
    • RELEASE_DATE
    • RETRACT_DATE
    • HAS_PRIORITY_SORT
    • PRIORITY_SORT_ORDER
  • Include optional properties in a separate table as triples (the CONTENT_ENTITY_PROPERTIES table).
  • Include list of groups for GROUPED entities in separate table (the CONTENT_ENTITY_GROUPS table).
  • Include file-path or file-body for resources in separate table (the CONTENT_ENTITY_BODY_FILE and CONTENT_ENTITY_BODY_BINARY tables).
  • Include information about earlier versions in a separate table (the CONTENT_ENTITY_VERSIONS table). Will also need version tables for content body, probably)
  • Increase the size of ENTITY_ID ENTITY_PATH field from 255 to 1024 4096 (this is under discussion).
  • Primary key in all but CONTENT_ENTITY table is a BIGINT (long) 36-byte String, which is a UUID.
  • DB should assign those unique BIGINT keys in CONTENT_ENTITY table before we add rows to other tables.
  • Need to add indexes/constraints for common queries.
  • Promote Resource-type to field. Demote mime-type to property.

The first draft version of an autoddl file for HSQL is at "Draft of HSQL autoddl file". A revised version ("sakai_content_2_4_0.sql v2") is shown below, followed by some comments about what changed between these versions.

-----------------------------------------------------------------------------
-- CONTENT_ENTITY
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY 
(
	ENTITY_UUID VARCHAR (36) NOT NULL,
	VERSION SMALLINT,
	ENTITY_ID VARCHAR (255) NOT NULL,
	ENTITY_PATH VARCHAR (4096) NOT NULL,
	IN_COLLECTION VARCHAR (36),
	RESOURCE_TYPE VARCHAR (99),
	-- DISPLAY_NAME VARCHAR (1024),	
	CREATED_TIME TIMESTAMP,
	CREATED_BY VARCHAR (99),
	MODIFIED_TIME TIMESTAMP,
	MODIFIED_BY VARCHAR (99),
	ACCESS_MODE VARCHAR (16),
	HIDDEN BOOLEAN,
	RELEASE_DATE DATETIME (0),
	RETRACT_DATE DATETIME (0),
	HAS_PRIORITY_SORT BOOLEAN,
	PRIORITY_SORT_ORDER SMALLINT,
	CHANGES VARCHAR (1024),
	
	CONSTRAINT CONTENT_ENTITY_INDEX UNIQUE (ENTITY_ID)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_VERSIONS
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_VERSIONS 
(
	ENTITY_UUID VARCHAR (36) NOT NULL,
	VERSION SMALLINT,
	IN_COLLECTION VARCHAR (36),
	RESOURCE_TYPE VARCHAR (99),
	-- DISPLAY_NAME VARCHAR (1024),	
	-- CREATED_TIME TIMESTAMP,
	-- CREATED_BY VARCHAR (99),
	MODIFIED_TIME TIMESTAMP,
	MODIFIED_BY VARCHAR (99),
	ACCESS_MODE VARCHAR (16),
	HIDDEN BOOLEAN,
	RELEASE_DATE DATETIME (0),
	RETRACT_DATE DATETIME (0),
	HAS_PRIORITY_SORT BOOLEAN,
	PRIORITY_SORT_ORDER SMALLINT,
	CHANGES VARCHAR (1024),
	
	CONSTRAINT CONTENT_ENTITY_INDEX UNIQUE (ENTITY_ID)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_GROUPS
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_GROUPS 
(
	ENTITY_UUID VARCHAR (36),
	VERSION SMALLINT,
	GROUP_ID VARCHAR (99)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_PROPERTIES
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_PROPERTIES 
(
	ENTITY_UUID VARCHAR (36),
	VERSION SMALLINT,
	ORDER_IN_LIST TINYINT,
	PROPERTY_ID VARCHAR (255),
	PROPERTY_VALUE LONGVARCHAR
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_BODY_FILE
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_BODY_FILE 
(
	ENTITY_UUID VARCHAR (36),
	VERSION SMALLINT,
	FILE_PATH VARCHAR (128)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_BODY_BINARY
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_BODY_BINARY
(
	ENTITY_KEY BIGINT NOT NULL,
	VERSION SMALLINT,
	BODY BINARY
);

Changes:

In this version, the BIGINT (64-bit integer) keys have been dropped, and UUIDs are being used as primary keys in the various tables.

The full path is included as the ENTITY_PATH. When adding a new entity, we will use SHA1 encoding to get a 64-byte hash of the ENTITY_PATH and that will be the ENTITY_ID. There is an infinitesimal chance of a collision for two strings that are not the same, so maybe each time we attempt to create or access an entity based on a freshly computed SHA1 encoding of the full path, we also check to see whether the entity already exists and (if so) whether the stored path is the same as the path we started from. If it's the same, we're fine. Otherwise, we can use some additional method to calculate a hash.

There's a new table for "version control" information. This is a placeholder for the possibility that we implement some version-control features in Sakai 2.4. This is likely to change, depending on what version-control features we need to support in the Resources tool and in the CHS API.

The DISPLAY_NAME is gone. Instead, we will use the last "part" of the ENTITY_PATH as a display name. This will avoid the possibility of a user renaming a resource without changing its id, which later leads to confusion. This leaves the question of what "display name" is used for the top-level collections within a site, since those use the site's id (a UUID) as the ENTITY_PATH.

Notes:

To retrieve an entity (resource or collection) requires getting all rows in all tables that have the same ENTITY_KEY. Generally, we will be getting it from the main table based on a search for the ENTITY_ID.

We need to anticipate the most common queries and support them with indexes or constraints to make them very efficient. We also need to support sorting in the query rather than sorting in memory after retrieval (which is why DISPLAY_NAME is included in the main table, for example, rather than treated as a property).

I think this might be an improvement over the current schema, but I don't think it helps with two very common queries:

  1. Any ContentEntity has a place somewhere in a tree/hierarchy, where the parent node is a collection that "contains" the child and all its siblings and all its children. Because of groups, to fully resolve permissions on any entity in this hierarchy, we need information about the entity and every collection that contains it (i.e. every node above it in the tree/hierarchy on a direct path back to the root). It would be very cool if we had a single query that retrieved all the records for an entity AND the collections that contain it (all the way up the hierarchy). Is it possible that a BIGINT value (or a combination of BIGINT values) could express containment in a meaningful way without limiting breadth or depth of the hierarchy?
  2. We need an efficient way to retrieve a collection and everything it contains. Right now we do this programmatically. Such a query would have to be used very judiciously. And maybe it should be paged.

Comments, suggestions welcome.

  • No labels

18 Comments

  1. One issue that needs to be addressed is the various names and identifiers that can be used in Sakai to refer to the same ContentEntity. In various places, we use entity-id's (which express a file-path from the CHS root to each item), UUID's (shorter fixed-length identifiers that are often shorter than the entity-id), and display-names (which are sometimes the same as the last nugget of the entity-id, but often different).

    I'm proposing to add another one – a "long" (integer) key that uniquely identifies all database entries related to a particular ContentEntity. The good news is that this new one would be confined entirely within the database impl of CHS and NEVER appear outside of that impl.

    So one question is whether we can achieve in our lifetimes (in Sakai 2.4, for example) something that many people have asked for repeatedly over the past several decades. That is that the "name" of a resource or collection correspond to the id used to retrieve it.

    1. There have been many JIRA tickets related to problems that occur because of the way names and id's are assigned for resources. Many of them are listed and summarized in REQ-56.

  2. Looks pretty good at first pass but I disagree with the following points:

    • Use SQL rather than Hibernate for all DB operations to support the CHS APIs.
      Is there a reason you are wanting to not use Hibernate? It should be able to match or exceed the performance of SQL in many cases and if you want to you can work with it using just HQL instead of SQL.
    • Increase the size of ENTITY_ID field from 255 to 1024 (this is under discussion).
      This would slow down searches and indexing on the entity key significantly. There should not be a reason that the entity key should need to be any longer than 255 chars since that allows for 26^255 (n^r) possible combinations assuming all letters are used. If we want to allow for dashes we can be conservative and say 26^200 which is 9.88*10^282 possibilities. If there is some issue with the generation of unique entity IDs then we might need to look at a better way to handle that.
    1. 2 more quick comments:
      1) No BIGINT in Oracle (Number(20) is equivalent most of the time though 20 is overkill IMO)

      2) All tables should have a primary key field which is a long called something like ID or PK. It should autoincrement and be used for FK references. This is good practice in general.

      1. 1) As I understand it, "BIGINT" is what HSQL calls a "long". When we do the Oracle impl, we'll use the Oracle equivalent.
        2) ENTITY_KEY is the primary key (sorry, forgot to label it as such), and we want it to be assigned by auto-increment as you suggest.

    2. The ENTITY_ID corresponds to the current COLLECTION_ID or RESOURCE_ID. It has the form of a file-path, where a new entity gets the id of the collection in which it's being created with the "local name" of the item appended. For example:

      /group/<site_id>/folder-1/folder-2/<new-item-name>

      When people use long names for folders and resources, these id's get pretty long quickly, limiting the depth of the hierarchy we can support.

      You're probably referring to the ENTITY_UUID, which was introduced in 2.1 or 2.2 as a way to have shorter id's for resources. Any proposal to substitute UUID's for the current entity-id's has to fully support the needs of WebDAV, JSR-170, etc.

      My thought is that we combine ENTITY_ID and DISPLAY_NAME and have it be a "local-name" (i.e. just the last part of the current entity-id) and limit its length. Then we support file-paths to ContentEntities for WebDAV, JSR-170, AccessServlet and other webapps, but the query for a particular entity identified by such a "file-path" is handled by a single query that starts at the root and uses the ENTITY_KEY, "local-name" and IN_COLLECTION fields to traverse the path to the requested item. Can that query be efficient?

      1. It's very unlikely we'll use more than 255 for the ENTITY_ID if we keep it as a unique identifier that is a full path to the entity. The reason is that MySQL only allows a total of 255 bytes/characters as an index value or primary key.

      2. This may be out of spec, but I'm very interested in seeing more "virtualized" paths and filenames rather than hard-coding it into the ENTITY_ID and wonder if this might be a good opportunity to look at this.

        Here are some cases where I've run into issues with the "the id is the absolute path" approach.

        1) site_id is gnarly when mounting WebDAV on Mac - when someone mounts Resources on a Mac using the the built-in WebDAV client, the mounted volume is the site_id/UUID. This is very confusing and meaningless to the end-users.

        2) Title and actual path can become out of sync - I've run into several instances in both WebDAV and the FCK file picker where a Resource was initially created as "Blue States.doc", for example. Later on, the user realizes it should have been "Red States.doc" and renames it thusly in the the UI. But when they access it via WebDAV or try to link to it in the FCK file picker, it still shows up as "Blue States.doc" as the RESOURCE_ID doesn't get changed in a rename operation.

        3) Dav access to drop boxes - when accessing drop boxes via webdav as an instructor, the student folders are indistinguishable as they are typically internal IDs (UUIDs) instead of the former behavior when the internal ID was the EID. Ideally, WebDAV could "virtually" present each student's folder with their display name as the web-based UI for the drop box does.

        4) Creating Resources programmatically - when importing content from another system, I ran into an issue where the path on the old system could not be replicated as it had gotten too "deep" - the RESOURCE_ID I was trying to create exceeded 255 characters. I had to truncate the file names to make the imported content fit.

        This could also have a potential benefit if the paths were stored in one table and the actual entity content/metadata were "flat". When a file needed to be referenced from the path table, it could join on the entity content table via a unique integer. With content no longer tied to a particular node in the file system, copies of the same file could be represented once in the database but surfaced as many times as needed - if it were copied, for example.

        I realize this is probably beyond the intended scope of this refactor, but if we're already refactoring, we may as well go all the way. (smile)

        1. I'll stick this comment in here to link the things together: REQ-56 links to a number of related issues identifying problems that users experience as a result of the multiple ways a Resource is identified.

          1. It looks like the most recent edit of the SQL drops the display name field altogether in favor of using the last part of the ENTITY_PATH. Good for consistency.

            Question - what happens when you have a folder containing sub-entities and you want to rename the folder?

            For example:

            /group/Week 1/
            /group/Week 1/discussion.doc
            /group/Week 1/classnotes.doc
            /group/Week 1/reading.pdf

            and "/group/Week 1/" gets renamed to "/group/Unit 1/"

            will the ENTITY_PATH for discussion.doc, classnotes.doc and reading.pdf get updated as well, or will we continue to have a name mismatch in these cases?

            1. If I recall correctly, we do have "move" and "copy" that are recursive for folders. The "move" operation may actually create the folder, iterate over "things" in the original folder creating copies in the new folder and then delete the originals. I think it should be implemented as a database operation that updates the records with new id's, but I don't thing it's implemented that way yet.

              In any event, you are right that it should do the right thing in this case.

  3. I am not intimately familiar with the data model, but is there an audit trail somewhere?  If not, one may want to consider factoring out malleable elements such as: MODIFIED_TIME TIMESTAMP
    MODIFIED_BY VARCHAR (99)
    ACCESS_MODE VARCHAR (16)
    HIDDEN BOOLEAN
    RELEASE_DATE DATETIME (0)
    RETRACT_DATE DATETIME (0)

    1. By audit trail, do you mean information about who mad what changes to a particular resource and when? If so, the answer is that CHS does not now save that info. At this point, all that is saved is the user-id and time of the most recent change(s). We are just starting work on versioning (see "Versioning of resources") and I assume that we will need further API and database changes to support versioning once we arrive at decisions about the requirements. I may be totally missing your point here, but I think you're suggesting that we might want a main index table for each entity and then a separate table with info about the history of changes to that resource. Is that right, Peter? That makes a lot of sense.

      Could you explain what makes the elements you mentioned "malleable" while others are not, Peter?

  4. What are the criteria for a property being in the CONTENT_ENTITY table? Is it that the CONTENT_ENTITY table will contain the properties that can be used in a database query to restrict search while optional properties will not be used directly in a database query?

    1. Good question. I was thinking that the properties that come into play for selecting and/or sorting items for any of the common queries used by the Resources tool, the AccessServlet, WebDAV or the filepicker should be in the CONTENT_ENTITY table. But as Peter's comment above seems to suggest, I may not have that quite right.

      Another distinction is that the properties in the CONTENT_ENTITY table are single-cardinality properties that are relevant to all ContentEntity objects. That's true even of RELEASE_DATE and RETRACT_DATE, where a null value indicates that RELEASE_DATE and/or RETRACT_DATE are not used for this entity.

      Groups (or the lack of groups) are relevant for ContentEntities, but groups is not a single-valued property, so it is in a separate table. In the future, some property that is dealt with in the CONTENT_ENTITY_PROPERTIES table may become so important that it becomes part of a very common query. In that case, I'd suggest promoting it to the CONTENT_ENTITY table.

      This question likely deserves further consideration.

  5. I might be missing something,

    but we appear to be sakign ENTITY_ID is unique, but then I look at the insert statements and see ENTITY_ID is the local name of the node eg '/' or 'group' rather than the full path..... which sort of suggests that there will only ever be one folder in the whole of sakai named group troughout the entire hierarchy.

    Thats not right, please tell me I am reading it incorrectly.

    Annother point.
    I beleive the refactor is to make CHS Storage fast.. so that things like WebDAV that hit Storage hard work faster..... but if we have to do lots of path resolution that is counter to the aim. I dont think that there can be n SQL statements per entity resolution where n is the depth of the path... ideally there would be 1 SQL statement to get the primary entity record form CONTENT_ENTITY.

    We can get round the 255 limit on Unique keys in MySQL by using a SHA1 hash of the full path, and then have an index on the full path, sored in VARCHAR2 on Oracle and Text in MYSQL, so that the full path of any one entity can be upto say 4K.... the URL limit for most GET requests.

    And annother one....
    If an numeric ID is used for FK references, then they become bound to the DB instance and require recoding on import/merge etc

    this has been discussed in emails, so I thought it made sense to record here for posterity.

    One discussed in email.
    When I did a prototype hierarchy service, I used Hibernate and found that I could generate deadlocks in mysql where a single table was updated with a text based PK, when I hit the table hard with high concurrency, lots of thread synced to act at the same time. NB, Im talking about deadlocks here not normal locks and Im talking about H3 on a single table so the order of update is not the problem. Its caused by the lock in the btree index for records that dont exist.

    I am not a Hibernate expert, but I could not get rid of the locks that hibernate was taking on non existant records. (by doing a "update ... where " or "select for update where ")

    When recoding in pure SQL the only thing that did not deadlock in MySQL was insert, fail, update pattern.

    1. Sorry. The insert statements are incorrect. I changed those when first considering the possibility of "hierarchical queries" (which turn out to be unique to Oracle, apparently) and forgot to change them back. They should show full paths.

      I will post a revised proposal tomorrow based on initial feedback from several people, including Ian.