Child pages
  • Entity Type1 Block Encoding
Skip to end of metadata
Go to start of metadata

Introduction

For a long time we have used XML storage in the Blobs for many of the Entities in Sakai. This has given us flexability and the ability to add properties to entities without the need for schema changes. Provided we have added sufficient supporting columns for searching, sorting and aggregation, it is a relatively efficient way of storing and retrieving unstructured data from the database. However where we have not provide those columns is causes problems, and the cost of parsing the XML to load the entity into memory is excessive and unacceptable. Some recent work in this area tries to address this.

Encoding

XML parsing may be great for inter change but even the fastest XML parsers cannot compete for binary byte[] serializations. The Type1 Block Encoding is an encoding that serializes the properties of an entity to a byte[] using the standard DataOutputStream. The Type1 relates to the format of the data block, and the block relates to the block structure of the data block.

The data block is encoded in a byte[] and start with a byte sequence (BLOB_ID) that identifies the entity type of the block. The byte sequence for content collections is the unicode single bytes "CHSBCE" and "CHSBRE" for content resources. The intended encoding is that the first 3 characters denote the service and the next 3 denote the entity.

The remainder of the datablock is a byte[] containing byte values.

The datablock is read and written with DataOutputStream and DataInputStream. The first Int is (readInt()) is the Type of the block. A parser should only attempt to read data blocks that match its type. The next int is the Block ID within that type, after that each Block is defined by the implementation of the reader or writer although I have tried to share block ID's in the same service. So CHS Resources and CHS Collections have different BLOB_ID's but for the Same Type ID (Type 1) the blocks are encode in the same way. The order of the blocks should be assumed to be arbritary and where entities are hierarchically structured blocks may be nested. For instance many entities have Properties. So ContentResources put their Properties into BLOCK_ID=4. This block is then processed by ResourceProperties Serializer class that has its own internal TYPE_ID and BLOCK_ID's. This nesting of blocks enables us to replicate the hierarchical structure of XML without the parsing overhead.

Content Collection

BLOB_ID

CHSBCE

unicode single byte sequence

TYPEID(1)

Type Identifier =1

Int

BLOCK1(10)

Block Identifier value = 10

Int

 

id

UTFString

 

resourceType

UTFString

 

access

UTFString

 

hidden

boolean

BLOCK2(11)

Block Identifier value = 11

Int

 

releaseDate

Long

 

retractDate

Long

BLOCK3(12)

Block Identifier value = 12

Int

 

number of Groups

Int

 

array of groups

array of UTFString

BLOCK4(13)

Block Identifier value = 13

Int

 

Properties Block

BLOCK_END(2)

End marker value = 2

Int

Content Resource

BLOB_ID

CHSBRE

unicode single byte sequence

TYPEID(1)

Type Identifier =1

Int

BLOCK1(10)

Block Identifier value = 10

Int

 

id

UTFString

 

resourceType

UTFString

 

access

UTFString

 

hidden

boolean

BLOCK2(11)

Block Identifier value = 11

Int

 

releaseDate

Long

 

retractDate

Long

BLOCK3(12)

Block Identifier value = 12

Int

 

number of elements in following array

Int

 

array of groups

array of UTFString

BLOCK4(13)

Block Identifier value = 13

Int

 

Properties Block

Nested Block

BLOCK5(14)

Block Identifier value = 14

Int

 

contentType

UTFString

 

contentLength

UTFString

 

filePath

UTFString

BLOCK6(15)

Block Identifier value = 15

Int

 

length of following byte array

Int

 

byte array representing body

array of byte

BLOCK_END(2)

End marker value = 2

Int

ResourceProperties

BLOB_ID

N/A

Properties dont exist except as a sub block of an entity

TYPEID(1)

Type Identifier =1

Int

BLOCK1(100)

Block Identifier value = 100

Int

 

Number of property blocks (BLOCK2 & BLOCK3) to follow

Int

BLOCK2(101)

Block Identifier value = 101

Int

 

Name of the Property

UTFString

 

Value of the Property

UTFString

BLOCK3(102)

Block Identifier value = 102

Int

 

Name of the Property

UTFString

 

Size of the following Array

Int

 

Array of Values

array of UTFString

Storage

Data Output streams producer byte[]. We did try to store this in a char[] by unsigning the byte[], but this proves problematic as it relies on a database connection that does not corrupt the chars in the range 128-255. (ie pure UTF8). There are some production system in existance that are not using UTF8. Hence we have switch to placing the byte[] into a new BLOB, BINARY_ENTITY.

Performance.

Performance is the only reason we are doing this. If we look at the overhead of parsing and serializing, removing the cost of storing the input and output data.

Technique

Memory Overhead per entity

CPU Time per Entity

XML DOM Parse

389K

3.5ms

XML DOM Serialize

79K

3.1ms

SAX Parse

48K

1.2ms

Binary Parse

23 bytes

16us

Binary Serialize

26 bytes

18us

The memory consumption on these tests was calculated using Runtime.freeMemory() before and after parsing/serializing 16000 entities. It looks like the JVM reuses most of the code and memory footprint in DataOutputStream and DataInputStream so its memory requirement is shared over all 16000 entities, hence the low overhead.

Inside Sakai, reporting the cost of parsing and serializing entity too and from memory in the BaseDbDoubleStorage, BaseDbSingleStorage (and associated) classes and averaging over 100 operations we see the real cost to Sakai.

Technique

Memory Overhead per entity

CPU Time per Entity

XML DOM Parse

402K

3.5ms

XML DOM Serialize

82K

3.1ms

SAX Parse

52K

1.2ms

Binary Parse

4K bytes

62us

Binary Serialize

5K bytes

65us

When doing these test there was variation of about +-10% on both the memory and CPU time.

Testing and Conversion

The new BLOB in the entity storage field is used for storing the binary entity. This new blob is called BINARY_ENTITY. When Content hosting start, it checks for this column and the xml column and acts acordingly. If it finds only the XML column, then it uses the default SAX/DOM BaseDbSingleStorage. If it find both the XML column and the BINARY_ENTITY column, it uses a new StorageUser class BaseDbDualSingleStorage, that reads from whichever column is not null and writes to the BINARY_ENTITY column. On writing to the BINARY_ENTITY column it sets the XML column to NULL in the same update, which indicates that the data has been migrated. Once all the records for XML are NULL, that column can be removed from the table in question. If it finds only the BINARY_ENTITY column it uses the BaseDbBinarySingleStorage StorageUser class.

The behaviour of BaseDbDualSingleStorage can be modified by the migrateData property. If true (default) it will write to the BINARY_ENTITY column. If false it will check that there are no non null entries in the BINARY_ENTRY column, and write to the XML column, disabling data migration. When the migrateData property is false, it will refuse to startup if there are any non null entries in the BINARY_ENTITY column to avoid loosing data.

There are a battery of tests within Content that verify that the serializations work and do not loose information. In addition there is a conversion utility. The implementation in content hosting can read and write both XML and Binary and is configured by default to write Binary content to the Database.

Offline Conversion.

There is an offline conversion utility that enables the conversion of the database at the command line on a separate machine. This utility is designed to run with the sakai appservers live and does not require them to be offline. It assumes that the BINARY_ENTITY columns are already present in the schema. When it runs it builds table containing items that have not been converted, and works through that list in batches of 100, checkpointing after each 100 conversions. It might be possible to run this utility on a number of machines in parallel, but this has not been tested and is probably not recommended.

To run, you need to configure a configuration file with the database connection details, and you may need to edit the SQL in that file to suit your target database. This file, a copy of which can be found at https://source.sakaiproject.org/svn/content/trunk/content-impl/impl/src/java/org/sakaiproject/content/impl/serialize/impl/conversion/upgradeschema.config
(the default) and here

# UpgradeSchema Control File (Default)
# Conversion needs the database connection details
dbDriver=com.mysql.jdbc.Driver
dbURL=jdbc:mysql://127.0.0.1:3306/sakai22?useUnicode=true&characterEncoding=UTF-8
dbUser=sakai22
dbPass=sakai22

convert.0=FileSizeResourcesConversion
convert.0.handler.class=org.sakaiproject.content.impl.serialize.impl.conversion.FileSizeResourcesConversionHandler
convert.0.create.migrate.table=create table content_res_fsregister ( id varchar(1024), status varchar(99) )
convert.0.drop.migrate.table=drop table content_res_fsregister
convert.0.check.migrate.table=select count(*) from content_res_fsregister where status <> 'done'
convert.0.select.next.batch=select id from content_res_fsregister where status = 'pending' limit 100
convert.0.complete.next.batch=update content_res_fsregister set status = 'done' where id = ?
convert.0.mark.next.batch=update content_res_fsregister set status = 'locked' where id = ?
convert.0.populate.migrate.table=insert into content_res_fsregister (id,status) select RESOURCE_ID, 'pending' from CONTENT_RESOURCE
convert.0.select.record=select XML from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.0.select.validate.record=select XML from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.0.update.record=update CONTENT_RESOURCE set CONTEXT = ?, FILE_SIZE = ? where RESOURCE_ID = ? 

convert.1=Type1BlobCollectionConversion
convert.1.handler.class=org.sakaiproject.content.impl.serialize.impl.conversion.Type1BlobCollectionConversionHandler
convert.1.create.migrate.table=create table content_col_t1register ( id varchar(1024), status varchar(99) )
convert.1.drop.migrate.table=drop table content_col_t1register
convert.1.check.migrate.table=select count(*) from content_col_t1register  where status <> 'done'
convert.1.select.next.batch=select id from content_col_t1register where status = 'pending' limit 100
convert.1.complete.next.batch=update content_col_t1register set status = 'done' where id = ?
convert.1.mark.next.batch=update content_col_t1register set status = 'locked' where id = ?
convert.1.populate.migrate.table=insert into content_col_t1register (id,status) select COLLECTION_ID, 'pending' from CONTENT_COLLECTION where BINARY_ENTITY IS NULL
convert.1.select.record=select XML from CONTENT_COLLECTION where COLLECTION_ID = ?
convert.1.select.validate.record=select BINARY_ENTITY from CONTENT_COLLECTION where COLLECTION_ID = ?
convert.1.update.record=update CONTENT_COLLECTION set XML = NULL, BINARY_ENTITY = ?  where COLLECTION_ID = ?

convert.2=Type1BlobResourceConversion
convert.2.handler.class=org.sakaiproject.content.impl.serialize.impl.conversion.Type1BlobResourcesConversionHandler
convert.2.create.migrate.table=create table content_res_t1register ( id varchar(1024), status varchar(99) )
convert.2.drop.migrate.table=drop table content_res_t1register
convert.2.check.migrate.table=select count(*) from content_res_t1register  where status <> 'done'
convert.2.select.next.batch=select id from content_res_t1register where status = 'pending' limit 100
convert.2.complete.next.batch=update content_res_t1register set status = 'done' where id = ?
convert.2.mark.next.batch=update content_res_t1register set status = 'locked' where id = ?
convert.2.populate.migrate.table=insert into content_res_t1register (id,status) select RESOURCE_ID, 'pending' from CONTENT_RESOURCE where BINARY_ENTITY is NULL
convert.2.select.record=select XML from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.2.select.validate.record=select BINARY_ENTITY from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.2.update.record=update CONTENT_RESOURCE set CONTEXT = ?, FILE_SIZE = ?, XML = NULL, BINARY_ENTITY = ? where RESOURCE_ID = ? 

There is a command line runner, currently as a bash script that runs the conversion utility. It assumes that you maven repo contains a M2 build of sakai at ~/.m2/repository

To run

cd SAKAI_SOURCE/content
# Run maven to check that we are uptodate
mvn -o clean install
runconversion.sh myconversion.config

The script is in SVN but at the time of writing is

#!/bin/sh
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/commons-logging/commons-logging/1.0.4/commons-logging-1.0.4.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/commons-dbcp/commons-dbcp/1.2.2/commons-dbcp-1.2.2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/commons-pool/commons-pool/1.3/commons-pool-1.3.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/mysql/mysql-connector-java/3.1.11/mysql-connector-java-3.1.11.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-util-api/M2/sakai-util-api-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-entity-api/M2/sakai-entity-api-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-entity-util/M2/sakai-entity-util-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-content-api/M2/sakai-content-api-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-content-impl/M2/sakai-content-impl-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-util-log/M2/sakai-util-log-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/log4j/log4j/1.2.9/log4j-1.2.9.jar"

java $JAVA_OPTS  \
      -classpath "$CLASSPATH" \
      org.sakaiproject.content.impl.serialize.impl.conversion.UpgradeSchema "$@" 

This conversion utility starts a very small part of the sakai framework to do its work.

  • No labels