Table of contents

An important feature of DataForge framework is ability to work with meta-data. This documentation is automatically generated by grain script from markdown and html pieces. Such pieces are called shards. The shard consists of data in one of supported formats and meta-data header. The ordering of shards is automatically inferred from meta-data by the script.

The meta-data of each shards includes version of this precise shard last update date, it's label for reference and ordering information.


The main and the most distinguishing feature of DataForge framework is a concept of data analysis as a meta-data processor. First of all, one need to define main terms:

The fundamental point of the whole DataForge philosophy is that every external input could be either data or meta-data. This point could seem dull at first glance, but in fact, it have a number of very important consequences:


Meta structure

The Meta object is a tree-like structure, which can contain other meta objects as branches (which are called elements) and Value objects as leafs. Both Values and Meta elements are organized in String-keyed maps. And each map element is a list of appropriate type. By requesting single Value or Meta element one is supposed to request first element of this list.


Note that such lists are always immutable. Trying to change it may cause a error.


While meta itself does not have any write methods and is considered to be immutable, some of its extensions do have methods that can change meta structure. One should be careful not to use mutable meta elements when one need immutable one.

In order to conveniently edit meta, there is MetaBuilder class.

The naming of meta elements and values follows basic DataForge naming and navigation convention. Meaning that elements and values could be called like child_name.grand_child_name.value_name. One can event use numbers as queries in such paths like child_name.grand_child_name[3].value_name.


An important part of working with meta is composition. Let us work with two use-cases:

  1. There is a data supplied with meta and one needs modified version of the meta.
  2. There is a list of data with the same meta and one needs to change meta only for one data piece.

DataForge provides instruments to easily modify meta (or, since meta is immutable, create a modified instance), but it is not a good solution. A much better way is to use instrument called Laminate. Laminate implements meta specification, but stores not one tree of values, but multiple layers of meta. When one requests a Laminate for a value ore meta, it automatically forwards request to the first meta layer. If required node or value is not found in the first layer, the request is forwarded to second layer etc. Laminate also contains meta descriptor, which could be used for default values. Laminate functionality also allows to use information from all of its layers, for example join lists of nodes instead of replacing them. Laminate layers also could be merged together to create classical meta and increase performance. Of course, in this case special features like custom use of all layers is lost.

Using Laminate in case 1 looks like this: if one needs just to change or add some value, one creates a Laminate with initial meta as second layer and override layer containing only values to be changed as first.

For case 2 the solution is even simpler: DataNode structures automatically uses laminates to define meta for specific data pieces. So one needs just to define meta for specific data, it will be automatically layered with node meta (or multiple meta elements if node data node structure has many levels).

The typical usage of data layering could be demonstrated on Actions (push data flow). Action has 3 arguments: Context, DataNode and Meta. The actual configuration for specific Data is defined as a laminate containing data meta layer, node meta layer(s) and action meta layer. Meaning that actual action meta is used only if appropriate positions are not defined in data meta (data knows best about how it should be analyzed).


The configuration is a very important extension of basic Meta class. It is basically a mutable meta which incorporates external observers. It is also important that while simple Meta knows its children knows its children, but could be attached freely to any ancestor, configuration has one designated ancestor that is notified than configuration is changed.


Note that putting elements or values to configuration follows the same naming convention as getting from it. Meaning putting to some_name.something will actually create or put to the node some_name if it exists. Otherwise, new node is created.



Meta structure supports so-called hidden values and nodes. These values and nodes exist in the meta and could be requested either by user or framework, but are not shown by casual listing. These values and nodes are considered system and are not intended to be defined by user in most cases.

Hidden values and nodes names starts with symbol @ like @node. Not all meta-data representations allow this symbol so in could be replaced by escaped sequence _at_ in text format (<_at_node/> in XML).



The important part of DataForge architecture is the context encapsulation. One of the major problems hindering parallel applications development is existence of mutable global states and environment variables. One of the possible solutions to this problem is to run any process in its own personal sandbox carrying copies of all required values, another one is to make all environment values immutable. Another problem in modular system is dynamic loading of modules and changing of their states.

In DataForge all these problems are solved by Context object (the idea is inspired by Android platform contexts). Context holds all global states (called context properties), but could in fact be different for different processes. Most of complicated actions require a context as a parameter. Context not only stores values, but also works as a base for plugin system and could be used as a dependency injection base. Context does not store copies of global values and plugins, instead one context could inherit from another. When some code requests value or plugin from the context, the framework checks if this context contains required feature. If it is present, it is returned. If not and context has a parent, then parent will be requested for the same feature.

There is only one context that is allowed not to have parent. It is a singleton called Global. Global inherits its values directly form system environment. It is possible to load plugins directly to Global (in this case they will be available for all active context), though it is discouraged in large projects. Each context has its own unique name used for distinction and logging.

The runtime immutability of context is supported via context locks. Any runtime object could request to lock context and forbid any changes done to the context. After critical runtime actions are done, object can unlock context back. Context supports unlimited number of simultaneous locks, so it won't be mutable until all locking objects would release it. If one needs to change something in the context while it is locked, the only way to do so is to create an unlocked child context (fork it) and work with it.


Plugin system allows to dynamically adjust what modules are used in the specific computation. Plugin is an object that could be loaded into the context. Plugins fully support context inheritance system, meaning that if requested plugin is not found in current context, the request is moved up to parent context. It is possible to store different instances of the same plugin in child and parent context. In this case actual context plugin will be used (some plugins are able to sense the same plugin in the parent context and use it).

Context can provide plugins either by its type (class) or by string tag, consisting of plugin group, name and version (using gradle-like notation <group>:<name>:<version>).


Note: It is possible to have different plugins implementing the same type in the context. In this case request plugin by type becomes ambiguous. Framework will throw an exception if two ore more plugins satisfy plugin resolution criterion in the same context.


A plugin has a mutable state, but is automatically locked alongside with owning context. Plugin resolution by default uses Java SPI tool, but it is possible to implement a plugin repository and use it to load plugins.


Note: While most of modules provide their own plugins, there is no rule that module has strictly one plugin. Some modules could export a number of plugins, while some of them could export none.



Providers

The navigation inside DataForge object hierarchy is done via Provider interface. A provider can give access to one of many targets. A target is a string that designates specific type of provided object (a type in user understanding, not in computer language one). For given target provider resolves a name and returns needed object if it is present (provided).

Names

The name itself could be a plain string or consist of a number of name tokens separated by . symbol. Multiple name tokens are usually used to describe a name inside tree-like structures. Each of name tokens could also have query part enclosed in square brackets: []. The name could look like this:

1
token1.token2[3].token3[key = value]

Paths

A target and a name could be combined into single string using path notation: <target>::<name>, where :: is a target separating symbol. If the provided object is provider itself, then one can use chain path notation to access objects provided by it. Chain path consists of path segments separated by / symbol. Each of the segments is a fully qualified path. The resolution of chain path begins with the first segment and moves forward until path is fully resolved. If some path names not provided or the objects themselves are not providers, the resolution fails.

Path notation

Default targets

In order to simplify paths, providers could define default targets. If not target is provided (either path string does not contain :: or it starts with ::), the default target is used. Also provider can define default chain target, meaning that this target will be used by default in the next path segment.

For example Meta itself is a provider and has two targets: meta and value. The default target for Meta is meta, but default chain target is value, meaning that chain path child.grandchild/key is equivalent of meta::child.grandchild/value::key and will point to the value key of the node child.grandchild.


Note The same result in case of Meta could be achieved by some additional ways:

Still the path child.grandchild/key is the preferred way to access values in meta.


Restrictions and recommendations

Due to chain path notation, there are some restrictions on symbols available in names:

Implementation

Under construction...

This section is under construction...


One of the most important features of DataForge framework is the data flow control. Any


Actions Push data flow model


Pull data flow model Tasks


The DataForge functionality is largely based on metadata exchange and therefore the main medium for messages between different parts of the system is Meta object and its derivatives. But sometimes one needs not only to transfer metadata but some binary or object data as well. For this DataForge supports an 'Envelope' entity, which contains both meta block and binary data block. Envelope could be automatically serialized to or from a byte stream.

DataForge supports an extensible list of Envelope encoding methods. Specific method is defined by so-called encoding properties - the list of key-value pairs that define the specific way the meta and data are encoded. Meta-data itself also could have different encoding. Out of the box DataForge server supports two envelope formats and three meta formats.


The default envelope format is developed for storage of binary data or transferring data via byte stream. The structure of this format is the following:

Meta encoding

DataForge server supports following metadata encoding types:

  1. XML encoding. The full name for this encoding is XML, the tag code is XM.
  2. JSON encoding (currently supported only with storage module attached). The full name is JSON, the tag code is JS.
  3. Binary encoding. DataForge own binary meta representation. The full name is binary, the tag code is BI.

To avoid confusion. All full names are case insensitive. All meta is supposed to always use UTF-8 character encoding.


Tagless format is developed to store textual data without need for binary block at the beginning. It is not recommended to use it for binary data or for streaming. The structure of this format is the following:

  1. The header line. #~DFTL~#. The line is used only for identify the DataForge envelope. Standard reader is also configured to skip any lines starting with # before this line, so it is compatible with shebang. All header lines in tagless format must have at least new line \n character after them (DOS/Windows new line \r\n is also supported).

  2. Properties. Properties are defined in a textual form. Each property is defined in its own lined in a following way:

    1
    
    #? <property key> : <property value>; <new line>
    

    Any whitespaces before <property value> begin are ignored. The ; symbol is optional, but everything after it is ignored. Every property must be on a separate line. The end of line is defined by \n character so both Windows and Linux line endings are valid. Properties are accepted both in their textual representation or tag code.

  3. Meta block start. Meta block start string is defined by metaSeparator property. The default value is #~META~#. Meta block start could be omitted if meta is empty.

  4. Meta block. Everything between meta block start and data block start (or end of file) is treated like meta. metaLength property is ignored. It is not recommended to use binary meta encoding in this format.

  5. Data start block. Data block start string is defined by dataSeparator property. The default value is #~DATA~#. Data block start could be omitted if data is empty.

  6. Data block. The data itself. If dataLength property is defined, then it is used to trim the remaining bytes in the stream or file. Otherwise the end of stream or file is used to define the end of data.


Interpretation of envelope contents is basically left for user, but for convenience purposes there is convention for envelope description encoded inside meta block.

The envelope description is placed into hidden @envelope meta node. The description could contain following values:

Both envelope type and data type are supposed to be presented in reversed Internet domain name like java packages.


Storage plugin defines an interface between DataForge and different data storage systems such as databases, remote servers or other means to save and load data.

The main object in storage system is called Storage. It represents a connection to some data storing back-end. In terms of SQL databases (which are not used by DataForge by default) it is equivalent of database. Storage could provide different Loaders. A Loader governs direct data pushing and pulling. In terms of SQL it is equivalent of table.


Note: DataForge storage system is designed to be used with experimental data and therfore loaders optimized to put data online and then analyze it. Operations to modify existing data are not supported by basic loaders.


Storage system is hierarchical: each storage could have any number of child storages. So ot is basically a tree. Each child storage has a reference for its parent. The sotrage without a parent is called root storage. The system could support any number of root storages at a time using storage context plugin.

By default DataForge storage module supports following loader types:

  1. PointLoader. Direct equivalent of SQL table. It can push or pull DataPoint objects. PointLoader contains information about DataPoint DataFormat. It is assumed that this format is just a minimum requirement for DataPoint pushing, but implementation can just cut all fields tht are not contained in loader format.
  2. EventLoader. Can push DataForge events.
  3. StateLoader. The only loader that allows to change data. It holds a set of key-value pairs. Each subsequent push overrides appropriate state.
  4. BinaryLoader. A named set of fragment objects. Type of these objects is defined by generic and API does not define the format or procedure to read or right these objects.

Loaders as well as Storages implement Responder interface and and could accept requests in form of envelopes.


The FileStorage is the default implementation of storage API.

Under construction...

This section is under construction...


DataForge control subsystem defines a general api for data acquisition processes. It could be used to issue commands to devices, read data and communicate with storage system or other devices.

The center of control API is a Device class. The device has following important features:


DataForge also supports a number of language extensions inside JVM.


Grind is a thin layer overlay for DataForge framework written in dynamic Java dialect Groovy. Groovy is a great language to write fast dynamic scripts with Java interoperability. Also it is used to work with interactive environments like beaker.

GRIND module contains some basic extensions and operator overloading for DataForge, a DSL for building Meta. Also in separate modules is an interactive console using GroovyShell and some mathematical extensions.


Kotlin is one of the best efforts to make a "better Java". The language is probably not the best way to write a complex architecture like the one used in DataForge, but it definitely should be used in end-user application. KODEX contains some basic DataForge classes extensions as well as operator override for values and metas. It is planned to also include a JavaFX extension library based on the great tornadofx.