A shallow dive into DataScript internals
This is an overview of DataScript code base. Without going into much detail, it paints the overall picture of how code is structured, what parts it’s built of and what purpose they serve. If you’re interested in studying DataScript sources, this is a great place to start.
For those who are using DataScript this post may help to get better understanding of machinery behind public APIs and make better use of them.
Datoms
Minimal piece of information in DataScript is Datom. It’s defined in datascript.core
as
(defrecord Datom [e a v tx added])
where [e a v]
is entity, attribute and value, tx
is a transaction id (integer) and added
is a boolean flag to differentiate additions from retractions.
Datom has (-hash)
and (-eqiv)
methods overloaded so only [e a v]
DB
Database is just an immutable, persistent collection of datoms. It’s very much like a built-in ClojureScript data structure, e.g. set. Main operations supported by DB are:
- Add a datom (similar to
conj
) - Retract a datom (similar to
disj
) - Search for range of datoms
Both addition and retraction are implemented in datascript.core/with-datom
. Searching is implemented as a part of ISearch
protocol.
DataScript DB contains only currently relevant datoms. There’s no history tracking at DB level. When datom is removed from a DB, there’s no trace of it anywhere. Retracted means gone.
Internally DB contains three different indexes to help with various search patterns. Each index is a sorted set of datoms (it’s literally a set). They’re named after sort order: EAVT
, AEVT
and AVET
. Every index contains all datoms of current DB, meaning all three indexes are the same set sorted differently. There’s no other datoms storage inside the DB besides indexes.
DB is defined in datascript.core
as
(defrecord DB [schema
eavt aevt avet
max-eid
max-tx
rschema])
Besides indexes, there’s only schema (just a regular map) and some internal bookkeeping: max seen entity id, latest transaction id and reverse value-to-key schema map.
BTSet
Each DataScript index is stored as datascript.btset
. It’s an immutable, persistent implementation of B+ tree data structure. It’s a proper ClojureScript collection, with all protocols of sorted-set
implemented.
BTSet
was needed because DataScript does a lot of range scans over indexes. Due to usage of continuous js arrays as tree nodes, range scans over BTSet
are ~3 times faster than over built-in sorted-set
which is a Red-Black binary tree.
BTSet
uses generative testing to validate its correctness: same operations are performed at BTSet
and sorted-set
, results are compared for equality (see stresstest-...
in datascript.test.btset
).
Adding data to DB
DataScript has a lot of conventions about how to format data before adding it to the DB. Basically, you can provide vector of form [op e a v]
{:db/id e, (a v)+}
:db.fn/*
shortcuts, referencing current transaction id, using nested maps for components, understanding different attributes arity and reversing reverse references.
Biggest part of datascript.core
, starting from (defrecord TxReport)
, is all about solving these problems. Main work happens in transact-tx-data
, which uses recursion to reduce complexity of transaction data step-by-step, until it’s reduced to a set of plain datoms:
entity map → op vector[s] → id resolution → datom[s]
transact-tx-data
loop also builds TxReport
record along the way:
(defrecord TxReport [db-before
db-after
tx-data
tempids
tx-meta])
TxReport
is a log of everything that has happened during transaction. It stores temporary entity ids resolution table (tempids
) and raw datoms which were used to modify DB (tx-data
). Given TxReport
and db-before
, it’s trivial to replay transaction and calculate db-after
:
db-before + tx-data = db-after
tx-data
is an end result of transaction data simplification, with all temporary and implicit ids allocated to actual entity ids, all maps expanded into vector forms, and all vectors converted to Datom
s.
TxReport
is also the only place where you can see datoms with added == false
for datoms which were retracted.
Entities
Entity is just a convenient interface to EAVT
index. Basically this:
(:person/name (d/entity db 1))
is translated to this:
(first (d/datoms db :eavt 1 :person/name))
with some per-entity caching. Most of the namespace datascript.impl.entity
is spent on implementing protocols to make entities look like normal ClojureScript maps.
Queries
Query engine implements Datalog over DBs and collections. It is the most complicated part of DataScript so far.
Query engine operates on relations. Relation is just a collection of tuples. E.g. this:
:where [?e :name ?v]
will be resolved by querying DB to following relation:
R1 = {:symbols {?e 0 ?v 1}
:tuples [#js [0 “Ivan”], #js [5 “Oleg”], ...]}
During query execution, :where
is processed clause-by-clause. Relations which share no common variables are kept separate in a context
, building a collection of relations. If :where
looks like this:
:where [?e :name ?v]
[?e2 :name ?v2]
then DataScript will create two separate relations:
R1 = {:symbols {?e 0 ?v 1}
:tuples [#js [0 “Ivan”], #js [5 “Oleg”], ...]}
R2 = {:symbols {?e2 0 ?v2 1}
:tuples [#js [0 “Ivan”], #js [5 “Oleg”], ...]}
Now what happens if we introduce third clause that implicitly joins first two relations by referring to the same variables? Let’s take [?e :friend ?e2]
R3 = {:symbols {?e 0 ?e2 1}
:tuples [#js [0 5], #js [5 0], ...]}
Now context contains three relations, but they are not truly independent: they share common variables. This is forbidden, so at this point they will be reduced to a single relation. DataScript needs to hash-join latest relation (R3
) with any conflicting relations there are. In our case, R3
has intersection with both R1
(by ?e
R2
(by ?e2
R1
and R2
, thus getting single relation to join with, then hash-join result with R3
by (?e ?e2)
R4 = (R1 × R2) `hash-join` R3
Then R4
will replace R1
, R2
and R3
in a context.
Full algorithm:
- Parse query (
datascript.parser
) and cache parse result (planned for 0.10.0). Most of the query validation happens at parsing stage. - Scan sources (
:in
clause) and transform them to relations. This will create initial set of relations which populate context. Constants are also resolved to relations, e.g.:in ?a
→{:symbols {?a 0}, :tuples [#js [val]]}
- For each clause in
:where
, do this:
3a. Try to resolve as much variables as possible using context so far and substitute them to clause (planned for 0.10.0)
3b. Execute clause and get result as a relation. If clause is a rule call, there’s special inner loop for rule resolution here, but result is still a relation.
3c. If resulting relation has intersection by variables with any existing relation, hash-join them together, fusing into one big relation
3d. Put resulting relation into the context
3e. Move onto the next clause
- Build result set based on
:find
and:with
vars list: do cartesian product on all relevant relations, leave just vars that matter from them, collect them into aset
- If there’s some aggregation happening, do
group-by
and run aggregation functions - If
pull()
is used, call Pull API for each entity - If find specifications are used, do post-processing: unwrap inner tuples, take first element from a set, etc.
Query engine is implemented in datascript.query
and uses parser from datascript.parser
.
Pull API
Pull API is implemented in datascript.pull-api
by David Thomas Hume. It walks all entity ids, for each entity it walks all attributes from a pull pattern, and for each recursive attribute calls itself. It’s a recursive algorithm implemented as a state machine that emulates stack so it can handle unlimited recursion depth.
There’s also datascript.pull-parser
where pull pattern is parsed and validated.
During implementation of DataScript pull pattern parser and query parser, we’ve helped Datomic team to fix couple of inaccuracies in their query grammar. This is why having alternative implementations is usually a good thing.
Filtered DBs
Given database D
and predicate P: datom, DB → boolean
, you can construct special view F = (datascript/filter D P)
that will work like a regular database but for all needs (query, entity, pull, index access) will look like as if it only contains datoms that satisfy P
.
Filtered databases are implemented in datascript.core
by just extending couple of protocols: IDB
, ISearch
, IIndexAccess
— mostly by adding additional filter
post-process step.
For advanced uses it means that it’s quite easy to pretend to be a database: just implement these protocols and rest of DataScript code will work as expected.
Database mutation
Everything we’ve discussed so far works on top of immutable, persistent DB objects. There’s also a small piece of code to handle database mutations (as of 0.9.0, it’s literally 25 lines).
datascript/create-conn
returns a new database object wrapped with an atom. There’s also a listen!
machinery for listening for DB changes. It was added because regular atom watchers receive value-before and value-after, but database changes have more useful information: list of datoms added and temporary ids resolution table.
DataScript “connection” is not a connection in any sense. I borrowed the term from Datomic, but now I’m thinking about changing it to db-ref
or better, if only I could think of anything.
JavaScript interop
Almost from the beginning DataScript has supported JavaScript interface. You include pre-compiled JS file (270k raw/~60k gzipped as of 0.9.0) and call datascript.empty_db()
, datascript.query()
and stuff.
This was made possible by datascript.js
facade: it converts data coming from JS into ClojureScript data structures, calls proper datascript.*
APIs, converts results into JS form and returns them back to JS caller. Not the most effective way to do things, but the only one manageable without writing js-native DataScript’s evil twin. Even with such approach, some things have leaked into the main DataScript code: it has to take into account that attributes can be strings, for example. But for the most part JS interop is isolated.
And JS interop is feature-full: everything you can do from CLJS, you can do from JS. Maybe not as well battle-tested, but nonetheless. See test/js/tests.html
for how DataScript can be used from JS.
Tests
DataScript does acceptance testing almost exclusively. Everything in datascript.test.*
works with top-level interfaces of datascript
and datascript.core
namespaces. Exceptions are datascript.test.btset
(generative tests of BTSet implementation) and datascript.test.perf
(somewhat ad-hoc and messy performance tests of various parts of DataScript).
Having just acceptance tests works because semantics and APIs are already defined and documented by Datomic team whom DataScript is following. APIs do not change often, so extensive tests suite (2.8k loc test/
for 3k loc src/
) is not a dead weight and requires little maintenance.
Tests allowed me to change underlying implementation without rewriting a single test: I completely changed index implementation and inner DB structure once already and replaced query engine twice so far. It gives tremendous confidence to pull such refactorings without breaking anyone else’s code.
Conclusion
I didn’t know what to write in conclusion although I felt every post needs one. So here it is: have fun, enjoy your day.