03 Mar 2018
Since the last post, I’ve started working on
an importer to load data from the existing Access database. Work to date is on
GitHub.
In the current domain model, there is a single aggregate root, the Client
.
The importer is written as a command line application which interacts directly
with the domain, assuming an empty database (I might get to incremental imports
in the future). At a high level, the importer currently:
- Creates the clients
- Adds any existing ‘notes’ about the client
- Notes are freeform text about a client, unrelated to any particular patient or
transaction
- The existing application is a little limited in what can be entered into
the main form, so notes have been used to make up the slack (e.g. in the
existing data, there are numerous clients which have an email address or fax
number in the notes field, as there is no first class input for these values)
- Adds home and mobile phone numbers
- Adds the ‘most common travel distance’ as a note
These steps are visible in the implementation of the importer:
override fun run(vararg args: String?) {
val rows = accessDb.clientTableRows
val newClientIds: Map<String, UUID> = createClients(
rows,
accessDb::postCodeFor,
accessDb::stateFor,
commandGateway
)
allOf(
addClientNotes(rows, newClientIds, commandGateway),
addPhoneNumbers(rows, newClientIds, commandGateway),
addMostCommonDistance(rows, newClientIds, commandGateway)
).get()
}
First, clients are created, producing a Map
of the old client ID to the new
client ID. Once all clients have been created, all the other updates are
applied (potentially concurrently).
Without giving away too much information about the existing data, the order of
magnitude of the existing number of clients is 3, and the total number
of events generated with the current importer implementation is at most 5x the
number of clients (one ClientMigratedEvent
, up to two ClientNoteAddedEvent
s
and up to two ClientPhoneNumberAddedEvent
s).
My first pass at the importer was taking around 80 seconds to import everything
into a PostgreSQL database. I know that premature optimization is the root of
all evil, and that I don’t have anything resembling a
working product at the moment, but this seemed far too high. Also, it was
impacting my ability to iterate quickly with ‘production’ data, which is enough
of a reason to look for improvements.
After looking at the generated schema and doing some sampling with
VisualVM, I decided there were three options to investigate:
- Asynchronous processing of commands
- Serialisation format changes
- Generated schema changes
In order to compare a full run of the importer pre and post optimisations, I
want to be able to toggle the optimisations on/off from the command line. The
following script has the toggle properties in place, and in the sections below I
will use Spring config management to read these properties.
PASSWORD=$(uuidgen)
docker stop vetted-postgres ; docker rm vetted-postgres
docker run \
--publish 5432:5432/tcp \
--name vetted-postgres \
--env POSTGRES_PASSWORD=$PASSWORD \
--detach \
postgres
./gradlew build
java \
-jar importer/build/libs/vetted-importer-0.0.1-SNAPSHOT.jar \
--axon.use-async-command-bus=false \
--axon.use-cbor-serializer=false \
--spring.jpa.database-platform=org.hibernate.dialect.PostgreSQL95Dialect \
--spring.datasource.password=$PASSWORD
The script above will allow me to evaluate the impact of any changes I make in
a repeatable fashion.
Option 1 - Asynchronous processing of commands
I’m using the Axon framework, which handles a lot of the plumbing of
building an application based on DDD & CQRS principles. By default when using
the Spring auto-configuration, a SimpleCommandBus
is used
which processes commands on the calling thread.
I added some configuration to use a AsynchronousCommandBus
with a configurable number of threads:
@Bean
@ConditionalOnProperty(
value = ["axon.use-async-command-bus"],
matchIfMissing = true
)
fun bus(
transactionManager: TransactionManager,
@Value("\${axon.command-bus.executor.pool-size}") poolSize: Int
): CommandBus {
val bus = AsynchronousCommandBus(
Executors.newFixedThreadPool(poolSize)
)
val tmi = TransactionManagingInterceptor(transactionManager)
bus.registerHandlerInterceptor(tmi)
return bus
}
I initially tried this configuration out with a pool size of 10. This reduced
the time for the import to around 30 seconds, which is an improvement from 80
seconds but short of an order of magnitude improvement which should be
possible. This led me to believe that there was either contention somewhere
else, or that some of the constant factors are just too high at the moment.
By default, Axon will use XStream to serialise
events, which uses an XML representation. XML is quite verbose, and the Axon
documentation even suggests using a different
serializer.
Overriding the serializer is thankfully quite easy:
@Primary
@Bean
@ConditionalOnProperty(
value = ["axon.use-cbor-serializer"],
matchIfMissing = true
)
fun serializer(): Serializer {
val objectMapper = ObjectMapper(CBORFactory())
objectMapper.findAndRegisterModules()
objectMapper.setSerializationInclusion(NON_ABSENT)
return JacksonSerializer(objectMapper)
}
I opted for using Jackson with a ‘Concise Binary Object
Representation’ (CBOR) JsonFactory
. This resulted in a ~70% reduction in size
for the serialized payload for most events. With XML:
postgres=# select avg(length(loread(lo_open(payload::int, x'40000'::int), x'40000'::int))) from domain_event_entry;
avg
--------------
433.69003053
and with CBOR:
avg
--------------
111.54379774
This didn’t have a huge impact on the run time of the importer, but is still a
worthwhile optimisation.
Option 3 - Generated schema changes
You may have noticed in the SQL statments above that the current schema is
using the PostgreSQL large objects functionality. From the
PostgreSQL docs:
PostgreSQL has a large object facility, which provides stream-style access
to user data that is stored in a special large-object structure. Streaming
access is useful when working with data values that are too large to
manipulate conveniently as a whole.
If we inspect the schema that’s being generated:
postgres=# \d domain_event_entry
Table "public.domain_event_entry"
Column | Type | Nullable | Default
------------------+------+----------+---------
meta_data | oid | |
payload | oid | not null |
...
The oid
type here is an object identifier - a reference to a large
object which is stored externally from the table. The events we’re writing are
small enough that the overhead of reading them as separate streams is hurting
performance rather than helping.
At least two people have had the same issue when using Axon with PostgreSQL, as
evidenced by the questions on Google Groups and
StackOverflow. The suggestion to customise the PostgreSQL
dialect used by Hibernate seems to work, and further reduced the runtime to
around 8 seconds.
Conclusion
Based on my very rough benchmarking, the three changes above have reduced the
run time of the importer from around 80 seconds to 8 seconds. The code
is all at the link above, and the optimisations are on by default.
There is surely more that can be done to improve performance, but that’s fast
enough for now!
30 Jan 2018
See Vetted - a new project for some background.
A task I’m going to have to tackle sooner or later is importing data from the
existing Access database. As I’m going to try my hand at event sourcing, this
raises an interesting question:
Given an existing application with data stored in an relational database & no
notion of events, how do you go about importing data into an event sourced
system?
Options
At a high level, the initial options seem to be:
- Try to reverse engineer/map the existing state to events
- Have some sort of migration event event in your domain model (e.g.
FooMigrated
) which acts as a snapshot
- Run everything through your new API as commands, and allow the API
implementation to take care of creating the relevant events like normal
‘Recreating’ domain events
Option one above would be nice, but seems impractical at best and is more likely
impossible. For every domain object (client, patient, payment, invoice,
vaccination etc) I’d need to try and reverse engineer the real-world happenings
that occurred to transition the object into its current state.
A ‘migration event’ as a snapshot
Originally when a colleague suggested this it conflicted with my understanding
of the term ‘snapshot’. To me a ‘snapshot’ has always been about collapsing an
event stream into a single event for performance reasons. When using this kind
of snapshot, the original stream of events is still available.
The second kind of snapshot (which I didn’t see immediately) is a snapshot
which is used as base data. When using a snapshot as base data, the collapsed
state of the aggregate at the time the snapshot is all the information you have
about the history of the aggregate.
It could also be argued that the migration is a meaningful domain event in its
own right, and should be captured explicitly. A CustomerMigratedEvent
could
result in the creation of a new customer aggregate root in the same way that a
CustomerRegisteredEvent
does.
Run all existing data through the new API
It should be possible to write a script that reads data from the existing
database, creates commands and posts those to the appropriate API. The relevant
events would ultimately be created off the back of processing the commands, so
all ‘legacy’ data should look exactly the same as anything created going
forward.
The outcome is probably close to option one above, but with less manual work.
Next steps
So far I’ve been spending a lot of time on the technical concepts & design of
an event sourced system, without doing much on the implementation side.
It’s hard to build a useful conceptual domain model without considering
implementation issues, so I think it’s time I stopped debating concepts and
wrote some code.
I’m planning to explore a little and gain an understanding of how building and
executing commands would differ in practice from the ‘import event’ option
above.
Further Reading
Acknowledgements
Thanks to Roman Safronov, Chris Rowe, Martin Fowler, Mariano Giuffrida, Jim
Barritt and Nouman Memon for taking the time to reply and/or chat about event
sourcing! Any good ideas are theirs, and errors are mine.
05 Nov 2017
See Vetted - a new project for some background.
I’m hoping to start pushing some code soon. Before I do, it’s a good
opportunity to do some reading into a topic that I am more ignorant of than I
should be: software licensing. The following is what I’ve learnt so far.
Disclaimer: I am not a lawyer and this may be totally incorrect, and as such
should not be used as the basis for any decision ever.
Once code is released as open source software, most common
OSS licenses (GPL, BSD, MIT, Apache etc) do not allow for the revocation of
rights granted under the license. This is a good thing. Imagine having to be
prepared for any open source library/framework you’re currently using become
proprietary software with no warning. Such uncertainty would severely limit the
utility of open source software.
It is possible for the current copyright owners to relicense their creations.
So, theoretically, any OSS software can be relicensed if all copyright owners
agree. The important part is that this relicensing does not revoke the rights
assigned under the previous license. So if I’ve released some software as open
source software, I can decide a year later to relicense it and create a
commercial version, with the following caveats:
- I still own the copyright for the entire project
- I need to have been the sole contributor to the project, or to have
ensured that contributors have assigned copyright to me for their work
- The existing rights assigned under the OSS license remain in place
- If the license permits, anyone can fork the project at this point
and develop/use their own version
Given what I’ve learned above, I’m planning to license the project under an OSS
license, but I won’t accept any contributions until I’ve got some kind of
Contributor License Agreement (CLA) in place.
This is likely the first of many posts where it might seem I’m researching a
topic and deliberating a little excessively, given I have no working software
or even a particularly interesting idea. It’s fairly premature to assume that
there are going to be any contributors to this project other than myself. I
don’t believe veterinary practice management software is so exciting that I am
going to be swamped with contributions. However, as I mentioned
earlier, this whole project is mainly a learning
opportunity for me.
Further resources:
05 Nov 2017
When I was in high school, I created a fairly basic application for managing a
small veterinary practice. It’s written in Microsoft Access and is used by my
parents to manage their mobile veterinary business.
I’m toying with rebuilding it as a web application. For its current users, the
main benefits of this would be:
- client contact details could be made available on a mobile device;
- there will be no more (or at least fewer) issues with concurrent
modification (merging two access database files that have been modified
independently on different computers because Dropbox didn’t sync is no fun);
and
- it will be accessible anywhere with internet access, so that my dad could do
accounts when he is away from home and has some downtime.
For me, it would mainly be a learning opportunity.
Naming things is not my strong suit, so I’m going with ‘Vetted’ for now.
I’m going to try to do the following throughout the project:
- Apply domain driven design rigorously
- Apply functional programming principles
- Document failures and successes
- Document my design heuristics
- Develop in the open
- Deploy continuously to
production somewhere
- Focus on adding the most valuable parts first (e.g. make phone numbers
available online) & delivering vertical slices
I’m thinking that the basic architecture for now will be:
- Single page application
- Elm frontend
- Kotlin backend
- Maybe some kind of event sourced data store, as I’d like to see how badly I
can shoot my foot off
While I want to build something functional, I also want to learn about a few
techniques/patterns/tools that would be applicable on larger projects, so I might
be making some choices which seems strange. I’ll try to call these out as they
happen. I need to remember that I am not Google.
I’m hoping to publish posts more regularly about this project, so keep an eye
out if you’re interested!
Notes
Design Heuristics
I attended the excellent Explore DDD conference this
year and one of my favourite talks was Cultivating Your Design
Heuristics by Rebecca Wirfs-Brock.
As defined in the talk, a heuristic is:
anything that provides a plausible aid (not a guaranteed aid) or direction in
defining a solution but is ultimately irrelevant to the final product
Rebecca encourages everyone to consciously document and cultivate heuristics,
learn others heuristics & discuss them and ultimately adapt (or wholesale
replace) your own heuristics when appropriate.
It’s a great talk, and I’m going to try to document the heuristics that I find
myself using while working on this project.
Develop in the open
I’m going to be writing about what I’m building, and the code will be available
on GitHub. However, I haven’t yet determined how to license the project. My
basic requirement is that I retain copyright and can relicense the project if
that’s ever required.
Look out for a post on this in the near future.
31 Oct 2017
Recently I found that opening a new bash
session (e.g. when opening a new
terminal window) was getting a bit slow on my machine. I take reasonable care
to make sure my dotfiles don’t get too crufty, and I keep them all in version
control.
The following is a walk through of how I went about debugging the issue.
So, how does one go about profiling what bash
is doing when starting a login
shell/interactive shell?
My initial thought was to use some kind of system call tracing to see what
files were being opened/executed. dtrace
exists on OS X, so let’s try that:
Sadly, the output isn’t overly useful due to System Integrity
Protection. I don’t want to boot
into recovery mode, so what are our options?
I regularly add set -o xtrace
to my bash scripts … would the same thing
work for my .bashrc
? I added the line, and executed bash
:
+ source /Users/mnewman/.bash_profile
++ export PATH=/Users/mnewman/bin:/Users/mnewman/perl5/bin:/Users/mnewman/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/mnewman/.rvm/bin
++ PATH=/Users/mnewman/bin:/Users/mnewman/perl5/bin:/Users/mnewman/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/mnewman/.rvm/bin
++ for file in ~/.{path,bash_prompt,exports,aliases,functions,extra}
++ '[' -r /Users/mnewman/.path ']'
++ for file in ~/.{path,bash_prompt,exports,aliases,functions,extra}
++ '[' -r /Users/mnewman/.bash_prompt ']'
...
It looks like that works (the above is showing the start of my .bash_profile
,
which is sourced from .bashrc
). There is a lot of output there though, and we
still don’t have any timing information. A little searching for variants of
bash add timestamp to each line
led me to an SO
answer recommending ts
. Looking at the manual page for ts
:
$ man ts
NAME
ts - timestamp input
SYNOPSIS
ts [-r] [-i | -s] [format]
DESCRIPTION
ts adds a timestamp to the beginning of each line of input.
The optional format parameter controls how the timestamp is formatted, as used by strftime(3). The default format is "%b %d %H:%M:%S". In addition to the regular strftime
conversion specifications, "%.S" and "%.s" are like "%S" and "%s", but provide subsecond resolution (ie, "30.00001" and "1301682593.00001").
If the -r switch is passed, it instead converts existing timestamps in the input to relative times, such as "15m5s ago". Many common timestamp formats are supported. Note that
the Time::Duration and Date::Parse perl modules are required for this mode to work. Currently, converting localized dates is not supported.
If both -r and a format is passed, the existing timestamps are converted to the specified format.
If the -i or -s switch is passed, ts timestamps incrementally instead. In case of -i, every timestamp will be the time elapsed since the last timestamp. In case of -s, the time
elapsed since start of the program is used. The default format changes to "%H:%M:%S", and "%.S" and "%.s" can be used as well.
So far so good, it looks like we could use ts -i
and get the duration of
every command! I’d like to try this out, but how can we redirect the xtrace
output to ts
?
Some further Googling led me to this SO
answer, which suggests using the
BASH_XTRACEFD
variable to tell bash where to write its xtrace
output. After
some trial and error, I added a few lines to my .bashrc
:
# open file descriptor 5 such that anything written to /dev/fd/5
# is piped through ts and then to /tmp/timestamps
exec 5> >(ts -i "%.s" >> /tmp/timestamps)
# https://www.gnu.org/software/bash/manual/html_node/Bash-Variables.html
export BASH_XTRACEFD="5"
# Enable tracing
set -x
# Source my .bash_profile script, as usual
[ -n "$PS1" ] && source ~/.bash_profile;
Upon restarting bash, this produces (a lot of) output in /tmp/timestamps
, and
each line contains an incremental timestamp, like so:
0.000046 ++ which brew
0.003437 +++ brew --prefix
0.025518 ++ '[' -f /usr/local/share/bash-completion/bash_completion ']'
0.000741 +++ brew --prefix
These particular lines tell me that a brew --prefix
command executed and took
20ms.
With output like the above, I had enough info to track down a couple of slow loading
scripts (like sourcing nvm.sh
) and remove them from my .bashrc
/.bash_profile
.