The three stages of *my* database choice

By

Originally posted on: http://geekswithblogs.net/cskardon/archive/2015/12/03/the-three-stages-of-my-database-choice.aspx

Prologue

I write and run a website called Tournr, a site to help people run competitions, it helps them organise and register for competitions, keeping the scores online and taking some of the pain of competition management out for them. I began it due to a badly run competition I attended (and ahem ran) a few years ago – and I wanted to make it better for first-timers (like myself) and old-hands alike. This post is about the database decisions and pains I’ve been through to get where I currently am, it’s long, and the TL;DR; I changed my DB.

Chapter 1 – SQL Server

I’m a .NET developer through and through – no bad thing, but it does tend to lead you in a certain train of thought – namely the Microsoft Stack, (WISA – Windows, IIS, SQL Server, Asp NET). Personally, I don’t have the time, well, more the inclination to learn a new language when I’m comfortable in .NET and it does what I want it to do – so I created my first version.

tournr

I also was predominantly a desktop developer, this was my first real foray into the world of web development, so the styling, colour choices in a word – sucked. More importantly, the backend was progressing slowly. At the early stages of any project, changes occur rapidly some ideas which seem great begin to lose their shine after a week, or when someone else hears them and says ‘no, just no’.

So Tournr was based on SQL Server using the Entity Framework as it’s ORM – again – standard practice. I started to get got fed up with writing migration scripts. I’m more your Swiss Army Knife developer, good at a lot of things, but not a super-specialized-amazeballs at one thing in particular – a generalist if you will – and I found the time spent migrating my database structure, writing SQL etc was delaying me from actually writing features. I know people who can reel out SQL easily and are super comfortable with it, and I’m ok, I can write queries for creating/deleting/joining etc, but not as efficiently as others.

Chapter 2 – RavenDB

Skip along 6 months, and I’d been playing with RavenDB at my workplace, and thought it looked like it might be a good fit for Tournr. So I took a month or so to convert Tournr to use Raven instead of SQL Server, and man alive! that was one of my best ever decisions, I felt freer in terms of development than I had for ages, instead of working out how my classes would fit, and whether I needed yet another lookup table, I could write my classes and just Save. Literally. A little note here: Raven has first class .NET integration, it is very easy to use.

I procrastinated for a while after the initial conversion and finally got Tournr released using RavenHQ for hosting the DB and life was good – including a new Logo.

Print

I could add new features relatively easily. Over time I found myself adding things into my class structures to make the queries simpler, and ended up doing a little bit of redundancy. As an example I would have a list of Competitors in a Class (not a code class, but a competition class – like Junior or Women’s for example), and if a competitor was registered in two Classes, they would in essence be copied into both, so my Tournament would have 2 Classes with the same Competitor in both. I won’t bore with details, but this encroachment started to happen a little bit more.

Brief interlude

I’m aware that anytime you write something about how you struggled with <technology>, the developers and users who love it and are passionate about it will think you’re:

a) doing it wrong
b) don’t understand the `<technology>`
c) vindictive because something went wrong
d) insert your own reason here!

It’s natural, people make decisions which they get invested in, and they want their decisions to be positively reinforced, if you read something saying ‘Oh I left <technology> because it was <insert your own negative phrase here>’. It’s like they’ve slapped you and said you’ve made the wrong choice.

So to those people. It was just that way for me, it’s not a personal attack on you -or- Raven, or indeed SQL Server.

I was talking with my partner about a new feature I wanted to add in, and as we talked about it, the structure started to become apparent, she drew a circle and lines going into it. I made the glib statement somewhere along the lines of “the problem is that what you’ve drawn there is basically a graph, it’s a bit more complex than that”. To which she responded “Why don’t you use the graph db?”.

I had no good answer. I’d been using Neo4j for a good few years so it’s not like I didn’t get it. Obviously it’s a big decision, switching from 1 DB to another is never a small thing, let alone from one type (document) to another (graph). Sure – I’d done it in the past from Relational to Document, but at that point *no-one* was using it, so it only affected me. This time I’d have users and Tournaments.

Now, Tournr isn’t used by many people at the moment, this is a blessing and a curse – the curse being that I’d love it to be used by more people 🙂 The blessing is that I can monitor it very closely and keep tabs on how the conversion has gone. Hooking in things like RayGun means that getting near instant notification of any error combined with quick code turn-around I can respond very quickly.

Long and short of it. I thought ‘<expletive> it!’, and set to work…..

Before jumping there, lets look at the positives and negatives of using Raven,

Positives:
  • Extremely fast to get up and running (I think it’s fair to say without Raven Tournr would not have been launched when it was)
  • Fits into C# / .NET code very well
Negatives:
  • You really need to buy into Ayende’s view of how to use the Database, this isn’t a bad thing in itself, but it does restrict your own designs.

 

Chapter 3 – Neo4j

At the point you take the plunge it’s important to get a quick win, even if (as it turns out) it’s superficial and full of lies and more LIES! I’m going to give a bit of an overview of Tournr’s structure, not going super deep – you don’t need to know that. Tournr was initially an ASP.NET MVC3 application, which was migrated to MVC5, along the way it stuck with the ASP.NET Membership system using first the Entity Framework version, and then a custom rolled RavenDB based version.

Whilst doing this conversion the *only* thing I allowed myself to do aside from the DB change was update the Membership to use ASP.NET Identity – and that was for two reasons –

1. There wasn’t a Neo4j based Membership codebase that I could see – so I’d have had to roll my own, and
2. There is a Neo4j Identity implementation (which I *have* helped roll).

Membership

Long story short – I added the Neo4j.Aspnet.Identity nuget package to my web project and switched out the Raven Membership stuff, this involved adding some identity code, setting up OWIN and other such-ness. The real surprise was that this worked. No problems at all – this was the quick win. I thought to myself – this is something that is not impossible.

Conversion – The rest

What? Membership and ‘The rest’ – it’s not exactly partioning the post is it Chris? Well – no, and the reason is this – when I switched the membership – it compiled, started and let me login, register etc. Obviously I couldn’t load any tournaments, or rather I could, but I couldn’t tie the user accounts to them. When I switched the pulling of Tournaments etc all bets were off.

I like to go cold turkey. I removed the RavenDB nuget package from the project and winced at the hundreds of red squiggles and build errors. All that could be done from this point was a methodical step by step process of going through controllers replacing calls to Raven with calls to my new DB access classes. Anyhews, that aside – I ended up with an interface with a bucket load of methods.

Model 1

Woah there! You’re thinking – I think you missed a step there, what about the data model design – yes – you’re of course right. Prior to my conversion I had drawn out a model we’ll call this Model 1. This was (as you can probably guess from the name) wrong. But that didn’t stop me, and that’s partly down to my personality – if I’m not doing something – I find it easy to get bored and then spend time reading the interwebs. Also – I know I’m going to find out some stuff that will change the model, no point in being too rigid to it.

In this model – I’d seperated out a lot of things into individual nodes, for example – a User has a set of properties which are grouped in a class together representing Personal Registration details – things like country flag etc, and I had the model:

(User)-[:HAS_PERSONAL_DETAILS]->(PersonalDetails)

So I wrote a chunk of code around that.

Something you will find is that Neo4j doesn’t store complex types – simple arrays of simple types are cool, Dictionaries and objects are out. So you can quite easily separate out into individual nodes like above, and first cut – well – that’s the route I took.
So I plugged away, until I hit some of the bigger classes, this is where Raven had given me an easy run – Oh hey! You want to store all those nested classes? NO PROBLEM! That is awesomely powerful – and gives super super fast development times. Neo4j not so forgiving. So, taking ‘Model 1’ as the basis I start to pick out the complex objects. Then EPIPHANY

Model 2 – The epiphany

In my view, for complex types which really are part of a Tournament or indeed a User, and in particular things I wasn’t going to search by, why create a new Node? Trade off – bigger nodes, but less of them – queries (or cyphers) become a bit simpler, but can’t query as easily against the complex embedded types.

Maybe I needed an inbetween – where some complex types *were* nodes, and some were just serialized with the main object. Weird. A _middle ground_, can you have that in development?

So Model 2 takes Model 1 and combines some of the types which really didn’t need to be separate nodes. So Personal Details moved into the User, as I had no need to query on the data in there (and if I _do_ need to at a later date, well – I can add it then).

Special note for .NET devs – if you try to put a CREATE into Neo4j with a type with a complex type for a property – Neo4j will b0rk at you. To get around this – you’ll need to provide a custom Json Converter to the Neo4jClient (obvs if you’re not using Neo4jClient this is totall irrelavent to you). There are examples of this on StackOverflow – and I imagine I’ll write some more on it later – probably try to update the Neo4jClient Wiki as well!

Now, so far I imagine there are Top-Devs (TM)(R) slapping their foreheads over the general lack of planning, well hold onto your pants, let’s enter the heady world of TESTING.

I know what TDD is, I know what BDD is, I’m pretty certain I know what DDD is – but for Tournr I don’t really practice them. A few reasons – and I don’t really want to get into some sort of standards war here, but in a nutshell – Tournr wouldn’t be live if I’d tested the heck out of it. In the areas that matter – the important calculations etc, I have tests, but for some things – I just don’t. Quick note for potential hirers:  I do write tests professionally, use NCrunch etc, but this is very much a personal project and I take all the heat for it, and it’s a risk I’m willing to take at the moment.

So, from Tournrs once I’d been through the controllers and got it all compiling, I started testing my codebase. Funny thing – when you write a lot of code which for the majority of time *doesn’t compile*, issues do creep in. Mostly (in this case) it was related to complex types I’d missed or the missing of a closing brace in the Cypher.

>> Cypher

I’m not going to go into this very deeply either, but Cypher is amazeballs, think of it as the SQL of the Graph DB world (well, Neo4j world – actually not anymore – you can tell this post has been in the drafts for a while – check out OpenCypher), it’s clear concise and yes – like SQL you can go wrong. You might think that you don’t want to learn Yet Another Language when you know SQL – so why not use something like OrientDB – but think about it from another way. You use SQL to interact with Relational DB, with tables, foreign keys etc. You perform Joins between tables – to use that in a GraphDB would be quite a mental leap – and confuses matters – you end up having the same keyword meaning different things for different databases – you could end up writing a ‘select’ statement in your code against both DB types. With Cypher the language is tailored to the DB, and as such describes your queries from a Node / Relationship point of view, not a Tables point of view.

The changes I mainly did involved adding attributes like ‘JsonIgnore’ to my classes to prevent Neo4j serializing them (or attempting to), partly as it meant I could get development up and running faster, but also from the point of view of Migration. One of the problems with the conversion (indeed *any* conversion) is keeping existing elements, and that means translation. Raven stores documents key’d by the type – so if I store a ‘Tournament’, it is stored as a Tournament. When I query – I bring back a Tournament. Ah, but I’ve just JsonIgnored my properties – so when I bring back – it’s missing things.

Migration

Obviously – I have elements in database A and I want them in database B, how do we achieve those goals? Bearing in mind – I don’t want them to change their passwords or not be able to login. Luckily I store passwords as plain text — HA! Not really, in practical terms, I have changed the way the passwords are hashed by switching to the Identity model, and as a consequence – there is nothing I can do :/ Existing users have to reset their passwords – now – this is BAD. How do you write an email like that? ‘Hi, I decided unilaterally to change the backend – now you need to reset your password – sucks to be you!’ – of course not. A more diplomatic approach is needed – specifically, the migration should only take place in the quietest possible period – once again a bonus of the ‘not used much’ scenario I find myself in.

All the other migration requirements are relatively simple, of course I have to split out bits that need to be split out, create relationships etc, but none of that affects the users.

The biggest headache I thought would be getting stuff from Raven and then putting into Neo4j. Take a Tournament for example, in it, I had a List of Class, which in the Neo4j world is now represented as (Tournament)-[:HAS_CLASS]->(Class) so in the codebase for the Neo4j version, I removed the ‘Classes’ property. But now I can’t deserialize from Raven, as Tournament no longer has Classes.

This is where judicious use of Source Control (which we’re *all* using right?????) comes into play. Obviously at this point I’ve done a shed load of checkins – on a different branch – ready for the big ol’ merge, so it’s relatively easy to browse the last checkin before the branch and copy the Tournament class from there.

If I just whack in the class, the compiler will throw a wobbly, not to mention the Neo4j and Raven code will be unsure of which Tournament I mean.

So, let’s rename to ‘RavenTournament’ (cleeeever), but coming back to the point made a while ago – Raven can’t deserialize into RavenTournament as it’s looking for Tournament, oh but wait. It can. Of course it can, simply as well. The standard query from Raven’s point of view would be:

session.Query<Tournament>()

to get all the Tournaments. If I switch to:

session.Query<RavenTournament>()

it will b0rk, but, if I add:

session.Query<Tournament>().ProjectFromIndexFieldsInto<RavenTournament>()

I hit the mother load, property wise RavenTournament is the same as Tournament was pre-Neo4j changes, and Raven can now deserialize.

A little word about IDs

By default Raven uses ids in the format: <type>/long, so a Tournament might be: Tournament/201. You can (and I did) change this so for example, I used ‘-‘ as the splitter: Tournament-201, and actually for Tournament – I just used a long. I can’t really change the IDs, or rather – I don’t want to, doing so means that existing links to tournaments are made invalid, of course I could add some sort of mapping code, but that seems like more effort that I shouldn’t need to do. So, Tatham to the rescue (this is Tatham Oddie of Neo4jClient fame) with SnowMaker – an Azure Storage based ID generator. I won’t go into the how’s your fathers about how I’ve used it – it’s a pretty simple concept that you can look up and enjoy. Needless to save it’s made the conversion work.

Epilogue – Post Conversion Analysis

So, codewise am I in a better shape with the conversion – was it worth it? I think so – but thinking isn’t the same as knowing – so let’s fire up some analysis with the excellent NDepend. First we’ve got to define the baseline, and in this case we’re going to set that as the last RavenDB based version of the codebase (comparing to the SQL version would be pointless as too much has changed inbetween), and then define the codebase to compare to – well that’s the most recent Neo4j based version (actually it’s a version I’m currently working on – so includes some new features not in the Raven one – so think a bit more of ‘Raven Version+’.

The first and most basic stats come from the initial dashboard –

image

Here I can see general metrics, and things like LOC, Complexity have gone down – generally – a good thing, but the number of types has increased a lot.

Less code but more types? Query-wise with Raven you can just pull out the same objects as you insert – with Neo4j, I’ve found myself using more intermediaries, which is fine, and is part of the process – in fairness as time has gone on, I’ve realised a few of these are not used as much as I thought – and I can group them better – so if I was stopping dev now, and just maintaining – I’d expect the number to drop, in practical terms – does it matter? Probably not – a few lines of code here and there – might make it more maintainable – but it’s worth thinking about the ramifications of switching to a rarely used DB (Raven or Neo4j) – I could have a 50% drop in code size, but the technology still requires more of a leap to get used to than a basic relational DB implementation.

What about new dependencies? What have I added, what have I removed?

One of the great things about NDepend (among many) is the ability to write CQL (Code Query Language) – a ‘LINQ’ style querying language for your code base – so Third Party types used that weren’t before:

from t in ThirdParty.Types where t.IsUsedRecently()
select new {
t,
t.Methods,
t.Fields,
t.TypesUsingMe
}

gives us:

image

And types which were used before and now aren’t:

from t in codeBase.OlderVersion().Types where t.IsNotUsedAnymore()
select new {
t,
t.Methods,
t.Fields,
TypesThatUsedMe = t.TypesUsingMe
}

image

There are more metrics and NDepend has a lot of things to look at, and I’m wary of making this post overly long, and I neglected to set my baseline properly to show the trend charts (bad me), ongoing though I’m keeping track of my quality to ensure it doesn’t take a dive. (by the by – Patrick of NDepend has given me a copy of NDepend, you should know that, it is genuinely useful though – but do you know if I’m saying that or am I a suck up lacky???)

 

Things I’ve Learned

  1. First and foremost is that taking on a conversion project of something you have built from scratch is totally doable – it’s hard, and quite frankly dispiriting to have your codebase not compile for a couple of days, and then spend days fixing up the stuff that’s b0rked.
  2. You can spend a long time thinking about doing something – sometimes you just have to do it – and it’s ok to be wrong.
  3. Don’t be afraid to throw stuff away, if no-one is using something – delete it, if your model is wrong, redo.
  4. Fiddler is your friend with Neo4j in Async mode,
  5. I used short cuts – the migration was a one off – the code is a quick console app that does it in a dodgy but successful way that had ZERO tests. That’s right ZERO.