5 Relational (TE 1.0) vs. NoSQL (TE 2.0)
Introduction: Relational is no Longer the Default MO
Whereas the coding (programming) side of system development has witnessed more or less constant innovation over the years, the data management side of things has long been dominated by the relational model, originally developed by E.F. Codd in the 1970s while at IBM (Codd 1970, 1982). One (imperfect) indicator of the dominance and penetration of the relational model in business IT systems is the share of coverage the relational vs. NoSQL database technologies receive in (Business) Information System Analysis and Design textbooks. Table 1 contains an inventory of a few of those books, namely the ones sitting on the shelves of one of us. Not a scientific sample, but telling nonetheless.
Textbook | Pages relational | Pages NoSQL |
---|---|---|
Valacich, J.S., George, J.F., Hofer, J.A. (2015) Essentials of Systems Analysis and Design. Pearson. | Chapter 9: 45 pp. | 0 pp |
Satzinger, J., Jackson, R., Burd, S. (2016) Systems Analysis and Design in a Changing World. CENGAGE Learning. | Chapter 9 : 35 pp. | 0 pp. |
Tilley, S., Rosenblatt, H. (2017) Systems Analysis and Design. CENGAGE Learning. | Chapter 9: 45 pp. | 0 pp. |
Dennis, A., Wixom, B.H., Roth, R.A. (2014) System Analysis and Design. 6th ed. John Wiley & Sons. | Chapter 6 & 11: 40 pp. | 1 pp. |
Dennis, A., Wixom, B.H., Tegarden, D. (2012) System Analysis & Design with UML Version 2.0. John Wiley & Sons. | Chapter 9: 40 pp. | 2 pp. |
Dennis, A., Wixom, B.H., Tegarden, D. (2015) System Analysis & Design. An Object-Oriented Approach with UML. 5th ed. John Wiley & Sons. | Chapter 9: 22 pp | 2 pp. |
Coronel C., Morris, S. (2016) Database Systems. Design, Implementation and Management. Cengage learning. | 700+ | 11 pp. |
To be clear, pointing out the almost total absence of NoSQL coverage in these texts is not meant as a critique of these texts. The relational database remains a dominant work horse of transaction processing and hence, remains at the heart of business computing. But whereas for a very long time it was essentially the only commonly available viable option for all of business computing—transaction processing or not—new, equally viable and commonly available non-relational alternatives for non-transaction processing are making quick inroads.
Of the six texts listed in Table 1, only Dennis, Wixom & Tegarden (2012, 2015) and Coronel & Morris give explicit attention to these non-relational or NoSQL options. Dennis et al. recognize three types: key-value stores, document stores and columnar stores. Key-value stores are databases which function like lookup tables where values to be looked up are indexed by a key as indicated by Figure 1. As the Wikipedia page on key-value stores shows, quite a few implementations are available these days. One of the better-known ones is the open-source implementation Riak. Columnar (or column-oriented) databases are databases which transpose the rows and columns of relational databases. Rows become columns and vice versa. This type of representation is meant to imply faster processing for what in the relational world would be column-oriented operations, which in the columnar database become row-oriented operations. Wikipedia contains a list of implementations of this type of database.
Which brings us to the document store, a type of NoSQL database which is most relevant for us because that is what TE 2.0 uses. Document stores, aka Document Management Systems, have quite a history. Already in the 1980s several systems were commercially available to manage an organization’s paper documents, followed by systems for managing electronic documents. However, whereas these systems typically enforced their own binary representation on documents or stored the documents in their native formats and then kept their metadata for sorting and searching, many modern NoSQL document stores store documents as text in a standard format such as XML or JSON. RavenDB and MongoDB, for example, store documents in JSON format. The fact that there typically are no constraints on the data other than these text-formatting ones is important because it implies that as long as a text complies with the format, it is considered a valid ‘document’ regardless of its information content. Hence, from the document store’s perspective, the JSON text {“foo”: “bar”}
is as much a ‘document’ as a complex JSON structure representing a TeachEngineering lesson.
This notion of a document as a structured text shares characteristics with the notion of classes and objects in object-oriented programming (OOP) but also with tables in a relational database. But whereas OOP objects typically live in program memory and are stored in CPU-specific or byte-compiled binary formats, NoSQL documents exist as text. And whereas in relational databases we often distribute the information of an entity or object over multiple tables in order to satisfy normal form, in NoSQL document stores —as in key-value stores— we often duplicate data and neither maintain nor worry about the database integrity constraints so familiar from relational databases.
So, What Gives?
One might ask why these NoSQL data stores came to the fore anyway? What was so wrong with the relational model that made NoSQL alternatives attractive? And if the NoSQL databases are indeed ‘no SQL,’[1] how does one interact with them?
Let us first say that on the issue of which of these alternatives is better, the dust has by no means settled. The web is rife with articles, statements and blog posts which offer some angle, often highly technical, to sway the argument one way or the other. Yet a few generally accepted assessments can be made (refer to Buckler (2015) for a similar comparison).
- There are few if any functions which can be fulfilled by one that cannot be fulfilled by the other. Most if not everything that can be accomplished with a NoSQL database can be accomplished with a relational database and vice versa. However…
- Modern, industry-strength relational databases are very good at providing so-called ‘data consistency;’ i.e., the property that everyone looking at the data sees the same thing. Whereas such consistency is particularly important in transaction processing —a bank balance should not be different when looked at by different actors from different locations, consistency is less important in non-transaction data processing. A good example would be pattern-finding applications such as those used in business intelligence or business analytics. A pattern is not stable—and hence, not a real pattern—if a single observation changes it. For example, when analyzing the 30-day log of our web server, any single request taken from the log should not make a difference. And if it does, we did not have a stable pattern anyway.
NoSQL databases are not always equipped with the same consistency-enforcing mechanisms as the relational ones such as locking and concurrency control. For example, in the NoSQL database RavenDB, operations on individual documents are immediately consistent. However, queries in RavenDB at best guarantee ‘eventual consistency;’ i.e., the notion that, on average or eventually, different observers see the same thing. Although eventual consistency is insufficient for transaction processing, it is often sufficient in non-transactional contexts. - Because unlike relational databases NoSQL databases are not encumbered with 40 years of engineering investments, their licenses are a lot less expensive, especially in cases where lots of data must be distributed over several or lots of different machines and different locations. Whereas that does not help when consistency is needed, it can weigh significantly if eventual consistency is good enough.
- Relational databases enforce relational integrity (read: primary and foreign key constraints). NoSQL databases do not. But what if we are willing to sacrifice some automatic integrity checking for simpler application code or faster running programs? As we mentioned in the previous chapter, the TE 2.0 JSON representation of documents contains a lot(!) of duplicated data when compared with the TE 1.0 representation, and yet, the 2.0 system runs faster and the codes are less complicated. Of course, making a correction to all that duplicated data would require more work, but if these corrections are rare and can be accomplished without service interruption…
- NoSQL databases tend to be light weight; i.e., no complicated and distributed client-server systems must be set up (hence, their cost advantage). However, this does not imply that complicated systems cannot be built with these systems. On the contrary, NoSQL databases are often used in highly distributed systems with multiple levels of data duplication, and complex algorithms must often be devised to retrieve information from these various locations to then be locally aggregated.
- Relational databases have predefined schemas meaning that only specific types of data can be stored in predefined tables. NoSQL databases do not have this constraint. Referring to the JSON databases such as RavenDB and MongoDB again, one is free to store the
{“foo”: “bar”}
document alongside a complex TeachEngineering lesson. One can easily conceive of some (dis)advantages either way. - So-called joins; i.e., querying data across multiple tables in a single SQL query, are the bread and butter of relational databases. NoSQL databases do not have joins. Of course, not having joins while having a database in normal form would mean a lot of extra programming, but since in a noSQL database we are more than welcome to ignore normal form, not having joins does not have to be a problem.[2]
- Relational databases traditionally scale vertically by just adding more CPU’s, memory and storage space. NoSQL databases scale horizontally by adding more machines.
- Querying. Whereas SQL is the universal language for interacting with relational databases, no such language is available for NoSQL databases. Since NoSQL is not a standard, there is no standard querying protocol. Therefore, most NoSQL implementations have their own syntax and API. However, because of SQL’s installed base and popularity, some NoSQL databases offer SQL-like querying protocols. Moreover, since many software applications —databases or not— offer some sort of data retrieval mechanism, efforts to develop supra or ‘über’ query languages are underway. We have already seen one of these in the previous chapter, namely Yahoo Query Language (YQL), a SQL-like language which can be used to query across Yahoo services. Similarly, Microsoft has developed its Language INtegrated Query (LINQ), a SQL-like language for programmatic querying across its .NET products.
We also must stress that as relational and NoSQL technologies evolve, the lines between them are blurring. For instance, PostgreSQL, a mature and popular open-source relational database supports JSON as a first-class data type. This affords developers the option of using both relational and NoSQL patterns where appropriate in the same application without needing to select two separate database management systems.
TE 1.0: XML/Relational
In this section we discuss the XML/relational architecture used in TE 1.0. If you just want to learn about the TE 2.0 JSON/NoSQL version, feel free to skip this section. However, this section contains some concepts and ideas which you might find interesting, and it assists a better understanding of the relational-NoSQL distinctions.
When we designed TE 1.0 in 2002 NoSQL databases were just a research topic, and using a relational database as our main facility for storing and retrieving data was a pretty much a forgone conclusion. What was nice at the time too was the availability of MySQL; a freely available and open-source relational database.
We had also decided to use XML as the means for storing TeachEngineering content; not so much for fast querying and searching, but as a means to specify document structure and to manage document validity (refer to Chapter 3 for details on XML and validity checking). Consequently, we needed a way to index the content of the documents as written by the curriculum authors into the MySQL relational database.
At the time that we were designing this indexing mechanism, however, the structure of the TeachEngineering documents was still quite fluid. Although we had settled on the main document types (curricular units, lessons and activities) and most of their mandatory components such as title, summary, standard alignments, etc., we fully expected that components would change or be added in the first few years of operations. Moreover, we considered it quite likely that in the future entirely new document types might have to be added to the collection.[3] As mentioned above, however, relational databases follow a schema and once that schema is implemented in a table structure and their associated integrity constraints set up, and once tables fill with data and application codes are written, database schema changes become quite burdensome. This then created a problem. On the one hand we knew that a relational database would work fine for storing and retrieving document data, yet the documents’ structure would remain subject to changes in the foreseeable future and relational databases are not very flexible in accommodating these changes.
After some consideration, it was decided to implement an auto-generating database schema; i.e., implement a basic database schema which, augmented with some basic application code, could auto-generate the full schema (Reitsma et al., 2005). With this auto-generation concept we mean that database construction; i.e., the actual generation of the tables, integrity constraints and indexes occurs in two steps:
- The first step consists of a traditional, hardwired schema of meta (master) tables which are populated with instructions on how the actual document database must be constructed.
- In a second step, a program extracts the instructions from the meta tables, builds the data tables accordingly and then processes the documents, one by one, to index them into those data tables.
This approach, although a little tricky to set up, has the advantage that the entire document database schema other than a small set of never-changing meta tables, is implemented by a program and requires no human intervention; i.e., no SQL scripts for schema generation have to be written, modified or run. Better still, when structural changes to the database are needed, all we have to do is change a few entries in the meta tables and let the program generate a brand-new database while the existing production database and system remain operational. When the application codes that rely on the new database structure are ready, just release both those codes and the new database and business continues uninterrupted with a new database and indexing structure in place.
Data Tables
The following discusses a few of the data tables in the TE 1.0 database:
- Although the various document types have certain characteristics in common; e.g., they all have a title and a summary, the differences between them gave sufficient reason to set up document-type specific data tables. Hence, we have a table which stores activity data, one which stores lesson data, one which stores sprinkle data, etc. However, since we frequently must retrieve data across document types; e.g., ‘list all documents with target grade level x,’ it can be beneficial to have a copy of the data common to these document types in its own table. This, of course, implies violating normal form, but such denormalization can pay nice dividends in programming productivity as well as code execution speed.
- K-12 standards have a table of their own.
- Since the relationship between TeachEngineering documents and K-12 educational standards is a many-to-many one, a document-standard associative table is used to store those relationships. Foreign key constraints point to columns in the referenced tables.
- Since the TeachEngineering collection consists of several hierarchies of documents; for instance, a curricular unit document referring to several lessons and each lesson referring to several activities (see Chapter 1), we must keep track of this hierarchy if we want to answer questions such as ‘list all the lessons and activities of unit x.’ Hence, we keep a table which stores these parent-child relationships.
- TeachEngineering documents contain links to other TeachEngineering documents and support materials as well as links to other pages on the web. In order to keep track of what we are pointing to and of the status of those links, we store all of these document-link relationships in a separate table.
- Several auxiliary tables for keeping track of registered TeachEngineering users, curriculum reviews, keywords and vocabulary terms, etc. are kept as well.
Meta tables
To facilitate automated schema generation, two meta tables, Relation and Types were defined.
Relation table (definition) | ||||
---|---|---|---|---|
Field | Type | Null | Key | Default |
id | int(10) unsigned | NO | PRI | NULL |
groupname | varchar(100) | NO | ||
component | varchar(100) | NO |
Relation table (sample records) | ||
---|---|---|
id | Groupname | component |
22 | Activity | cost_amount |
6 | Activity | edu_std |
72 | Activity | engineering_connection |
18 | Activity | grade_target |
21 | Activity | keywords |
5 | Activity | summary |
70 | Activity | time_in_minutes |
4 | Activity | title |
52 | child_document | link_text |
50 | child_document | link_type |
49 | child_document | link_url |
46 | Vocabulary | vocab_definition |
45 | Vocabulary | vocab_word |
Types table (definition) | ||||
---|---|---|---|---|
Field | Type | Null | Key | Default |
id | int(10) unsigned | NO | PRI | NULL |
name | varchar(100) | NO | ||
expression | varchar(250) | NO | ||
cast | enum(‘string’,’number’,’group’,’root’) | NO | string | |
nullable | enum(‘yes”no’) | YES | YES | NULL |
datatype | varchar(50) | YES | NULL |
Types table (sample records) | |||||
---|---|---|---|---|---|
id | name | expression | cast | nullable | datatype |
19 | child_document | /child_documents/link | group | NULL | NULL |
24 | cost_amount | /activity_cost/@amount | number | yes | float |
25 | cost_unit | /activity_cost/@unit | string | yes | varchar(100) |
6 | edu_std | /edu_standards/edu_standard | group | NULL | NULL |
1 | edu_std_id | /@identifier | string | yes | varchar(8) |
33 | engineering_category | /engineering_category_TYPE/@category | string | yes | varchar(250) |
14 | grade_lowerbound | /grade/@lowerbound | number | yes | int(10) |
13 | grade_target | /grade/@target | number | no | int(10) |
15 | grade_upperbound | /grade/@upperbound | number | yes | int(10) |
26 | keywords | /keywords/keyword | group | NULL | NULL |
27 | keyword | string | yes | Varchar(250) |
Records in the Relation table declare the nesting (hierarchy) of components. For instance, an activity has a title, a summary, etc. Similarly, any reference to a child document has a link_text, a_link type and a link_url. Note that this information is similar to that contained in the documents’ XML Schema (XSD). Essentially, the Relation table declares all the needed data tables (groupname) and those tables’ columns (component).
The Types table in its turn declares for each component its datatype, nullability as well as an XPath expression to be used to extract its content from the XML document. For example, the cost associated with an activity (cost_amount) is a float, can be null (yes) and can be extracted from the XML document with the XPath expression /activity_cost/@amount. Similarly, the grade_target of a document is an int(10), cannot be null (no) and can be extracted with the XPath expression /grade/@target. Note that array-like datatypes such as edu_standards are not a column in a table. Instead, they represent a list of individual standards just as keywords is a list of individual keywords. They have an XPath expression but no datatype.[4]
Between the Relation and the Types tables, the entire data schema of the database is declared and as a side effect, for each column in any of the tables, we have the XPath expression to find its document-specific value. Hence, it becomes relatively straightforward to write a program which reads the content from these two tables, formulates and runs the associated SQL create table
statements to generate the tables, uses the XPath expressions to extract the values to be inserted into these tables from the XML documents, and then uses SQL insert
statements to populate the tables.
What is particularly nice about this approach is that all that is needed to integrate a brand new document type into the collection is to add a few rows to the Types and Relation tables. The auto-generation process takes care of the rest. Hence, when in 2014 a new so-called sprinkle document type was introduced —essentially an abbreviated activity— all that had to be done was to add the following rows to the Types and Relation tables:
Records added to the Relation table to store sprinkle document data | ||
---|---|---|
id | groupname | component |
84 | sprinkle | Title |
86 | sprinkle | total_time |
87 | sprinkle | sprinkle_groupsize |
88 | sprinkle | total_time_unit |
89 | sprinkle | grade_target |
90 | sprinkle | grade_lowerbound |
91 | sprinkle | grade_upperbound |
93 | sprinkle | sprinkle_cost_amount |
96 | sprinkle | Link |
97 | sprinkle | time_in_minutes |
98 | sprinkle | engineering_connection |
103 | sprinkle | Summary |
104 | sprinkle | dependency |
105 | sprinkle | translation |
Records added to the Types table for storing sprinkle document data | |||||
---|---|---|---|---|---|
id | name | expression | Cast | nullable | datatype |
34 | sprinkle | /sprinkle | Root | NULL | NULL |
35 | sprinkle_groupsize | /sprinkle_groupsize | Number | yes | int(10) |
36 | sprinkle_cost_amount | /sprinkle_cost/@amount | Number | yes | float |
Adding these few records resulted in the automatic generation of a sprinkle table with the requisite columns, their declared datatypes and nullabilities. Extraction of sprinkle information from the sprinkle XML documents to be stored in the tables was done automatically through the associated XPath expressions. Not a single manual SQL create table
or alter table
statement had to be issued, and no changes to the program which generates and populates the database had to be made.
TE 2.0: JSON/NoSQL
The process described in the previous section worked fine for almost 13 years during which time it accommodated numerous changes to the documents’ structure such as the addition of the sprinkle document type. During those years the collection grew from a few hundred to over 1,500 documents. Yet when the decision was made to rebuild the system, the architectural choice of having a separate store of XML-based documents to be indexed into a relational database was questioned in light of the newly available NoSQL document stores. Why not, instead of having two more or less independent representations of the documents (XML and relational database), have just one, namely the document database; i.e., a database which houses the documents themselves. If that database of documents can be flexibly and efficiently searched, it would eliminate a lot of backend software needed to keep the document repository and the database in sync with each other. Better still, when rendering documents in a web browser one would not have to retrieve data from both sources and stitch it all together anymore. Instead, it could just come from a single source.
To illustrate the latter point consider the way a document was rendered in TE 1.0. Figure 2 shows a section of an activity on heat flow. In the Related Curriculum box the activity lists its parent, the Visual Art and Writing lesson. This ‘parental’ information, however, is not stored on the activity itself because in TeachEngineering, documents declare their descendants but not their parents. In TE 1.0 we rendered this same information; i.e., the lesson from which the activity descended. However, whereas in TE. 1.0 most of the rendered information came directly from the XML document, the document’s parent-child information was retrieved from the database. Hence, two independent sources of information had to be independently queried and retrieved, each across networks and different computers, only to then be stitched together into a single HTML Web page.
We could have, in TE 1.0, stored not just the typical lookup information of documents in the database; e.g., title, target grade, required time, summary, keywords, etc., but also the entire text of documents. Had we done that we would have only had to access a single source of information for rendering. Except… we did not. In hindsight, perhaps, we should have?
Fast forward to 2015, TE 2.0 and the availability of NoSQL document databases such as MongoDB and RavenDB. Now, we no longer have to separate document content from document searching since the basic data of these databases are the documents themselves. Hence, we have everything we want to know about these documents in a single store. In addition, these databases, partly because their data structures are so simple (no multi-table normalized data structures and no referential constraints), are really fast!
Earlier in this chapter, we discussed how RavenDB provides eventual consistency for document queries. Clearly, consistency is something that requires careful consideration when designing an application. In a highly transactional application, receiving out-of-date data from queries could be quite problematic. For instance, a query on a bank account or credit card balance or on the number of available seats on a future airplane ride must be correct at all times. But for a system such as TeachEngineering, eventual consistency is just fine. Curriculum documents do not change that often, and even if they do, it is perfectly acceptable if queries return a previous version for a (limited) period of time. Similarly, when a new document is made available it would be perfectly acceptable if that document is not immediately available everywhere in the world. Moreover, due to the relatively small number of documents stored in TE 2.0’s RavenDB database, the’ eventual’ in ‘eventual consistency’ means in practice that queries return up-to-date results in a matter of seconds after a change to is made.
Some Practice with two JSON Document Stores: RavenDB and MongoDB
In the remainder of this chapter we work through some practice examples running against two well-known JSON document stores: RavenDB and MongoDB. We will run the same examples against both databases so that you acquire a feeling how interacting with them is different, yet how the concepts behind them in terms of non-relational JSON storage are quite similar if not identical. For both document stores we first use a very simple example of four very simple JSON documents. The example (swatches) is taken from MongoDB’s documentation pages. Next, we run a more realistic (but still very small) example of six TeachEngineering so-called ‘sprinkle’ documents. Sprinkles are abbreviated versions of TeachEngineering activities.
Exercise 5.1: A Little RavenDB
One of the freely available JSON document databases is RavenDB. RavenDB was written for .NET. At the time of this writing RavenDB 4 is available. It supports multiple platforms/operating systems.
We will install a Windows version of RavenDB locally and communicate with it using C#.
- Download the RavenDB installer from https://ravendb.net/downloads.
- Unzip the RavenDB-4….zip file to an appropriate place on your system; e.g., c:\temp.
- Right-click the file run.ps1 and select the Run with PowerShell option from the menu.
- You will be prompted from the PowerShell window to change the execution policy. Anwer (type) with ‘A’.
- The installation script loads a license agreement page into your browser. Please read it and Accept it.
- Another page will be loaded into your browser. It offers a choice between a secure and unsecure installation. Choose unsecure.
- The next page allows you to specify the port on which the Raven DB server will be listening. The default is port 8080. Select Next.
- RavenDB will next be installed and its server started.
- We can communicate with the server in several ways. We will use HTTP at port 8080 on the local machine: http://127.0.0.1:8080.
For our example, take a look at the following JSON array containing four swatch documents[5]. Each document represents a fabric swatch with the following characteristics:
- name
- qty (quantity)
- tags, each having:
- string describing the material; e.g., cotton or wool
- size containing:
- h (height)
- w (width)
- uom (unit of measurement)
[ { "name": "cotton_swatch", "qty": 100, "tags": ["cotton"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }, { "name ": "wool_swatch", "qty": 200, "tags": ["wool"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }, { "mame ": "linen_swatch", "qty": 300, "tags": ["linen"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }, { "name ": "cotton_swatch", "qty": 100, "tags": ["cotton"], "size": { "h": 50, "w": 50, "uom": "cm" } } ]
These are the four ‘documents’ which we will programmatically store and query in RavenDB.
We will be using Visual Studio and the C# language.[6][7] Our program will communicate with RavenDB over HTTP.
For this example we will be using the standard (de)serialization approach; i.e., we will hold the documents in objects of the Swatch
class in our C# program and then programmatically store them in RavenDB. RavenDB will store them as JSON strings (documents). After they are stored we will run a few queries against the database.
In order to hold the swatches as objects, we need a class definition for them. We can, of course, figure this out by ourselves, but let us use the services of others and have http://json2csharp.com/ do it for us. Simply enter the above JSON array in and click Generate. The result:
public class Size { //we will use a ‘decimal’ instead public int h { get; set; } // we will use a ‘decimal’ instead public double w { get; set; } public string uom { get; set; } } public class RootObject //we will use ‘Swatch’ instead { public string name { get; set; } public int qty { get; set; } public List<string> tags { get; set; } public Size size { get; set; } }
With these class definitions we can write our Raven program. However, in order for our program to interact with RavenDB, we must first install the required RavenDB Client on our machine:
- Project –> Manage NuGet Packages…
- Find RavenDB.Client
- Click Install
- Click OK
Here is the (console) C# program:
using System; using System.Collections.Generic; using System.Linq; using Raven.Client.Documents; using Raven.Client.Documents.Indexes; using Raven.Client.Documents.Operations; using Raven.Client.Exceptions; using Raven.Client.Exceptions.Database; using Raven.Client.ServerWide; using Raven.Client.ServerWide.Operations;
namespace RavenDB { public class Swatch //Define the Swatch class { public string Id { get; set; } public string Name { get; set; } public int Qty { get; set; } public List<string> Tags { get; set; } public Size Size { get; set; } } //Define the Size class(every Swatch has a Size) public class Size { public decimal W { get; set; } public decimal H { get; set; } public string Uom { get; set; } }
public class SwatchIndex : AbstractIndexCreationTask<Swatch> { /* Method for creating the index. The index is RavenDB's catalog of items it uses to conduct retrievals/searches */ public SwatchIndex() { Map = items => from item in items select new { item.Name, item.Qty, item.Tags, item.Size.H, item.Size.W, item.Size.Uom }; } }
//Program which interacts with RavenDB class Program { //inputs or retrieves JSON data from Raven static void Main(string[] args) { //Create a list of Swatches var Items = new List<Swatch> { new Swatch { Id = "items/1", Name = "cotton_item", Qty = 100, Tags = new List<string> { "cotton" }, Size = new Size { H = 28, W = 35.5m, Uom = "cm"} }, new Swatch { Id = "items/2", Name = "wool_item", Qty = 200, Tags = new List<string> { "wool" }, Size = new Size { H = 28, W = 35.5m, Uom = "cm" } }, new Swatch { Id = "items/3", Name = "linen_item", Qty = 300, Tags = new List<string> { "linen" }, Size = new Size { H = 28, W = 35.5m, Uom = "cm" } }, new Swatch { Id = "items/4", Name = "cotton_item", Qty = 100, Tags = new List<string> { "cotton" }, Size = new Size { H = 50, W = 50, Uom = "cm" } } }; string databaseName = "foo"; /* We will communicate with Raven over HTTP at port 127.0.0.1:8080. CHANGE THIS IN THE LINE BELOW DEPENDING ON THE PORT ON WHICH RAVENDB IS LISTENING. */ DocumentStore store = new DocumentStore { Urls = new[] { "http://localhost:8080" }, Database = databaseName }; store.Initialize(); //Create the database if it doesn't already exist CreateDatabaseIfNotExists(store, databaseName);
//Run index creation new SwatchIndex().Execute(store); //Open the store's session using (var session = store.OpenSession()) { //Loop through the Items foreach (var item in Items) { //Store the item in the session session.Store(item); } //Save all pending changes session.SaveChanges(); /* Query for all swatches named "cotton_item" and store them in a List */ var cottonSwatches = session .Query<Swatch, SwatchIndex>() .Where(x => x.Name == "cotton_item") .ToList(); Console.WriteLine( "Swatches with name 'cotton_item':"); //Write them out foreach (var swatch in cottonSwatches) { Console.WriteLine(swatch.Id); }
/* Query for all swatches with a Quantity > 100 and store them in a List */ var highQuantitySwatches = session .Query<Swatch, SwatchIndex>() .Where(x => x.Qty > 100) .ToList(); Console.WriteLine("Swatches with quantity > 100:"); //Write them out foreach (var swatch in highQuantitySwatches) { Console.WriteLine(swatch.Id); } Console.WriteLine("Hit a key"); Console.ReadKey(); } }
//Creates the database if it does not exist public static void CreateDatabaseIfNotExists( DocumentStore store, string database) { database = store.Database; try { store.Maintenance.ForDatabase(database). Send(new GetStatisticsOperation()); // The database already exists } catch (DatabaseDoesNotExistException) { try { //we do not have the database yet; create it store.Maintenance.Server. Send(new CreateDatabaseOperation( new DatabaseRecord(database))); } catch (ConcurrencyException) { System.Console.WriteLine( "Error... creating database"); System.Environment.Exit(1); } } }
} }
So how does all this work?
- We first define the classes
Swatch
andSize
in accordance with the class structures returned from http://json2csharp.com/. - We then define an index for swatches:
SwatchIndex
. In RavenDB an index specifies the associations of the properties of documents with those documents so that at some point in the future we can query them. For instance, if we ever want to find documents that have a certain Name or Qty, we must build an index for those properties. Internally, RavenDB uses Lucene.Net, an open-source search engine library, to do its indexing and querying. - We then load up a list of
Swatch
objects (calledItems
). - Each of the
Swatches
in theItems
list is then sent to RavenDB for storage.- Note that we do not(!) send the
Swatches
as JSON structures to RavenDB. We could do that too, but in this case we let RavenDB figure out how to make JSON structures from theSwatch
objects and store them.foreach (var item in Items) //Loop through the Items { //Store the item in the session session.Store(item); }
- Note that we do not(!) send the
- After storing the
Swatches
in RavenDB, we formulate a few queries and send them to RavenDB for retrieval. The results are printed to the console.The queries take a form which may look a little odd:
var cottonSwatches = session .Query<Swatch, ItemIndex>() .Where(x => x.name == "cotton_item") .ToList();
and
var highQuantitySwatches = session .Query<Swatch, ItemIndex>() .Where(x => x.qty > 100) .ToList();
These queries use RavenDB’s Query() method, which supports .NET’s Language INtegrated Query or LINQ. LINQ has a conceptual similarity with SQL (select … from … where …), but it is, of course, not at all relational. Instead, it is a general-purpose language for querying a great variety of data structures. In Microsoft’s own words: “.NET Language-Integrated Query defines a set of general purpose standard query operators that allow traversal, filter, and projection operations to be expressed in a direct yet declarative way in any .NET-based programming language.“
Note, by the way, that when you run this program several times in a row, the number of swatches stored in RavenDB remains the same. This is because we explicitly specify a value for the Id property of each swatch. By convention, RavenDB treats the Id property as the unique identifier for each document (similar to a primary key in a relational database table). Hence, each time you run the program the existing documents are overwritten. Alternatively, if the Id property was not specified for each document, RavenDB would automatically generate one. Each subsequent run of the program would then add new versions of the documents to the database, each with a unique value for the Id property.
Exercise 5.2: A Little MongoDB
Another NoSQL/JSON document database we can practice with is the open source NoSQL database MongoDB.
We will install the MongoDB 4.x (Community) Server locally. We assume a Windows machine, but installs for other platforms are available as well.
- Download the On-Premises, MongoDB Community Server version (as msi file) from https://www.mongodb.com/try#community
- Run the downloaded installation script:
- Accept the license.
- Select the Complete (not the Custom) version.
- Select Run service as a Network Service user.
- Uncheck the Install Compass option.
- Finish the install.
- Find the folder/directory where MongoDB has been installed (typically c:\Program Files\MongoDB) and navigate to its \Server\ version_no\bin folder where the executables mongod.exe and mongo.exe are stored:
- mongod.exe runs the server instance.
- mongo.exe runs a Mongo command shell through which we can send commands to mongod.
- Use the Windows Task Manager to see if the process MongoDB Database Server (mongod.exe) is running. If not, start mongod.exe.
- By default, MongoDB relies on being able to store data in the c:\data\db folder. Either create this folder using the Windows File Browser, or pop up a Windows command line and run the command:
mkdir c:\data\db
We will communicate with MongoDB in two modes: first, by sending it commands directly using its command shell and then programmatically using C#.
Exercise 5.3: MongoDB Command Shell
- Run mongo.exe to start a MongoDB command shell. Enter all commands mentioned below in this command shell.
As in RavenDB, MongoDB structures its contents hierarchically in databases, collections within databases and JSON records (called ‘documents’) within collections.
- Start a new database foo and make it the current database:
use foo
- Create a collection fooCollection in database foo:
db.createCollection("fooCollection")
Now we have a collection, we can add documents to it. We will use the swatches documents again (they came directly from MongoDB’s own documentation pages).
Add four documents to fooCollection:
db.fooCollection.insertOne( { "swatch": "cotton_ swatch", "qty": 100, "tags": ["cotton"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }) db.fooCollection.insertOne( { "swatch": "wool_ swatch", "qty": 200, "tags": ["wool"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }) db.fooCollection.insertOne( { "swatch": "linen_ swatch", "qty": 300, "tags": ["linen"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }) db.fooCollection.insertOne( { "swatch": "cotton_ swatch", "qty": 100, "tags": ["cotton"], "size": { "h": 50, "w": 50, "uom": "cm" } })
Now let us start querying foo.fooCollection:
Let us see what we have in the collection. Since we pass no criteria to the find() method, all documents in fooCollection are returned:
db.fooCollection.find()
Which swatches made of cotton do we have?
db.fooCollection.find( { "tags": "cotton" } )
Of which swatches do we have more than 100?
db.fooCollection.find( { "qty": { $gt: 100 } } )
Exercise 5.4: C# – MongoDB Programmatic Interaction
Now that we have played with command-driven MongoDB, we can try our hand at having a program issue the commands for us. We will use C# as our language, but MongoDB can be programmatically accessed with other languages as well.
A note before we start. When you Google for C# examples of MongoDB interactions, you will find a lot of code with the async and await keywords. These codes are typically written for applications which will asynchronously process MongoDB queries. With ‘asynchronous’ we mean that these applications will not sequentially queue up commands, executing them in order and waiting for one to come back before the next one is sent. Instead, they send commands to MongoDB whenever they need to and process the replies in any order in which they come back. This approach makes a lot of sense in production environments where lots of queries must be issued and different queries demand different amounts of time to complete. However, for our simple examples, we will just use more traditional synchronous communication. We issue a single command and wait until it returns before we send the next one.
Instead of hardcoding the JSON documents in our code, we will pick them up from a file. Store the following JSON in the file c:\temp\swatches.json:
[ { "swatch": "cotton_ swatch", "qty": 100, "tags": ["cotton"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }, { "swatch": "wool_ swatch", "qty": 200, "tags": ["wool"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }, { "swatch": "linen_swatch", "qty": 300, "tags": ["linen"], "size": { "h": 28, "w": 35.5, "uom": "cm" } }, { "swatch": "cotton_swatch", "qty": 100, "tags": ["cotton"], "size": { "h": 50, "w": 50, "uom": "cm" } } ]
To validate the syntax of the above JSON segment, check it with an on-line JSON parser such as http://jsonparseronline.com.
We once again will be using Visual Studio and a C# console app to run these examples. However, in order for our program to interact with MongoDB, we must first install the required MongoDB and Json.Net packages:
- Project –>
- Manage NuGet Packages
- Find and select Newtonsoft.Json, MongoDB.Driver and MongoDB.Bson
- Click Install.
- Click OK.
Here is the (console) C# program for inserting the four documents stored in the swatches.json file into the swatchesCollection
:
using System; using System.IO; using MongoDB.Bson; using MongoDB.Driver; using Newtonsoft.Json.Linq; namespace mongodb_c_sharp { class Program { /* This code assumes that the swatches.json file is stored at C:\swatches.json. If you stored it elsewhere, just change the jsonFilePath string below */ const string jsonFilePath = "c:\\temp\\swatches.json"; // We will use the “swatches” database const string dbName = "swatches"; // We will use the “swatchesCollection” collection const string collectionName = "swatchesCollection"; static void Main(string[] args) { /* Open a connection to mongoDB. We are assuming localhost */ string connectString = "mongodb://localhost"; MongoClient client = null; try { client = new MongoClient(connectString); } catch { Console.WriteLine("Error connecting to MongoDB..."); Console.WriteLine("Is mongod.exe running?"); Environment.Exit(1); // Exit on failed connection } /* Attach to the swatches database (create it if it does not yet exist) */ IMongoDatabase database = client.GetDatabase(dbName); // Get the swatchesCollection collection var collection = database.GetCollection<BsonDocument> (collectionName); // Read the JSON file into a string StreamReader reader = new StreamReader(jsonFilePath); string json_str = reader.ReadToEnd(); // Load the JSON into a JArray JArray arr = JArray.Parse(json_str); /* Loop through the JArray and insert each of its documents into the database */ foreach (JObject obj in arr) { BsonDocument doc = BsonDocument.Parse(obj.ToString()); collection.InsertOne(doc); } // end of foreach } // end of Main() } // end of Program class } // end of Namespace
If you still have your MongoDB command shell open (if not, just start it again — see previous section on how to do this), let us see what the program accomplished:
use swatches db.swatchesCollection.find()
Next, we run the program which queries swatchesCollection
for those swatches of which we have more than 100 (qty > 100). Here is the (console) program:
using System; using MongoDB.Bson; using MongoDB.Driver; namespace mongodb_c_sharp { class Program { // We will use the "swatches" database const string dbName = "swatches"; // We will use the "swatchesCollection" collection const string collectionName = "swatchesCollection"; static void Main(string[] args) { /* Open a connection to mongoDB. We are assuming localhost */ string connectString = "mongodb://localhost"; MongoClient client = null; try { client = new MongoClient(connectString); } catch { Console.WriteLine("Error connecting to MongoDB..."); Console.WriteLine("Is mongod.exe running?"); Environment.Exit(1); // Exit on failed connection } // Attach to the swatches database IMongoDatabase database = client.GetDatabase(dbName); // Get the swatchesCollection collection var collection = database.GetCollection<BsonDocument> (collectionName); // Find swatches with “qty” > 100 // First, build a filter var filter = Builders<BsonDocument> .Filter.Gt("qty", 100); // Then apply the filter to querying the collection var resultList = collection.Find(filter).ToList(); // Write the results to the console foreach (var result in resultList) Console.WriteLine(result); Console.WriteLine("Hit a key"); Console.ReadKey(); } // end of Main() } // end of Program class } // end of Namespace
Now, while still in Mongo’s command shell, query the swatches database again to see if its collection of swatches is still available. Of course it still is, but it nice to see anyway.
Exercise 5.5: C# – MongoDB Programmatic Interaction: TeachEngineering Example – Sprinkles
Now we have run through a basic (swatches) example, we can apply the same approach to a small sample of more complex but conceptually identical TE 2.0 JSON documents. Sprinkles are abbreviated versions of TeachEngineering activities. Our sample contains six TE 2.0 sprinkle documents and is stored at http://faculty.bus.oregonstate.edu/reitsma/sprinkles.json.
Download the content of http://faculty.bus.oregonstate.edu/reitsma/sprinkles.json and store it in a file called c:\temp\sprinkles.json.[8] Please take a moment to study the content of the file. Notice that it is a JSON array containing six elements; i.e., six sprinkles. Also notice that although the elements are all sprinkle documents and they all have the same structure, their elements are not always specified in the same order. For instance, whereas the third and following documents start with a Header element followed by a Dependencies element, the first two documents have their Header specified elsewhere.
First, just like in the swatches case, we will programmatically import the data. We use the exact same program as we used for loading the swatches data, except for some changes to the following constants:
- Set the
const dbName
tosprinkles
.[9] - Set the
const collectionName
tosprinklesCollection
. - Set the
const jsonFilePath
toc:\\temp\\sprinkles.json
.
Assuming that your MongoDB server (mongod.exe) is running, running the program will load the six sprinkles into the sprinkleCollection.
In mongo.exe, check to see if the data made it over:
use sprinkles
db.sprinklesCollection.find()
Now the reverse: programmatically finding the number of sprinkles for which the required time >= 50 minutes (Time.TotalMinutes >= 50). Again, we use pretty much the exact same program as we used for retrieving swatches data, except for a few simple changes:
- Set the
const dbName
tosprinkles
. - Set the
const collectionName
tosprinklesCollection
. - Repalce the query filter with the folowing:
var filter = Builders<BsonDocument>.Filter .Gte("Time.TotalMinutes", 50);
- Remove (comment out) the
foreach()
loop and replace it with the following:Console.WriteLine(resultList.Count);
- The result of running the program should be 5 (You can check this check it manually against the sprinkles.json file).
Exercise 5.6: C# – RavenDB Programmatic Interaction: TeachEngineering Example — Sprinkles
Let us now repeat the sprinkle example we just ran against MongoDb, but this time run it against RavenDB. When you look through the program you will see that unlike what we did in our earlier (swatches) example, this time we do not generate a search index prior to searching. Instead, we follow the MongoDB example above and search through the documents in real-time without such an index. Of course, in a production environment, generating the indexes ahead of time would save (significant) search time, but for our case here we tried keeping the MongoDB and RavenDB examples similar.
using System; using System.IO; // Do not forget to add the RavenDB NuGet package! using Raven.Client.Document; using Raven.Json.Linq; namespace RavenDB { /* Program which interacts with RavenDB using a few TeachEngineering 'sprinkle' documents */ class Program { //inputs or retrieves JSON data from Raven static void Main(string[] args) { /* This code assumes that the sprinkles.json file is stored at C:\sprinkles.json If you stored it elsewhere, change the jsonFilePath string below */ const string jsonFilePath = "C:\\temp\\sprinkles.json"; // Make a new document store var store = new DocumentStore { /* We will communicate with Raven over HTTP at localhost:8080 */ Url = "http://localhost:8080", /* We will communicate with database ‘sprinklesCollection’ */ DefaultDatabase = "sprinklesCollection" }; store.Initialize(); //Initialize the store /* This next statement is not appropriate for production use, but is used to ensure that the index is up to date before returning query results */ store.Conventions.DefaultQueryingConsistency = ConsistencyOptions. AlwaysWaitForNonStaleResultsAsOfLastWrite; // Read the JSON file into a string StreamReader reader = new StreamReader(jsonFilePath); string json_str = reader.ReadToEnd(); // Load the JSON into a JArray var arr = RavenJArray.Parse(json_str); /* Loop through the JArray and insert each of its documents into the database */ foreach (RavenJObject obj in arr) { store.DatabaseCommands.Put(null, null, obj, null); } // end of foreach /* Query for sprinkles greater than or equal to 50 minutes in length Note that unlike in the swatches example, we do not create an explicit index ahead of time. RavenDB will automatically create an index for us. This works for ad-hoc queries, but in a typical application it is best to create the indexes ahead of time. */ using (var session = store.OpenSession()) { var results = session.Advanced.DocumentQuery<object>() .WhereGreaterThanOrEqual("Time.TotalMinutes", 50); Console.WriteLine(results.Count()); } Console.WriteLine("Hit a key"); Console.ReadKey(); } } }
Once again, running this program should result in a search result of ‘5.’
Summary and Conclusion
In this chapter we took a look at how TeachEngineering evolved from a system with a relational database at its core and XML representing documents (TE 1.0), to a system running off of a NoSQL database with JSON representing documents (TE 2.0). Whereas the relational/XML model worked fine during the 12 years of TE 1.0, the new NoSQL/JSON alternative provides the advantage that JSON is used as both the document and the database format. This unification of representation significantly reduces system complexity from a software engineering point of view.
We also argued that whereas the property of ‘consistency’ which is well entrenched in industry-strength relational databases, is of crucial importance in transactional settings, ‘eventual consistency’ is plenty good for an application such as TeachEngineering. As such, we can forgo much of the concurrency control facilities built into relational databases and instead rely on a less advanced but also much less expensive and easier to maintain NoSQL database.
Similarly, since the data footprint for a system such as TeachEngineering is quite small ─the current size of the RavenDB database is a mere 170 MB─ replicating some data in a controlled and managed way is quite acceptable. This again implies that we can forego the mechanisms for adhering to and maintaining strict normal form, and that in turn implies that we do not need sophisticated data merge and search methods such as relational table joins.
We very much care to state that none of the NoSQL material we have discussed and practiced here deters from the value and utility of relational databases. For systems which require ‘consistency,’ relational databases with their sophisticated built-in concurrency controls continue to be a good and often the best choice. But if ‘eventual consistency’ is good enough and if one has data governance policies and practices in place which prevent the inconsistent representation of data, then one should at least consider a modern NoSQL database as a possible candidate for storing data.
References
Buckler, C. (2015) SQL vs. NoSQL: The Differences. Available: https://www.sitepoint.com/sql-vs-nosql-differences/. Accessed: 12/2016.
Codd, E. F. (1970) A Relational Model of Data for Large Shared Data Banks. Communications of the ACM. 13. 377-387.
Codd, E. F. (1982) Relational database: A Practical Foundation for Productivity. Communications of the ACM. 25. 109-117.
Dennis, A., Wixom, B.H., Roth, R.A. (2014) System Analysis and Design. 5th ed. John Wiley & Sons.
Dennis, A., Wixom, B.H., Tegarden, D. (2012) System Analysis & Design with UML Version 2.0. John Wiley & Sons.
Dennis, A., Wixom, B.H., Tegarden, D. (2015) System Analysis & Design. An Object-Oriented Approach with UML. 5th ed. John Wiley & Sons.
Reitsma, R., Whitehead, B., Suryadevara, V. (2005) Digital Libraries and XML-Relational Data Binding. Dr. Dobbs Journal. 371. 42-47.
Satzinger, J., Jackson, R., Burd, S. (2016) Systems Analysis and Design in a Changing World. CENGAGE Learning.
Tilley, S., Rosenblatt, H. (2017) Systems Analysis and Design. CENGAGE Learning.
Valacich, J.S., George, J.F., Hofer, J.A. (2015) Essentials of Systems Analysis and Design. Pearson.
- Some interpret the term NoSQL as ‘No SQL.’ Others interpret it as ‘Not Only SQL.’ ↵
- Although standard instructional texts on relational database design emphasize design for normal form, practitioners frequently denormalize (parts of) a relational database for the same reason; namely to write easier and/or faster code. ↵
- Although much later than originally expected, in 2014 a brand-new document type, the so-called ‘sprinkle’ was indeed added. ↵
- XPath is a language for extracting content from XML pages. ↵
- Swatch: a sample or example item; typically a sample piece of fabric or other type of material. ↵
- Visual Studio is Microsoft’s .NET Integrated Development Environment (IDE). Although it is available in several configurations, for these examples we will be using the freely available Community Edition. ↵
- There is a public RavenDB playground available for testing and experimentation. It can be accessed at http://live-test.ravendb.net/. For more information, visit https://ravendb.net/docs/article-page/3.5/csharp/start/playground-server. ↵
- We could, of course, write code which retrieves the JSON over HTTP (from the given URL) rather than first storing the JSON on the local file system. However, for the sake of being able to use pretty much the exact same code as in the previous examples, we will read the JSON from a local file and insert it into the database. ↵
- We can, of course, use the swatches database to store sprinkle data, but in order to keep the examples independent of each other, we use a new database. ↵