Options

History of Databases

In "Lecture 39 - Chapter 3: Databases" I offered to tell a bit about why SQL came into existence, and what modern relational databases have to offer. Here is my very short intro to what I think are important bits of real-world databases.

First a little bit on the history of relational databases. Edgar F. Codd coined the term "relational database" in his 1969 theory paper "A Relational Model of Data for Large Shared Data Banks" to capture what people were doing at the time, with a neat formalism to describe tables of data:

http://www.morganslibrary.net/files/codd-1970.pdf

By the way, here is a nice obituary about Codd, likely written by one of his colleagues, now part of IBM's "history of progress" thread:

https://www-03.ibm.com/ibm/history/exhibits/builders/builders_codd.html

Ooh, and also notice the other famous names in the "The Builders details" menu on the right hand side of that page! But I digress...

The references from Codd's 1970 paper mention some of the most prolific or important structured data storage systems of the era ("popular" might be a strangely unfitting term here). One notable example is Charles Bachman's "Integrated Data Store" (IDS) which he designed in the early 60's, and got Bachmann his Turing award in 1973:

http://amturing.acm.org/award_winners/bachman_1896680.cfm

Soon thereafter, Codd, together with Raymond F. Boyce and Donald D. Chamberlin began working on the "Structured English Query Language" or "SEQUEL", the predecessor of SQL. The first working implementation came 1979 by what would later be known as Oracle, and IBM strived to put what came out of its System R prototype to work everywhere. SQL became standardized in 1986.

But why did they need to invent SQL in the first place, what was the drive behind it?

First came data structures with O(log n) complexity for lookup and insertion (e.g. AVL, or red-black trees). Then people came up with variants which aimed to maximize density, or to minimize the number of disk accesses, or seeks (e.g. radix Trees, B-Trees). So now that there was efficient indexing, people could start building humongous databases, you merely had to combine querying the indices in a sensible way...

To achieve this, The data operator (a person) would write a query plan, for example in the CODASYL Data Manipulation Language, which told the database management system how to combine the various indices, hopefully in an efficient way. This was rather tedious, error prone, and sometimes very hard to debug or extend. See here for a quick impression:

https://en.wikipedia.org/wiki/Navigational_Database

Or read this nice long interview with Donald D. Chamberlin (or just peek at page 8):

https://conservancy.umn.edu/handle/11299/107215

If you're as ancient as I am, you might remember dBase for CP/M, Apple II, or DOS. Or maybe FoxBase+. Or Clipper, which had data plans with a programming language around it.

However, the solution to this mess was to keep the technology, and merely put a better language on top. SQL is a compiled language to create data plans from rather concise descriptions. Modern databases like MySQL or Postfix will still let you look at the generated data plans!

But nowadays we want even more from a database management system. I won't get into distributed, geographical (geometric), or graph databases, I mean transactional qualities: We want to make sure that we never lose any data! The relevant keywords here are atomicity, consistency, isolation, and durability (ACID).

https://en.wikipedia.org/wiki/ACID

Getting this right is immensely difficult, especially in light of the complexity of filesystem drivers (heck, scheduling and other properties of the operating system), as well as the underlying hardware. All of the NoSQL systems might still be too young to excel at this (sorry for the pun), and many of the smaller SQL implementations might also come with compromises.

By the way, the whole problem area is close to the issue of distributed databases: how to build a perfectly reliable one? Spoiler: you can't: see the CAP theorem:

https://en.wikipedia.org/wiki/Cap_theorem

This is what I have for you today. One could easily fill books with this stuff, and I'm certain some authors did. I would love to hear about good books, or more on the history of modern RDBMS!

Some more links:

https://en.wikipedia.org/wiki/SQL

https://en.wikipedia.org/wiki/Relational_database_management_system

https://en.wikipedia.org/wiki/Integrated_Data_Store

https://en.wikipedia.org/wiki/Edgar_F._Codd

https://en.wikipedia.org/wiki/Charles_Bachman

Who is also known for Bachman diagrams:

https://en.wikipedia.org/wiki/Bachman_diagram

https://en.wikipedia.org/wiki/CODASYL

https://en.wikipedia.org/wiki/Donald_D._Chamberlin

https://en.wikipedia.org/wiki/Raymond_F._Boyce

Comments

  • 1.

    This is a very nice write up Robert!

    Thank you for writing this.

    Comment Source:This is a very nice write up Robert! Thank you for writing this.
  • 2.
    edited June 6

    I'm glad you like it, Matthew, thanks for your feedback!

    Comment Source:I'm glad you like it, Matthew, thanks for your feedback!
  • 3.

    A few side notes:

    • One of the historic rivals to SQL is Datalog. SQL was released in 1974 and researchers started having workshops about datalog in 1977.

      Datalog is a based on Prolog. Prolog is a programming language based on the logic of Horn clauses.

      Datalog proponents used to critique SQL as not supporting recursive queries. However in SQL 1999 recursive queries were introduced. The two query languages are thought to have the same power now.

      In industry, there are a few people still using Datalog. There's Datomic which is a product by Cognitect, the company behind clojure. Datomic wraps postgres and conventional SQL solutions.

      There is also Cascalog. Cascalog uses apache Hadoop to drive queries over large data sets. It hasn't been in development for a couple of years, so I think the industry has moved on.

    • Postgres in particular is strongly consistent as far as the CAP theorem is concerned. Jepsen has done some excellent stress testing of postgres in his Jepsen blog.

    • Modern databases support a wide variety of data structures. One in particular that is useful for geographic information systems (GIS) is the R-tree. These have very good lookup characteristics for nearest neighbor searches. You can use R-trees in PostgresSQL and MySQL.

    Comment Source:A few side notes: - One of the historic rivals to SQL is [Datalog](https://en.wikipedia.org/wiki/Datalog). SQL was released in 1974 and researchers started having workshops about datalog in 1977. Datalog is a based on [Prolog](https://en.wikipedia.org/wiki/Prolog). Prolog is a programming language based on the logic of [Horn clauses](https://en.wikipedia.org/wiki/Horn_clause). Datalog proponents used to critique SQL as not supporting recursive queries. However in [SQL 1999](https://en.wikipedia.org/wiki/SQL:1999) recursive queries were introduced. The two query languages are thought to have the same power now. In industry, there are a few people still using Datalog. There's [Datomic](https://www.datomic.com/) which is a product by Cognitect, the company behind [clojure](https://en.wikipedia.org/wiki/Clojure). Datomic wraps postgres and conventional SQL solutions. There is also [Cascalog](http://cascalog.org/). Cascalog uses apache Hadoop to drive queries over large data sets. It hasn't been in development for a couple of years, so I think the industry has moved on. - Postgres in particular is strongly consistent as far as the CAP theorem is concerned. Jepsen has done some excellent stress testing of postgres in his [Jepsen blog](https://aphyr.com/posts/282-jepsen-postgres). - Modern databases support a wide variety of data structures. One in particular that is useful for geographic information systems (GIS) is the [R-tree](https://en.wikipedia.org/wiki/R-tree). These have very good lookup characteristics for *nearest neighbor* searches. You can use R-trees in PostgresSQL and MySQL.
  • 4.
    edited June 7

    After reading Uber's benchmarks for postgres vs mysql I swapped back.

    Comment Source:After reading Uber's [benchmarks](https://eng.uber.com/mysql-migration/) for postgres vs mysql I swapped back.
  • 5.

    Thanks for adding interesting stuff about Datalog, Matthew! And you found out the year of SQL's release! Nice! So they got there within about a year!

    I can't help it, I have to say that, its achievements aside, I find SQL quite an awful looking language. There's also a conceptual fault line between the application and the database, right where SQL is located. Stored procedures, for example, they certainly sound nice, but they're neither as useful nor as much fun as one might hope.

    On the other hand, Datalog is much more beautiful. I'm not sure if that's enough to heal the rupture, though. When coding in ML, I'd clearly like to extend my types into the database...

    Comment Source:Thanks for adding interesting stuff about Datalog, Matthew! And you found out the year of SQL's release! Nice! So they got there within about a year! I can't help it, I have to say that, its achievements aside, I find SQL quite an awful looking language. There's also a conceptual fault line between the application and the database, right where SQL is located. Stored procedures, for example, they certainly sound nice, but they're neither as useful nor as much fun as one might hope. On the other hand, Datalog is much more beautiful. I'm not sure if that's enough to heal the rupture, though. When coding in ML, I'd clearly like to extend my types into the database...
  • 6.
    edited June 6

    I'm with you, Jim! I really like MySQL's innoDB storage engine. The performance is nice, and the stability is totally worth the smaller featureset when compared to Postfix'. I never actually understood why anyone would want to have more stuff on the other side.

    Folks, if you want to get a feel for databases, go and read blog posts like the one Jim linked!

    Comment Source:I'm with you, Jim! I really like MySQL's innoDB storage engine. The performance is nice, and the stability is totally worth the smaller featureset when compared to Postfix'. I never actually understood why anyone would want to have more stuff on the other side. Folks, if you want to get a feel for databases, go and read blog posts like the one Jim linked!
  • 7.

    Hi @Robert, I'm surprised at you don't find stored procedures useful. They are very useful because they modularise SQL spag. into easily reusable and more easily debuggable subroutines.

    Comment Source:Hi @Robert, I'm surprised at you don't find stored procedures useful. They are very useful because they modularise SQL spag. into easily reusable and more easily debuggable subroutines.
  • 8.

    Here is a puzzle to think about. How to express all of the SQL semantics categorically? Ask @RyanWisnesky.

    Comment Source:Here is a puzzle to think about. How to express all of the SQL semantics categorically? Ask @RyanWisnesky.
  • 9.
    edited June 7

    Well, stored procedures have some limitations. For example, one cannot use variables as universally as one might like, altough they may have lifted some of these restrictions for MySQL since last time I checked. They also cannot parametrize variations of the statement itself nicely, or help break up very large statements into smaller pieces.

    Which is of course to be expected for constructed statements, like for search functionality. If it were only for a single search function, it might be okay to make an exception, but otherwise you might end up having two very different places to look for SQL code. Btw, they're also useless for statements that aren't about data manipulation. So either you use a very simplistic mechanism to establish your schema, or you will need to add a way to execute non-stored statements anyways, in the end doubling the code required to interact with the dbms process.

    By the way, stored procedures aren't as popular as one might think, which means that there are many bindings for SQL apis that don't support them at all! And, last but not least, if you code in ML, you might want the compiler to type check your SQL statements as well, which makes not much sense if it is optional, as one feature of stored procedures is that you may change them without rebuilding your application...

    So, while I like the idea a lot, I have seriously regretted each and every attempt to use them in my own projects, and I have seen lots of code whose only purpose was to work around their limitations.

    Comment Source:Well, stored procedures have some limitations. For example, one cannot use variables as universally as one might like, altough they may have lifted some of these restrictions for MySQL since last time I checked. They also cannot parametrize variations of the statement itself nicely, or help break up very large statements into smaller pieces. Which is of course to be expected for constructed statements, like for search functionality. If it were only for a single search function, it might be okay to make an exception, but otherwise you might end up having two very different places to look for SQL code. Btw, they're also useless for statements that aren't about data manipulation. So either you use a very simplistic mechanism to establish your schema, or you will need to add a way to execute non-stored statements anyways, in the end doubling the code required to interact with the dbms process. By the way, stored procedures aren't as popular as one might think, which means that there are many bindings for SQL apis that don't support them at all! And, last but not least, if you code in ML, you might want the compiler to type check your SQL statements as well, which makes not much sense if it is optional, as one feature of stored procedures is that you may change them without rebuilding your application... So, while I like the idea a lot, I have seriously regretted each and every attempt to use them in my own projects, and I have seen lots of code whose only purpose was to work around their limitations.
  • 10.

    Hi Fred, I divide categorical approaches to SQL into two groups. The first are 'programming language theory' centric approaches, where you translate SQL into a lambda calculus containing either monad comprehensions, folds/structural recursion, or a combination of both. This 'mainstream, traditional' theory provides the foundation for most 'language integration query' systems today (e.g., MS LINQ). A good overview is my PhD thesis: http://wisnesky.net/dissertation.pdf . The other group is 'everything else', and there are many one-off approaches and some large bodies of work. Of particular interest are the papers centered around Rosebrugh and Wood, e.g., http://www.mta.ca/~rrosebru/articles/rdic.pdf .

    Comment Source:Hi Fred, I divide categorical approaches to SQL into two groups. The first are 'programming language theory' centric approaches, where you translate SQL into a lambda calculus containing either monad comprehensions, folds/structural recursion, or a combination of both. This 'mainstream, traditional' theory provides the foundation for most 'language integration query' systems today (e.g., MS LINQ). A good overview is my PhD thesis: http://wisnesky.net/dissertation.pdf . The other group is 'everything else', and there are many one-off approaches and some large bodies of work. Of particular interest are the papers centered around Rosebrugh and Wood, e.g., http://www.mta.ca/~rrosebru/articles/rdic.pdf .
  • 11.

    Hi Robert, thanks for alerting me to the 'mentions broken because of spaces issue'. Regarding your question, "I understand that categorial representations for SQL might help getting types to work to help get better programs in the end. I would love to hear a bit about motivations and lessons from applying these more modern approaches in practice, and I think the others might have more questions.", David and I and others examined this question in the paper 'Algebraic Databases': https://arxiv.org/pdf/1602.03501.pdf and from a purely syntactic view in 'QINL: co-LINQ, or Query Integrated Languages': https://arxiv.org/pdf/1511.06459.pdf . Basically, you have the same schemas as categories idea as usual, but you allow edges between the nodes in your category presentation that correspond to functions on types, e.g., reverse : String -> String or not : Bool -> Bool (and higher-arity generalizations thereof). In practice, rather than representing these functions extensionally, as infinite tables, you represent them intensionally as e.g., the SQL function 'reverse'.

    Comment Source:Hi Robert, thanks for alerting me to the 'mentions broken because of spaces issue'. Regarding your question, "I understand that categorial representations for SQL might help getting types to work to help get better programs in the end. I would love to hear a bit about motivations and lessons from applying these more modern approaches in practice, and I think the others might have more questions.", David and I and others examined this question in the paper 'Algebraic Databases': https://arxiv.org/pdf/1602.03501.pdf and from a purely syntactic view in 'QINL: co-LINQ, or Query Integrated Languages': https://arxiv.org/pdf/1511.06459.pdf . Basically, you have the same schemas as categories idea as usual, but you allow edges between the nodes in your category presentation that correspond to functions on types, e.g., reverse : String -> String or not : Bool -> Bool (and higher-arity generalizations thereof). In practice, rather than representing these functions extensionally, as infinite tables, you represent them intensionally as e.g., the SQL function 'reverse'.
  • 12.
    edited June 19
    Comment Source:From Petri nets and flow-charts to UML <img href="https://modeling-languages.com/wp-content/uploads/2014/04/historyModelingLanguages.jpg"> https://modeling-languages.com/wp-content/uploads/2014/04/historyModelingLanguages.jpg https://modeling-languages.com/history-modeling-languages-one-picture-j-p-tolvanen/
  • 13.

    @Robert Thanks for the explanation. I've never had a problem because I've only ever used them or wanted to use them for anything other than simple data search or update. I'm not surprised that modifying them without recompilation causes problems.

    Comment Source:@Robert Thanks for the explanation. I've never had a problem because I've only ever used them or wanted to use them for anything other than simple data search or update. I'm not surprised that modifying them without recompilation causes problems.
  • 14.
    edited August 7

    One of the areas I feel somewhat 'underserved' by research as well as open source & commercial offerings -- is the notion of time, in the data evolution model. Datomic had taken a stab at it, and somewhat emulated what is known as 'persistent data structures [1]' -- that is data structures that preserve state of the data a given time.

    In my professional career, I had designed out several (enterprise-internal) implementations of this concept using basically hand-designed notions of time evolution, state management, etc.

    I could not find (perhaps I have not looked hard enough) for easy to explain / build relational algebra, across separately evolving (evolving in time) data and what relational operations are valid vs invalid in those models.

    These kinds of problems arise very frequently and many of the domains.

    For example, say take financial services.
    There are Counter parties (parties in the contract), there is the Derivative contract (that describes the underlying behavior + the base instruments), then there market events, and then there are credit-events and credit worthiness associated with the parties.

    All these legs (parts), evolve independently in time. So my typical data models included a bi-temporal setup (where both system time, and business time are included) for each.

    This way, I could build Domain specific query language(s), around these time-aware evolution legs. And those languages would constrain the users/client applications -- what they can ask, what parameters (especially time parameters), they have to pass in, etc.

    However, I found it difficult to evaluate and pre-build constraints that would prevent submitting 'invalid' queries -- where the say a client would be asking my system: for data (state), that based on a relational query -- but some parts of the relation, did not exist at the time (or there were multiple changes to one of the relation within an assumed interval -- and therefore it was ambiguous, what exactly they are asking about).

    In other words, what I could not figure out, at least at that time -- a constraints model around the relations, in separately-evolving 'sub sets' or 'sub-states' of the overall data.

    This is probably similar to trying to relate entries from separately evolving block chains/distributed ledgers.

    There are tools (even built into databases) that make time arithmetic easier (eg Allen Algebra is built into several database systems) -- but that's really just a form of 'interval algebra' and not really a foundational (or technical implementation) of the time-centric relation algebra.

    [1] https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-854j-advanced-algorithms-fall-2005/lecture-notes/persistent.pdf

    Comment Source:One of the areas I feel somewhat 'underserved' by research as well as open source & commercial offerings -- is the notion of time, in the data evolution model. Datomic had taken a stab at it, and somewhat emulated what is known as 'persistent data structures [1]' -- that is data structures that preserve state of the data a given time. In my professional career, I had designed out several (enterprise-internal) implementations of this concept using basically hand-designed notions of time evolution, state management, etc. I could not find (perhaps I have not looked hard enough) for easy to explain / build relational algebra, across separately evolving (evolving in time) data and what relational operations are valid vs invalid in those models. These kinds of problems arise very frequently and many of the domains. For example, say take financial services. There are Counter parties (parties in the contract), there is the Derivative contract (that describes the underlying behavior + the base instruments), then there market events, and then there are credit-events and credit worthiness associated with the parties. All these legs (parts), evolve independently in time. So my typical data models included a bi-temporal setup (where both system time, and business time are included) for each. This way, I could build Domain specific query language(s), around these time-aware evolution legs. And those languages would constrain the users/client applications -- what they can ask, what parameters (especially time parameters), they have to pass in, etc. However, I found it difficult to evaluate and pre-build constraints that would prevent submitting 'invalid' queries -- where the say a client would be asking my system: for data (state), that based on a relational query -- but some parts of the relation, did not exist at the time (or there were multiple changes to one of the relation within an assumed interval -- and therefore it was ambiguous, what exactly they are asking about). In other words, what I could not figure out, at least at that time -- a constraints model around the relations, in separately-evolving 'sub sets' or 'sub-states' of the overall data. This is probably similar to trying to relate entries from separately evolving block chains/distributed ledgers. There are tools (even built into databases) that make time arithmetic easier (eg Allen Algebra is built into several database systems) -- but that's really just a form of 'interval algebra' and not really a foundational (or technical implementation) of the time-centric relation algebra. [1] https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-854j-advanced-algorithms-fall-2005/lecture-notes/persistent.pdf
Sign In or Register to comment.