Thursday, May 21, 2009

EAV FAIL



This illustrates (by counter-example) an important characteristic of a normalized database: each logical "type" of attribute belongs in a separate column.

Just because three values happen to be numeric doesn't mean it makes sense to SUM() them together. But if dissimilar attributes are stored in the same column, it's tempting to treat them as compatible in this way.

This also shows a fallacy of the Entity-Attribute-Value antipattern. In this design, all attribute values are stored in a single column.

CREATE TABLE EntityAttributeValue (
entity        VARCHAR(20) NOT NULL,
attribute     VARCHAR(20) NOT NULL,
value         VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

INSERT INTO EntityAttributeValue (entity, attribute, value)
VALUES
('New Cuyama', 'Population',          '562'),
('New Cuyama', 'Ft. above sea level', '2150'),
('New Cuyama', 'Established',         '1951'),

SELECT SUM(value) FROM EntityAttributeValue
WHERE entity = 'New Cuyama';


The Entity-Attribute-Value design does not support or conform to rules of database normalization.

update: To be clear, the proper way to design a database is to put different attributes in different columns. Use column names, not strings, to identify the attributes.

CREATE TABLE Cities (
 city_id          SERIAL PRIMARY KEY,
 city_name        VARCHAR(100) NOT NULL,
 population       INT UNSIGNED NOT NULL,
 feet_altitude    SMALLINT UNSIGNED NOT NULL,
 year_established SMALLINT UNSIGNED NOT NULL
);

79 comments:

Wenbert Del Rosario said...

what would be the correct way to structure your database using your example?

At first glance, I would do something like this:

Table: Entity
Fields: id, entity

Table: Attributes
Fields: id, attr

Table: EntityAttributes
Fields: id, entity_id, attr_id, value

But then again, I was in a hurry or being lazy in a not-so-important project, I'd follow the Entity-Attribute-Value design. Because querying that kind of database is easier. Less table joins, etc.

Roland Bouman said...

@Wenbert Del Rosario

" Because querying that kind of database is easier. Less table joins, etc"

You gotto be kidding...seriously, the main characteristic of EAV schemas is that they have a "store everything, query nothing" property. You can store what ever you like, because it's so generic. Because it's so generic, you can't query it, because nothing really ever means something.

And talking about joins...what do you think the query would look like to get one row for Cuyama showing Population, ft above sea level and Established?

I have used EAV too, but only in an isolated case where an extensible storage seemed to best fit the bill. We have not, and will not, solve the problem of querying that database. If we eed reports, we transform graphs of interest to regular table structures and query those)

Bill Karwin said...

@Wenbert: In your example, you *still* have all three values occupying the same column (EntityAttributes.value).

The point is that in a well-designed table, you shouldn't put different types of things into the same column.

The correct table would be:

Cities (city_id, city_name, population, feet_altitude, year_established)

Different attributes in different columns. Use column names, not strings, to identify the attributes.

Wenbert Del Rosario said...

@Bill Karwin Thanks. That cleared up things for me. :-)

Willo van der Merwe said...

Hi Bill,

As you know I'm pro EAV, but this article really had me in stitches... :D

Willo

Bill Karwin said...

@Willo: Yeah, I realize it's possible to make a functioning system with EAV.

But that's like saying one can build a house of bricks without using mortar. You can make it stand if you're careful enough, but you better not lean against it! ;-)

dbscience said...

I generally don't like this pattern either as it doesn't scale, but what rule of normalization does it violate?

Assuming the entity and attribute are immutable (and if they aren't, they can be by using entity and attribute ids), this doesn't violate the core rule of normalization that nothing is duplicated. There are no insertion, update, and deletion anomalies with this data model.

For example:

CREATE TABLE EntityAttributeValue (
entityId Int NOT NULL,
attributeId Int NOT NULL,
value VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

Which implies associated entity and attribute tables.

Plus, just because a column isn't additive doesn't mean a table isn't normalized. Think of an international financial system where the amounts are stated in different currencies and the amounts can't just be totaled.

Again, to be clear, I agree this is a bad general model, but it is normalized to the key, the whole key, and nothing but the key form.

Bill Karwin said...

@dbscience: The brief definition of 1NF is that the table faithfully represents a relation, and has no repeating groups.

What does it mean to faithfully represent a relation? One of the criteria is that each column & row intersection has exactly one value in the appropriate type.

What is the appropriate type? Well, in an EAV table the type varies per row, depending on what's in `attribute_id`. Though we may use a large varchar as a sort of universal type, the domain of permitted values is different for each attribute.

You can't do that with an attribute in a proper relation. The attribute must have one type (or domain) which applies on all rows. So the EAV table isn't a proper relation, therefore EAV fails to meet the criteria of 1NF.

The meaning of any given value in an EAV table also varies depending on what attribute the row represents. "1968" means something different as population, altitude, or year established. The attribute id is *part* of the key of the EAV table, not the whole key. Therefore the EAV table also violates 2NF.

Finally, though the EAV is implemented as a table, it also serves as a meta-model for another table.

A relation has a header, that names a set of attributes, followed by rows, where there must be a value for each of the attributes in the header.

If each entity can supplement its attributes at will, or omit some of its attributes (by not storing a row in the EAV table for a given attribute), this fails to be a relation.

So the actual EAV table is not a relation, because the domain and meaning of the `value` are variable, and the implicit table that EAV is modeling is not a relation because any entity is allowed to have more or fewer attributes than those named in the header.

dbscience said...

By this logic this table isn't normalized.

create table Order
(orderId int,
currencyId int,
amount number)

Because the amount column meaning depends on the value of the currencyId column. For example, 100 could mean 100 dollars or 100 yen depending on the currencyId value.

Bill Karwin said...

@dbscience: Yes, that seems to be correct. Unless `currencyId` is part of a compound primary key with `orderId`, the table you describe isn't in 3NF (it violates "nothing but the key").

Likewise, an Order that has columns
UnitPrice, Quantity, TotalPrice is also not in normal form. The TotalPrice depends on the other two columns, which are not part of the table's key.

dbscience said...

My point is the order table doesn't have any insert, delete or update anomalies, the ultimate goal of normalization, and is therefore normalized. The amount column depends on the orderId key for its value if not its meaning, which is how I interprect normalization.

Like everything complicated, there are multiple ways to interprect something and I suspect we aren't going to convince each other of the other's viewpoint.

But I do agree that the eav pattern is a bad general concept.

Willo van der Merwe said...

@Bill
I disagree. I believe that EAVs do have a place and can accelerate development and implementation enormously. It's also brilliant in prototyping.

But it needs to be implemented carefully and not used a generic stop-gap.

EAVs need a specific structure and I tend to use systems as dbscience describes: with a separate attribute table.

I like to think of EAVs in the same way as pointers in C. You have to be careful, but the benefits can be enormous.

From my experience EAVs are completely different from from conventional database modeling and one should not attempt to apply traditional relational database theories to it. It's closer to Object Orientated database systems.

Bill Karwin said...

@Willo: Certainly EAV and other database designs have their place. I'm not saying the relational paradigm is the only one we should use. There are problems it doesn't solve easily.

SQL is a language intended to work on relational databases. If you need to manage data in a non-relational way, that's fine. But trying to manage a non-relational data structure with technology designed for relational data is bound to be awkward. For example, you can't model constraints like NOT NULL in EAV.

Hence the need to supplement SQL with code in the application layer, to make sure the data doesn't fall apart. EAV frameworks like yours can provide the mortar for this brick wall.

I think EAV is also like the Semantic Web paradigm, where any entity can have variable attributes, and you can query what attributes an entity has. Thus data and metadata are interchangeable from the DBMS perspective. A a new query language, SPARQL, was invented for this.

Willo van der Merwe said...

@Bill
Yes, certainly, but due to the lack of tools that will efficiently retrieve and filter large amounts of data, and given the ease of installation, and the shear proliferation of SQL server installations out there, using a traditional relational database engine as a storage mechanism makes sense.

So that is what it becomes, just a data store. If XML parsers were fast enough it would probably make more sense to store your EAV data in XML.

Even though RDF is very close to EAV it is not the same. There definitely is a relationship there, but they solve different problems. I had a long hard look at the RDF specification and one of the major differences between EAV and RDF is that, typically, in EAV you have a clear separation between schema and data, while in RDF schema is part of the data.

I have been looking at SPARQL and how it can be used for EAVs.

pompomru said...

If EAV is antiptter then how we can solve a problem when entities should be configured from admin panel?

Bill Karwin said...

@pompomru: Adding attributes should be done by adding columns to the table.

Chris said...

Your example is an over-generalization. The purpose of the EAV pattern is to efficiently store sparse data in a RDMS. If you don't have sparse data, then EAV is unsuitable. However, if your data contains millions of features (i.e. "columns") but each sample (i.e. "row") only has a dozen non-null values, then its extremely poor database design NOT to use EAV.

Bill Karwin said...

@Chris: Point taken, using EAV as sparse data storage is not too different from a simple many-to-many table. If a given attribute is applicable to the entity, store a row. If not applicable, store no row.

Then you can add supportive attributes to the many-to-many relationship. That's where we get the `value` column.

But many people use EAV for something other than sparse data. They use it to represent a non-sparse table when they haven't decided on the attribute columns, or when they want to make it user-extensible.

The difference that gets people into trouble is when you try to treat rows with distinct `attribute` values as if they were separate columns in a conventional table.

Prakash M. Nadkarni said...

Your points are semi-valid - but to call EAV an "anti-pattern" is an extreme viewpoint. The fact is that the EAV approach is to be used, like any advanced modeling technique, only in EXCEPTIONAL circumstances where the attributes are sparse (for a given entity) and very numerous (across all entities). Obviously, it's such a pain to work with that you don't employ it when the components of your data model are mostly stable and well-known.

Further, an EAV design is unworkable without a supporting metadata infrastructure to ensure validity of the EAV tables' contents. The metadata tables have to employ 3NF plus features like declarative referential integrity and constraints maximally. If not, the 2nd law of thermodynamics takes over.

Prakash M. Nadkarni said...

Follow-up: Oracle Corporation sells a product called Oracle Clinical for clinical trials data management and has recently spent $700M to acquire Phase Forward, the market leader in Clinical Trials software. Both products implement an EAV subschema for their clinical data, though the majority of the tables are in 3NF.

Is it just possible that the vendor of an RDBMS and a (fairly large) niche market leader know something about the judicious use of EAV that you haven't considered?

Bill Karwin said...

@Prakash: The EAV model dates back to medical records systems in the early 1970's, when relational databases were still just research and no implementations existed.

So it's not surprising that the Phase Forward product uses a similar design. It might even have a direct lineage to one of those early medical systems.

As for Oracle's acquisition of Phase Forward, I'm sure they chose that company for its business potential, not for its use of relationally correct design.

The fact is that EAV is not a relational design, but as you point out, there are exceptional circumstances that make EAV the only option, given other requirements for a project.

Luc said...

At a logical level are the system tables of a DBMS not EAV-based? When we create new tables are their definitions not stored using some variation of that model? Physically there may tend to be new files for new tables but not necessarily and only for the sake of performance. DBMSes now also offer various partitioning options to remove the problems of many rows in a table. If you focus on the fact that it is a metadata schema, I can't see why you think EAV model is not normalised. The fact that a typical query on the application-data is more complicated is not a big disadvantage and does not signify that it isn't normalised. The values stored in the values table, from a metadata point of view are all of type 'mixed'. Does that violate any rule of normalisation?

Bill Karwin said...

@Luc: Thanks for your comment. In the relational model, there is no such thing as a "mixed" data type. Every column must have a single data type. You must be able to compare every value in that column to another value in that column for equality or inequality. In other words, every value in a given column comes from a single domain.

When you use EAV, values in the "value" column come from all sorts of different domains. You can't compare a value '1234' to another value '1234' and say whether they represent the same element from the same domain. You can't say whether those two values are equal even though they look identical.

The meaning of domains and the fact that each column in a table has exactly one domain is fundamental to being relational. EAV breaks this rule.

Can you give an example of where system tables do something similar? That is, where two values in a given column can look identical yet be logically different and incomparable?

Luc said...

@BillK: I was perhaps using the term 'mixed' too loosely; from the viewpoint of the metadata it is just one logical type...'Value'. I don't have a better general word for it. In RDF terms it is 'object' (the other two being 'subject' and 'predicate'. Also, we should really distinguish between the normalised entities (logical data-stores) and the eventual physical tables, files, XML documents, binary-files, etc. which actually get used to implement the normalised model. Using a non-specialised DBMS to implement an EAV schema is likely to be quite in-efficient because the DBMS is itself holding metatdata about the the metadata which is the EAV; so the EAV is adding another layer of abstraction above the application-data-schema (which can also, of course be an inplementation of a fully normalised ER model). By the way (I can almost see you shaking your head already :-) ), I don't agree if you believe that an EAV ER model can't be normalised. EAV-based or schema-less databases which store these 'mixed' values by explicit logical design and may be more efficient, include Amazon SimpleDB, Google AppEngine datastore, Microsoft SQL Data Services and in-principle at least, RDF datastores in general. All this is just my opinion of course. It's not like I've spent time ever researching it, but I have used systems which run on this type of schema and they do a very good job. The schema is beautiful. The only 'weakness' is in speed of retrieving data from complex queries. But there are ways to mitigate against that problem. The huge benefits to application development are too good to ignore for most business solutions.

Bill Karwin said...

@Luc: Thanks for your elaboration. I worked for a company that developed Seamark Navigator, a faceted search engine, sort of along the same lines as Jena. So I'm familiar with RDF.

You're right that EAV has two aspects, one the literal SQL table where attributes are stored one per row, and the abstract table it's meant to model, which has variable attributes per row. This layer of abstraction is called the Inner-Platform Effect (http://en.wikipedia.org/wiki/Inner-platform_effect).

It's true that data with variable attributes is useful and necessary for some applications. But my point is that data in this form is not relational.

So it's not surprising that representing non-relational data with a relational database must employ a level of abstraction, resulting in the Inner-Platform Effect.

Luc said...

@BillK: hmm, did you write that wiki article? Almost like running an emulator? I'm still not convinced EAV is not relational or 3-NF. e.g. its data is always mutually accurate.

Bill Karwin said...

@Luc: I recommend a book "SQL and Relational Theory" by C. J. Date, who is the world's leading authority on the relational model. Read it and I think you will be more likely to agree that types are fundamental to relations.

Luc said...

lol... I'm quite sure I owned a copy of that book while at Uni. Tuples etc. etc. OK... I'll re-scan it, just for you :-)

Luc said...

@BillK: by the way, if the 'value' is stored as an id (foreign key) pointing to the actual value (stored in another table dedicated to that datatype), would that not satisfy your objection while still being an EAV model?

Bill Karwin said...

@Luc: The book I mentioned is from 2009. But no doubt it contains similar content to Date's earlier books.

Re the "id" column, no, it makes no difference. You're still trying to store values of a different type in the same column in the EAV table. An extra layer of indirection doesn't help.

By "different type" I'm not talking about SQL data types. I'm talking about the logical domain of those values. You can use one domain per column, full stop.

Also, to do what you describe, that id would have to reference different tables depending on the attribute type. This is called polymorphic associations and it's also a non-relational design.

The fact that you can't declare a foreign key constraint to enforce polymorphic associations should be a clue that it doesn't fit the relational model.

Luc said...

@BillK: Yes, I'm remembering his and Codds much older books... Date was easier to read I think. OK, I think Codd made a distinction between Domain and Type but Date views them as synonymous, however sticking with the term domain rather than data-type... that domain, for the EAV system is a special or user-defined one anyway (the 'attribute-value' domain). It says that here is a Value, how to use that value (outside of the EAV system) is explained by what attribute of what entity it belongs to. And those value can be very complex according to the context in which they are used (atomicity is dependant upon usage, yes?). In EAV relational algebra works fine for manipulating the system. I doubt that would be the case if it were not a relational (and normalised) system.

Bill Karwin said...

@Luc: Consider that you can't enforce UNIQUE or FOREIGN KEY constraints on the values of a given attribute. You can't because all the entity's values are stored in the same column. You can't place any constraints for one attribute because they would affect all attributes.

This is the consequences of trying to store multiple domains in a single column.

You also can't enforce NOT NULL, because you can't make a constraint that a row must exist for each given entity-attribute pair.

Joe Celko explains it also:
http://www.simple-talk.com/sql/t-sql-programming/avoiding-the-eav-of-destruction/

CmdrAndreyasn said...

I'll sum EAV up in a simple statement. Magento Performance. Gets better in a negative direction with every major update as the number of tables increases. Death by a thousand joins.

Ron said...

Yes. This is why Magento is struggling with scalability. Just ask their dev team how many hours they spent on inventing their own 'full page caching' ...

They are trying hard to move away from the EAV model. In fact I heard that the Enterprise version is already moving away from the EAV model for the order history tables..

Bill Karwin said...

@Ron: Yeah, I thought Magento was a non-starter as soon as I took a look at the code and realized they were another case like "Vision" in the Bad CaRMa article.

If they need that degree of extensibility, they should switch to a document-oriented database.

Willo van der Merwe said...

Again, sorry guys.

Yes, I agree. Implementing EAV in a conventional SQL database is a challenge and should not be taken lightly. There are much better tools out there that'll fit EAV like a glove.

It doesn't mean that EAV is an anti-pattern. Hell, the Universal Data Model is effectively an EAV.

All it means is: choose your supporting technology carefully.

Alejandro said...

I've been following the thread for a bit. Really interesting stuff considering that this question has irritated me for some time.

I don't want to take the conversation off track, but what solutions might one pose to the dilemma of disjoint subtypes. It seems like EAV fits here, but is bound to degrade over time and impose scalability constraints (which is why I view it as an antipattern).

I'm wondering what possible solutions exist in terms of RDMS if there are any at all and how can object/graph databases address the problem.

Tom Anderson said...

In regards to the fallacy of EAV: In a properly designed EAV database the decision comes to reference attributes to attribute sets through a many to many or one to many relationship. I believe 98% of developers choose one to many completely validating your point. However, in a one to many relationship columns are not treated as the same data since all attributes must be unique to each attribute set.

You also touch on values. Magento made a wise choice to use five values tables for each entity. I've found their example overly complicated and well founded.

Bruce said...

Every discussion I see on the evils of EAV uses a brain-dead example (population, elevation of a city), and assumes we are going to try to use some kind of two-table solution for the entire domain.

NOT ONE TIME have I ever seen someone who throws rocks at this solution recommend how to solve the underlying problem. Which is, in a nutshell, using a concrete example:

How would you implement a product configurator? To use my own real-world example, what would you suggest the FeatureDescription table looks like when it's describing the features of a 4D Ultrasound system, but also the features of an aluminum fishing boat?

I can tell you from experience; I can log into the exact same web application (with a different subdomain) and configure both of these things - with all the correct rules guiding each one. Was this application easy to write? No. And in fact the database schema consists of almost 100 tables that exist to support customer records, sales, work orders, etc. But none of it would have been possible without EAVs for the product configuration data. Period.

Until I see some response that goes beyond the ridiculous suggestions from the Usual Suspect that I have 1000 different DB schemas for every possible product/feature domain, I'm simply ignoring those that whack on EAV and never offer an alternative.

Bill Karwin said...

Hi @Bruce, I feel your frustration. I have asked relational database gurus how they would solve problems like that which you describe, and I seldom get a good answer -- or any response at all.

Suppose you have to manage variable attributes per subtype, and you have a fixed number of subtypes, you can use Class Table Inheritance (http://martinfowler.com/eaaCatalog/classTableInheritance.html) or Concrete Table Inheritance (http://martinfowler.com/eaaCatalog/concreteTableInheritance.html)

But sometimes that's not practical, because you have too many subtypes or subtypes that must be created at runtime without modifying the schema. Or you really have no subtypes at all, you allow every entity to have distinct attributes.

The best compromise I've seen to solve that problem in a relational database is to encode the variable attributes in a BLOB, using your choice of XML, JSON, etc. Then for each attribute, create an inverted index table to point back to rows where a given value in that attribute occurs.

This solution was written about in 2009 in "How FriendFeed uses MySQL to store Schema-less Data" (http://bret.appspot.com/entry/how-friendfeed-uses-mysql).

The other answer is: go ahead and store schema-less database, just don't do it in a relational database. Relational and schemaless are completely in opposition to each other.

So you can use Solr, CouchDB, MongoDB, Hadoop, whatever non-relational solution you find meets your needs for non-relational data. They have specialized ways of storing that data, and specialized ways of querying it.

But SQL is for relational data. Using EAV in a relational database and trying to query it with SQL is not the answer.

Bruce said...

I'm not sure what is gained by using two different storage solutions. If the domain objects maintain the typing going in and out of the database (e.g., "Fabric Color" is a text attribute with choices of "Red, Green, Yellow"; "Additional Trolling Motor" is a boolean attribute) - why should this data be "query-friendly", per se?

Sure, maybe I won't be (easily) able to answer the question "how many Red Model XYZs did we sell last year that had the optional Hammeranger" - but I wouldn't be able to in any case with BLOB data, so it's a moot point.

At some point, I'm going to have to pass in known parameters - "get me all the used features between date X and date Y" - and then use a custom query engine that works on in-memory domain objects. How I populate them is immaterial.

Smart EAVs don't have only three columns - they have lots of columns that are universally applicable. Using my example, perhaps a "Phase-In date" and "Phase-Out date". It would be silly to store these kinds of things in EAV columns. This allows reasonable limits on the query and then the in-memory query engine can take over. Implemented correctly, the user never even needs to see the difference between the baked-in attributes and the dynamic ones.

Ad-hoc queries become less practical, of course - but that's a trade-off that I simply have to live with in this problem domain.

Bill Karwin said...

@Bruce: Did you read the article about FriendFeed I linked to?

Bruce said...

I did read it, thanks. A couple of times actually. If I understand it correctly, their BLOB data is a pickled Python dictionary that has some sort of key values in it (along with other arbitrary data).

So, the primary ID of this blob data is extracted and placed as the primary key in the ENTITY table, then all other key values are stored into separate index table(s) that reference back to the entity table.

For example, if we had a "part_id" in the BLOB, we would create an index table with "part_id, entity_id" as the composite primary key, and this would point back to the correct row in the ENTITY table.

I made sense of it, but I don't see how it solves the query problem. Maybe I'm missing something. Using their example, how could we write a query like "select all ENTITY where TITLE contains 'we just launched'"?

It seems they solved a performance problem, but I don't see how it's any better than an EAV for querying. Maybe I'm missing something and you can comment. Thanks.

Bill Karwin said...

@Bruce: The difference from a relational perspective is that columns are stored in columns, not rows.

> how could we write a query like "select all ENTITY where TITLE contains 'we just launched'"

SELECT e.entity_id, e.attribute_xml
FROM Entities e
JOIN Titles t ON e.entity_id = t.entity_id
WHERE t.title CONTAINS 'we just launched'

Then you fetch the blob of XML and parse it in your application, to discover all the other attributes of that entity.

But the idea of the Titles table is that if it's in sync with your XML blobs, it points correctly to rows where the value you're looking for occurs in the blob.

So you can have a proper table whose metadata names the "title" attribute and uses the correct data type. There's no way you're going to store in that same table the postal code or the birthdate or the password of the same entity. Each attribute goes in its own separate table.

It's better than EAV because in EAV you need to do a join per attribute you want to fetch, and if you add an attribute, you need to re-code all your queries.

Whereas in inverted index, you do a join per attribute that you need to filter on, so if you add an attribute to the XML, you fetch it automatically with the rest of the blob, with no need for code changes.

(ps: CONTAINS is not a standard SQL operator, it exists in Microsoft SQL Server iirc.)

Bruce said...

Bill:

Yeah the CONTAINS thing is probably because I'm used to using my in-memory search parser language instead of SQL. =c)

I checked out Anchor Modeling (basically a separate table for each attribute, no updates, no nulls, and use views for querying). It seemed a little extreme to me for an OLTP scenario.

The reverse index makes sense for RI where you reference another entity. The parent entity can safely do a cascade delete, and the referenced entity can have a where-used that is fairly easy to implement, although it can't allow a cascade delete. I tried it for image references and it worked well.

It still doesn't solve the problem that EAVs do, however, which is dynamic data without schema changes.

Bill Karwin said...

@Bruce: "It still doesn't solve the problem that EAVs do, however, which is dynamic data without schema changes."

Yes it does. You're free to put any dynamic data you want to into the XML blob, including new attributes at will. No schema changes needed.

But this means you have to run a table-scan and row-by-row XML parsing if you want to search for a given value in one of those new attributes.

If you want to make it efficient to search for a value in a particular attribute, then you have to create a new attribute table.

This is no different from document-oriented NoSQL databases e.g. CouchDB and MongoDB. They allow you to do ad hoc map/reduce queries, but it searches the hard way, over the whole data collection. If you want to run that search repeatedly, it pays to create an index, which is a DDL operation just like it is in SQL.

Or you could put the data into something like Solr, which allows documents to have distinct sets of attributes, and it implicitly indexes every attribute.

Deroude said...

Hello gentlement - may I first state that I have found your dialog most instructive. I would have been the most dedicated adversary of EAV models (except for obvious practical uses) - that is unless I had met with a dilemma of my own involving it. I need to be able to add dynamically any number of "fields" to a document, each field having a different data type. Since the data types are limited (say 5), an obvious solution would be to just add 5 nullable columns. But it's still wasteful (even though it is a recommended way to model class inheritance). Can you think of any other way?

Bill Karwin said...

@Deroude, InnoDB doesn't store column values when they are NULL, so I don't think adding five new columns is so wasteful.

But it's very likely that you'll need to add more than five.

You should read "How FriendFeed uses MySQL to store schema-less data" (http://bret.appspot.com/entry/how-friendfeed-uses-mysql). This is a compromise design that allows you to add dynamically any number of "document fields" to a row, without changing the schema, and without breaking your relational model as badly as EAV does.

James Kessler said...

Wow, this blog post is epic fail in itself.

Bill Karwin said...

James Kessler, negativity is not constructive. If you are only going to take pot shots, please keep them to yourself. If you're willing to elaborate and describe why you say that, please do.

sharifi14 said...

Hi Bill

Interesting discussion. Your advice and comments on EAV patterns around the web are always useful reading.

My problem seems to be an issue that others have too (I've read some of your comments on Stack Overflow). Here's my situation:

I want to create a database of around 1000 products. However products will have different attributes depending on their category, i.e. Computers may have CPU speed, hard drive storage, SKU; whereas Books may have author, publisher, pages, ISBN.

I can see three possible methods:

1/ Use EAV pattern, i.e. have one Attribute table and one Value table, something similar to this: http://i43.tinypic.com/2j2defp.png. This method is quick to set up and quite easy to maintain too, but as you mentioned, 'The point is that in a well-designed table, you shouldn't put different types of things into the same column.'

2/ Use a single table called Products and have lots of columns for all possible attributes. However there would be many NULL values in this table, as a Book isn't going to have a value for CPU speed for example!

3/ Create a table for each Category, i.e category_computers, category_books. These Category tables contain all the relevant attributes, so we won't get lots of NULL values. The problem with this is that it may result in duplicate columns across multiple tables, and also we will end up with loads of different tables!

Which of these three methods would you recommend? If none, what would you use as an alternative?

Bill Karwin said...

Hi sharifi14, thanks for you question.

Your option 2 is called Single Table Inheritance. Its downside is that if you have a lot of NULL columns, and if you ever add a new category you have to add more columns.

Your option 3 is called Concrete Table Inheritance. You end up with N tables for N categories, and you have the problem of duplicating columns that are common to all categories.

Another option is Class Table Inheritance
http://martinfowler.com/eaaCatalog/classTableInheritance.html
You'd get N+1 tables for N categories; the extra table has columns that are common to all categories.

Finally, Serialized LOB
http://martinfowler.com/eaaCatalog/serializedLOB.html
Where you have one table, and columns that are common get conventional columns. Other columns that are custom or category-specific are serialized together as XML or JSON or something and stored in a BLOB. This makes it inconvenient to query or update individual custom fields without fetching the whole blob, but it only requires one table.

You can index the Serialized LOB pattern with a technique described here:
http://backchannel.org/blog/friendfeed-schemaless-mysql

Jon said...

@Bill Informative post. Like some posters before me, I happen to have a problem with growing schema. What do you think about the following approach? Products -> Products_PC, Products -> Products_Apparel, Products -> Products_Vehicle.

Basically, Products_PC will consist of all columns relevant to PC and will inherit other attributes from Products (ie price, cost, etc). As client requires more products, I shall get them to engage our team again to further customize the schema. Will this be a good approach?

Bill Karwin said...

@Jon, Yes, you're basically describing the pattern that Martin Fowler calls Class Table Inheritance.

It can be an effective alternative to EAV, as long as you can take the time to create a new table every time you add a new product type.

But if you absolutely must support creation of new subtypes without schema changes (even the creation of a new empty table), it may not be flexible enough.

I recommend reading the book Patterns of Enterprise Application Architecture.
http://martinfowler.com/books/eaa.html

Jon said...

@Bill
Thanks. I'm not much of a database guru, so bear with my ignorance, if any. I'll go through your link. Much thanks.

DragonMan said...

Your missing the point of EAV. While denormalization is not in most cases proper for relational databases, it is necessary for datawarehousing. EAV is useful for the organization of heterogeneous data in which you might not know what the attributes will be. Contrary to the Cities example table you provided.

Bill Karwin said...

Hi DragonMan, I see your point, but typically in data warehousing you *do* know what attributes will be.

A significant part of Ralph Kimball's books on data warehousing make the point that you *must* know what facts you need to store before you can design a DW schema for them.

The requirement to accommodate heterogeneous data is a challenge for any project. There are a number of enterprise design patterns designed to handle them:
- Single Table Inheritance
- Concrete Table Inheritance
- Class Table Inheritance
- Serialized LOB, optionally supported by inverted indexes.

Lastly, there are non-relational data stores that can handle data with variable attributes.

Alexandru Ilie said...

Excellent post and excellent explanations Bill. The EAV problem couldn't be explained better. Appart from other problems, as someone in the comments pointed, that's why, for example, Magento is so unbelievable slow. 400+ tables for a simple shopping cart. The database is horrible complex and the queries... man, you should see the queries, they are hilarious.

Keep up the good work Bill. You have some great articles! Cheers

Matt C said...

Still one of the best EAV threads ever. I'm trying to get a feel for all the Magento forks that are out there in the world, and posted some EAV thoughts here: http://mwic.org/wp/?p=51

I'd be honored to hear anyone's comments on this --especially, of course, Bill's

Bill Karwin said...

Matt C, Thanks for the kind words! I've posted a reply comment on your blog.

HEALIS said...

Hi, I really enjoyed reading the pros and cons on EAV, especially your strong truthfull arguments Bill. What I would like to add here for your readers is the fact that Entity Attribute and Value are simply abstractions that we made in order to cope with complexity of real problems, the same is true for RDF and even ER and EER I dear to say. It is a simplistic model. In practice you discover that attributes are in fact entities and values can also be names for other entities. In that scenario relations may exist among entities and attributes and values. I am looking into ways of modeling that complex scenario with other alternative models as a better candidate for WEB 3.0 than RDF (see Neurorganon NULO)

Bill Karwin said...

Hi HEALIS, thanks for your comment.

I agree that there are complex real-world problems that are hard to map into the relational model. But my view is that using a relational database for such data is a misapplication of that model.

In a relational model, any attribute can have a lookup table that enumerates the set of values. The lookup table may have more columns that serve to annotate the discrete values within that attribute.

I'm not a fan of Semantic Web; I think it's an impractical abstraction. There's no way to optimize a dynamic data model in the way we do with a relatively static relational schema. The only other option to avoid scanning your full data store on every query is a navigational / graph database model.

A couple of products in this vein I've heard of (but I haven't used them) are Neo4j and OrientDB.

Mike Parkin said...

Hi Guys,

What a great selection of comments and answers, I've been thinking about this issue a lot over the last few years and all of the things I have thought about are raised here.

We currently use a very simple Serialised LOB approach and are now looking at indexing options. We will be looking into the other design patterns and Elastic search that are mentioned. Will come back when we have made our decisions (perhaps a blog post or something).

Thanks very much!

-------------

One addition I wanted to make is that it is possible to enforce unique using EAV - magento does this as follows:

There are two tables, one contains the entity *definition* and the other the attributes for those entities.

Entities
------
entity_id
entity_name
entity_table

Attributes
--------
attribute_id
entity_id
attribute_name
attribute_type (varchar, text, int)

Then for each entity a concrete table is created, one for the entity itself and then one table for each attribute type - for example customer:

customer_entity
-------------
customer_id
created_at
updated_at

customer_entity_varchar
customer_entity_int
customer_entity_text
customer_entity_....etc
-----------------
customer_id
attribute_id
value

Let's say you wanted to add 'email' as a unique attribute for customers. It would now be possible using this approach to say that value on customer_entity_text must be unique (using a composite index against the attribute_id and value)

Bill Karwin said...

Hi Mike, thanks for your thoughts and comment!

Regarding the Magento EAV schema, how could email be unique if the customer entity has two text attributes, one that should be unique (email) and one that shouldn't be unique (postal address)? Wouldn't both attributes of the same type text be stored in the same `customer_entity_text` table with different attribute_ids?

Mike Parkin said...

Hi Bill - you're absolutely correct - that wouldn't be possible! :-)

Mike Parkin said...

Hi Bill,

We have a new member of staff and because of this are re-visiting the love and joy that is EAV.

Whilst we were having a discussion, we realised that it is actually possible to have unique constraints, it would just require extra tables.

You would need to have a table for unique and non-unique:

customer_entity_varchar
customer_entity_int
customer_entity_text
customer_entity_unique_varchar
customer_entity_unique_int
customer_entity_unique_text

This way you can keep unique and non-unique attributes.

An obvious real-world problem we can see with this approach though is that updating the 'unique' constraint on an entity would be quite an expensive operation (copying data between tables).

:)

Bill Karwin said...

Hi Mike,

What if your entity requires the combination of two attributes to be unique? For example, how would you model this table:

CREATE TABLE Playoffs (
game_id INT PRIMARY KEY,
team1 INT NOT NULL,
team2 INT NOT NULL
UNIQUE KEY (team1, team2)
);

Sam Scott said...

Just a thought on enforcing uniqueness in EAV for some attributes and not others:

Could you not just add a unique_key column with a trigger that looks up the attribute in the metadata, determines whether or not it is unique, and then either adds an incrementing number (not unique), or a constant 1 (unique)? You could then put a unique index across the columns attribute, value, unique_key. Whenever unique_key = 1, you have then effectively enforced unique values for that particular attribute; where unique_key increments, you can re-use the same value...

Does this make sense?

(You can't use this to enforce a unique constraint on a combination of two attributes, but you should only really be using constraints like that when you know that both attributes will definitely be present - in which case you could put them into standard columns, rather than into the dynamic EAV section of your schema.)

Bill Karwin said...

Hi Sam, thanks for your idea.

Yes, that would support uniqueness for some attributes but not others. But it just adds evidence that EAV is an example of the Inner-Platform Effect.

One definition of the Inner-Platform Effect is a result of designing a system to be so customizable that it ends becoming a poor replica of the platform it was designed with.

ronnie said...

I'm making a product review website, wherein, the columns for various varieties of products are 15-20 in number. They could increase too, since we'll be taking feedback from users as to what else attribute that describes the product to add as a column in the corresponding product category's table.

my question is, does this EAV table good for my need?
Say, for CARS, i would have "power", "torque", "handling" as hard coded columns. But in the other model, I'll have 3 columns - car_model, attribute_name, and attrib_value. Then in that table I could insert power, torque, handling as data in the attribute_name column, and their corresponding data in the same row under the attrib_value column.

Which sounds better>

Bill Karwin said...

Hi Ronnie,

I would choose the "Serialized LOB" pattern for your case. If you have numerous optional attributes of a product, and users may define new attributes, then you app should serialize all the product-specific attributes (think JSON or something) and store them in a BLOB or TEXT column.

See also my recent presentation Extensible Data Modeling with MySQL

NWest said...

Bill,

How does one go about convincing "agile" developers that EAV is a poor choice, given that they see it as a way to speed up their development time. They complain that there is a lot of code they have to write if they have to define each entity in their code... Is there any way to convince them that if they have a table with the name of "account_attributes" they are doing it wrong? Columns *are* attributes.

Bill Karwin said...

Hi @NWest, thanks for the question.

It might be worth pointing out to these developers that when they use EAV, they have to write a lot more custom application code to enforce data integrity.

If they have multiple applications using the same database, they have to implement similar data integrity code in each app (some projects even have multiple apps in multiple languages accessing the same data). That violates the agile principle of "don't repeat yourself."

The database enforces data integrity constraints with less code, since SQL is a declarative language, and enforce these rules consistently for all apps that use that database. This should be more appealing to agile development buffs.

Bill Karwin said...

Data integrity constraints include FOREIGN KEYs, but also UNIQUE keys and even NOT NULL. There's no way to enforce NOT NULL in EAV, that is, to make a specific attribute mandatory.

Cassiano Rabelo said...

Hello Bill.
Great article!

Unfortunately knowing what I shouldn't do, doesn't mean I know what to do :)

I need to create a database to store user generated surveys. Something very similar to Google Forms.
I'm completely lost on how to approach this problem. How would you design a scheme for something like this?
Thanks a lot!

Bill Karwin said...

Hi Cassiano,

It depends a lot on how you will use the data you store.

Check out my presentation on Extensible Data Modeling, which covers the pros and cons of several solutions.

For example, you could use Serialized LOB and just store a JSON encoding of the whole form. That might be sufficient, if the only thing the database does is store and retrieve the whole form data.

But if you need to search it, for example to count how many people gave a specific answer, then you should have some way of indexing the form fields for searching.

My presentation gives several alternatives to EAV, which can be useful in different circumstances.

Cassiano Rabelo said...

Thanks Bill! Very helpful presentation! Cheers!

Jaimie Sirovich said...

@Bill

Worth a note that in addition to hot table changes, MariaDB has added a native Serialized LOB support. PostgreSQL also has H-Store, which is a LOB + supports indexing via GiST. And I think you're talking about MSSQL's sparse columns when you speak of variable attributes. Very cool feature, actually.

And I despise the concrete inheritance idea but Maria has CONNECT, which could help here. PostgreSQL also has native support for class table inheritance, no?

CREATE TABLE capitals (
state char(2)
) INHERITS (cities);

RE: enterprise design patterns designed to handle them:
- Single Table Inheritance
- Concrete Table Inheritance
- Class Table Inheritance
- Serialized LOB, optionally supported by inverted indexes.
- Non-relational data stores that can handle data with variable attributes.

Jaimie Sirovich said...

Ah, I misread your last one and didn't realize you meant document stores. And you mentioned MariaDB's serialized LOB feature in your slide deck. PostreSQL never gets any love for its version with GiST indexing, though :) You can skip the hacky inverted index table. GiST does it for you generically.

Slide 69 is gold.

MS-SQL's compromise is actually interesting. I think MS-SQL allows for hot schema changes as well, so it's actually viable to just slap on a column and make it sparse if you know it's specialized. Then simply make your application dynamically add columns with a prefix or something to prevent collisions with built-ins.

What do you think of that?