entity-attribute-value design in PostgreSQL – don’t do it! – CYBERTEC

good (?) reasons to use an entity-attribute-value design — © Laurenz Albe 2021

Customers have often asked me what I think of “Entity-attribute-value” (EAV) design. So I thought it would be a good idea to lay down my opinion in writing.

What is entity-attribute-value design?

Customers have often asked me what I think of “Entity-attribute-value” (EAV) design. So I thought it would be a good idea to lay down my opinion in writing.

The idea is not to create a table for each entity in the application. Rather, you store each attribute as a separate entry in an attribute table:

CREATE TABLE objects (
objectid bigint PRIMARY KEY
/* other object-level properties */
);

CREATE TABLE attstring (
objectid bigint
REFERENCES objects ON DELETE CASCADE NOT NULL,
attname text NOT NULL,
attval text,
PRIMARY KEY (objectid, attname)
);

CREATE TABLE attint (
objectid bigint
REFERENCES objects ON DELETE CASCADE NOT NULL,
attname text NOT NULL,
attval integer,
PRIMARY KEY (objectid, attname)
);

/* more tables for other data types */

The name of the model is derived from the “att...” tables, which have the three columns: “entity ID”, “attribute name” and “value”.

There are several variations of the basic theme, among them:

omit the objects table
add additional tables that define “object types”, so that each type can only have certain attributes

Why would anybody consider an entity-attribute-value design?

The principal argument I hear in support of the EAV design is flexibility. You can create new entity types without having to create a database table. Taken to the extreme, each entity can have different attributes.

I suspect that another reason for people to consider such a data model is that they are more familiar with key-value stores than with relational databases.

Performance considerations of entity-attribute-value design

In my opinion, EAV database design is the worst possible design when it comes to performance. You will never get good database performance with such a data model.

The only use cases where EAV shines are when it is used as a key-value store.

`INSERT`

Inserting an entity will look like this:

INSERT INTO objects (objectid) VALUES (42);

INSERT INTO attstring (objectid, attname, attval)
VALUES (42, ‘name’, ‘myobject’);

INSERT INTO attint (objectid, attname, attval)
VALUES (42, ‘start’, 100),
(42, ‘end’, 1000);

That means that we insert four rows into three tables and have four index modifications. Also, the three statements will require three client-server round trips. You can save on the round trips by using CTEs to turn that into a single statement, or by using the new pipeline mode of libpq. Still, it will be much more expensive than inserting a single table row.

`DELETE`

If you use cascading delete, you could do that in a single statement:

DELETE FROM objects WHERE objectid = 42;

Still, you will end up deleting four table rows and modifying four indexes. That’s much more work than deleting a single table row.

`UPDATE`

A single-column update could actually be faster with the EAV design, because only one small table row is modified:

UPDATE attint

SET attval = 2000

WHERE objectid = 42 AND attname = ‘end’;

But if you have to modify several columns, you will need to run several UPDATE statements. That will be slower than if you only had to modify a single (albeit bigger) table row.

`SELECT`

Querying the attributes of an entity requires a join:

SELECT as.attval AS "name",
       ai1.attval AS "start",
       ai2.attval AS "end"
FROM objects AS o
   LEFT JOIN attstring AS as USING (objectid)
   LEFT JOIN attint AS ai1 USING (objectid)
   LEFT JOIN attint AS ai2 USING (objectid)
WHERE objectid = 42
  AND as.attname = 'name'
  AND ai1.attname = 'start'
  AND ai2.attname = 'end';

Alternatively, you could run three separate queries, one for each attribute. No matter how you do it, it will be less efficient than a single-row SELECT from a single table.

Single-column aggregates

As an example for a query that might be faster with the EAV model, consider a query that aggregates data from a single column:

SELECT sum(attval) AS total
FROM othertab
   JOIN attint USING (objectid)
WHERE othertab.col = 'x'
  AND attint.attname = 'attendants';

With a covering index on attint(objectid, attname) INCLUDE (attval), this could be quite a bit faster than aggregating a column from a wider table.

More complicated queries

After these examples, it is clear that writing more complicated queries will be a pain with the EAV design. Imagine a simple join:

SELECT e1a1.attval AS person_name,
       e1a2.attval AS person_id,
       e2a1.attval AS address_street,
       e2a2.attval AS address_city
FROM attint AS e1a2
   JOIN attstring AS e1a1
      ON e1a2.objectid = e1a1.objectid
   LEFT JOIN attint AS e2a0
      ON e1a2.attval = e2a0.attval
   LEFT JOIN attstring AS e2a1
      ON e2a0.objectid = e2a1.objectid
   LEFT JOIN attstring AS e2a2
      ON e2a0.objectid = e2a2.objectid
WHERE e1a1.attname = 'name'
  AND e1a2.attname = 'persnr'
  AND e2a0.attname = 'persnr'
  AND e2a1.attname = 'street'
  AND e2a2.attname = 'city';

If you think that this query is hard to read, I agree with you. In a normal relational data model, the same operation could look like this:

SELECT person.name AS person_name,
       persnr AS person_id
       address.street,
       address.city
FROM person
   LEFT JOIN address
      USING (persnr);

You can guess which query will perform better.

But we need an entity-attribute-value design for flexibility!

Relational data models are not famous for their flexibility. After all, that is the drive behind the NoSQL movement. However, there are good ways to deal with variable entities.

Creating tables on the fly

Nothing keeps you from running statements like CREATE TABLE and CREATE INDEX from your application. So if there is a limited number of entity types, and each type has a certain number of attributes, you can easily model that with a traditional relational model.

Certain problems remain:

A data model that grows on the fly may not end up being well-designed. But that’s not different in the entity-attribute-value design.
If the application has to create tables, it needs permission to do so. But today, when many applications create their own database tables anyway, few people will worry about that.

Creating tables on the fly will only work well if the set of attributes for each entity is well-defined. If that is not the case, we need a different approach.

Using JSON for a flexible data model

PostgreSQL has extensive JSON support that can be used to model entities with a variable number of attributes.

For that, you model the important and frequently occurring attributes as normal table columns. Then you add an additional column of type jsonb with a GIN index on it. This column contains the “rare attributes” of the entity as key-value pairs.

When using a model like this, you should take care that attributes

used in joins
on which you need a database constraint
that you want to use in a WHERE condition with an operator different from “=”

are modeled as regular table columns.

Conclusion

Avoid entity-attribute-value designs in your relational database. EAV causes bad performance, and there are other ways to have a flexible data model in PostgreSQL.

Source: Laurenz Albe @ cybertec-postgresql.com

What is entity-attribute-value design?

Why would anybody consider an entity-attribute-value design?

Performance considerations of entity-attribute-value design

INSERT

DELETE

UPDATE

SELECT