Feedback on Denormalized Model vs Normalized Model

JonasLevin · May 8, 2022, 3:45pm

I’m working on the backend of a Mobile Application that allows users to create public or private Parties and upload Video Stories for specific Parties.
I have previously used Scylla but I hit some roadblocks, like ttl expiration not being reported in CDC and no Geo search support. This pushed me towards trying to find out if Aerospike is suitable for our Data Model.

I have sketched out our Data Model once Denormalized and once Normalized and would appreciate some feedback in regards to which Model better fits Aerospike with our Access patterns. Or their might be a better data model better fitted than the two below.

Access Patterns:

CRUD on Parties and Stories through their id.
Get all Parties of a User (Ordered by latest first).
Get all Stories of a User (Ordered by latest last).
Get all Stories of a Party (Ordered by latest last).

Denormalized

Party {
  id: string,
  user_id: string,
  position: []float,
  title text,
  is_public boolean,
  address: {
    street_address: string,
    postal_code: string,
    state: string,
    country: string,
  },
  created_at: string,
  stories: [
    {
      id: string,
      user_id: string,
      position: []float,
      url text,
      tagged_friends: []string
      created_at: string,
   }
  ]
}

Normalized

Party {
  id: string,
  user_id: string,
  position: []float,
  title text,
  is_public boolean,
  address: {
    street_address: string,
    postal_code: string,
    state: string,
    country: string,
  },
  stories: []string, (list of story id’s)
}

Story {
  id: string,
  user_id: string,
  position: []float,
  url text,
  tagged_friends: []string
  created_at: string,
}

To achieve getting Parties/Stories of a specific User while still being able to access a Party/Story by their id I think we would have to use the Normalized Model with a Secondary Index on the user_id.
But I don’t know if ordering with a Secondary Index is possible since we want to return the latest Party of a User first and their latest Story last.
We off course also want to be able to paginate the requests of getting Stories/Parties of a User.

neelp · May 10, 2022, 12:38am

A denormalized schema with composite ids to allow direct access to the respective objects may be one solution. A party-id has used-id in it as a prefix, and story_id has party_id in it as a prefix. With this, you can:

Store a user’s parties (and associated stories) in a user specific record as a nested map.
User parties can be stored as a Map {party_id → party_object}, and a party’s stories as a map {story_id → story_object}. This scheme would allow a direct access to a party/story object through their id, enabling CRUD operations by id. Retrieval of all parties by user and all stories by party should be straightforward.
Store a user’s story ids in a separate bin (or record) as a List. Story create/delete will need to update this list. Retrieval of user stories can be done using a batch retrieval (or multiple gets) using the user’s story ids.
For the desired retrieval “rank” order as well as storage efficiency, party_object and story_object can be represented as a List that stores the fields of the object a specific order. Map/List operations allow range retrieval for pagination.

The normalized solution may appear simpler and you can use secondary indexes on user-id (and even party-id), but it seems sort/paginate must be handled on the client side.

Stephen_Johnston · May 11, 2022, 11:27am

Nothing about this seems like a fit for a key-value DB. What is your reason for not using a relational DB for clearly relational data?

JonasLevin · May 11, 2022, 4:58pm

I wanted to have TTL support, since Stories are time based, mostly 24 hours.
Geosearching is also used in our main feature, showing Parties on a map around the users position. So I did previously use Scylla which allowed me to model the different access patterns (through composite keys & materialized views), but as I described hit some roadblocks like TTL not being reported in CDC & no Geo searching.
I’m interested in the TTL expiration since I want to replicate the Parties in a search engine (Vespa) for personalization/recommendation and when the Party is either manually deleted or through TTL this also needs to be reflected their.
So the reasons for not choosing a Relational Database are:

Not having TTL/Geo search (except Postgis but I heard it’s slow)
Hard to Scale.
I do have a few other features like friend relations and story inboxes which are a perfect fit for key/value DB’s so for ease of deployment I tried putting all the data in one DB cluster.

But if you have some recommendations for solving Geo Searching and TTL with a relational Database, I would of course choose Aerospike only for the fitting use cases and move the other data model to a Relational DB.

Stephen_Johnston · May 11, 2022, 6:07pm

I generally warn against saying “I have a few super cool use cases so I’m redoing my entire data model to be compatible with this DB that does that thing”. Instead, put all your data in a normalized DB. Write only the parts of that data that are useful into Aerospike. In a format that is most useful for Aerospike. With enough info to show your map or other “realtime” stuff and the key from the relational DB. Then on click or interaction go back to the relational DB. Basically, you would write a key into Aerospike with bins for locations. Friend sets are fine, but even then, will you ever show 100000 friends? Or 10? Query aerospike for just the relation list. Query the relational for the detail. With paging you’re only ever asking for 10-25 at a time. The extra step is likely not meaningful.

If you ever say “but I need to show a user X thousands of results from Aerospike and join them into the relational DB” then that’s a smell and you’re likely doing something else wrong.

If you need to count friends. Count the IDs in AS. Counts don’t have the detail. Counting the IDs is fine.

neelp · May 13, 2022, 11:33pm

@JonasLevin, agree with your conclusions.

It is worth emphasizing that Aerospike:

Natively supports the hierarchical (document) data model,
Has TTL and Geosearch capabilities,
Has CDT operations for TTL based deletion of stories, which can be piggybacked in a multi-op request for another Party or Story operation that you will anyway need to perform (that is, not additional TTL maintenance overhead for non-record objects),
Eliminates two databases “solution” - pick one database that has the right model, scale, functionality. All the best.

system · August 5, 2022, 11:33pm

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Migrate Data Model from Scylla to Aerospike Data Modeling modeling	1	1171	June 9, 2022
Data Modeling 101! (Data modeling for specific use-cases) How Developers Are Using Aerospike secondary , index , sort	1	2386	December 13, 2014
Data modeling for expiration Data Modeling pk	1	2062	May 22, 2015
Voting and how to store Data Modeling	1	1986	September 22, 2014
Modelling large data structures Data Modeling	1	1979	July 23, 2016

Feedback on Denormalized Model vs Normalized Model

Access Patterns:

Denormalized

Normalized

Related topics