Why should I not put all my data in one CosmosDB collection?

Why should I not put all my data in one CosmosDB collection?

The problem

I have discovered that Cosmos DB is priced very aggressively and can be expensive if used with many data types.

I would think that a good structure, would be to put each data type I have in their own collection, almost like tables in a database (not quite).

However, each collection costs at least 24 USD per month. This is if I choose "Fixed", that limits me to 10GB and is NOT scalable. Hardly the point of Cosmos DB, so I would rather choose "Unlimited". However, here the price is at least 60 USD per month.

60 USD per month per data type.

This includes 1000 RU, but on top of this, I have to pay more for consumption.

This might be OK if I have a few data types, but if I a fully fledged business application with 30 data types (not at all uncommon), it becomes 1800 USD per month, at least. As a starting price. When I have no data yet.

The question

The structure of the data in the collection is not strict. I can store different types of documents in the same collection.

When using an "Unlimited" collection, I can use partition keys, which should be used to partition my data to ensure scalability.

However, why do I not just include the data type in the partition key?

Then the partition key becomes something like:

[customer-id]-[data-type]-[actual-partition-value, like 'state']

With one swift move, my minimum cost becomes 60 USD and the rest is based on consumption. Presumably, partition keys ensure satisfactory performance regardless of the data volume. So what am I missing? Is there some problem with this approach?

No there will be no problem per se. It all boils down to whether you're fine with having 1000 RU/s for your whole system. In fact you can simplify this even more by having your document id to be the partition key to enable the maximum distribution and scale in CosmosDB. That's exactly how Collection sharing works in Cosmonaut and I have noticed no problems, even on systems with many different data types.
– Nick Chapsas
Jul 3 at 12:39

@NickChapsas your comment sounds more like an answer, so maybe post it as one?
– Haspemulator
Jul 3 at 12:50

@Haspemulator Oh sorry i wanted to write it as an answer but I clicked on the wrong area.
– Nick Chapsas
Jul 3 at 12:57

You can always start at fixed collection and migrate to unlimited collection once you're getting near the limits. Overprovisioning before you actually need it is to be expected to be expensive.
– Imre Pühvel
Jul 4 at 13:11

1 Answer
1

No there will be no problem per se.
It all boils down to whether you're fine with having 1000 RU/s, or more specifically a single bottleneck, for your whole system.

In fact you can simplify this even more by having your document id to be the partition key. This will guarantee the uniqueness of the document id and will enable the maximum possible distribution and scale in CosmosDB.

That's exactly how collection sharing works in Cosmonaut (disclaimer, I'm the creator of this project) and I have noticed no problems, even on systems with many different data types.

However you have to keep in mind that even though you can scale this collection up and down you still restrict your whole system with this one bottleneck. I would recommend that you don't just create one collection but probably 2 or 3 collections with shared entities in them. If this is done smart enough and you batch entities in a logical way then you can scale your throughput for specific parts of your system.

Is it really a bottleneck if we have different partitions? Let's say we have 10 documents in one partition. Is that faster than having 100 documents in 10 different partitions, if we just need the data from one partition?
– Niels Brinch
Jul 4 at 10:48

The throughput is collection wide, not partition wide. It doesn't matter how many partitions you have. If you are constantly pumping and querying for data you might hit those 429s. Knowing the partition value only makes your queries faster and more cost efficient.
– Nick Chapsas
Jul 4 at 11:26

"your document id to be the partition key" - beware, this would cause a inefficient fan-out queries across all partitions should you ever need a select by other than id.
– Imre Pühvel
Jul 4 at 13:08

@ImrePühvel Which will be a problem anyway if he goes with the [customer-id]-[data-type]-[actual-partition-value, like 'state'] approach.
– Nick Chapsas
Jul 4 at 13:09

[customer-id]-[data-type]-[actual-partition-value, like 'state']

Nick, throughput is OK, I will of course buy more RUs if I need more. Regarding the partition strategy that I propose, it would have thousands of documents in the same partition. As opposed to your proposed strategy where each document has their own partition. So it is not the same structure in terms of partitions.
– Niels Brinch
Jul 4 at 14:41

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

XgsCyop1j,2oYG

搜尋此網誌

Fjhtyj