Skip Navigation

database greenhorn

hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

49 comments
    • spent time to generate/optomize your indexes.
    • faster storage/cpu/ram for your rdbms
    • get the data needed by specific services into the service, only get the data from a central place if you have to (spinning up a new instance, another service changes state of data you need, which is a warning sign in itself that your architecture is brittle...)
    • faster storage/cpu/ram
    • generate indexes
    • 2nd level cache shared between services
    • establish a faster datastore for often requested data thats used by multiple services (that might be something like redis, or another rdbms on beefier hardware)
    • optimize queries
    • generate indexes
    • faster storage/cpu/ram
    • first of all many thanks for the bullets. Good to have some guidance on where to start.

      2nd level cache shared between services

      I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

      faster storage/cpu/ram faster storage/cpu/ram faster storage/cpu/ram

      I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

      generate indexes

      How would indeces help me when I want to ensure that no duplicate row is added? Is this some sort of internal SQL constraint or what is the difference to compare a certain list of rows with an existing table (lets say column id)?

      • I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

        correct, introducing caching can result in returning outdated data for awhile, which is usually not a huge deal. those caches can get tricky, but they should take pressure from your db, if you're scenario is read heavy, which is often the case. Research existing caching solutions before running ahead and implementing something from scratch, especially if you need a cache distirbuted between multiple instances of your service. In the Java world that would be something like Infinispan, but your ecosystem might over better integration with other solutions.

        I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

        having management on board is great and the new hardware should help a lot, migrating to another RDBMS sounds scary, but probably worth it if your organisation has more expertise with it.

        generate indexes

        they won't help you with your duplicates, they will help speed up your reads but could slow down writes. building a good index is not trivial, but nothing is when it comes to performance tuning a database, it's tradeoff after tradeoff. The best way to handle identical rows of data is to not write them usually, but i don't know your system nor its history, maybe there is or was a good reason for its current state.

  • Indexes and pagination would be good starts

    • with pagination you mean paginating to split the query into chunks during comparison of a give data set with a whole table?

      • yes? maybe, depending on what you mean.

        Let's say you're doing a job and that job will involve reading 1M records or something. Pagination means you grab N number at a time, say 1000, in multiple queries as they're being done.

        Reading your post again to try and get context, it looks like you're identifying duplicates as part of a job.

        I don't know what you're using to determine a duplicate, if it's structural or not, but since you're running on HDDs, it might be faster to get that information into ram and then do the job in batches and update in batches. This will also allow you to do things like writing to the DB while doing CPU processing.

        BTW, your hard disks are going to be your bottleneck unless you're reaching out over the internet, so your best bet is to move that data onto an NVMe SSD. That'll blow any other suggestion I have out of the water.

        BUT! there are ways to help things out. I don't know what language you're working in. I'm a dotnet dev, so I can answer some things from that perspective.

        One thing you may want to do, especially if there's other traffic on this server:

        • use WITH (NOLOCK) so that you're not stopping other reads and write on the tables you're looking at
        • use pagination, either with windowing or LIMIT/SKIP to grab only a certain number of records at a time

        Use a HashSet (this can work if you have record types) or some other method of equality that's property based. Many Dictionary/HashSet types can take some kind of equality comparer.

        So, what you can do is asynchronously read from the disk into memory and start some kind of processing job. If this job does also not require the disk, you can do another read while you're processing. Don't do a write and a read at the same time since you're on HDDs.

        This might look something like:

         undefined
            
        offset = 0, limit = 1000
        
        task = readBatchFromDb(offset, limit)
        
        result = await task
        
        data = new HashSet\<YourType>(new YourTypeEqualityComparer()) // if you only care about the equality and not the data after use, you can just store the hash codes
        
        while (!result.IsEmpty) {
        
        offset = advance(offset)
        
        task = readBatchFromDb(offset, limit) // start a new read batch
        
        
        
        dataToWork = data.exclusion(result) // or something to not rework any objects
        
        data.addRange(result)
        
        
        
        dataToWrite = doYourThing(dataToWork)
        
        // don't write while reading
        
        result = await task
        
        
        
        await writeToDb(dataToWrite) // to not read and write. There's a lost optimization on not doing any cpu work
        
        }
        
        
        
        // Let's say you can set up a read or write queue to keep things busy
        
        abstract class IoJob {
        
        public sealed class ReadJob(your args) : IoJob
        
        {
        
        Task\<Data> ReadTask {get;set;}
        
        }
        
        public sealed class WriteJob(write data) : IoJob
        
        {
        
        Task WriteTask {get;set;}
        
        }
        
        }
        
        
        
        Task\<IoJob> executeJob(IoJob job){
        
        switch job {
        
        ReadJob rj => readBatchFromDb(rj.Offset, rj.Limit), // let's say this job assigns the data to the ReadJob and returns it
        
        WriteJob wj => writeToDb(wj) // function should return the write job
        
        }
        
        }
        
        
        
        Stack\<IoJob> jobs = new ();
        
        
        
        jobs.Enqueue(new ReadJob(offset, limit));
        
        jobs.Enqueue(new ReadJob(advance(offset), limit)); // get the second job ready to start
        
        
        
        job = jobs.Dequeue();
        
        do () {
        
        // kick off the next job
        
        if (jobs.Peek() != null) executeJob(jobs.Peek());
        
        
        
        if (result is ReadJob rj) {
        
        
        
        data = await rj.Task;
        
        if (data.IsEmpty) continue;
        
        
        
        jobs.Enqueue(new ReadJob(next stuff))
        
        
        
        dataToWork = data.exclusion(data)
        
        data.AddRange(data)
        
        
        
        dataToWrite = doYourThing(dataToWork)
        
        jobs.Enqueue(new WriteJob(dataToWrite))
        
        }
        
        else if (result is WriteJob wj) {
        
        await writeToDb(wj.Data)
        
        }
        
        
        
        } while ((job = jobs.Dequeue()) != null)
        
        
          
  • Lotta smarter people than me have already posted better answers in this thread, but this really stood out to me:

    the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table

    Why aren't you de-duping the table before processing? What's inserting these duplicates and why are they necessary to the table? If they serve no purpose, find out what's generating them and stop it, or write a pre-load script to clean it up before your core processing queries access that table. I'd start here - it sounds like what's really happening is that you've got a garbage query dumping dupes into your table and bloating your db.

  • To paraquote H. L. Mencken: For every problem, there is a solution that's cheap, fast, easy to implement -- and wrong.

    Silver bullets and magic wands don't really exist, I'm afraid. There's amble reasons for DBA's being well-paid people.

    There's basically three options: Either increase the hardware capabilities to be able to handle the amount of data you want to deal with, decrease the amount of data so that the hardware you've got can handle it at the level of performance you want or... Live with the status quo.

    If throwing more hardware at the issue was an option, I presume you would just have done so. As for how to viably decrease the amount of data in your active set, well, that's hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series? If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage? Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

    • To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

      This can be the new slogan of our development. :')

      I have convinced management to switch to a modern server. In addition we hope refactoring our approach (no random reads, no dedupe processes for a whole table, etc.) will lead us somewhere.

      As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series?

      Actually now. We are adding a layer of processing products to an already in-production system which handles already multiple millions of products on a daily basis. Since we not only have to process the new/updated products but have to catch up with processing the historical (older) products as well its a massive amount of products. We thought since the order is not important to use a random approach to catch up. But I see now that this is a major bottleneck in our design.

      If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage?

      so no. No narrowing.

      Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

      Also no IMO. since we dont want a product to be processed twice, we want to ensure deduplication - this requires knowledge of all already processed products. Therefore comparing with the whole table everytime.

  • If you are new to something and want to learn, seek resources and educate yourself with them. Learning takes time, and there are no shortcuts.

    A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you can't, consider getting a new server and migrating to it.

    SQL server can generate execution plans for you. For your queries, generate those, and see if you're doing any operations that involve iterating the entire table. You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.

    If you want to do some kind of dupe protection, let the DB do it for you. Create an index and a table constraint on the relevant columns. If the data is too complex for that, find a way to do it, like generating and storing hashes, sorting lists/dicts, etc just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).

    For read-heavy workflows, consider whether caches or read replicas will benefit you.

    And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you don't take the time to educate yourself on it.

    • A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you can’t, consider getting a new server and migrating to it.

      Did this because of the convincing replies in this thread. Migrating to modern hardware and switch SQL server with PostgreSQL (because its used by the other system we work with already, and there is know-how available in this domain).

      You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.

      But how can we then ensure that I am not adding/processing products which are already in the "final" table, when I have no knowledge about ALL the products which are in this final table?

      Create an index and a table constraint on the relevant columns. ... just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).

      This is helpful and also what I experienced. In the peak of the period where the server was overloaded the CPU load was pretty much zero - all processing happened related to disk read/write. Which was because we implemented poor query design/architecture.

      For read-heavy workflows, consider whether caches or read replicas will benefit you.

      May you elaborate what you mean with read replicas? Storage in memory?

      And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you don’t take the time to educate yourself on it.

      Yes, I will swallow the pill. but thanks to the replies here I have many starting points on where to start.

      RTFM is nice - but starting with page 0 is overwhelming.

      • But how can we then ensure that I am not adding/processing products which are already in the "final" table, when I have no knowledge about ALL the products which are in this final table?

        Without knowledge about your schema, I don't know enough to answer this. However, the database doesn't need to scan all rows in a table to check if a value exists if you can build an index on the relevant columns. If your products have some unique ID (or tuple of columns), then you can usually build an index on those values, which means the DB builds what is basically a lookup table for those indexed columns.

        Without going into too much detail, you can think of an index as a way for a DB to make a "contains" (or "retrieve") operation drop from O(n) (check all rows) to some much faster speed like O(log n) for example. The tradeoff is that you need more space for the index now.

        This comes with an added benefit that uniqueness constraints can be easily enforced on indexed columns if needed. And yes, your PK is indexed by default.

        Read more about index in Postgres's docs. It actually has pretty readable documentation from my experience. Or read a book on indexes, or a video, etc. The concept is universal.

        May you elaborate what you mean with read replicas? Storage in memory?

        This highly depends on your needs. I'll link PG's docs on replication though.

        If you're migrating right now, I wouldn't think about this too much. Replicas basically are duplicates of your database hosted on different servers (ideally in different warehouses, or even different regions if possible). Replicas work together to stay in sync, but depending on the kind of replica and the kind of query, any replica may be able to handle an incoming query (rather than a single central database).

        If all you need are backups though, then replicas could be overkill. Either way, you definitely don't want prod data all stored in a single machine, usually. I would talk to your management about backup requirements and potentially availability/uptime requirements.

  • When detecting duplicates gets expensive, the secret is to process them anyway, but in a way that de-duplicates the result of processing them.

    Usually, that means writing the next processing step into a (new) table whose primary key contains every detail that could make a record a duplicate.

    Then, as all the records are processed, just let it overwrite that same record with each duplicate.

    The resulting table is a list of keys containing no duplicates.

    (Tip: This can be a good process to run overnight.)

    (Tip: be sure the job also marks each original record as processed/deduped, so the overnight job only ever has to look at new unprocessed/un-deduped records.)

    Then, we drive all future processing steps from that new de-duplicated table, joining back to only whichever of the duplicate records was processed last for the other record details. (Since they're duplicates anyway, we don't care which one wins, as long as only one does.)

    This tends to result in a first single pass through the full data to process to create the de-duplicated list, and then a second pass through the de-duplicated list for all remaining steps. So roughly 2n processing time.

    (But the first n can be a long running background job, and the second n can be optimized by indexes supporting the needs of each future processing step.)

49 comments