CSES Bookshop

https://cses.fi/problemset/task/1158

While I was solving this problem, I noticed that you could switch the order of the loops using the dynamic programming solution. However, when you iterate through all the weights first and then the books, you TLE. If you iterate through the books and then the weights, you get the correct answer. Why is this?

I read from this thread;

that it is because of memory. However, don’t both solutions use the same amount of memory? Also, as a side question to the first, in most dp problems, you can flip the loops as long as you have taken the necessary precautions. My question is when do we know when to use each ordering of the loops?

Could you send the code that TLE’s and the one that AC’s?

Here’s the one that AC’s

AC solution
#include <bits/stdc++.h>
using namespace std;
  
int main(){
    cin.tie(0)->sync_with_stdio(0);
    int n,x;
    cin >> n >> x;
    int dp[n+1][x+1], cost[n], pages[n];
    for(int i=0;i<n;i++)
        cin >> cost[i];
    for(int i=0;i<n;i++)
        cin >> pages[i];
    for(auto &i: dp)
        for(auto &j:i)
            j=0;
    for(int i=1;i<=n;i++){
        for(int j=1;j<=x;j++){
            dp[i][j]=dp[i-1][j];
            if(j-cost[i-1]>=0)
                dp[i][j]=max(dp[i][j],dp[i-1][j-cost[i-1]]+pages[i-1]);
        }
    }
    cout << dp[n][x];
}

Here’s the one that TLE’s

TLE solution
#include <bits/stdc++.h>
using namespace std;
 
int main(){
    cin.tie(0)->sync_with_stdio(0);
    int n,x;
    cin >> n >> x;
    int dp[n+1][x+1], cost[n], pages[n];
    for(int i=0;i<n;i++)
        cin >> cost[i];
    for(int i=0;i<n;i++)
        cin >> pages[i];
    for(auto &i: dp)
        for(auto &j:i)
            j=0;
    for(int j=1;j<=x;j++){
    /* for(int i=1;i<=n;i++){ */
        for(int i=1;i<=n;i++){
        /* for(int j=1;j<=x;j++){ */
            dp[i][j]=dp[i-1][j];
            if(j-cost[i-1]>=0)
                dp[i][j]=max(dp[i][j],dp[i-1][j-cost[i-1]]+pages[i-1]);
        }
    }
    cout << dp[n][x];
}

My best guess is it has something to do with this.

I still don’t understand because I flipped the rows and columns of the dp array as the blog post suggested. I still got TLE. Is there anything else that I am getting incorrect?

New TLE Solution
#include <bits/stdc++.h>
using namespace std;
 
int main(){
    cin.tie(0)->sync_with_stdio(0);
    int n,x;
    cin >> n >> x;
    int dp[x+1][n+1], cost[n], pages[n];
    for(int i=0;i<n;i++)
        cin >> cost[i];
    for(int i=0;i<n;i++)
        cin >> pages[i];
    for(auto &i: dp)
        for(auto &j:i)
            j=0;
    for(int j=1;j<=x;j++){
    /* for(int i=1;i<=n;i++){ */
        for(int i=1;i<=n;i++){
        /* for(int j=1;j<=x;j++){ */
            dp[j][i]=dp[j][i-1];
            if(j-cost[i-1]>=0)
                dp[j][i]=max(dp[j][i],dp[j-cost[i-1]][i-1]+pages[i-1]);
        }
    }
    cout << dp[x][n];
}

not the specific optimization in the blog, moreso that the loop order may affect cache hits.

Ah ok. Also, how do we know which ordering of the loop to use? In theory, both ways should work.

I believe the first solution makes \Theta(\frac{NX}{B}) L1 cache replacements, where B=64\text{ bytes} is the cache line size (proportional the number of consecutive array elements that you can read from main memory into the cache at the same time), while the second makes \Theta(NX) L1 cache replacements.

As the first solution processes consecutive j in order, it only needs to read in a new cache line after every \approx B values of j. Whereas the second would need to read a new cache line after every iteration of i because N is pretty large. As L1 cache is only ~64 KB large, you can’t fit N cache lines into memory at the same time; so if you access dp[1][j], dp[2][j], \ldots, dp[N][j], dp[1][j+1] in that order the cache line containing dp[1][j] will already have been evicted from L1 cache when you access dp[1][j+1].

EDIT: Actually N isn’t that large (I originally thought it was 10^5 instead of 10^3), but I think this reasoning still applies.

As for the third solution I think the number of cache misses is still \Theta(NX).

You can find some more information about analysis of cache misses here (might not be that beginner-friendly though).

Btw the “AC” solution doesn’t compile …

EDIT: Nvm it seems fine, maybe it was just a glitch on my end.

Thank you so much! BTW, I think the solution should compile for AC.