Results tables look objective. They're not.
Learn to read what they hide, not just what they show.
Standard benchmarks or cherry-picked? Dense or sparse? How old are they?
Are the strongest recent methods included? Are implementations fair?
Do they report only the metrics where they win? Are evaluation protocols valid?
Which component, when removed, hurts the most? That's where the real contribution is — and your research opportunity.
Find the weakest result first. Which dataset, metric, or backbone shows the smallest gap over the best baseline? That's where the method is most fragile.
K-RagRec on LLaMA-3: improvement is only 2.5% on R@5. Much weaker than the 27.8% on LLaMA-2. Why? The paper doesn't explain this clearly.
Evaluating on 1 positive + 19 random negatives is much easier than full ranking. Numbers look good but don't reflect real recommendation quality.
Old, dense datasets (MovieLens-1M is from 2003) don't reflect modern RS challenges. Results may not transfer.
If authors reimplement baselines themselves instead of using original code, they may be suboptimal. Check if the paper cites the original implementation.
If a paper reports Recall@3 and Recall@5 but not NDCG, ask why. Different metrics tell different stories.
An ablation removes one component at a time to show its contribution. Read it as:
The component with the biggest drop is what the paper is actually about — and your best target for extension.
Insight: GNN Encoder is the core contribution. Everything else is supporting infrastructure.
The last question is the most important. "What would break this method?" — your answer to that is the seed of your next research idea.
Methodology is the paper. Contributions list is the contract. Read gaps, not just results.
5 min → 30 min → 2 hrs. Stop when you have what you need. Most papers only need Pass 2.
Hidden assumptions, missing baselines, weak evaluation. Ablations reveal the real contribution.
Next: M3 · Documenting Findings — building your systematic knowledge base