Making GPT-2 better at math reasoning with a new attention mechanism

(github.com)

3 points | by umjunsik132 3 hours ago

1 comments

umjunsik132 3 hours ago
Hi HN Author here.
I built FactorizedAttention - a new attention mechanism based on the GWO framework. Instead of simple QK^T dot products, it uses factorized quadratic forms to model higher-order token interactions.
Testing on GPT-2 small + LoRA fine-tuning:
Math reasoning: 3.4% PPL improvement
Competitive programming: 3.2%
Python code: 1.9%
The bigger gains on reasoning tasks suggest the approach helps with complex relationships. Still early stage (only GPT-2 small), but the results are encouraging. Happy to answer questions! Code + repro steps in the repo.
[-]
- mynti 2 hours ago
  Cool idea! I had a look at the code and have been wondering about the sigmoid gating, it is used to add some of the q_struct and k_struct into the original key and query. But I wonder why this gating is independend of the input? I would have expected this gating to be dependednd on the input, so if the model sees something more complex it needs more of this information (or something similar). But it is just a fix, learnable parameter per layer, or am I mistaken? What is the intuition about this?
  [-]
  - umjunsik132 2 hours ago
    For this initial version, I kept the gating static to keep the model as simple as possible while validating the core idea. Making the gate dynamic based on the input is a great suggestion for the next step, and I agree it could lead to better performance. I really appreciate the feedback