code in a mess, but gemm-optimization works on more systematic test cases including josephs NAACL graph