Skip to content

Conversation

@tkarna
Copy link
Contributor

@tkarna tkarna commented Nov 18, 2025

Implements #16.

  • Adds a Workload object and execution engine utility functions that can execute it.
  • Adds two CPU examples that demonstrate the usage
    • One that allocates input data with NumPy and another that uses MLIR helper functions to allocate/fill/deallocate input memrefs.

Copy link
Contributor

@rolfmorel rolfmorel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a partial pass through it. Will try to complete the first pass through tomorrow!

pass

@abstractmethod
def get_input_arrays(self, execution_engine) -> list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> abc.Sequence[ir.Value]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had that each Workload is explicitly associated with a Target (or more than one), i.e. has a Target member, and each Target keeps track of it's associated execution_engine, this signature becomes simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating the execution engine requires a fully lowered payload IR. So keeping the engine in the Workload object complicates the re-use of this object, e.g., for different schedule parameters.

The flow is something like: ParametrizedWorkload(schedule with unknown params, initial payload IR, correctness test, target specific info, ...) -> ConcreteWorkload(lowered payload IR, target specific info, ...).

In this proposal the Workload object addresses the first part - it is a re-usable high-level description of the problem. Every time one asks for a payload or schedule module, a new module is generated (The payload module is lowered in-place, so it cannot be re-used; schedule may depend on the given parameters). The latter part - lowering to a final payload IR and executing it - is handled via the execution helper functions. We could design this differently, e.g. by adding another "ConcreteWorkload" object that is associated with an execution engine (either in this PR or later). All ideas are much appreciated!

Copy link
Contributor

@rolfmorel rolfmorel Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining!

Indeed, there appears to be this "phase change" in interacting with Workloads: only after lowering do certain actions, like executing, become available. It would be nice if we modelled this in the code somehow. I agree that having separate classes would work (I don't have much of a preference for the mechanism at the moment -- e.g. making methods take arguments of type only produced by methods that need have happened earlier is a fine mechanism as well).

pass

@abstractmethod
def get_complexity(self) -> list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the different "dimensions" of the returned thing are known, and have sensible names, I would go with a NamedTuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to tuple[int, int, int]

Comment on lines 141 to 143
if dump_kernel == "bufferized":
transform.YieldOp()
return schedule_module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this in the committed version?

An alternative to consider is inserting transform.print %mod if some verbosity flag has been set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has the dump_payload and dump_schedule arguments in the Workload class. As such we should demonstrate their use in the examples. We could also drop this feature for now, but personally I find this feature useful, e.g. in the xegpu matmul example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this kind of functionality is very useful while developing! I am not sure if these arguments are necessary on the most general version of Workload. Though as above, let's iterate in-tree. That is, it's up to follow-up PRs to demonstrate that getting rid of such arguments still gives us all the functionality we care about.

@tkarna tkarna marked this pull request as ready for review December 4, 2025 14:41
Copy link
Contributor

@rolfmorel rolfmorel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have left a couple more medium-level comments.

We are getting close to this being ready, IMO.

return execution_engine


def execute(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move both execute and other functions taking Workload to lighthouse.workload (creating a folder if so needed)?

To me these functions seem closely tied to the Workload interface (and as such have a bit of an opinionated way of interacting with schedules, e.g. how schedule_parameters are to be taken).

raise ValueError("Benchmark verification failed.")


def emit_benchmark_function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be largely independent of the workload argument.

I would probably still prefer it in lighthouse.workload over lighthouse.utils though.

benchmark.func_op.attributes["llvm.emit_c_interface"] = ir.UnitAttr.get()


def benchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it is implementing a little "protocol" for how to interact with Workload objects. Given its close tie to Workload, we could add it as a method on Workload. We could even mark that method @final: https://peps.python.org/pep-0591/#id1

Same goes for execute, I think.

(Sorry if I am repeating myself from a couple weeks back.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me personally, such protocol-implementing functions also help me understand the different parts of an interface, e.g. which purpose(s) they serve. Having them close to the definition of the interface could help people (like me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could promote benchmark and execute functions to the Workload class. For the time being I've added them as "examples" of common utility functions because we don't know if these two methods are actually sufficient for all use cases. If not, the developer can just implement their own version of execute/benchmark function and use it instead. (execute is probably always fairly similar though).

Also semantically, the workload definition and the way we want to, say, benchmark it seem somewhat orthogonal to me.

print(schedule_module)
return payload_module

@abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an abstractmethod, do we need a default implementation? If the default implementation is useful, i.e. subclasses would sometimes just use it as is, then we can remove @abstractmethod, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do not need an implementation here, I've included the dummy code just for the comments. I can remove it.

func.FuncOp("rtclock", ((), (f64_t,)), visibility="private")
# emit benchmark function
time_memref_t = ir.MemRefType.get((nruns,), f64_t)
args = payload_arguments + [time_memref_t]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to modifying the signature is to add a memref.global @time_deltas : memref<10xf64> = dense<0.000000e+00> alongside the benchmark function and just load and store from/to there. In Python you should be able to use https://numpy.org/doc/stable/reference/routines.ctypeslib.html to get a numpy array backed by this global, so that you can do your means etc easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the benefit of using a global buffer is to avoid pre-allocating the buffer to the expected size with numpy? The global buffer will live as long as the execution engine is alive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the primary benefit is that the function can have the same signature as the benchmarked function. The time tracking becomes a side-effect of the function, meaning that, for their functional behaviour, we could interchange them.

I believe the global buffer would indeed live as long as the execution engine (i.e. as long as the shared object remains loaded). Might not be great if we were to decide we want to run parallel execution within the same process, though that doesn't sound great for benchmarking anyway.

Comment on lines +112 to +113
def payload(*args):
A, B, C = args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def payload(*args):
A, B, C = args
def payload(A, B, C):

Super nit 😄

min_cst = arith.constant(f64_t, min)
max_cst = arith.constant(f64_t, max)
seed_cst = arith.constant(i32_t, seed)
linalg.fill_rng_2d(min_cst, max_cst, seed_cst, outs=[buffer])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Didn't know this existed. Wonder why it's just "2d".


@abstractmethod
@contextmanager
def allocate_inputs(self, execution_engine: ExecutionEngine):
Copy link
Contributor

@rolfmorel rolfmorel Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's debatable as to whether this method should be on the most generic version of Workload. That is, another conceptually simple way of implementing workloads is to fully represent the benchmark in IR. Where the payload contains the kernel and a main function that calls it which is also responsible for the setup of inputs. In that case this method is extraneous.

When we start adding example workloads like that we can have a look at if we maybe want a more generic version of the Workload interface alongside more concrete ones for particular ways of implementing workloads (e.g. one for where the kernel-calling logic is implemented as Python methods on Workload and another for where the calling logic resides fully in IR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, anticipating all the possible kinds of workloads is kind of hard. For the generic case, however, we do need to assume that the payload function takes input memrefs and that we need to be able to handle user-defined allocation and deallocation. For the latter, a context manager seems like a good choice for the generic interface, e.g. in the runner, as shown in this PR.

Then it's a question of defining a nice interface that would provide a context manager for the runner (even if a no-op) and avoid having the user to write a lot of unnecessary boilerplate for the simple workloads. One option could be to add a default context manager that calls allocate/deallocate functions that by default are no-ops. The use would then only need to implement alloc/dealloc functions if actually needed. In addition we would probably also need something like get_input_arrays to get the memrefs for the payload function. I had something like this in an earlier version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants