Using Reinforcement Learning to try to control the heating of a building (district heating)

In short, we are trying to use Reinforcement Learning to try to control the heating of a building (district heating) with the input buildings zone temperature, outdoor temperature. To not use the real building during training of the RL-algorithm we are using a building simulation program as an environment.

The building simulation program has inputs:

Zone thermostat heating and cooling setpoint (C)
Hot water pump flow rate.

Outputs from the building simulation program are:

Zone temperatures (C)
Outdoor temperature (C)
Hot water rate (kw)

The aim of the RL-algorithm is to make a more efficient control of the buildings district heating use, then the current district heating control function. The primary goal is to make the RL-algorithm peak-shave the district heating use.

We are using ClippedPPO as an agent using a RL-framework. As a comparison we have district heating data from one year from the building we want to control. The building is modelled in the building simulation format.

Action space of the RL-algorithm is:

Hot water pump flow rate
Zones heating and cooling temperature SP

Observation space of the RL-algorithm is:

Zone Air temperature
Outdoor temperature, current and forecast (36 hours into future)
Heating rate of hot water

In each timestep the RL-environment takes the input from the building simulation program and calculates a penalty from the observation state that is returned to the agent. The penalty is calculated as a sum of 4 different parts. Each part has a coefficient that by art I have been trying to figure out. Some of parts are for example the -coeff1*heating_rate^2, -coeff2*heating_derivative and -coeff3*unfomfortabletemp (large penalty when indoor temperature less than 19C)

The problem is that we are seeing heating with high peaks that we want the RL-algorithm to shave. So if anyone has any idea on how to get this working or give some insight on how to progress.

The orange part is the RL-resulting hot water heating rate and the blue part is the real-world measured values for 2022:

[Question] Using Reinforcement Learning to try to control the heating of a building (district heating)