[Question] On the subject of in-house large language models versus implementing frontier models

A recent survey in the US states that 39.4% of adults are using generative AI for tasks both at work and outside of work, which highlights the rapidly increasing dependency on these models. Anecdotally, I introduced my wife to chatGPT four months ago, and today she consistently consults chatGPT 4o not only for work related matters, but also for day-to-day tasks at home. In our household, googling is slowly becoming a thing of the past.

Over the last year and a half, I have heard lots of chatter of regular businesses hiring teams of engineers to design in-house LLM applications. The arguments for building in-house LLMs are obvious: You control the architecture, data, and sensitive information of your business, as opposed to exposing this data to these ‘black box’ models. A year ago, it seemed like a good tradeoff to make, but after GPT-4 and other frontier models have been released, it seems to me that any regular business that continues to develop LLMs in-house will be left behind. Frontier models have advanced so quickly in terms of complexity, data scale, and efficiency that matching this pace internally may no longer be feasible for most regular businesses.

Looking into the future, I am curious about the following:

Aside from the reasons stated above, are there any other reasons why regular businesses should be spending resources creating their own in-house LLMs?

Is there a way to identify which companies are partnering with the builders of frontier models versus those developing their own in-house models within an industry?

If one of the barriers of customizing and implementing a frontier LLM within a firm is data cleaning, as sarahconstatin mentions in this post, is there a business opportunity in becoming a data cleaner, aka the bridge between regular companies and frontier model builders?