Two interviews with the founder of DeepSeek

Link post

# The Madness of High-Flyer: The Approach to LLM by an AI Giant that Few See
暗涌Waves (2023-05-23 22:50)

Written by 于丽丽
Edited by 刘旌
Translated by Cosmia Nebula

High-Flyer is probably the most exotic among the swarming multitude of competitors in the battle of large models.

This is a game destined for the few, and while many startups are adjusting their direction or even retreating after the big players enter the game, this quantitative fund is alone in its march.

In May, High-Flyer named its new independent organization for making large models DeepSeek (深度求索) and emphasized that it would focus on making real human-level artificial intelligence. Their goal is not only to replicate ChatGPT, but also to research and discover mysteries of artificial general intelligence (AGI).

Not only that, in a track that is considered to be extraordinarily dependent on scarce talent, High-Flyer is also assembling a group of obsessive people and offer up what they consider to be their greatest weapon: a group of people’s curiosity.

Among quants, High-Flyer is a “top fund” that has reached over 100 billion yuan in size, but its focus on this new wave of AI is actually quite dramatic and unexpected.

When the shortage of high-performance GPU chips from domestic cloud vendors became the most direct factor limiting generative AI in China, according to Caijing Eleven (财经十一人), there are no more than five companies with more than 10,000 GPUs in China. Other than a few headline big corps [like Baidu], it also included a quantitative fund called High-Flyer. (It is commonly believed that 10,000 NVIDIA A100 chips are the bare-minimum for training large models from stratch.)

In fact, this company has long been a hidden AI giant: Founded in 2015, High-Flyer founded an AI company in 2019. It built out the Supercomputer “Firefly I” (萤火一号) with a total investment of ~200 million yuan, carrying 1,100 GPUs. 2 years later, “Firefly II” (萤火二号) supercomputer investment increased to 1 billion yuan, equipped with about 10,000 NVIDIA A100 graphics cards.

This means that from the point of view of computing power alone, High-Flyer had a ticket to self-train ChatGPT earlier than those big corps.

Large models strongly depend on compute, algorithms and data, so it takes 50 million yuan to start, and ~10 million yuan to train once, and it’s hard for small (< 1 billion yuan market-cap) companies to keep up. Depsite all kinds of difficulties, High-Flyer is very optimistic, founder Liang Wenfeng told us, “What we are sure of now is that since we want to do this and we have the ability to do this, we are one of the best candidates at this point in time.”

This bizarre optimism comes first from High-Flyer’s unique growth path.

Quantitative trading is an import from the U.S., which makes almost all of China’s top quantitative fund founding team, more or less, to have worked with American or European hedge fund previously. The only exception is High-Flyer, which started with a local team and grew up on its own.

In 2021, High-Flyer, which was founded only six years ago, reached a size of 100 billion yuan and was called one of the “Four Celestials of Quant”.

With the growth path of outsiders, High-Flyer has always been like a spoiler. A number of industry sources told us that High-Flyer “whether it’s the R&D system, products, or sales, they are always in a new way, jumping into the industry.”

The founder of a head quantitative fund believes that over the years High-Flyer , has always “left conventional paths”, but “went the way they want”, even if deviant or controversial, “they dare to say it out loud and do it the way they want to”.

Regarding the secret of High-Flyer’s growth, High-Flyer internally attributes it to “choosing a group of people, inexperienced but have potential, as well as having an organizational structure and corporate culture that allows innovation to happen”, which they believe will also be the secret of startups that can compete with the big players at the large model game.

And perhaps the crucial secret comes from High-Flyer’s founder, Liang Wenfeng.

When he was still studying AI at Zhejiang University, Liang Wenfeng was convinced that “AI will definitely change the world”, which in 2008 was still an unaccepted, obsessive belief.

After graduation, he didn’t go to a big corp to become a programmer like others around him, but hid in a cheap rental house in Chengdu, constantly accepting the frustration of trying to barge into many scenarios, and finally succeeded at barging into finance, one of the most complex fields, and founded High-Flyer.

Fun fact: In the early years, he had a similarly crazy friend who tried to get him to join his team for making flying machines in a Shenzhen [urban village](https://​​en.wikipedia.org/​​wiki/​​Urban_village_(China)), an endeavor considered “nonsense” [不靠谱]. Later, this friend founded a $100 million company called DJI.

Therefore, in addition to the inevitable topics of money, people and compute involved in making a large model, we also talked about what kind of organizational structure can make innovation happen, and how long the madness of people can last.

This is the first public interview with this seldom-seen “techno-otaku” founder after more than a decade in business.

Coincidentally, on April 11th, High-Flyer also quoted French New Wave director Truffaut’s advice to young filmmakers when announcing its large model: “Be insanely ambitious, and insanely sincere.” [I cannot confirm the quote. The Chinese translation is “务必要疯狂地怀抱雄心,且还要疯狂地真诚。”]

Below is the conversation:

## Part 1: Research and Exploration

>”Do the most important and difficult thing.”

“DarkWaves”: Not long ago, High-Flyer made an announcement that it would enter the field of large AI models. Why would a quant fund want to do such a thing?
Liang Wenfeng: Our large models are not directly related to quants or finance, so we’ve created a new company called DeepSeek to do this.
Many of High-Flyer’s original team members worked on AI. Back then, we tried a lot of fields before getting our big break in finance, which is complex enough. AGI is probably one of the hardest things we can do next, so for us it was a question of how, not why.

“DarkWaves”: Are you going to self-train a large model, or a model for some vertically integrated field like finance?
Liang Wenfeng: We’re going to do AGI. LLMs are probably a necessary step on the approach to AGI, and they already possess preliminary traits of AGI, so that’s where we’re going to start, and then we’ll do models with vision, and so on.

“DarkWaves”: Many startups have given up on being a company focused only on building Foundation Models since the entry of big corps.
Liang Wenfeng: We won’t prematurely focus on building applications on top of models. We will focus on large models.

“DarkWaves”: Many people think that it’s no longer a good time for startups to enter the field, since the big players have formed a consensus.
Liang Wenfeng: Now it seems that no matter big corps, or startups, it’s hard to establish a technical advantage to crush rivals in a short period of time. For LLMs, with OpenAI leading the way, and with LLMs all based on published papers and code, by next year at the latest, both big companies and startups will make their own LLMs.
Both big corps and startups have their own opportunities. For large models in vertically integrated fields, it is not in the hands of startups, and such field is unfriendly to startups. But because this kind of scene is, in the end, also scattered and fragmented, with specific and non-generic needs, it is again more suitable for flexible startup organizations. In the long run, the threshold for large model applications will get lower and lower, and startups will also have an opportunity to enter he field at any point in the next 20 years.
Our position is also clear: we don’t do vertical integration or applications, but just research and exploration.

“DarkWaves”: Why “research and exploration”?
Liang Wenfeng: It’s driven by curiosity. From a distance, we want to test some conjectures. For example, we understand that the essence of human intelligence may be language, and human thinking may be a language process. You think you’re thinking, but you’re actually weaving language in your head. This means that human-like AGI may be born from LLMs.
Closer to home, GPT-4 still has many mysteries. We will research on them at the same time as we replicate GPT-4.

“DarkWaves”: But research means paying a bigger cost.
Liang Wenfeng: Yes. If we settle for only replication, we can train it in a few tries, or even just finetune it, based on public papers or open source code, so the cost is very low. But for research, we need to do all kinds of experiments and comparisons, which requires more computing power and higher requirements for personnel, so the cost is higher.

“DarkWaves”: Where does the research funding come from?
Liang Wenfeng: High-Flyer, as one of our funders, has an adequate R&D budget. In addition, every year High-Flyer donates 100s millions of yuan to public welfare organizations, but we can change that, if needed.

“DarkWaves”: But how can you sustain the investment in Foundation Models when one can’t even enter the model-training game without dropping $300 million into the pot?
Liang Wenfeng: We are also looking for different funders to talk to. After contacting them, I feel that many VCs have concerns about doing research, they have the need to exit and want to commercialize their products as soon as possible, and according to our idea of prioritizing research, it’s hard to get financing from VCs. But we have computing power and a team of engineers, which is equivalent to half the chips.

“DarkWaves”: What are your business model projections and assumptions?
Liang Wenfeng: What we are thinking now is that we can share most of our training results publicly, which can be combined with commercialization. We hope that more people, even a small app, can use the large models at low cost, instead of monopolizing the technology in the hands of a few people and companies.

“DarkWaves”: Some big companies will also provide some services in the later stage, what is your differentiation?
Liang Wenfeng: The models of the big players may be bundled with their platforms or ecosystems, while ours are open.

“DarkWaves”: In any case, it’s crazy for a commercial company to go into a kind of research with unlimited investment.
Liang Wenfeng: If we have to find a commercial reason, we probably can’t, because it’s not profitable.
From a commercial point of view, basic research has a very low return-on-investment ratio, and when OpenAI’s early investors put in their money, they didn’t think about the returns. They did it because they wanted it.
What we are sure of now is that since we want to do this and we have the ability to do this, we are one of the best candidates at this point in time.

## Part 2: The 10,000-GPU Reserve and its Cost

> An exciting thing may not be measured by money alone.

“DarkWaves”″: GPUs are a scarce commodity in this ChatGPT startup wave, and yall had the foresight to stockpile 10,000 of them in 2021. Why?
Liang Wenfeng: Actually, the process happened gradually. From the first card, to 100 cards in 2015, 1,000 cards in 2019, and 10,000. Before a few hundred cards, we were hosted on an Internet Data Center (IDC), and when the scale got bigger, they couldn’t meet the requirements, so we started to build our own server room.
Many people would think that there is a secret business logic here, but in fact, it is mainly driven by curiosity.

“DarkWaves”: What kind of curiosity?
Liang Wenfeng: Curiosity about the boundaries of AI capabilities. For many outsiders, the impact of ChatGPT is the big one, but for insiders, it was the shock of AlexNet in 2012 that started a new era; AlexNet’s error rate was much lower than other models at that time, and revitalized neural network research that had been dormant for decades. Although the specific techniques have been always changing, there remains the constant of models + data + compute. Especially when OpenAI released GPT-3 in 2020, the direction was clear that a lot of compute was needed; but even in 2021, when we invested in the construction of Firefly II, most people still couldn’t understand it.

“DarkWaves”: So since 2012, you’ve been focusing on building up a reserve of compute?
Liang Wenfeng: For researchers, the thirst for compute is never-ending. After doing small-scale experiments, we always want to do larger-scale experiments. After that, we will also consciously deploy as much compute as possible.

“DarkWaves”: Many people think that the reason for building this computer cluster is that the quantitative private equity business will use machine learning to make price predictions?
Liang Wenfeng: If we were doing purely quantitative investing, just a few cards would be enough. We’ve done a lot of research outside of investing, and we’re more interested in figuring out what kind of paradigm can completely describe the entire financial market, whether there’s a more concise way to express it, where the boundaries of the capabilities of different paradigms lie, and whether these paradigms are more broadly applicable, etc. We’re also looking for ways to improve the quality of our work.

“DarkWaves”: But this is a money burner.
Liang Wenfeng: An exciting thing may not be measured by money alone. It’s like buying a piano for your family, firstly because you can afford it, and secondly because you have a group of people who are eager to play music on it.

“DarkWaves”″: GPUs are usually depreciating at a rate of 20%/​yr.
Liang Wenfeng: We haven’t calculated it exactly, but it shouldn’t be that much. NVIDIA’s graphics cards are hard currency, and even many still use old cards from years ago. We’ve retired old cards that were worth a lot of money when we sold them second-hand, so we didn’t lose too much.

“DarkWaves”: When you build a compute cluster, the maintenance, the labor, and even the electricity costs a lot of money.
Liang Wenfeng: Electricity and maintenance costs are actually very low, and they only account for about 1% of the hardware cost per year. Labor costs are not low, but they are also an investment in the future, the company’s biggest asset. The people we pick will also be relatively down-to-earth, curious, and come here for the opportunity to do research.

“DarkWaves”: High-Flyer is one of the first companies in the Asia-Pacific region to get an A100 graphics card in 2021, why is it ahead of some cloud vendors?
Liang Wenfeng: We did pre-research, testing and planning for the new card very early. As for some of the cloud vendors, as far as I know, the demands for their compute had been disaggregated. [And they didn’t have the infrastructure for large-scale training until] 2022 when autonomous driving, the need to rent machines for training, and the ability to pay for it, appeared. Then some of the cloud vendors went ahead and put the infrastructure in place. Generally it’s hard for a big corp to just go and do pure research or training. It’s going to be more business demand driven.

“DarkWaves”: How would you look at the competitive landscape for large models?
Liang Wenfeng: The big players definitely have an advantage, but if they can’t apply it quickly, the big players won’t necessarily be able to sustain it, because it’s more important to see results.
The head startups also have solid technology, but like the old wave of AI startups, they all have to face commercialization challenges.

“DarkWaves”: Some people would think that a quant fund emphasizing AI is just fluffing stuff up, “blowing bubbles”, to attract attention for their other actual businesses.
Liang Wenfeng: But in fact, our quant fund doesn’t raise much more money from the public anymore.

“DarkWaves”: How do you tell AI believers from speculators?
Liang Wenfeng: Believers will be here before and here after. They’re more likely to buy cards in bulk or sign a long term deal with a cloud vendor rather than renting for a short period of time.

## Part 3: How to make innovation actually happen

> Innovation is often self-generated, not orchestrated or taught.

“DarkWaves”: How is the progress on the hiring of the DeepSeek team?
Liang Wenfeng: The initial team has already been assembled, and some people will be seconded from High-Flyer in the early stage due to lack of manpower. We started hiring at the end of last year when ChatGPT 3.5 became popular, but we still need more people to join us.

“DarkWaves”: Talents for large model startups are also scarce, and some investors say that many suitable talents may only be found in the AI labs of giants such as OpenAI, Facebook AI Research, etc. Will you go overseas to poach such talents? Will you go overseas to scout for such talents?
Liang Wenfeng: If you’re looking for short-term goals, it’s right to look for ready-made experienced people. However, if we look at the long term, experience is not so important, and basic ability, creativity, and love are more important. From this perspective, there are many suitable candidates in China.

“DarkWaves”: Why is experience less important?
Liang Wenfeng: It’s not necessarily the person who has done the job before who can do the job, High-Flyer has a principle of recruiting people based on ability, not experience. Our core technical positions are mainly filled by fresh graduates and those who have graduated for one or two years.

“DarkWaves”: Do you think experience is an obstacle in the innovation business?
Liang Wenfeng: When doing something, an experienced person will tell you without thinking that it should be done this way, but an inexperienced person will try to figure out how to do it again and again, and then find a solution that fits the current situation.

“DarkWaves”: High-Flyer has gone from being an outsider with no financial genes at all to being at the top of the industry in a few years, is this recruiting rule one of the secrets?
Liang Wenfeng: Our core team, even myself, started with no quantitative experience, which is unique. I can’t say it’s the secret to success, but it’s part of the culture at High-Flyer. We don’t intentionally avoid people with experience, but it’s more about ability.
Take the sales position as an example. Our two main sales, are newbies in this industry. One originally did the German machinery category of foreign trade, one was originally writing backend code for a brokerage. When they entered this industry, they had no experience, no resources, no accumulation.
And now we may be the only large private equity firm that can focus on direct sales. Doing direct sales means cutting out the middlemen, more profit for the same size and performance. Many have tried to imitate us, but none succeeded.

“DarkWaves”: Why did many try to imitate you but didn’t succeed?
Liang Wenfeng: Because that alone is not enough for innovation to happen. It needs to match the culture and management of the company.
In fact, the first year they couldn’t get anything done, and the second year they started to get something done. But our assessment criteria are not the same as a normal company’s. We don’t have KPIs, and we don’t have what we call a KPI. We don’t have KPIs and we don’t have so-called “missions”.

“DarkWaves”: What are your assessment criteria?
Liang Wenfeng: We are not like other companies that emphasize on the amount of orders placed by customers. We do not calculate the amount of sales and the commission from the beginning, but encourage sales to develop their own circles, to know more people, and to have more influence.
Because we believe that if a salesperson has integrity trusted by the customers, even if he may not be get a lot of sales in a short time, but he will make you feel like he is a dependable person.

“DarkWaves”: After selecting the right person, what are the best ways to get him into shape?
Liang Wenfeng: Give him important things to do and don’t interfere. Let him figure it out on his own and play on his own.
In fact, a company’s DNA is hard to imitate. For example, when recruiting inexperienced people, how does the company judge their potential, and how does it let them grow after recruiting? None of these can be directly imitated.

“DarkWaves”: What do you think are the necessary conditions to build an innovative organization?
Liang Wenfeng: Our conclusion is that innovation requires as little intervention and management as possible, so that everyone has the freedom to play and the opportunity for trial and error. Innovation is often self-generated, not orchestrated or taught.

“DarkWaves”: It’s an unconventional management style. How do you make sure that a person is doing things efficiently and in the direction you want them to go?
Liang Wenfeng: We recruit people to make sure they have the same values, and then the culture makes they stay on the same page. Of course, we don’t have a written corporate culture because written things, again, get in the way of innovation. More often than not, it’s the example of the manager sets. It’s the way you make decisions when encountering a problem that becomes a guideline.

“DarkWaves”: Do you think that in this wave of large model competition, the organizational structure of startups that is more suitable for innovation will be the breakthrough point to compete with the big players?
Liang Wenfeng: If we apply the textbook theories to calculate the fates of startups, in the present, doing what they do, they will die.
But the market is changing. The real determining force is often not some ready-made rules and conditions, but an ability to adapt and adjust to change.
The organizational structure of many large companies can no longer respond and do things quickly, and they can easily let the previous experience and inertia become a constraint, and under this new wave of AI, there will be a number of new companies born.

## Part 4: True Madness

> Innovation *is* expensive and inefficient, and sometimes comes with waste.

“DarkWaves”: What excites you most about doing something like this?
Liang Wenfeng: Figuring out if our conjecture is true, and if it is, it’s exciting.

“DarkWaves”: What are some of the musts you are looking for in this recruitment drive?
Liang Wenfeng: Passion and solid foundation skills. Nothing else is that important.

“DarkWaves”: Is it easy to seek such people?
Liang Wenfeng: Their passion usually shows because they really want to do it, so these people are often seeking you at the same time.

“DarkWaves”: Large models can be an endless endeavor. Is the cost a concern for you?
Liang Wenfeng: Innovation *is* expensive and inefficient, and sometimes comes with waste. That’s why innovation can only happen when the economy is sufficiently developed. When you’re poor, or not in an innovation-driven industry, cost and efficiency are critical. Look at how OpenAI burned a lot of money before its big break.

“DarkWaves”: Do you think you are doing something mad?
Liang Wenfeng: I don’t know if it’s mad, but there are a lot of things in this world that can’t be explained by logic, just like many programmers, who are also mad contributors to the open source community, who are tired after a long day, but still need to contribute code.

“DarkWaves”: There’s a kind of spiritual reward here.
Liang Wenfeng: It’s like when you hike 50 kilometers and your flesh is paralyzed, but your spirit is satisfied.

“DarkWaves”: Do you think curiosity-driven madness can last forever?
Liang Wenfeng: Not everyone can be mad for the rest of their lives, but most people, in their youth, can devote fully into something, with no utilitarian concerns at all.

----

<https://​​mp.weixin.qq.com/​​s?__biz=Mzk0MDMyNDUxOQ==&mid=2247486864&idx=1&sn=dd80bd76dd937e363a5c61aa542e6d18>

# 疯狂的幻方:一家隐形AI巨头的大模型之路
暗涌Waves 2023年05月23日 22:50

文 |于丽丽
编辑 | 刘旌

在蜂拥而至的大模型团战中,幻方大概是最异类的一个。

这是一场注定是少数人的游戏,很多创业公司在大厂入局后开始调整方向甚至萌生退意,而这家量化基金却孤绝前行。

5月,幻方把下场做大模型的独立新组织,命名为“深度求索”,并强调将专注于做真正人类级别的人工智能。他们的目标,不只是复刻ChatGPT,还要去研究和揭秘通用人工智能(AGI)的更多未知之谜。

不仅如此,在这个被认为格外依赖稀缺人才的赛道,幻方还试图去集结一批有执念的人,并祭出了他们认为的最大武器:一群人的好奇心。

在量化领域,幻方是一家抵达过千亿规模的“顶级基金”,但它被这波AI新浪潮集中关注到,其实还颇具戏剧性。

当国内云厂商高性能GPU芯片缺货成为限制中国生成式AI诞生的最直接因素时,据《财经十一人》报道,国内拥有超过1万枚GPU的企业不超过5家。而除几家头部大厂外,还包括一家名为幻方的量化基金公司。通常认为,1万枚英伟达A100芯片是做自训大模型的算力门槛。

其实,这家很少被置于人工智能视野打量的公司,早已是一家隐秘的AI巨头:2019年,幻方量化成立AI公司,其自研的深度学习训练平台“萤火一号”总投资近2亿元,搭载了1100块GPU;两年后,“萤火二号”的投入增加到10亿元,搭载了约1万张英伟达A100显卡。

这意味着,单从算力看,幻方甚至比很多大厂都更早拿到了做ChatGPT的入场券。

只是大模型对算力、算法和数据都有强依赖,所以起步就需要5000万美金,训练一次需要上千万美金,非百亿美金公司其实很难持续跟进。各种艰难之下,幻方却很乐观,创始人梁文锋告诉我们:“关键是我们想做这件事,能做这件事,那我们就是最合适的人选之一。”

这种谜之乐观,首先来自幻方的独特成长路径。

量化投资是一个源自美国的舶来品,这使得几乎所有中国的头部量化基金创始班底,都或多或少有过美国或欧洲对冲基金的履历。唯独幻方是一个例外:它完全是本土班底起家,独自摸索着长大。

2021年,成立仅六年的幻方,抵达千亿规模,并被称为“量化四大天王”之一。

以局外人杀入的成长路径,让幻方始终像一个搅局者。多位行业人士向我们表示,幻方“无论研发体系、产品还是销售,都始终在用一种崭新的方式,切入到这个行业中来。”

一家头部量化基金创始人认为,这些年的幻方,始终“没有按照某种约定成俗的道路在走”,而是“按照他们想要的方式 ” ,即便是有点离经叛道或者争议,“也敢大大方方说出来 ,然后按照自己的想法去做”。

关于幻方的成长奥秘,幻方内部将之归结为“选用了一批没有经验但有潜能的人,以及有一个可以让创新发生的组织架构和企业文化”,他们认为这也将是大模型创业公司可以与大厂竞争的秘密所在。

而更关键的秘密,或许来自幻方的创始人梁文锋。

还在浙江大学攻读人工智能时,梁文锋就无比笃信“人工智能一定会改变世界”,而2008年,这还是一个不被认同的执念。

毕业后,他没有像周围人一样去大厂做个程序员,而是躲在成都的廉价出租屋里,不停接受进入诸多场景中尝试的挫败,最终切入了最复杂场景之一的金融,并成立了幻方。

一个有趣的细节是,在最早几年,曾有个同样疯癫的、在深圳城中村做着“不靠谱”飞行器的朋友拉他入伙。后来这个朋友做成了一个千亿美金的公司,名叫:大疆。

也因此,在做大模型必然涉及的钱、人、算力等话题外,我们还和幻方创始人梁文锋特别聊了聊,怎样的组织架构可以让创新发生,以及人的疯狂可以持续多久。

创业十余年,这是这位鲜少露面的“技术宅”型创始人第一次公开受访。

巧合的是,4月11日,幻方在发布做大模型公告时,也引用了法国新浪潮导演特吕弗曾告诫青年导演的一句话:“务必要疯狂地怀抱雄心,且还要疯狂地真诚。”

以下为对话:

## Part 1: 做研究,做探索

>“做最重要、最困难的事”

「暗涌」:前不久,幻方发公告决定下场做大模型,一家量化基金为什么要做这样一件事?
梁文锋:我们做大模型,其实跟量化和金融都没有直接关系。我们独建了一个名为深度求索的新公司来做这件事。
幻方的主要班底里,很多人是做人工智能的。当时我们尝试了很多场景,最终切入了足够复杂的金融,而通用人工智能可能是下一个最难的事之一,所以对我们来说,这是一个怎么做的问题,而不是为什么做的问题。

「暗涌」:你们要自训一个大模型,还是某个垂直行业——比如金融相关的大模型?
梁文锋:我们要做的是通用人工智能,也就是AGI。语言大模型可能是通往AGI的必经之路,并且初步具备了AGI的特征,所以我们会从这里开始,后边也会有视觉等。

「暗涌」:因为大厂的入局,很多创业型公司都放弃了只做通用型大模型的大方向。
梁文锋:我们不会过早设计基于模型的一些应用,会专注在大模型上。

「暗涌」:很多人认为,创业公司在大厂形成共识后下场,已经不是一个好的时间点。
梁文锋:现在看起来,无论大厂,还是创业公司,都很难在短时间内建立起碾压对手的技术优势。因为有OpenAI指路,又都基于公开论文和代码,最晚明年,大厂和创业公司都会把自己的大语言模型做出来。
大厂和创业公司都各有机会。现有垂类场景不掌握在初创公司手上,这个阶段对初创公司不太友好。但因为这种场景说到底也是分散的、碎片化的小需求,所以它又是更适合灵活的创业型组织的。从长期看,大模型应用门槛会越来越低,初创公司在未来20年任何时候下场,也都有机会。
我们的目标也很明确,就是不做垂类和应用,而是做研究,做探索。

「暗涌」:为什么你的定义是“做研究、做探索”?
梁文锋:一种好奇心驱动。从远处说,我们想去验证一些猜想。比如我们理解人类智能本质可能就是语言,人的思维可能就是一个语言的过程。你以为你在思考,其实可能是你在脑子里编织语言。这意味着,在语言大模型上可能诞生出类人的人工智能(AGI)。
从近处说,GPT4还有很多待解之谜。我们去复刻的同时,也会做研究揭秘。

「暗涌」:但研究意味着要付出更大的成本。
梁文锋:只做复刻的话,可以在公开论文或开源代码基础上,只需训练很少次数,甚至只需finetune(微调)一下,成本很低。而做研究,要做各种实验和对比,需要更多算力,对人员要求也更高,所以成本更高。

「暗涌」:那研究经费哪里来?
梁文锋:幻方作为我们的出资人之一,有充足的研发预算,另外每年有几个亿的捐款预算,之前都是给公益机构,如果需要,也可以做些调整。

「暗涌」:但做基础层大模型,没有两三亿美元,连牌桌都上不了,我们如何支撑它的持续投入?
梁文锋:我们也在找不同出资方在谈。接触下来,感觉很多VC对做研究有顾虑,他们有退出需求,希望尽快做出产品商业化,而按照我们优先做研究的思路,很难从VC那里获得融资。但我们有算力和一个工程师团队,相当于有了一半筹码。

「暗涌」:我们对商业模式做了哪些推演和设想?
梁文锋:我们现在想的是,后边可以把我们的训练结果大部分公开共享,这样可以跟商业化有所结合。我们希望更多人,哪怕一个小 app都可以低成本去用上大模型,而不是技术只掌握在一部分人和公司手中,形成垄断。

「暗涌」:一些大厂后期也会有一些服务提供,你们差异化的部分是什么?
梁文锋:大厂的模型,可能会和他们的平台或生态捆绑,而我们是完全自由的。

「暗涌」:无论如何,一个商业公司去做一种无限投入的研究性探索,都有些疯狂。
梁文锋:如果一定要找一个商业上的理由,它可能是找不到的,因为划不来。
从商业角度来讲,基础研究就是投入回报比很低的。OpenAI早期投资人投钱时,想的一定不是我要拿回多少回报,而是真的想做这个事。
我们现在比较确定的是,既然我们想做这个事,又有这个能力,这个时间点上,我们就是最合适人选之一。

## Part 2: 万卡储备与它的代价

> “一件激动人心的事,或许不能单纯用钱衡量。”

「暗涌」:GPU是这次ChatGPT创业潮的稀缺品,你们在2021年就可以有先见之明,储备了1万枚。为什么?
梁文锋:其实从最早的1张卡,到2015年的100张卡、2019年的1000张卡,再到一万张,这个过程是逐步发生的。几百张卡之前,我们托管在IDC,规模再变大时,托管就没法满足要求了,就开始自建机房。
很多人会以为这里边有一个不为人知的商业逻辑,但其实,主要是好奇心驱动。

「暗涌」:什么样的好奇心?
梁文锋:对 AI 能力边界的好奇。对很多行外人来说,ChatGPT 这波浪潮冲击特别大;但对行内人来说,2012年 AlexNet 带来的冲击已经引领一个新的时代。AlexNet 的错误率远低于当时其他模型,复苏了沉睡几十年的神经网络研究。虽然具体技术方向一直在变,但模型、数据和算力这三者的组合是不变的,特别是当 2020 年 OpenAI 发布 GPT3 后,方向很清楚,需要大量算力;但即便 2021 年,我们投入建设萤火二号时,大部分人还是无法理解。

「暗涌」:所以2012年起,你们就开始关注到算力的储备?
梁文锋:对研究员来说,对算力的渴求是永无止境的。做了小规模实验后,总想做更大规模的实验。那之后,我们也会有意识地去部署尽可能多的算力。

「暗涌」:很多人以为搭这个计算机集群,是量化私募业务会用到机器学习做价格预测?
梁文锋:如果单纯只做量化投资,很少的卡也能达到目的。我们在投资外做了大量研究,更想搞清楚什么样的范式可以完整地描述整个金融市场,有没有更简洁的表达方式,不同范式能力边界在哪,这些范式是不是有更广泛适用,等等。

「暗涌」:但这个过程也是一个烧钱行为。
梁文锋:一件激动人心的事,或许不能单纯用钱衡量。就像家里买钢琴,一来买得起,二来是因为有一群急于在上面弹奏乐曲的人。

「暗涌」:显卡通常会以20%的速度在折损。
梁文锋:我们没有精确计算过,但应该没这么多。英伟达的显卡是硬通货,即使是很多年前的老卡,也还有很多人在用。我们之前退役的老卡,二手处理时还挺值钱的,没亏太多。

「暗涌」:搭一个计算机集群,维护费用,人工成本,甚至电费也都是不菲的支出。
梁文锋:电费和维护费用其实是很低的,这些支出每年只占硬件造价的1%左右。人工成本不低,但人工成本也是对未来的投资,是公司最大的资产。我们选的人也会相对朴实一点,有好奇心,来这里有机会去做研究。

「暗涌」:2021年,幻方是亚太地区第一批拿到A100显卡的公司,为什么会比一些云厂商更早?
梁文锋:我们很早就对新卡做了预研、测试和规划。至于一些云厂商,据我所知,他们之前的需求都是分散的,直到2022年自动驾驶,有租用机器做训练的需求,又有付费能力,一些云厂商才去把基础设施建好。大厂很难单纯去做研究,做训练,它更多会是业务需求驱动。

「暗涌」:你会如何看大模型的竞争格局?
梁文锋:大厂肯定有优势,但如果不能很快应用,大厂也不一定能持续坚持,因为它更需要看到结果。
头部的创业公司也有技术做得很扎实的,但和老的一波AI创业公司一样,都要面对商业化难题。

「暗涌」:一些人会觉得一个量化基金却强调自己做AI,是为其他业务吹泡泡。
梁文锋:但其实我们的量化基金已经基本不怎么对外募集了。

「暗涌」:你会如何去辨别哪些是AI信仰者,哪些是投机者?
梁文锋:信仰者会之前就在这里,之后也在这里。他们更会去批量买卡,或者跟云厂商签长协议,而不是短期去租。

## Part 3: 如何让创新真正发生

> “创新往往都是自己产生的,不是刻意安排的,更不是教出来的”

「暗涌」:深度求索团队的招聘进展如何?
梁文锋:初始团队已经集结到位,前期因为人手不够,会从幻方临时借调一部分人过去。去年底ChatGPT3.5风靡时,我们就开始动手招聘了,不过我们依然需要更多的人加入。

「暗涌」:大模型创业的人才也是稀缺的,有投资人说很多适合的人才可能只在OpenAI、FacebookAI Research 等巨头的AI lab里。你们会去海外挖这类人才吗?
梁文锋:如果追求短期目标,找现成有经验的人是对的。但如果看长远,经验就没那么重要,基础能力、创造性、热爱等更重要。从这个角度看,国内合适的候选人就不少。

「暗涌」:为什么经验没那么重要?
梁文锋:不一定是做过这件事的人才能做这件事。幻方招人有条原则是,看能力,而不是看经验。我们的核心技术岗位,基本以应届和毕业一两年的人为主。

「暗涌」:在创新业务上,你觉得经验是阻碍吗?
梁文锋:做一件事,有经验的人会不假思索告诉你,应该这样做,但没有经验的人,会反复摸索、很认真去想应该怎么做,然后找到一个符合当前实际情况的解决办法。

「暗涌」:幻方从一个完全无金融基因的外行,切入到这个行业,几年内做到头部,这条招人法则是其中秘密之一吗?
梁文锋:我们的核心团队,连我自己,一开始都没有量化经验,这一点很特殊。不能说是成功的秘密,但这是幻方的文化之一。我们不会故意回避有经验的人,但更多是看能力。
拿销售这个岗位举个例子。我们的两个主力销售,都是这个行业的素人。一个原来做德国机械品类外贸的,一个是原来在券商做后台写代码。他们进入这个行业时,没有经验,没有资源,没有积累。
而现在我们可能是唯一一家能以直销为主的大私募。做直销意味着不用给中间商分费用,同样规模和业绩下,利润率更高,很多家会试图模仿我们,但并没有成功。

「暗涌」:为什么很多家试图模仿你们,却没有成功?
梁文锋:因为仅凭这一点不足以让创新发生。它需要和公司的文化和管理相匹配。
事实上,第一年他们什么都做不出来,第二年才开始有点成绩。但我们的考核标准和一般公司不太一样。我们没有 KPI,也没有所谓的任务。

「暗涌」:那你们的考核标准是?
梁文锋:我们不像一般公司,看重客户下单量,我们的销售卖多少和提成不是一开始就算好的,而会更鼓励销售去发展自己的圈子,认识更多人,产生更大影响力。
因为我们认为,一个让客户信任的正直的销售,可能在短时间内做不到让客户来下单,但可以让你觉得他是个靠谱的人。

「暗涌」:选来合适的人后,用何种方式让他进入状态?
梁文锋:交给他重要的事,并且不干预他。让他自己想办法,自己发挥。
其实,一家公司的基因是很难被模仿的。比如说招没有经验的人,怎么判断他的潜力,招进来之后如何才能让他成长,这些都没法直接模仿。

「暗涌」:你觉得什么是打造一个创新型组织的必要条件?
梁文锋:我们的总结是,创新需要尽可能少的干预和管理,让每个人有自由发挥的空间和试错机会。创新往往都是自己产生的,不是刻意安排的,更不是教出来的。

「暗涌」:这是一种非常规的管理方式,这种情况下你如何确保一个人做事是有效率的,而且在你要的方向上?
梁文锋:招人时确保价值观一致,然后通过企业文化来确保步调一致。当然,我们并没有一个成文的企业文化,因为所有成文东西,又会阻碍创新。更多时候,是管理者的以身示范,遇到一件事,你如何做决策,会成为一种准则。

「暗涌」:你觉得这波做大模型的竞争中,创业公司更适合创新的组织架构会是和大厂竞争的破局点吗?
梁文锋:按照教科书的方法论来推导创业公司,在当下,他们做的事,都是活不下来的。
但市场是变化的。真正的决定力量往往不是一些现成的规则和条件,而是一种适应和调整变化的能力。
很多大公司的组织结构已经不能快速响应和快速做事,而且他们很容易让之前的经验和惯性成为束缚,而这波AI新浪潮之下,一定会有一批新公司诞生。

## Part 4: 真正的疯狂

> “创新就是昂贵且低效的,有时候伴随着浪费。”

「暗涌」:做这样一件事,最让你们兴奋的是什么?
梁文锋:去搞清我们的猜想是不是事实,如果是对的,就会很兴奋了。

「暗涌」:这次大模型招人,什么是我们必卡的条件?
梁文锋:热爱,扎实的基础能力。其他都没那么重要。

「暗涌」:这种人容易找到吗?
梁文锋:他们的热情通常会表现出来,因为他真的很想做这件事,所以这些人往往同时也在找你。

「暗涌」:大模型可能是一件无休止投入的事,付出的代价会让你们顾虑吗?
梁文锋:创新就是昂贵且低效的,有时候伴随着浪费。所以经济发展到一定程度之后,才能够出现创新。很穷的时候,或者不是创新驱动的行业,成本和效率非常关键。看OpenAI也是烧了很多钱才出来。

「暗涌」:会觉得你们在做一件很疯狂的事吗?
梁文锋:不知道是不是疯狂,但这个世界存在很多无法用逻辑解释的事,就像很多程序员,也是开源社区的疯狂贡献者,一天很累了,还要去贡献代码。

「暗涌」:这里边会有一种精神奖赏。
梁文锋:类似你徒步50公里,整个身体是瘫掉的,但精神很满足。

「暗涌」:你觉得好奇心驱动的疯狂可以一直持续下去吗?
梁文锋:不是所有人都能疯狂一辈子,但大部分人,在他年轻的那些年,可以完全没有功利目的,投入地去做一件事。

----

# DeepSeek Uncovered: The Story of a More Extreme Chinese Techno-Idealism

暗涌Waves (2024-07-17 02:01)

Written by于丽丽
Edited by 刘旌
Translated by Cosmia Nebula


Of China’s seven startups on large models [I don’t know which 7 they meant], DeepSeek (深度求索) is the quietest, yet it’s always remembered for the surprises it brings.

A year ago, surprisingly, a quantitative fund, High-Flyer, was the only company outside of the big corps to stockpile 10,000 A100 chips, and a year later, it surprisingly triggered China’s large model price war.

In May, a month bombarded with AI news, DeepSeek jumped to fame with an open source model `DeepSeek-V2`, which offered an unprecedented price/​performance ratio: the reasoning cost was reduced to only 1 yuan per million tokens, which is about one-seventh of the cost of Llama3-70B and one-seventieth of the cost of GPT-4 Turbo.

DeepSeek was quickly crowned “AI Pinduoduo” [an e-commerce giant famous for selling cheap low quality products at bulk]. At the same time, ByteDance, Tencent, Baidu, Alibaba and other big corps was forced to to reduce prices. Thus began China’s large model price war.

The smoke that fills the air actually hides a fact: unlike many big manufacturers that burn money to subsidize their price war, DeepSeek is already profitable.

Behind this is DeepSeek’s all-round innovation of model architecture. It proposes a brand new MLA (multiple latent attention) architecture, which reduces the VRAM cost to 5--13% of the most commonly used MHA (multihead attention) architecture in the past, and at the same time, its original DeepSeekMoE-Sparse architecture also reduces the compute cost, all of which ultimately contributes to the decrease in cost.

In Silicon Valley, DeepSeek is regarded as a “mysterious force from the East”, and [SemiAnalysis’ principal analyst believes the DeepSeek V2 paper is “probably the best one this year in terms of information and details shared”](https://​​semianalysis.com/​​2024/​​05/​​07/​​openai-is-doomed-et-tu-microsoft/​​). Former OpenAI employee Andrew Carr found the paper “full of amazing wisdom” and applied its training setup to his own models. [unsourced claim] And Jack Clark, OpenAI’s former head of policy and co-founder of Anthropic, thinks [“DeepSeek has managed to hire some of those inscrutable wizards who can deeply understand CUDA” and that “Made in China will be a thing for AI models, same as electric cars, drones, and other technologies”](https://​​jack-clark.net/​​2024/​​05/​​13/​​import-ai-372-gibberish-jailbreak-deepseeks-great-new-model-googles-soccer-playing-robots/​​).

This is rare in an AI wave that’s largely driven by Silicon Valley. Several industry sources told us that this strong reaction stems from seeing the innovations at the architectural level, something rarely tried by domestic large model companies, and even among global open source Foundation Models. One AI researcher said that modifications to the Attention architecture has been rarely successful since its proposal many years ago, let alone actual validation on an actual large model [that is costly to train]. “It’s an idea that gets squashed at the planning stage, because most people lack confidence.”

Another reason that domestic large models have rarely dabbled in innovation at the architectural level before is that few people have dared to go against the stereotype that America is better at the technological innovation from 0 to 1, while China is better at the application innovation from 1 to 10. Not to mention that this kind of behavior is very unprofitable—the usual thinking is that, naturally, in a few months, someone would have made the next generation of models, and then Chinese companies can just follow the leader, and do a good job of application. Innovating the model structure means that there is no path to follow, and there are a lot of failures to go through, which is costly in terms of time and money.

DeepSeek is clearly going against the grain. Among clamorous claims that “techniques for large model are bound to converge to the same ones” and “it’s a smart shortcut to follow”, DeepSeek values the value accumulated in going the long way around, and believes that Chinese large model entrepreneurs can join the global flood of technological innovation too, not just application innovation.

Many of DeepSeek’s choices are different. Up to now, among the 7 Chinese large model startups, it is the only one that has given up the “both models and applications” route, and so far it has just focused on research and technology, without making to-consumer products, and it is also the only one that has not fully considered commercialization, and has firmly chosen the open source route without even raising capital. These make it often forgotten in the large models game. But on the other end, it is often spread throughout community by the word-of-mouth advertisements by the users themselves.

How did DeepSeek come to be? We caught up with the seldom-seen founder of DeepSeek, Liang Wenfeng.

This gen-80s founder, who has been researching technology behind the scenes since the High-Flyer era, continues his low-profile style in the DeepSeek era, reading papers, writing code, and participating in group discussions every day, just like all other researchers.

And though many quantitative fund founders have overseas hedge fund resume, mostly from physics, mathematics and other majors, Liang Wenfeng has just a domestic background, studying AI in the Department of Electrical Engineering, Zhejiang University.

Several industry insiders and DeepSeek researchers told us that Liang Wenfeng is a very rare person in China’s AI industry who has abilities in “strong infrastructure engineering, model research, and also resource mobilization”, and “can make accurate high-level judgments, and can also be stronger than a frontline researcher in the technical details”. He has a “terrifying ability to learn” and at the same time is “less like a boss and more like a geek”.

This is a particularly rare interview. In the interview, this techno-idealist speaks with a rare voice in China’s tech world: He is one of the rare people who puts “true and false” before “cost and benefit”, reminds us to see the inertia of the times, and put “original innovation” on the agenda.

A year ago, when DeepSeek was just launched, we interviewed Liang Wenfeng: “The Madness of High-Flyer: The Approach to LLM by an AI Giant that Few See”. If “Be insanely ambitious, and insanely sincere.” was merely a beautiful slogan, one year later, it is already in action.

Below is the conversation.

## Part 1: How was the first shot in the price war fired?

“DarkWaves”: After the release of DeepSeek V2 model, it quickly triggered a bloody price war for large models, and some people say you are a catfish in the industry.
Liang Wenfeng: We didn’t mean to be the [proverbial catfish](https://​​zh.wikipedia.org/​​zh-hans/​​%E9%B2%B6%E9%B1%BC%E6%95%88%E5%BA%94). We just became one by accident.

“DarkWaves”: Were you surprised by this result?
Liang Wenfeng: Very. I didn’t realize that the price issue is so touchy to people. We were just doing things at our own pace, and then we calculated the total cost, and set the price accordingly. Our principle is that we neither subsidize nor make huge profits, so the price is set slightly above the cost.

“DarkWaves”: 5 days later, Zhipu AI followed, and then ByteDance, Alibaba, Baidu, Tencent, and the other big players.
Liang Wenfeng: Since Zhipu AI provided an entry-level product, we should compare with our models at the same level. Compared with that, their prices were still very expensive. ByteDance was the first who legitimately followed our prices. The flagship model came down to the same price as ours, which then triggered other big players to lower their prices. Because the models from the big corps cost a lot more than ours, we didn’t expect anyone would lose money by doing this, yet here we have ended up with the logic of the internet era, of burning money to subsidize products.

“DarkWaves”: From an external perspective, the price cuts look like a userbase grab, which is what price wars in the Internet era usually are.
Liang Wenfeng: Rushing to grab the userbase is not our main goal. On one hand, we’re lowering prices because we, as an effect of exploring the structure of our next-generation model, have managed to lower the costs, and on the other hand, we feel that both APIs and AI should be affordable and accessible to everyone.

“DarkWaves”: Before this, most Chinese companies would just copy the current generation of Llama structure to start making applications from that point. Why did you start with making the model architecture?
Liang Wenfeng: If the goal is just to make applications, then it is reasonable to follow the Llama architecture and start the product in a short period of time. But our goal is AGI, which means we need to research new model structure to realize stronger model capability with limited resources. This is one of the basic research that needs to be done to scale up to larger models. In addition to the model structure, we have done a lot of other research, including how to construct data, how to make the model more human-like, etc., which are all reflected in the models we released. In addition, Llama’s architecture, in terms of training efficiency and reasoning cost, is estimated to be already 2 generations behind compared to the foreign state of the art.

“DarkWaves”: Where does this generation gap mainly come from?
Liang Wenfeng: First of all, there is a gap in training efficiency. We estimate that compared to the best domestic or foreign level, the difference in model structure and training dynamics results in twice the compute cost for the same performance. In addition, there may also be another 2x gap in training data efficiency, that is, we need twice the training data to reach the same performance. Combined, that’s four times more compute. What we’re trying to do is to keep closing these gaps.

“DarkWaves”: Most Chinese companies choose to work on both models and applications, so why is DeepSeek only doing research?
Liang Wenfeng: Because we think it’s important to be part of the global innovation wave. In the past years, Chinese companies are used to the idea that other people make technology innovations, and we take them over to make applications, but this is not something we take for granted. In this wave, our starting point is not to just use this kind of opportunity to make money, but to go to the forefront of technology to promote the development of the entire ecosystem.

“DarkWaves”: The Internet and mobile Internet era has left most people with an inertial belief that the US is good at technological innovation and China is better at applications.
Liang Wenfeng: We believe that as the economy develops, China should gradually become a contributor rather than a free-rider. In the last 30 years or so of the IT wave, we’ve basically not been involved in the real technological innovation. We’re taken Moore’s Law for granted, as if it comes from the sky, so that even if we lie flat in our homes, once every 18 months the hardware and software performance doubles. We have had the same attitude towards AI Scaling Laws.
But in fact, it’s a process created by generations of West-dominated technological communities, and we’ve ignored it because we haven’t joined this process before.

## Part 2: The real difference isn’t a year or two, it’s between originality and imitation.

“DarkWaves”: Why did DeepSeek V2 surprise so many people in Silicon Valley?
Liang Wenfeng: It is a kind of innovations that just happens every day in the US. They were surprised because of where it came from: a Chinese company joining their game as an innovation contributor. After all, most Chinese companies are used to following, not innovating.

“DarkWaves”: But even in the Chinese context, this choice is also extremely extravagant. Large models are a heavy investment game, and not all companies have the capital to just research and innovate, instead of thinking about commercialization first.
Liang Wenfeng: The cost of innovation is definitely high, and the inertial belief of yoinkism is partly because of the economic situation of China in the past. [拿来主义 yoinkism: Literally “take-ism”. A humorous invention by Lu Xun. Roughly it means, “If you see a useful idea, just take it. Don’t worry about where it came from or its political suggestions.”] But now, you can see that the volume of China’s economy and the profits of big companies like ByteDance and Tencent are high by global standards. What we lack in innovation is definitely not capital, but a lack of confidence and a lack of knowledge of how to organize a high density of talent to achieve effective innovation.

“DarkWaves”: Why is it so easy for Chinese companies—including big companies that don’t lack money—to prioritize rapid commercialization?
Liang Wenfeng: Over the past 30 years, we have emphasized making money and neglected innovation. Innovation is not entirely business-driven, but also requires curiosity and creativity. We’re just bound by the inertia of the past, but it’s just a phase.

“DarkWaves”: But you’re really a commercial organization, not a public interest research institution, and choosing to innovate and then share it out through open source, where is that going to create a moat? Innovations like this MLA architecture in May will be quickly copied by others, right?
Liang Wenfeng: In the face of disruptive technologies, the moat formed by closed source is short-lived. Even if OpenAI is closed source, it won’t stop others from catching up. So we put the value on our team, our colleagues grow in the process, accumulate a lot of know-how, and form an organization and culture that can innovate, which is our moat.
In fact, nothing is lost with open source and openly published papers. For technologists, being `follow`ed is a great sense of accomplishment. In fact, open source is more of a cultural behavior than a commercial one. To give is to receive glory. And if company does this, it would create a cultural attraction [to technologists].

“DarkWaves”: What do you think about market believers like Allen Zhu Xiaohu?
Liang Wenfeng: Allen Zhu Xiaohu is self-consistent, but his style of play is more suited to fast-money companies. But when you look at the most profitable companies in the US, they are all high-tech companies that have built up a reputation for excellence over a long time.

“DarkWaves”: But it’s hard to form an absolute advantage in a large model, simply by being ahead in technology, so what’s the bigger thing you’re betting on?
Liang Wenfeng: What we see is that Chinese AI can’t stay a follower forever. We often say that there is a gap of one or two years between Chinese AI and the US, but the real gap is the difference between originality and imitation. If this doesn’t change, China will always be a follower, so there’s no escaping of doing exploration.
NVIDIA’s lead is not just the efforts of one company, but the result of the joint efforts of the entire Western technological community and industry. They can see the next generation of technology trends and have a roadmap in hand. China’s AI development has the same need for such an ecology. Many domestic chips are not developed because of the lack of a supporting technology community, only second-hand information, so China must need someone to be at the forefront of technology.

## Part 3: More investment doesn’t necessarily produce more innovation

“DarkWaves”: DeepSeek now has an air of idealism that OpenAI had in its early days, and it’s also open source. Will you go closed-source, as both OpenAI and Mistral have gone from open-source to closed-source?
Liang Wenfeng: We won’t go closed-source. We think it’s more important to have a strong technology ecosystem first.

“DarkWaves”: Do you have any funding plans? I’ve read media reports that High-Flyer has plans to spin off DeepSeek and list it on the stock market, and that AI startups in Silicon Valley will inevitably be tied to big companies in the end.
Liang Wenfeng: We don’t have any financing plan in the short term, the problem we are facing is never money, but the embargo on high-end chips.

“DarkWaves”: Many people think that doing AGI and doing quantitative trading are two completely different things. Quantitative trading can be done quietly, but AGI may need to be done in a high-profile way, with alliances, in order to increase the capital investment.
Liang Wenfeng: More investment doesn’t necessarily produce more innovation. Otherwise, the big players would have taken care of all the innovation.

“DarkWaves”: You’re not doing applications right now because you don’t have the corporate-DNA for that?
Liang Wenfeng: We believe that the current stage is an explosion of technological innovation, not an explosion of applications. In the long run, we hope to form an ecosystem in which the industry directly uses our technology and outputs, and we are only responsible for the basic models and cutting-edge innovations, and then other companies will build to-business and to-consumer products on the basis of DeepSeek. If a complete upstream and downstream industrial ecosystem is formed, then there is no need for us to make applications ourselves. Of course, there is no obstacle for us to make applications if needed, but research and technological innovation will always be our first priority.

“DarkWaves”: But if one were to choose an API, why would they choose DeepSeek’s instead of one from the big players?
Liang Wenfeng: The world of the future is likely to be a specialized division of labor, and the underlying large models need to be continuously innovated, while the big players have their boundaries of competence, which are not necessarily suitable for this.

“DarkWaves”: But can technology really bridge the gap? You also said there is no absolute technical secret.
Liang Wenfeng: There are no secrets in technology, but it takes time and cost to start again. NVIDIA’s graphics cards, in theory, don’t have any technical secrets and are easy to copy, but it takes time to reorganize the team and catch up with the next generation of technology, so the actual moat is still very wide.

“DarkWaves”: The fact that Byte was the first to follow suit after your price cut suggests that they still feel some kind of threat. What do you think of this new solution for startups to compete with the big players?
Liang Wenfeng: Honestly we don’t really care. We did it as a side effect. Providing cloud services is not our main goal. Our goal is still to achieve AGI.
We haven’t seen some kind of “new solution” at the moment, but the big players don’t have a clear advantage either. The big players have ready-made users, but their cash-flow business is also a baggage and can make them ripe for disruption.

“DarkWaves”: What do you think about the endgame of the 6 large model startups outside of DeepSeek?
Liang Wenfeng: Maybe 2 or 3 of them will survive. They are all still in the money-burning stage, so the ones with clear self-positioning and more refined operations have a better chance of surviving. Other companies may be transformed. That which has value will not dispel like morning mist, but it will be transformed.

“DarkWaves”: In the High-Flyer era, your attitude toward competition was said to be like ‘I do what I do’ with little concern for side-by-side comparisons. What is the origin of your attitude towards competition?
Liang Wenfeng: I always think about whether something can make society run more efficiently, and whether you can find a good position in its industrial division of labor. As long as the end game is to make society more efficient, it is valid. A lot of the in-betweens are just passing trends, and too much attention on these is bound to blind you with details.

## Part 4: A group of young people doing something “inscrutable”.

“DarkWaves”: Jack Clark, former policy director of OpenAI and co-founder of Anthropic, said “DeepSeek has managed to hire some of those inscrutable wizards”, what kind of people made DeepSeek v2?
Liang Wenfeng: There weren’t a lot of deep wizards, just this-year graduates from top colleges and universities, those who are in their 4th or 5th year of PhD, and young people who had only graduated a few years ago.

“DarkWaves”: Many large modeling companies are obsessed with poaching people from overseas, many people think that the top 50 talents in this field may not be in Chinese companies, where do your people come from?
Liang Wenfeng: V2 didn’t use any people coming back from overseas, they are all local. The top 50 people may not be in China, but maybe we can build them ourselves.

“DarkWaves”: How did this MLA innovation happen? I heard that the idea first came from the personal interest of a young researcher?
Liang Wenfeng: After summarizing some mainstream variation patterns of Attention architecture, he had a sudden idea to design an alternative. But it was a long process from idea to realization. We formed a team for this, and it took a few months to get it off the ground.

“DarkWaves”: This kind of inspiration has a lot to do with the fact that you’re a completely innovative organization. In the High-Flyer era, you rarely assigned goals or tasks from the top down, but with the uncertainty of cutting-edge exploration like AGI, is there more top-down management control?
Liang Wenfeng: DeepSeek is also all bottom-up. And we generally don’t assign the division of labor up-front. It’s a naturally emerging division of labor. Each person has his own unique growth experience and brings his own ideas, so we don’t need to push him. During the exploration process, he encounters problems and pulls people in to discuss them on his own. But when an idea shows potential, we also deploy resources from the top down.

“DarkWaves”: I’ve heard that DeepSeek is very flexible about mobilizing chips and people.
Liang Wenfeng: Each of us has no cap on the number of chips and people we can mobilize. If you have an idea, you can mobilize chips from the training cluster at any time without approval. And because there are no hierarchies or cross-departments, it’s also flexible to mobilize anyone, as long as the other person is also interested.

“DarkWaves”: A loose management style also depends on the fact that you have selected a group of people who are strongly love-driven. I’ve heard that you’re very good at recruiting from the details, and that you’re able to get people selected who don’t excel in the traditional evaluation metrics.
Liang Wenfeng: Our selection criteria have always been love and curiosity, so a lot of people will have interesting and unique experiences. Many people are more interested in doing research than money.

“DarkWaves”: Transformer was born in Google’s AI Lab and ChatGPT was born in OpenAI. What do you think is the difference between the AI lab of a big company and a startup company in terms of the value of innovation?
Liang Wenfeng: Whether it’s Google Lab, OpenAI, or even the AI Labs of Chinese companies, they are all very valuable. The fact that OpenAI is the one that ended up doing ChatGPT is partly a historical accident.

“DarkWaves”: Is innovation largely serendipitous? I see that the row of conference rooms in the middle of your office area have doors on the left and right sides that can be pushed open at will. Your coworkers say that this is a way to “leave room for serendipity”. Transfomer was born with the kind of story where a random person hears about it and joins in, and eventually turns it into a universal framework.
[Note: For details on this, see [8 Google Employees Invented Modern AI. Here’s the Inside Story | WIRED](https://​​web.archive.org/​​web/​​20240320101528/​​https://​​www.wired.com/​​story/​​eight-google-employees-invented-modern-ai-transformers-paper/​​)]
Liang Wenfeng: I think innovation is first and foremost a matter of belief. Why is Silicon Valley so innovative? When ChatGPT came out, all of China lacked confidence in doing cutting-edge innovations, from investors to big companies, they all thought that “the gap is too big, let’s do applications”. But innovation needs confidence first. This confidence is usually more evident in young people.

“DarkWaves”: But you didn’t go raise funds, and you seldom make public announcements, so you’re definitely not as social as those companies that are active in raising funds, so how do you make sure that DeepSeek is the first choice for people who are doing large models?
Liang Wenfeng: Because we are doing the hardest thing. The biggest attraction for top talent is definitely solving the hardest problems in the world. Actually, top talent is undervalued in China. Because there’s so little hardcore innovation at the societal level that they don’t have a chance to be recognized. The fact that we are doing the hardest things is attractive to them.

“DarkWaves”: OpenAI didn’t release the expected GPT-5, so many people think it’s a clear sign that technology is slowing down, and many people are starting to question the Scaling Law. What do you think?
Liang Wenfeng: We are optimistic. The overall state of the industry appears still in line with expectations. OpenAI is not a god and it can’t stay in the front all the time.

“DarkWaves”: How long do you think it will take for AGI to be realized? Before releasing DeepSeek V2, you released a model for code generation and math, and you also switched from a dense model to an MoE, so what are the coordinates of your AGI roadmap?
Liang Wenfeng: It could be 2, 5, or 10 years, but in any case, it will be realized in our lifetime. As for the roadmap, even within our company, we don’t have a unified view. But we did put our chips down on three bets: math and code, multimodality, and natural language itself. Math and code is a natural testing ground for AGI, kind of like Go, a closed, verifiable system that has the potential to achieve a high level of intelligence just by self-learning. On the other hand, the possibility of being multimodal and participating in the real world of human learning is also necessary for AGI. We remain open to all possibilities.

“DarkWaves”: What do you think the endgame of large models will look like?
Liang Wenfeng: There will be specialized companies that provide basic models and basic services, and there will be a long chain of specializations. More people will be there to meet the diversified needs of the whole society.

## Part 5: All the best-practices were produced by the previous generation

“DarkWaves”: There have been a lot of changes in China’s large model startups in the past year. For example, Wang Huiwen, who was very active at the beginning of last year, dropped out, and the companies that have joined since then have begun to show differentiation.
Liang Wenfeng: Wang Huiwen took on all the losses himself to let the others walk away. He made a choice that was most unfavorable to himself but good for everyone, so he is a very generous person, which I admire.
[Note: Wang Huiwen is the founder of Meituan, a food takeout delivery system. After ChatGPT, he returned from retirement to start an AI company “Light Years Beyond” (光年之外) with $50 million of starting capital, and then sold it back to Meituan at no cost. Most invested money was returned to the investors, Meituan took on much of the debt of Light Years Beyond, while Wang’s $50 million was simply lost.]

“DarkWaves”: Where do you focus most of your energy now?
Liang Wenfeng: My main focus is on the next generation of large models. There are still a lot of unresolved issues.

“DarkWaves”: Several other large model startups are insisting on both research and applications. After all, technology won’t bring a permanent lead. It’s also important to seize the time window to bring the technology advantage to the product. Is the reason that DeepSeek has the courage to focus on model research, because the model capability is yet not enough?
Liang Wenfeng: All the best-practices were produced by the previous generation, and may not be valid in the future. Taking the business logic of the Internet to discuss the future profit model of AI is like discussing General Electric and Coca-Cola when Ma Huateng [founder of Tencent] started his business. It’d be fighting the next war with last wars’ generals.

“DarkWaves”: High-Flyer has already shown to possess a strong technology and innovation DNA in the past, and its growth has been relatively smooth, is this why you are optimistic?
Liang Wenfeng: High-Flyer has somewhat boosted our confidence in technology-driven innovation, but it hasn’t always been a straight path. We’ve gone through a long process of accumulation. What the outside world sees is the post-2015 part of High-Flyer, but we’ve actually been doing it for 16 years.

“DarkWaves”: Back to the topic about original style innovation. Now that the economy is trending down, and capital is entering the cold phase of the cycle, will it put more of a damper on original innovation?
Liang Wenfeng: I don’t think so. The restructuring of China’s industry will rely more on hard-core technology innovation. When many people realize that the fast money they made in the past probably came from the luck of the draw, they will be more willing to bend over backwards to do real innovation.

“DarkWaves”: So you’re optimistic about this too?
Liang Wenfeng: I grew up in a fifth-tier city in Guangdong in the 1980s. My father was an elementary school teacher, and in the 90s, there were a lot of opportunities to make money in Guangdong, and many parents came to my house at that time, basically because they thought education was useless. But when I go back to look at it now, the ideas have all changed. Because money is not easy to make anymore, even the chance to drive a cab may be gone. It has changed in one generation.
There will be more and more hardcore innovation in the future. It may not be yet easily understood now, because the whole society still needs to be educated by the facts. After this society lets the hardcore innovators make a name for themselves, the groupthink will change. All we still need are some facts and a process.

----

<https://​​mp.weixin.qq.com/​​s/​​r9zZaEgqAa_lml_fOEZmjg>

# 揭秘DeepSeek:一个更极致的中国技术理想主义故事
暗涌Waves 2024年07月17日 02:01

文 | 于丽丽
编辑 | 刘旌

中国的7家大模型创业公司中,DeepSeek(深度求索)最不声不响,但它又总能以出其不意的方式被人记住。

一年前,这种出其不意源自它背后的量化私募巨头幻方,是大厂外唯一一家储备万张A100芯片的公司,一年后,则来自它才是引发中国大模型价格战的源头。

在被AI连续轰炸的5月,DeepSeek一跃成名。起因是他们发布的一款名为DeepSeek V2的开源模型,提供了一种史无前例的性价比:推理成本被降到每百万token仅 1块钱,约等于Llama3 70B的七分之一,GPT-4 Turbo的七十分之一。

DeepSeek被迅速冠以“AI界拼多多”之称的同时,字节、腾讯、百度、阿里等大厂也按耐不住,纷纷降价。中国大模型价格战由此一触即发。

弥漫的硝烟其实掩盖了一个事实:与很多大厂烧钱补贴不同,DeepSeek是有利润的。

这背后,是DeepSeek对模型架构进行了全方位创新。它提出的一种崭新的MLA(一种新的多头潜在注意力机制)架构,把显存占用降到了过去最常用的MHA架构的5%-13%,同时,它独创的DeepSeekMoESparse结构,也把计算量降到极致,所有这些最终促成了成本的下降。

在硅谷,DeepSeek被称作“来自东方的神秘力量”。SemiAnalysis首席分析师认为,DeepSeek V2论文“可能是今年最好的一篇”。OpenAI前员工Andrew Carr认为论文“充满惊人智慧”,并将其训练设置应用于自己的模型。而OpenAI前政策主管、Anthropic联合创始人Jack Clark认为,DeepSeek“雇佣了一批高深莫测的奇才”,还认为中国制造的大模型,“将和无人机、电动汽车一样,成为不容忽视的力量。”

在基本由硅谷牵动故事进展的AI浪潮里,这是罕有的情形。多位行业人士告诉我们,这种强烈的反响源自架构层面的创新,是国产大模型公司乃至全球开源基座大模型都很罕见的尝试。一位AI研究者表示,Attention架构提出多年来,几乎未被成功改过,更遑论大规模验证。“这甚至是一个做决策时就会被掐断的念头,因为大部分人都缺乏信心。”

而另一方面,国产大模型之前很少涉足架构层面的创新,也是因为很少有人主动去击破那样一种成见:美国更擅长从0-1的技术创新,而中国更擅长从1-10的应用创新。何况这种行为非常不划算——新一代模型,过几个月自然有人做出来,中国公司只要跟随、做好应用即可。对模型结构进行创新,意味着没有路径可依,要经历很多失败,时间、经济成本都耗费巨大。

DeepSeek显然是逆行者。在一片认为大模型技术必然趋同,follow是更聪明捷径的喧哗声中,DeepSeek看重“弯路”中积累的价值,并认为中国的大模型创业者除应用创新外,也可以加入到全球技术创新的洪流中。

DeepSeek的很多抉择都与众不同。截至目前,7家中国大模型创业公司中,它是唯一一家放弃“既要又要”路线,至今专注在研究和技术,未做toC应用的公司,也是唯一一家未全面考虑商业化,坚定选择开源路线甚至都没融过资的公司。这些使得它经常被遗忘在牌桌之外,但在另一端,它又经常在社区被用户“自来水”式传播。

DeepSeek究竟是如何炼成的?我们为此访谈了甚少露面的DeepSeek创始人梁文锋。

这位从幻方时代,就在幕后潜心研究技术的80后创始人,在DeepSeek时代,依旧延续着他的低调作风,和所有研究员一样,每天“看论文,写代码,参与小组讨论”。

和很多量化基金创始人都有过海外对冲基金履历,多出身物理、数学等专业不同的是,梁文锋一直是本土背景,早年就读的也是浙江大学电子工程系人工智能方向。

多位行业人士和DeepSeek研究员告诉我们,梁文锋是当下中国AI界非常罕见的“兼具强大的infra工程能力和模型研究能力,又能调动资源”、“既可以从高处做精准判断,又可以在细节上强过一线研究员”的人,他拥有“令人恐怖的学习能力”,同时又“完全不像一个老板,而更像一个极客”。

这是一次尤为难得的访谈。访谈里,这位技术理想主义者,提供了目前中国科技界特别稀缺的一种声音:他是少有的把“是非观”置于“利害观”之前,并提醒我们看到时代惯性,把“原创式创新”提上日程的人。

一年前,DeepSeek刚下场时,我们初次访谈了梁文锋 :《疯狂的幻方:一家隐形AI巨头的大模型之路》 。如果说当时那句「务必要疯狂地怀抱雄心,且还要疯狂地真诚」还是一句美丽的口号,一年过去,它已经在成为一种行动。

以下为对话部分

## Part 1: 价格战第一枪是怎么打响的?

「暗涌」:DeepSeek V2模型发布后,迅速引发一场血雨腥风的大模型价格战,有人说你们是行业的一条鲶鱼。
梁文锋:我们不是有意成为一条鲶鱼,只是不小心成了一条鲶鱼。

「暗涌」:这个结果让你们意外吗?
梁文锋:非常意外。没想到价格让大家这么敏感。我们只是按照自己的步调来做事,然后核算成本定价。我们的原则是不贴钱,也不赚取暴利。这个价格也是在成本之上稍微有点利润。

「暗涌」:5天后智谱AI就跟进了,之后是字节、阿里、百度、腾讯等大厂。
梁文锋:智谱AI降的是一个入门级产品,和我们同级别的模型仍然收费很贵。字节是真正第一个跟进的。旗舰模型降到和我们一样的价格,然后触发了其它大厂纷纷降价。因为大厂的模型成本比我们高很多,所以我们没想到会有人亏钱做这件事,最后就变成了互联网时代的烧钱补贴的逻辑。

「暗涌」:外部看来,降价很像在抢用户,互联网时代的价格战通常如此。
梁文锋:抢用户并不是我们的主要目的。我们降价一方面是因为我们在探索下一代模型的结构中,成本先降下来了,另一方面也觉得无论API,还是AI,都应该是普惠的、人人可以用得起的东西。

「暗涌」:在这之前,大部分中国公司都会直接copy这一代的 Llama结构去做应用,为什么你们会从模型结构切入?
梁文锋:如果目标是做应用,那沿用 Llama结构,短平快上产品也是合理选择。但我们目的地是AGI,这意味着我们需要研究新的模型结构,在有限资源下,实现更强的模型能力。这是scale up到更大模型所需要做的基础研究之一。除了模型结构,我们还做了大量其他的研究,包括怎么构造数据,如何让模型更像人类等,这都体现在我们发布的模型里。另外,Llama的结构,在训练效率和推理成本上,和国外先进水平估计也已有两代差距。

「暗涌」:这种代差主要来自哪里?
梁文锋:首先训练效率有差距。我们估计,国内最好的水平和国外最好的相比,模型结构和训练动力学上可能有一倍的差距,光这一点我们要消耗两倍的算力才能达到同样效果。另外数据效率上可能也有一倍差距,也就是我们要消耗两倍的训练数据和算力,才能达到同样的效果。合起来就要多消耗4倍算力。我们要做的,正是不停地去缩小这些差距。

「暗涌」:大部分中国公司都选择既要模型又要应用,为什么DeepSeek目前选择只做研究探索?
梁文锋:因为我们觉得现在最重要的是参与到全球创新的浪潮里去。过去很多年,中国公司习惯了别人做技术创新,我们拿过来做应用变现,但这并非是一种理所当然。这一波浪潮里,我们的出发点,就不是趁机赚一笔,而是走到技术的前沿,去推动整个生态发展。

「暗涌」:互联网和移动互联网时代留给大部分人的惯性认知是,美国擅长搞技术创新,中国更擅长做应用。
梁文锋:我们认为随着经济发展,中国也要逐步成为贡献者,而不是一直搭便车。过去三十多年IT浪潮里,我们基本没有参与到真正的技术创新里。我们已经习惯摩尔定律从天而降,躺在家里18个月就会出来更好的硬件和软件。Scaling Law也在被如此对待。
但其实,这是西方主导的技术社区一代代孜孜不倦创造出来的,只因为之前我们没有参与这个过程,以至于忽视了它的存在。

## Part 2: 真正的差距不是一年或两年,而是原创和模仿之差

「暗涌」:为什么DeepSeek V2会让硅谷的很多人惊讶?
梁文锋:在美国每天发生的大量创新里,这是非常普通的一个。他们之所以惊讶,是因为这是一个中国公司,在以创新贡献者的身份,加入到他们游戏里去。毕竟大部分中国公司习惯follow,而不是创新。

「暗涌」:但这种选择放在中国语境里,也过于奢侈。大模型是一个重投入游戏,不是所有公司都有资本只去研究创新,而不是先考虑商业化。
梁文锋:创新的成本肯定不低,过去那种拿来主义的惯性也和过去的国情有关。但现在,你看无论中国的经济体量,还是字节、腾讯这些大厂的利润,放在全球都不低。我们创新缺的肯定不是资本,而是缺乏信心以及不知道怎么组织高密度的人才实现有效的创新。

「暗涌」:为什么中国公司——包括不缺钱的大厂,这么容易把快速商业化当第一要义?
梁文锋:过去三十年,我们都只强调赚钱,对创新是忽视的。创新不完全是商业驱动的,还需要好奇心和创造欲。我们只是被过去那种惯性束缚了,但它也是阶段性的。

「暗涌」:但你们究竟是一个商业组织,而非一个公益科研机构,选择创新,又通过开源分享出去,那要在哪里形成护城河?像5月这次MLA架构的创新,也会很快被其他家copy吧?
梁文锋:在颠覆性的技术面前,闭源形成的护城河是短暂的。即使OpenAI闭源,也无法阻止被别人赶超。所以我们把价值沉淀在团队上,我们的同事在这个过程中得到成长,积累很多know-how,形成可以创新的组织和文化,就是我们的护城河。
开源,发论文,其实并没有失去什么。对于技术人员来说,被follow是很有成就感的事。其实,开源更像一个文化行为,而非商业行为。给予其实是一种额外的荣誉。一个公司这么做也会有文化的吸引力。

「暗涌」:你怎么看类似朱啸虎的这种市场信仰派观点?
梁文锋:朱啸虎是自洽的,但他的打法更适合快速赚钱的公司,而你看美国最赚钱的公司,都是厚积薄发的高科技公司。

「暗涌」:但做大模型,单纯的技术领先也很难形成绝对优势,你们赌的那个更大的东西是什么?
梁文锋:我们看到的是中国AI不可能永远处在跟随的位置。我们经常说中国AI和美国有一两年差距,但真实的gap是原创和模仿之差。如果这个不改变,中国永远只能是追随者,所以有些探索也是逃不掉的。

英伟达的领先,不只是一个公司的努力,而是整个西方技术社区和产业共同努力的结果。他们能看到下一代的技术趋势,手里有路线图。中国AI的发展,同样需要这样的生态。很多国产芯片发展不起来,也是因为缺乏配套的技术社区,只有第二手消息,所以中国必然需要有人站到技术的前沿。

## Part 3: 更多的投入并不一定产生更多的创新

「暗涌」:现在的DeepSeek有一种OpenAI早期的理想主义气质,也是开源的。后边你们会选择闭源吗?OpenAI和Mistral都有过从开源到闭源的过程。
梁文锋:我们不会闭源。我们认为先有一个强大的技术生态更重要。

「暗涌」:你们有融资计划吗?看有媒体报道,幻方对DeepSeek有独立拆分上市的计划,硅谷的AI创业公司,最终也都难免要和大厂绑定。
梁文锋:短期内没有融资计划,我们面临的问题从来不是钱,而是高端芯片被禁运。

「暗涌」:很多人认为,做AGI和做量化是完全不同的两件事,量化可以闷声去做,但AGI可能更需要高举高打,需要结盟,这样可以让你的投入变大。
梁文锋:更多的投入并不一定产生更多的创新。否则大厂可以把所有的创新包揽了。

「暗涌」:你们现在不做应用,是因为你们没有运营的基因吗?
梁文锋:我们认为当前阶段是技术创新的爆发期,而不是应用的爆发期。长远来说,我们希望形成一种生态,就是业界直接使用我们的技术和产出,我们只负责基础模型和前沿的创新,然后其它公司在DeepSeek 的基础上构建toB、toC的业务。如果能形成完整的产业上下游,我们就没必要自己做应用。当然,如果需要,我们做应用也没障碍,但研究和技术创新永远是我们第一优先级。

「暗涌」:但选择API的话,为什么选择DeepSeek,而不是大厂?
梁文锋:未来的世界很可能是专业化分工的,基础大模型需要持续创新,大厂有它的能力边界,并不一定适合。

「暗涌」:但技术真的可以拉开差距吗?你也说过并不存在绝对的技术秘密。
梁文锋:技术没有秘密,但重置需要时间和成本。英伟达的显卡,理论上没有任何技术秘密,很容易复制,但重新组织团队以及追赶下一代技术都需要时间,所以实际的护城河还是很宽。

「暗涌」:你们降价后,字节率先跟进,说明他们还是感受到某种威胁。你怎么看创业公司与大厂竞争的新解法?
梁文锋:说实话我们不太care这件事,只是顺便做了这件事。提供云服务不是我们的主要目标。我们的目标还是去实现AGI。
目前没有看到什么新解法,但大厂也没有明显占优。大厂有现成的用户,但它的现金流业务也是它的包袱,也会让它成为随时被颠覆的对象。

「暗涌」:你怎么看DeepSeek之外的6家大模型创业公司的终局?
梁文锋:可能活下来2到3家。现在都还处在烧钱阶段,所以那些自我定位清晰、更能精细化运营的,更有机会活下来。其它公司可能会脱胎换骨。有价值的东西不会烟消云散,但会换一种方式。

「暗涌」:幻方时代,面对竞争的姿态就被评价为“我行我素”,很少在意横向比较。关于竞争,你思考的原点是什么?
梁文锋:我经常思考的是,一个东西能不能让社会的运行效率变高,以及你能否在它的产业分工链条上找到擅长的位置。只要终局是让社会效率更高,就是成立的。中间很多都是阶段性的,过度关注必然眼花缭乱。

## Part 4: 一群做“高深莫测”事的年轻人

「暗涌」:OpenAI前政策主管、Anthropic联合创始人Jack Clark认为DeepSeek雇佣了“一批高深莫测的奇才”,做出DeepSeek v2的是怎样一群人?
梁文锋:并没有什么高深莫测的奇才,都是一些Top高校的应届毕业生、没毕业的博四、博五实习生,还有一些毕业才几年的年轻人。

「暗涌」:很多大模型公司都执着地去海外挖人,很多人觉得这个领域前50名的顶尖人才可能都不在中国的公司,你们的人都来自哪里?
梁文锋:V2模型没有海外回来的人,都是本土的。前50名顶尖人才可能不在中国,但也许我们能自己打造这样的人。

「暗涌」:这次MLA创新是如何发生的?听说idea最早来自一个年轻研究员的个人兴趣?
梁文锋:在总结出Attention架构的一些主流变迁规律后,他突发奇想去设计一个替代方案。不过从想法到落地,中间是一个漫长的过程。我们为此组了一个team,花了几个月时间才跑通。

「暗涌」:这种发散性灵感的诞生和你们完全创新型组织的架构很有关系。幻方时代,你们就很少自上而下地指派目标或任务。但AGI这种充满不确定性的前沿探索,是否多了管理动作?
梁文锋:DeepSeek也全是自下而上。而且我们一般不前置分工,而是自然分工。每个人有自己独特的成长经历,都是自带想法的,不需要push他。探索过程中,他遇到问题,自己就会拉人讨论。不过当一个idea显示出潜力,我们也会自上而下地去调配资源。

「暗涌」:听说DeepSeek对于卡和人的调集非常灵活。
梁文锋:我们每个人对于卡和人的调动是不设上限的。如果有想法,每个人随时可以调用训练集群的卡无需审批。同时因为不存在层级和跨部门,也可以灵活调用所有人,只要对方也有兴趣。

「暗涌」:一种松散的管理方式也取决于你们筛选到了一批强热爱驱动的人。听说你们很擅长从细节招人, 可以让一些非传统评价指标里优秀的人被选出来。
梁文锋:我们选人的标准一直都是热爱和好奇心,所以很多人会有一些奇特的经历,很有意思。很多人对做研究的渴望,远超对钱的在意。

「暗涌」: transformer诞生在谷歌的AI Lab,ChatGPT诞生在OpenAI,你觉得大公司的AILab 和一个创业公司对于创新产生的价值有什么不同?
梁文锋:不管是Google实验室,还是OpenAI,甚至中国大厂的AI Lab,都很有价值的。最后是OpenAI做出来,也有历史的偶然性。

「暗涌」:创新很大程度也是一种偶然吗?我看你们办公区中间那排会议室左右两侧都设置了可以随意推开的门。你们同事说,这就是给偶然留出空隙。transfomer诞生中就发生过那种偶然经过的人听到后加入,最终把它变成一个通用框架的故事。
梁文锋:我觉得创新首先是一个信念问题。为什么硅谷那么有创新精神?首先是敢。Chatgpt出来时,整个国内对做前沿创新都缺乏信心,从投资人到大厂,都觉得差距太大了,还是做应用吧。但创新首先需要自信。这种信心通常在年轻人身上更明显。

「暗涌」:但你们不参与融资,很少对外发声,社会声量上肯定不如那些融资活跃的公司,怎么确保DeepSeek就是做大模型的人的首选?
梁文锋:因为我们在做最难的事。对顶级人才吸引最大的,肯定是去解决世界上最难的问题。其实,顶尖人才在中国是被低估的。因为整个社会层面的硬核创新太少了,使得他们没有机会被识别出来。我们在做最难的事,对他们就是有吸引力的。

「暗涌」:前一段OpenAI的发布并没有等来GPT5,很多人觉得这是技术曲线明显在放缓,也很多人开始质疑Scaling Law,你们怎么看?
梁文锋:我们偏乐观,整个行业看起来都符合预期。OpenAI也不是神,不可能一直冲在前面。

「暗涌」:你觉得AGI还要多久实现,发布DeepSeek V2前,你们发布过代码生成和数学的模型,也从dense模型切换到了MOE,所以你们的AGI路线图有哪些坐标?
梁文锋:可能是2年、5年或者10年,总之会在我们有生之年实现。至于路线图,即使在我们公司内部,也没有统一意见。但我们确实押注了三个方向。一是数学和代码,二是多模态,三是自然语言本身。数学和代码是AGI天然的试验场,有点像围棋,是一个封闭的、可验证的系统,有可能通过自我学习就能实现很高的智能。另一方面,可能多模态、参与到人类的真实世界里学习,对AGI也是必要的。我们对一切可能性都保持开放。

「暗涌」:你觉得大模型终局是什么样态?
梁文锋:会有专门公司提供基础模型和基础服务,会有很长链条的专业分工。更多人在之上去满足整个社会多样化的需求。

## Part 5: 所有的套路都是上一代的产物

「暗涌」:过去这一年,中国的大模型创业还是有很多变化的,比如去年开头还很活跃的王慧文中场退出了,后来加入的公司也开始呈现出差异化。
梁文锋:王慧文自己承担了所有的损失,让其他人全身而退。他做了一个对自己最不利,但对大家都好的选择,所以他做人是很厚道的,这点我很佩服。

「暗涌」:现在你的精力最多放在哪里?
梁文锋:主要的精力在研究下一代的大模型。还有很多未解决的问题。

「暗涌」:其他几家大模型创业公司都是坚持既要又要,毕竟技术不会带来永久领先,抓住时间窗口把技术优势落到产品也很重要,DeepSeek敢于专注在模型研究上是因为模型能力还不够吗?
梁文锋:所有的套路都是上一代的产物,未来不一定成立。拿互联网的商业逻辑去讨论未来AI的盈利模式,就像马化腾创业时,你去讨论通用电气和可口可乐一样。很可能是一种刻舟求剑。

「暗涌」:过去幻方就有很强的技术和创新基因,成长也比较顺利,这是你偏乐观的原因吗?
梁文锋:幻方某种程度上增强了我们对技术驱动型创新的信心,但也不都是坦途。我们经历了一个漫长的积累过程。外部看到的是幻方2015年后的部分,但其实我们做了16年。

「暗涌」:回到关于原创式创新的话题。现在经济开始进入下行,资本也进入冷周期,所以它对原创式创新是否会带来更多抑制?
梁文锋:我倒觉得未必。中国产业结构的调整,会更依赖硬核技术的创新。当很多人发现过去赚快钱很可能来自时代运气,就会更愿意俯身去做真正的创新。

「暗涌」:所以你对这件事也是乐观的?
梁文锋:我是八十年代在广东一个五线城市长大的。我的父亲是小学老师,九十年代,广东赚钱机会很多,当时有不少家长到我家里来,基本就是家长觉得读书没用。但现在回去看,观念都变了。因为钱不好赚了,连开出租车的机会可能都没了。一代人的时间就变了。
以后硬核创新会越来越多。现在可能还不容易被理解,是因为整个社会群体需要被事实教育。当这个社会让硬核创新的人功成名就,群体性想法就会改变。我们只是还需要一堆事实和一个过程。

No comments.