- Published on
机器知道我们知道它知道什么吗(中)
- Authors
他(科辛斯基博士)最近对于大型语言模型的研究使用了经典的心智理论测试,这些测试衡量儿童理解他人的错误想法的能力。
![]()
双语精读
His recent work on large language models uses classic theory of mind tests that measure the ability of children to attribute false beliefs to other people.
他(科辛斯基博士)最近对于大型语言模型的研究使用了经典的心智理论测试,这些测试衡量儿童理解他人的错误想法的能力。
A famous example is the Sally-Anne test, in which a girl, Anne, moves a marble from a basket to a box when another girl, Sally, isn't looking.
其中一个著名的例子是萨莉-安测试,在这个测试中,一个名叫安的女孩把一个玻璃弹珠从篮子里放到盒子里,而另一个女孩萨莉并没有看到这一过程。
To know where Sally will look for the marble, researchers claimed, a viewer would have to exercise theory of mind, reasoning about Sally's perceptual evidence and belief formation: Sally didn't see Anne move the marble to the box, so she still believes it is where she last left it, in the basket.
研究人员称,要知道萨莉会在哪里找玻璃弹珠,观众必须运用心智理论,推理出萨莉感知到了什么证据并如何形成了她的看法:萨莉没有看到安把玻璃弹珠放到盒子里,所以她认为玻璃弹珠还在之前的地方,而她之前把玻璃弹珠放在篮子里。
Dr. Kosinski presented 10 large language models with 40 unique variations of these theory of mind tests -- descriptions of situations like the Sally-Anne test, in which a person (Sally) forms a false belief.
科辛斯基博士给10个大型语言模型做了40种不同的这类心智理论测试 -- 描述了类似萨莉-安测试的情境,在这种情境下,某个人(萨莉)形成了错误的看法。
Then he asked the models questions about those situations, prodding them to see whether they would attribute false beliefs to the characters involved and accurately predict their behavior.
然后,他向模型提出有关这些情境的问题,试探它们,看看它们是否会认为相关角色产生了错误看法,并准确预测他们的行为。
He found that GPT-3.5, released in November 2022, did so 90 percent of the time, and GPT-4, released in March 2023, did so 95 percent of the time. The conclusion? Machines have theory of mind.
科辛斯基博士发现,2022年11月发布的GPT-3.5在90%的情况下能做到这些,而2023年3月发布的GPT-4在95%的情况下能做到这些。所以结论是什么?就是机器有心智理论。
But soon after these results were released, Tomer Ullman, a psychologist at Harvard University, responded with a set of his own experiments, showing that small adjustments in the prompts could completely change the answers generated by even the most sophisticated large language models.
但这些结果公布后不久,哈佛大学心理学家托默·乌尔曼用他自己的一组实验做出了回应,表示略微调整一下提示词,就可以让哪怕是最复杂精密的大型语言模型完全改变其生成的答案。
If a container was described as transparent, the machines would fail to infer that someone could see into it.
如果说容器是透明的,机器就无法推断出人们可以看到容器里有什么这一情况。
The machines had difficulty taking into account the testimony of people in these situations, and sometimes couldn't distinguish between an object being inside a container and being on top of it.
在这些情境中,机器很难考虑到人们有什么证据,有时甚至无法区分物体在容器内部还是容器顶部。
Maarten Sap, a computer scientist at Carnegie Mellon University, fed more than 1,000 theory of mind tests into large language models and found that the most advanced transformers, like ChatGPT and GPT-4, passed only about 70 percent of the time. (In other words, they were 70 percent successful at attributing false beliefs to the people described in the test situations.)
卡内基梅隆大学的计算机科学家马尔滕·萨普在大型语言模型中输入了1000多项心智理论测试,发现最先进的变换器,如ChatGPT和GPT-4,只在大约70%的情况下通过了这些测试。(换言之,他们理解测试情境所描述的人的错误看法的成功率为70%。)
The discrepancy between his data and Dr. Kosinski's could come down to differences in the testing, but Dr. Sap said that even passing 95 percent of the time would not be evidence of real theory of mind.
这一数据与科辛斯基博士的数据之间的差异可归结为所做的测试不同,但萨普博士表示,即使在95%以上的情况下通过测试也不能证明拥有真正的心智理论。
Machines usually fail in a patterned way, unable to engage in abstract reasoning and often making "spurious correlations," he said.
他说,机器未通过测试是有一种模式的,机器不能进行抽象推理,而且经常在事物之间做出"虚假的关联"。
词汇预习
psychologist [高考]
美[saɪˈkɑːlədʒɪst] | 英[saɪˈkɒlədʒɪst]
n.心理学研究者,心理学家
viewer [高考]
美[ˈvjuːər] | 英[ˈvjuːə(r)]
n. 观看者;电视观众;观察器
engage [高考]
美[ɪnˈɡeɪdʒ] | 英[ɪnˈɡeɪdʒ]
v. 答应,预定,使忙碌,雇佣,订婚,啮合;吸引住(注意力、兴趣)
accurately [高考]
美['ækjərətlɪ] | 英['ækjərətlɪ]
adv. 准确地;精确地
formation [高考]
美[fɔːrˈmeɪʃn] | 英[fɔːˈmeɪʃn]
n. 形成;队形;编队;构造;[地]地层
data [高考]
美[ˈdeɪtə] | 英[ˈdeɪtə]
n. 数据;资料
unique [高考]
美[juˈniːk] | 英[juˈniːk]
adj. 独特的;独一无二的;稀罕的
evidence [高考]
美[ˈevɪdəns] | 英[ˈevɪdəns]
n. 证据;证词;根据;迹象 v. 证明;证实
belief [高考]
美[bɪˈliːf] | 英[bɪˈliːf]
n. 信念;信仰;相信
theory [高考]
美[ˈθiːəri] | 英[ˈθɪəri]
n. 学说;理论;原理;意见
classic [高考]
美[ˈklæsɪk] | 英[ˈklæsɪk]
n. 杰作;古典作品;第一流艺术家 adj. 最优秀的;传统的;古典的
distinguish [高考]
美[dɪˈstɪŋɡwɪʃ] | 英[dɪˈstɪŋɡwɪʃ]
vt. 区别;辨认;使显著
transparent [高考]
美[trænsˈpærənt] | 英[trænsˈpærənt]
adj. 透明的;明显的;清晰的
container [高考]
美[kənˈteɪnər] | 英[kənˈteɪnə(r)]
n. 容器;集装箱
false [高考]
美[fɔːls] | 英[fɔːls]
adj. 假的;人造的;不真实的;错误的;虚伪的 adv. 欺骗地
abstract [高考]
美[ˈæbstrækt] | 英[ˈæbstrækt]
adj. 抽象(派)的,纯理论的 n. 摘要,概要; 抽象派艺术作品 v. 提取,抽取,摘取; 写摘要
marble [高考]
美[ˈmɑːrbl] | 英[ˈmɑːbl]
n. 大理石;弹子,弹珠 vt. 使有大理石的花纹
engage in [高考]
美[ɪn'gedʒ ɪn] | 英[ɪnˈɡeɪdʒ in]
从事; 参加
attribute [高考]
美[əˈtrɪbjuːt] | 英[əˈtrɪbjuːt]
vt. 把 ... 归于 n. 属性;标志;象征;特征
sophisticated [高考]
美[səˈfɪstɪkeɪtɪd] | 英[səˈfɪstɪkeɪtɪd]
adj. 老练的;精密的;复杂的;久经世故的
where [高考]
美[wer] | 英[weə(r)]
adv. 在哪里;在那个地方 conj. 在 ... 地方
would [高考]
美[wʊd , wəd] | 英[wʊd , wəd]
aux. 将;可能;大概;总会;愿意;will的过去式
fail in [四级]
美[fel ɪn] | 英[feil in]
v. 在 ... 上失败; 变弱
on top of [四级]
美[ɑːn tɑːp əv] | 英[ɒn tɒp ɒv]
在 ... 之上;停留在…之上;加之;控制住;对…了如指掌
come down to [四级]
美[kʌm daʊn tu] | 英[kʌm daʊn tu]
可归结为
see into [四级]
美[si ˈɪntu] | 英[si: ˈɪntuː]
调查; 了解 ... 的性质
advanced [四级]
美[ədˈvænst] | 英[ədˈvɑːnst]
adj. 先进的;高级的
involved [四级]
美[ɪnˈvɑlvd] | 英[ɪnˈvɒlvd]
adj. 涉及的;牵连的;复杂的;感情投入的;有密切关系的
testimony [六级]
美[ˈtestɪmoʊni] | 英[ˈtestɪməni]
n. 证明;证据
discrepancy [专四]
美[dɪsˈkrepənsi] | 英[dɪsˈkrepənsi]
n. 差异;不一致;分歧
spurious [专八]
美[ˈspjʊriəs] | 英[ˈspjʊəriəs]
adj. 假的;伪造的
perceptual [考研]
美[pərˈseptʃuəl] | 英[pəˈseptʃuəl]
adj. 感性的;知觉的
重点讲解
His recent work on large language models uses classic theory of mind tests that measure the ability of children to attribute false beliefs to other people.
他(科辛斯基博士)最近对于大型语言模型的研究使用了经典的心智理论测试,这些测试衡量儿童理解他人的错误想法的能力。
false adj.
1. 错误的;不符合事实的
【例】She gave false information to the insurance company.
她向保险公司提供了不真实的资料。
2. 虚假的;不符合实际的
【例】a false impression/hope/economy 错误的印象/虚假的希望/看似省钱,其实不划算
A famous example is the Sally-Anne test, in which a girl, Anne, moves a marble from a basket to a box when another girl, Sally, isn't looking.
其中一个著名的例子是萨莉-安测试,在这个测试中,一个名叫安的女孩把一个玻璃弹珠从篮子里放到盒子里,而另一个女孩萨莉并没有看到这一过程。
exercise v. 运用;行使(权利、能力等)
【例】exercise one’s brain/influence/legal rights/self-control 开动脑筋/施加影响力/行使合法权利/运用自制力
perceptual adj. 感知上的;知觉上的
【例】The exercises emphasize both perceptual and logical skills.
这个练习既注重培养感知能力,也注重培养逻辑能力。
【辨析】conceptual adj. 概念上的;观念上的
【拓展】perceive v. 感觉到;察觉到;把……看作
To know where Sally will look for the marble, researchers claimed, a viewer would have to exercise theory of mind, reasoning about Sally's perceptual evidence and belief formation: Sally didn't see Anne move the marble to the box, so she still believes it is where she last left it, in the basket.
研究人员称,要知道萨莉会在哪里找玻璃弹珠,观众必须运用心智理论,推理出萨莉感知到了什么证据并如何形成了她的看法:萨莉没有看到安把玻璃弹珠放到盒子里,所以她认为玻璃弹珠还在之前的地方,而她之前把玻璃弹珠放在篮子里。
exercise v. 运用;行使(权利、能力等)
【例】exercise one’s brain/influence/legal rights/self-control 开动脑筋/施加影响力/行使合法权利/运用自制力
perceptual adj. 感知上的;知觉上的
【例】The exercises emphasize both perceptual and logical skills.
这个练习既注重培养感知能力,也注重培养逻辑能力。
【辨析】conceptual adj. 概念上的;观念上的
【拓展】perceive v. 感觉到;察觉到;把……看作
Dr. Kosinski presented 10 large language models with 40 unique variations of these theory of mind tests -- descriptions of situations like the Sally-Anne test, in which a person (Sally) forms a false belief.
科辛斯基博士给10个大型语言模型做了40种不同的这类心智理论测试 -- 描述了类似萨莉-安测试的情境,在这种情境下,某个人(萨莉)形成了错误的看法。
variation n.
1. (数量、水平、程度等的)变化,变动
【例】A mother's ears can hear even the slightest variation in her baby's breathing.
宝宝的呼吸哪怕发生了最轻微的变化,妈妈的耳朵都可以听到。
2. 变种;变体;变化了的形式
【例】The films she makes are all variations on the same theme.
她拍的电影都是对同一个主题的不同演绎。
【辨析】variety n. 不同种类;多样性
Then he asked the models questions about those situations, prodding them to see whether they would attribute false beliefs to the characters involved and accurately predict their behavior.
然后,他向模型提出有关这些情境的问题,试探它们,看看它们是否会认为相关角色产生了错误看法,并准确预测他们的行为。
prod v. & n.
1. 作动词:戳;捅;催促
【例】He prodded at his food with a fork, but he didn't eat a mouthful.
他拿叉子戳弄着食物,但一口没吃。
【例】The movie prodded the audience into thinking about the status quo.
这部电影促使观众思考现状。
2. 作名词:戳;捅;催促
【例】give sb a prod 戳某人一下;催某人一下
He found that GPT-3.5, released in November 2022, did so 90 percent of the time, and GPT-4, released in March 2023, did so 95 percent of the time. The conclusion? Machines have theory of mind.
科辛斯基博士发现,2022年11月发布的GPT-3.5在90%的情况下能做到这些,而2023年3月发布的GPT-4在95%的情况下能做到这些。所以结论是什么?就是机器有心智理论。
prod v. & n.
1. 作动词:戳;捅;催促
【例】He prodded at his food with a fork, but he didn't eat a mouthful.
他拿叉子戳弄着食物,但一口没吃。
【例】The movie prodded the audience into thinking about the status quo.
这部电影促使观众思考现状。
2. 作名词:戳;捅;催促
【例】give sb a prod 戳某人一下;催某人一下
But soon**** after these results were**** released****,**** Tomer Ullman****,**** a psychologist**** at**** Harvard University****,**** responded with**** a set of**** his own experiments,**** showing that**** small adjustments in**** the prompts could completely**** change**** the answers generated by**** even**** the most**** sophisticated large language**** models****.****
但这些结果公布后不久,哈佛大学心理学家托默·乌尔曼用他自己的一组实验做出了回应,表示略微调整一下提示词,就可以让哪怕是最复杂精密的大型语言模型完全改变其生成的答案。
If a container was described as transparent, the machines would fail to infer that someone could see into it.
如果说容器是透明的,机器就无法推断出人们可以看到容器里有什么这一情况。
infer v. 推断;推论
【例】I inferred from her expression that she wanted to leave.
我从她的表情推断出她想离开。
【近义词】deduce v. 推理;演绎
【辨析】imply v. 暗示;含有……意思
【例】Her expression implied that she wanted to leave.
她的表情暗示她想离开。
The machines had difficulty taking into account the testimony of people in these situations, and sometimes couldn't distinguish between an object being inside a container and being on top of it.
在这些情境中,机器很难考虑到人们有什么证据,有时甚至无法区分物体在容器内部还是容器顶部。
testimony n.
1. 证词;口供
【例】give false testimony 作伪证/提供假口供
2. 证据;证明
【例】This increase in exports bears testimony to the successes of industry.
出口增长证明了工业的成功。
【拓展】testify v. 作证;证明
Maarten Sap, a computer scientist at Carnegie Mellon University, fed more than 1,000 theory of mind tests into large language models and found that the most advanced transformers, like ChatGPT and GPT-4, passed only about 70 percent of the time. (In other words, they were 70 percent successful at attributing false beliefs to the people described in the test situations.)
卡内基梅隆大学的计算机科学家马尔滕·萨普在大型语言模型中输入了1000多项心智理论测试,发现最先进的变换器,如ChatGPT和GPT-4,只在大约70%的情况下通过了这些测试。(换言之,他们理解测试情境所描述的人的错误看法的成功率为70%。)
advanced adj.
1. 先进的
【例】advanced technology 先进的技术
【反义词】backward 落后的
2. 高级的;高阶的;高等的
【例】advanced mathematics/an advanced class in high school 高数/高中的快班
The discrepancy between his data and Dr. Kosinski's could come down to differences in the testing, but Dr. Sap said that even passing 95 percent of the time would not be evidence of real theory of mind.
这一数据与科辛斯基博士的数据之间的差异可归结为所做的测试不同,但萨普博士表示,即使在95%以上的情况下通过测试也不能证明拥有真正的心智理论。
discrepancy n. 差异;不一致
【例】The committee is reportedly unhappy about the discrepancy in numbers.
数据之间有出入,据说委员会对此很不高兴。
【辨析】disparity n. 不同;不等;差异
【例】the wide disparity between rich and poor/income disparity 贫富悬殊/收入差距
come down to 可归纳为;可归结为
【例】People talk about various reasons for the company's failure, but it all comes down to one thing: a lack of leadership.
人们讨论了公司倒闭的各种原因,但归结起来就是一个问题:群龙无首。
【近义词】boil down to 归结为;基本问题是
【例】What it all boils down to is a lack of communication.
一切问题归结起来就是缺乏沟通。
Machines usually fail in a patterned way, unable to engage in abstract reasoning and often making "spurious correlations," he said.
他说,机器未通过测试是有一种模式的,机器不能进行抽象推理,而且经常在事物之间做出"虚假的关联"。
discrepancy n. 差异;不一致
【例】The committee is reportedly unhappy about the discrepancy in numbers.
数据之间有出入,据说委员会对此很不高兴。
【辨析】disparity n. 不同;不等;差异
【例】the wide disparity between rich and poor/income disparity 贫富悬殊/收入差距
come down to 可归纳为;可归结为
【例】People talk about various reasons for the company's failure, but it all comes down to one thing: a lack of leadership.
人们讨论了公司倒闭的各种原因,但归结起来就是一个问题:群龙无首。
【近义词】boil down to 归结为;基本问题是
【例】What it all boils down to is a lack of communication.
一切问题归结起来就是缺乏沟通。
课后练习
The Sally-Anne test measures the ability to ____.
A. observe the details in the surroundings
B. make judgement about daily situations
C. distinguish between facts and opinions
D. construct different thinking processes
Dr. Kosinski measured the theory of mind of machines by ____.
A. asking them to create variations of the Sally-Anne test
B. asking them to single out the original Sally-Anne test
C. testing their understanding of the characters’ views
D. testing their ways of dealing with the similar situations
Tomer Ullman found that the machines ____.
A. failed to understand more complicated prompts
B. were unstable and changed their answers often
C. could be confused about too many details
D. tended to ignore implicit information
According to the text, Dr. Sap believed that ____.
A. Dr. Kosinski's tests were faulty and problematic
B. real theory of mind could not be measured by data
C. machines lacked some essential mental capacities
D. machines considered things in isolated ways