The Fort Worth Press - AI is learning to lie, scheme, and threaten its creators

USD -
AED 3.672498
AFN 66.135424
ALL 82.428003
AMD 381.697608
ANG 1.790403
AOA 917.000333
ARS 1440.719298
AUD 1.503556
AWG 1.8
AZN 1.698617
BAM 1.6671
BBD 2.013298
BDT 122.155689
BGN 1.666095
BHD 0.376959
BIF 2954.536737
BMD 1
BND 1.290974
BOB 6.906898
BRL 5.403152
BSD 0.999616
BTN 90.396959
BWP 13.244683
BYN 2.94679
BYR 19600
BZD 2.010374
CAD 1.37658
CDF 2240.000343
CHF 0.795735
CLF 0.023238
CLP 911.629427
CNY 7.054505
CNH 7.041445
COP 3801.6
CRC 500.023441
CUC 1
CUP 26.5
CVE 93.988535
CZK 20.66805
DJF 178.007927
DKK 6.35678
DOP 63.547132
DZD 129.654932
EGP 47.449851
ERN 15
ETB 156.189388
EUR 0.850931
FJD 2.253797
FKP 0.748248
GBP 0.74691
GEL 2.70203
GGP 0.748248
GHS 11.474844
GIP 0.748248
GMD 73.000007
GNF 8692.206077
GTQ 7.656114
GYD 209.124811
HKD 7.78223
HNL 26.31718
HRK 6.410897
HTG 131.023872
HUF 327.803501
IDR 16673.45
ILS 3.20699
IMP 0.748248
INR 90.72575
IQD 1309.438063
IRR 42122.494452
ISK 126.299846
JEP 0.748248
JMD 160.047735
JOD 0.708952
JPY 154.966501
KES 128.950385
KGS 87.449685
KHR 4002.062831
KMF 419.501996
KPW 899.999687
KRW 1464.35502
KWD 0.30682
KYD 0.833039
KZT 521.320349
LAK 21670.253798
LBP 89512.817781
LKR 308.871226
LRD 176.427969
LSL 16.864406
LTL 2.95274
LVL 0.60489
LYD 5.429826
MAD 9.19607
MDL 16.897807
MGA 4428.248732
MKD 52.4169
MMK 2099.265884
MNT 3545.865278
MOP 8.015428
MRU 40.004433
MUR 45.950131
MVR 15.398937
MWK 1733.36743
MXN 17.978805
MYR 4.0925
MZN 63.910031
NAD 16.864406
NGN 1451.530241
NIO 36.789996
NOK 10.13585
NPR 144.638557
NZD 1.725615
OMR 0.384498
PAB 0.999595
PEN 3.365397
PGK 4.308177
PHP 58.924995
PKR 280.140733
PLN 3.59277
PYG 6714.401398
QAR 3.643004
RON 4.335502
RSD 99.943984
RUB 79.121636
RWF 1454.886417
SAR 3.752081
SBD 8.176752
SCR 14.658273
SDG 601.499594
SEK 9.28439
SGD 1.288906
SHP 0.750259
SLE 24.125013
SLL 20969.503664
SOS 570.259558
SRD 38.547979
STD 20697.981008
STN 20.880385
SVC 8.746351
SYP 11056.681827
SZL 16.85874
THB 31.431503
TJS 9.186183
TMT 3.51
TND 2.922143
TOP 2.40776
TRY 42.701498
TTD 6.783302
TWD 31.318031
TZS 2482.490189
UAH 42.236116
UGX 3552.752147
UYU 39.226383
UZS 12042.534149
VES 267.43975
VND 26320
VUV 121.127634
WST 2.775483
XAF 559.141627
XAG 0.015656
XAU 0.00023
XCD 2.70255
XCG 1.801522
XDR 0.695393
XOF 559.141627
XPF 101.655763
YER 238.499715
ZAR 16.776101
ZMK 9001.197187
ZMW 23.065809
ZWL 321.999592
  • RIO

    -1.0800

    75.66

    -1.43%

  • CMSC

    -0.1300

    23.3

    -0.56%

  • SCS

    0.0200

    16.14

    +0.12%

  • RBGPF

    0.0000

    81.17

    0%

  • BCC

    0.2500

    76.51

    +0.33%

  • RYCEF

    -0.2500

    14.6

    -1.71%

  • BTI

    -1.2700

    57.1

    -2.22%

  • NGG

    0.2400

    74.93

    +0.32%

  • BCE

    0.3100

    23.71

    +1.31%

  • CMSD

    -0.1500

    23.25

    -0.65%

  • RELX

    0.1000

    40.38

    +0.25%

  • GSK

    -0.0700

    48.81

    -0.14%

  • JRI

    -0.0200

    13.7

    -0.15%

  • VOD

    0.0500

    12.59

    +0.4%

  • BP

    -0.2700

    35.26

    -0.77%

  • AZN

    -0.4600

    89.83

    -0.51%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: © AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

J.Barnes--TFWP