AI SREs, Chat With Your Infrastructure with Anyshift
E10

AI SREs, Chat With Your Infrastructure with Anyshift

I'd love to really have like morning
summaries, that are, that essentially read

to me by either a British or Australian
accent AI that because I gotta have a

the killer feature.

Let's let's strip this.

I got a French accent over here.

I got a South African accent over here.

I want all the robots to have more
personality so that I, I recognize their

voice when we're in a group call with
nothing but me and the agents, and one

of them knows my infrastructure and one
of them knows my GitHub situation and

One of them's cooking my,
my breakfast downstairs.

I don't know what's gonna happen.

Welcome back to the
Agentic DevOps podcast.

I am your host, Bret Fisher., I'm
excited to talk about anyshift.io,

the co-founders Roxane Fischer and

Stephane Jourdan are creating a AI SRE,
and I think I said at the beginning

of the year, I predicted somewhere
around the beginning of the year that

in 2025, it was the year of agents,
it was the year of Claude Code.

It was us figuring out how to move
beyond chatbots and LSP tab completion

of AI to actually having conversations
with AI and generating, through

agents, generating code, right?

And also generating YAML and
HCL and markdown and all the

things that we do in DevOps.

But this is the year that
we figure out context.

And Anyshift is an
perfect example of that.

They're a year old, roughly a year old
startup that is dealing with the context

problem of infrastructure in that if
you want a current sense of the entire

infrastructure, and if you're gonna
shove that into the context of an AI,

you're gonna need to do some things.

You're gonna need to have read only
keys to a w, your cloud, AWS whatever.

You're gonna need to
have access to GitHub.

You might need access to monitoring
solutions, logging solutions.

Uh, you might need access to git
ops and deployment solutions.

And you need to gather
all that data together.

You need to create summaries and
memories and basically a bunch of

tokens that you're gonna have to give
the AI to help it understand what's

going on in infrastructure, because
AI is coming for infrastructure too.

It's coming for DevOps and
SREs and platform engineers.

And that's why we started this podcast
a year ago was we, we kind of predicted

that it was coming for developers in code
first because that, that was sort of the,

the nature of what the model companies
were providing us was they were focusing

on code, but now people are taking that
and figuring out how do we shove the

context of operations into the context
window of these AI LLMs and get out some

usable data like troubleshooting, like
predicting different sorts of outages

that might potentially happen soon.

And like how to ensure that we're
fixing things so that errors

and issues don't happen again.

All the stuff that's the
concern of SRE and operators.

So, uh, we dived in deep as basically
we talk about the little bit of the

features and the, the reason why Anyshift
exists, but I was much more interested

in how they're trying to solve the
problems of context management, of

memory management, how do we avoid
hallucinations, how do we protect

ourselves from any sort of mistakes?

And we get deep into all
that in this episode.

So I am glad to have them on
the show and let's get into it.

welcome to the show.

So on the right over there we've got
Roxane Fischer, no relation, the CEO

and co-founder of Anyshift, anyshift.io.

And there in the middle.

Stephane Jourdan, I'm gonna, I'm
horrible at my French accents.

I'm gonna try to do better
next time, who's the CTO.

So these two co-founded Anyshift.

welcome to the show.

Thank for.

Yeah.

Roxane, so when did you all start this?

when, how, what's the
born on date of Anyshift?

Ah, it's a long story.

let's say that the premises were
two years ago when I met Stephane

in a small coffee in Paris.

We are both at the starting point where
we wanted to create a new journey.

Stephane with his production background
in AI and me with my AI background on a

deep tech problem to solve, and from day
one, we knew we wanted to do something

about data and context, even before
like the context engineering trend.

And that's how it started.

Nice.

Did you say coffee shop?

Very cute coffee shop in Paris.

So, Stephane, tell me what's the
elevator pitch for, like, let's say

you're a platform engineer, or SRE
type, you've got a managed production.

You've gotta maintain
Kubernetes and the cloud stuff.

You're luckily not someone who's
settled with this as a part-time

job amongst many other jobs.

Like you're a dedicated ops engineer.

What is Anyshift gonna do for me?

Yeah, so what I would say to this
person is that in 2026, our jobs did

not become easier and they actually
became like much more complex.

We are managing now, like probably
much more services, everything, the

architectures became more complex.

We have like many different clusters, auto
scalers, we have serverless things, we now

have like agents taking decisions for us.

So things became very, very complex.

Our teams did not, become larger,
probably to the contrary actually.

And so we really need,
like that knowledge.

We need to offload a lot of things
to very capable agents or services.

And luckily, we have like a set
of products that we can build,

a lot of things around knowledge
and solving our problems and

managing the complexity for us.

Yeah, a year ago, I think maybe it
was like a year and a half ago, I

started seeing companies showing up
at KubeCon or just in the operator

space talking about managing agents.

And I was not, I was naive and I was
new to all of it, like I, a lot of us,

but I was like, everybody was using
this word agents and overloading it.

And I still feel like it's used
for just talking about a dozen

different unrelated things.

But at the time I felt like there
these companies that were letting us

host agents that we were gonna run in
our infrastructure, run in production.

And maybe not managing production
with it, but just agents running

somewhere in, in our infrastructure.

And then around the same time, probably
even up to three years ago, at KubeCon,

we started hearing people talking
about, even in the keynotes, I felt

like we were joking that the keynotes
were all about AI when no one was

actually using a or running AI yet.

But they were really focused on the
inference side and the model building

and managing GPUs and the sort of running
of AI infrastructure, which I didn't

have, I didn't have a dog in that hunt.

Like I wasn't really doing any of that.

None of my customers or clients were
managing their own GPU infrastructure.

They were just sort of using the
SOTA models out there and all of the

APIs that everyone else provided.

And so it, I felt like it wasn't really
until about a year ago that we started,

I started to see startups at the show
that were saying, we're not here to

help you manage the AI infrastructure.

We're here to help you use AI
to manage the infrastructure.

Which is a subtle difference of words,
but completely separate jobs, right?

What gave you, what was the
initial like, obviously it's hard.

I feel like this is a hard problem
to solve, otherwise we wouldn't have

entire startups dedicated to it.

What is one of the biggest challenges
to using besides just running your

AI and your local harness and having
it fill out your HCL and YAML files

and, you know, executing CLI for you?

Like what's the hard part about it
seeing my infrastructure and making

intelligent decisions about that?

I really believe it'll be the context.

It took us one year, even a little bit
more than that to build this context

graph that is underneath Anyshift,
which is a reconciliation between

dozens of different sources of data.

Today, where we really specialize
would be how do we make one

unified source of truth, full
time and history, of your context.

Context means cloud providers, Kubernetes,
clusters cut basis infrastructure

as cloud basis, monitoring host.

And when you do, you need to do day-to-day
task, related to your production,

you need to have the understanding of
which configuration had an impact and

a blast radius impact on the service
or like on this specific error.

Everything that happened
full time and needs to be.

And so how do you manage to get these
dependencies, this connection full time?

Today, there are a lot of like AI ops
solutions, a lot of them are plugged

through different sources of data.

What was really hard for us, yeah, you can
see this graph that we're building, is how

do you make the connection between those
different universes, we call it universe.

here you can see AWS GitHub
Kubernetes, to understand when you

make a change, what the blast radius.

When you need to deal with production
at scale, this context is booming.

It's like 10 of billion of context token.

You cannot put that in one single LLM.

And also if you would put all this data,
like imagine like you could put all this

data in one context window, you would only
get the correlation between data sources.

But when you need, you need to deal
with incidents or like make a change,

you don't need the correlation.

You need the causality.

And building this causality
graph is what was really tough,

and to get some time to build.

Yeah, I feel like right now we've
got developers very much focused on

the local harness and we, I just, in
our Agentic DevOps Guild call this

week, we were talking about, like
what goes in the repo now, right?

It seems like we're like, one of the
early answers that we have is just dump a

bunch of markdown of every documentation,
plans, intent, schemas, specs, like just,

it's all going in the repo with the code.

And I, I know that that's not really
scalable for those of us on the ops side.

Like, we have things everywhere
and multiple systems.

MCP isn't gonna solve
all those issues for us.

And we don't all have, like,
like there isn't a 10 million

token AI model that I'm aware of.

Like we, we don't have infinite abilities
to just throw everything at an AI prompt

and say, okay, now figure all this out.

Right?

And so I think the problem isn't
as well defined for those of us.

Like, there's not as much people besides
as many people talking about solutions and

architectures on how they've solved this.

I expect that to probably be solved in
the next, like, next year where we're

gonna see a, like a ramp up of a lot
more content of people telling stories.

But even right now, I feel like
there's pretty, there's not a lot of

stories out there of people talking
about how exactly they used AI to

automate their infrastructure or
to accelerate their troubleshooting

or to, you know, and reduce the
number of pager duty midnight calls.

Right.

Like that kind of thing.

And I'm curious, like if you've already
got customers, like what's the, what

are some of the, the callouts, what are
some of the big stories that you can

talk about in terms of how this helped
teams or gave them like an aha moment?

that's definitely, our customers, have
different value in the different features.

One of the, one of the value is in
gathering all this, complex data

very quickly, like in, in seconds.

There's this pager duty alert ringing
and sure, as a human, as a trained

human, you are, I dunno, you are a
senior SRE and you know all the things.

You've been there on the job for the past
six years, for example, and great for you.

Everything is easy for you.

And in, I dunno, in 15 to 20
minutes, you can query the logs,

look at the Datadog dashboards.

You, you, you can query, you know,
who created this pool request that

changed this specific thing, et cetera.

Guess what?

It's not the reality out there.

We, we don't, we are not
always like senior engineers.

We don't know everything.

We probably have like this specific
deployment late at night on a Friday

evening, and nobody knew about it.

And having all this complexity brought
to you automatically by knowing the

changes, knowing the deployments,
knowing the past errors that led to,

maybe a, a causal chain of problems.

The small, the, the slowly increasing
rate of issues that, that was driving in

the end to another issue that actually,
I dunno, created this database downtime

that in the end was the actual issue
for your 500 errors and on the API.

And that's all of this is so complex
today that having a system that really

gather all this information, for you and
present it to you as a, as something,

that you can digest and they can,
you can really process as a human and

you can just act upon it and resolve
that, that problem very quickly.

Yeah, the complexity.

So, how am I interfacing?

This is a question I've been thinking
about a friend of mine, Viktor Farcic,

who runs the DevOps Toolkit channel.

And we were talking a couple weeks
ago, and he has this theory that

I think I still agree with, that
we're gonna be so used to talking

with the chat bot in our harness.

Like we're essentially writing our
writing e everything, whether it's code

or YAML or you know, documentation.

We're writing all that through our
local harness and that that needs to

become the primary window to everything,
rather than us having 20 different

chat bots in 20 different siloed
places where none of like, one doesn't

know about the other and there's no
shared context or shared memories.

Do you see that as like the future
of like, is this plugged, is this

something where I can plug it in to
my local harness so it can interface?

Or am I using this like through Slack
or, I'm just like kinda thinking

what are the different ways that I
can interface with this system as a,

the human to AI interface, I guess?

Yeah.

Yeah.

I'm smiling a lot, so let me
take this one because like I, I

have the conversation very often.

today you can use, our like
Anyshift for various forms.

We have a CLI, we have an MCP,
a web app, and a Slack app.

And it really depends on your needs.

The web app, you can plot some
diagrams about your infrastructure.

You can create reports about
anomalies and proactively go

fetch from information for you.

You can view the root cause
analysis being performed live

for live graph of hypothesis with
all the fact that we gathered.

You are a visual person.

But I need to say that I guess our poor
users really love us in the CLI and MCP.

People love Claude Code
and we love it too.

And so like how do we bring this
context into your day-to-day workflow?

As you were saying, what's kind of
the future like, are we still like

in a person to agent interface?

So like, I am a person who works in,
and I'm going to ask in the chat bot

on one specific product, on a web app
a question, or do I want my agent to

actually interface with another agent
to get this context to perform a task?

And this is what we see at the
moment and how things evolve.

We have some customers who asked us to
have the agent to agent protocol, like how

does any, so our agents who has the entire
context about the different deployments,

commits, all the changes full time?

So this agent that has a context
to decide how to trigger the

Dynatrace, agent, for instance,
to be the master of other agents.

I don't know how fast this transition
from like completely automated agent

to agent will happen, but there,
there are some automat automation

happening already, and so we need to
be at different places depending of

different needs to provide this context
to the different like agents arguments.

So yeah, it's like meet you where you are.

If you're used to using the
website dashboard, if that's

your thing, you can do that.

If you're someone who wants to use a
CLI locally, I guess I, I would maybe

like, I'd create a skill or something
that would know about the CLI, so that

I could just tell my local harness, Hey,
go find this information on Anyshift.

Is that kind of the, the
idea there with the, the CLI?

Exactly.

Yeah.

Yeah.

'cause that, that maybe we haven't
stated that on the show before, but

I feel like in 2026, a lot of what we
were trying to use MCP for a year ago,

you know, when it was the new hotness
and everybody was MCP-ing everything

and we had way too many MCP tools.

I think at some point I looked at my
copilot inside or I looked at my VS

Code, which all the extensions are
still now shipping with MCP tools and

I was tinkering around with Copilot and
I just clicked on the MCP tools and I

think I had like 150 tools loaded in my
context, and of which none I actually

put in myself or knew about because they
all came with the VS Code extensions.

And so then I had to like basically
uncheck them all because I, I didn't

realize they were all filling up
my context window essentially.

And now it feels like we're leaning
towards, you know, actually the

AI's really good at just having bash
like, just give it bash, tell it

where some tools are maybe CLI tools
and that's the way it can access.

So I've completely shifted my workflow.

I don't know if that's what everybody's
doing, but a lot of things that I was

trying to use MCP for before, I'm now
just f downloading the CLI tool and

making sure that the AI knows that it
can use it when I'm asking it a question

rather than expecting it to use some MCP.

It, we just recently had, I think
Google launched their Workspace CLI,

so now I can have it access my email
and my Google Docs all through a CLI.

And that kind of for me now becomes
the default, first thing I do is if I'm

looking at a tool, if I'm signing up
for something like Anyshift, whether

it's AI or not, I'm looking for them to
have a CLI, so I, I can give that to my

harness to, to do the work rather than
trying to find out if they have an MCP.

Do you see that same trend with your
customers and you're part of the industry?

Yeah, yeah.

One, one of them, I have in
mind one, one specific customer.

He's supercharged his workflow,
he has so much to do, many, many

different products to manage.

So many problems to tackle
with such a small team.

So he supercharged himself with all
the different CLIs he could use.

He created skills very dedicated
to his way of working, et cetera.

And it's gathering every single
bit of information of context that

he can bring to his workstation
through his Claude Code, so he can

like, tackle more, pro more, yeah,
problems, project products, et cetera.

So that's really something.

One, one case.

We, we have, so it's, it's huge, for
him on the laptop and another use case,

we, we actually learned, recently from
one of our largest customers as well.

they actually use the MCP to build
something around the Anyshift MCP

so that, that knowledge, all that
information they use it to, to create

something very specific for them.

So like it's super agent based on the,
that source of information and it's

central to help them fix problems earlier.

Because in the end, when you are managing
product and systems, you really need to

solve the problems as fast as possible.

And today it's, it's possible if you have
the right information at the right moment.

And just querying logs right now,
it's probably not gonna help you.

It's probably something that
happened like five years ago, or

five days ago, or five hours ago.

That was the culprit.

So let me get, let me
make sure I understand.

So you're saying they built their
own custom agent running somewhere

and it has access to your MCP,
is that what you're saying?

And so that agent is pulling data
out on its own from your graph.

Yeah, and I think I read on your
website that your graph, like you're

doing this all read only, right?

Like these are, these have
read only credentials.

We're not really talking about this thing
going and tearing down my infrastructure

automatically for me, right?

Like, this is something that's really
meant for investigation and analysis

rather than command and control.

Is that the goal here
with Anyshift this just

well, well, well,

taking?

Well, let's roadmap.

actually, we were probably too cautious.

I was probably the, the most cautious,
between Roxane and, and I, and I was

like, no, you can't, you, you can't do
like, destructive things so, so early.

And actually, our customers are, asking
for it now, as, as really like they want,

they ask for like, okay, we have this
specific runbook, we have this issue.

It's waking up like every other Tuesday
at 3:00 AM and we know that when this

rings, we have to do this action, and we
have no way of knowing when it's gonna

ring, but it's gonna ring at 2:00 AM.

If Any could just do this specific
run book, this action right now, it

would save one guy's night's sleep.

And so we are actually working on
making Any do actions, because people

are actually ready, like it's April
2026 and people are ready for having

agents do actions on their production
systems under control, obviously.

But it was the same when you
hired someone new in your team.

Like, you want them to be ready as fast as
possible and with all that information and

you can do pretty accurate things already.

Yeah, actually by the time this podcast
comes out, the Mendral one with Sam

Alba will come out because we recorded
that, I think last week, and we had

a similar conversation where I wanted
him, I'm using their product to help

manage my GitHub and automate some
of the toil of GitHub management.

And they didn't have any, right,
they didn't have any, it wasn't

taking actions yet, right?

It was just incident
reporting and analysis.

And then would, give you sort of.

The ability for it to do something,
but you had to approve, right?

You had to let it, you had to go in and
read and make sure you wanted to do that

thing, and then it would go do that thing.

And the, and the thing it was doing
at most was a PR, like, it wasn't

even, it wasn't even committing
to main or anything like that.

And I wanted him to go faster.

I was like, when are we gonna have the
ability for it to just auto fix, auto fix?

And I feel like there's, like, I'm
imagining this as sort of, like a linear

progression, a maturity model that's
really just this line that I'm dragging

on one side, it starts in the blue,
and then as I drag it to the right,

it's going redder and redder to like,
basically to the end is YOLO mode.

And I feel like all of us over the last
six months have gone through that similar

process with learning Claude Code and some
of the other ag agentic, harnesses where

I know enough people now that I know some
people that are very much in full safety,

they still want copilot to ask them
about every command before it does it.

And then there's others that
are going full off on safety

and they just let it run.

I mean, there's obviously the
OpenClaw people that are just,

you know, a whole nother level.

But everyone is in their own comfort
level and I tend to find that, we

all just like a junior engineer, we
all need to trust the model and the

harness, and we're just gonna call
that the AI agent, and we need to trust

it, and so we just need time with it.

We just need to watch it not hallucinate
for months on end or weeks on end before

we're willing to give it more work.

And then we give it a little bit more,
and then we maybe adjust our skills

and our Claude file or whatever we're
doing, and we get, we get a little bit

better and we realize this thing hasn't
made a mistake in a couple of weeks

other than just, you know, like little
innocent mistakes that a human would

make that are not even really mistakes.

They're just a different
choice than what I would make.

And eventually we get to that point where
we're like, okay, I'm turning off safety.

I don't even care about sandboxing.

I trust this thing implicitly.

You know, go Terraform apply.

And, and I feel like that is, eventually,
I think, gonna happen to most or all

of us, if at least we're allowed to
by organizations, obviously certain

organizations are gonna put a lot of
re restraint and restriction on that.

But I feel like we're all
on that path somewhere.

So it's somewhere on that.

I need to just have a graph
I put up on screen and just

basically allow people to choose.

Where are you?

I just wanted to speak about trust
because like you're mentioning like how

do you progressively trust the agent
from like, doing more and more action.

And this is something which really believe
as well, like you need to give like human,

and human in the loop as a capability to
understand exactly what the agent did,

what type of queries this agent performed
to like build this progressive trust.

And like later give the agents like
the capabilities of making actions.

So like how do you actually like
exactly, with like a junior SAV joining

your team, you can of understand the
progress and the type of capabilities

this person is doing to then allow
widely selection, pull requests from

being done and then like complete
out the loop kind of capabilities.

Yeah.

Yeah.

The, to me the auditing is like,
really, it feels really important.

Like if I'm having a, if I'm having
a robot take actions on my behalf,

it's I have lots of things I think
I want, and I think that's all gonna

completely change the minute I actually
start using systems, like Anyshift,

because I, what I perceive to be what
I wanted to do, you know, I wanted to

have like read only credentials until
I manually granted right in the moment.

And then I wanted to swap out tokens or,
or keys so that it has the right pat now.

And then I then kinda like we
have build and plan mode locally.

Like I, I kind of want that
in my infrastructure where in

certain moments I'm, for certain
activities I'm gonna give it.

Right, but other ones, it doesn't
have that ability or doesn't even

have access to Kubernetes API.

Or if it does, it's read only.

Like, I'm just sort of imagining how
I'm gonna slowly onboard this thing.

One of the ideas I was talking with Sam
about was the idea that, if things are

recurring or recurring types of incidents
where it's a similar type of problem

or maybe the, the tool that it needs to
fix is something that I've, I've already

approved for, right, or permission for
it to actually do something on, on a

right access that it, it now knows that
there are certain things it's allowed to

do and I've given it, we've established
those permissions that it can have.

And then there's other things
that it's still not ready for.

And we're, and maybe as a team we're
not ready for, it may be because

we have crappy documentation,
so it doesn't have good context.

It might be because we haven't given
it full read access to all that

particular part of the infrastructure.

Maybe it's hybrid and, you know, there's
missing components, so it tends to make

the wrong decisions or hallucinate.

Do you see that spectrum
happening in Anyshift?

Is it, is it gonna be like this messy
world of like some things are this and

some things are that, or is it just
gonna be like an on and off toggle?

How do you see that sort of happening.

a great question.

Like we really see like Anyshift as a
new member that you're onboard within

the company, you give this person
some accesses and depending of the

accesses this person have, she or he
will be able to debug an incident or

like to perform a day-to-day task.

It's also like super important to
mention that when we speak about

context, you need to have memory as well.

So our agents, all have like
self reinforcement memory.

They will learn from past
patterns, past incidents.

All the data, all, everything that
they have seen very similarly to

someone, new team that begins and
then will evolve within the team.

it'll learn that day one, oh, I thought
like this is an incident because like

twice a day I have huge scaling events.

And two days later actually I understand
that this is a normal behavior because

this is a pattern of the production
of the company I'm working at.

That's the type of context that you
need to have and that our agents have

access to, to be able to learn from
connection that you only find at one time.

We spoke about the graph
that we're building.

So this is like a time machine of
your production, all resource and

dependencies, how things connect.

But sometimes you don't have all this
connection, all this data from just

like, integration and like raw hard data.

Some of those connection can
only be found at one time.

Like how does the
service pick another one?

We learn it as well through
the memory at one time will

upgrade the memory of our agents.

They will do it by themself.

And this also is part of
the context we're building.

The resources, all the dependencies,
all the changes that you need to have

between dozen of fragmented tools,
Datadog, AWS, Kubernetes, like, many ones.

How do you bring back this context in
one unified view to debug incidents

and also like prevent some of them?

Prevention also like as important
than as to just react on the fire.

I wanna ask about prevention, but before
I do, the concept of memory is something

that I actually spent a lot of time
in the last week kind of understanding

where the industry is going and like
even for local harnesses now, I just,

I think it was yesterday actually, I
learned that Claude was explaining to me.

I basically asked it list to me all the
harnesses that you can find and whether or

not they have built-in memory components,
you know, or features inside them.

And it was like it listed, Claude Code
has it and Copilot has it now that where,

like Claude Code stores it, it's very,
it's a, I think it's right now, I don't

know about team accounts, but I know
it, it stores it for you individually

and they have this new auto dream thing
that also enhances memories over time.

And for those of those not familiar
with this stuff, I think we're kind

of all centering around the idea.

We understand context, we understand
like skills and the agent files and sort

of these different parts of the puzzle.

But this memory, this thing we're
labeling as memory to me is about

summarizing previous conversations
or previous sessions with an agent

and storing those long term so
that they can be whether basically

searched and injected dynamically.

I think they're all kind of doing
something a little bit different

on how exactly they bring memories
into the current conversation.

And this is just, to me, a natural
extension of how we're going to take

this beyond the current session.

I've been writing a newsletter this week
around the bad habit that I, I see myself

doing and others doing, where we start
a conversation with an agent and we just

continue that conversation rather than
starting a new one because there's so much

information locked up in that session that
becomes almost like an AI tribal knowledge

at that point because it's only in there.

My team doesn't have it.

My other AI models
certainly don't have it.

My other sessions don't even have
it, and it's only in that session.

And so I, it's like, it's like a precious
little snowflake of a conversation,

and that I need to have like almost a
personal discipline to somehow get that

into, whether it's the agent's file or
documentation in the repo or something

else where the agent, the future
conversations can access those memories.

It turns out that maybe all the tools
are gonna manage this for us, and

it's all gonna be eventually a solved
problem where maybe there'll be central

memories for the team somewhere.

I'm, I'm not sure how it's all gonna
shape out in terms of just the harnesses,

but it's cool that you're doing that.

I know that Mendral with Sam, they're
also doing the same thing where they're

adding memories that, they're incidents
of creating memories, but also you

can add in your own memories, which
I almost, I was telling 'em like,

this is the tribal knowledge area.

This is where I'm just typing in, in
my case, for him it was, I needed,

I specifically want GitHub Actions,
reusable actions to not pin the digest

if the, this is very technical, but
the calling action and the reusable

action, if they're both managed by me,
I don't need to pin one to the other.

I just need to have the reusable
action pinning any actions it's using.

I know that's, that's if you're
not using enough actions, that

just sounded like gibberish.

But it was something where I have
a very specific workflow for me

and, and the people that I work
with, and that's how we do it.

But the linters and the sort of
LLMs think that there's another

way that they should be doing it.

And I need to give it very explicit
instructions that it maybe won't

be seeing because not all these
tools are looking at a repo, right?

They're not all harnesses like these
SRE tools and these GitOps, these,

DevOps tools aren't necessarily
looking in repo for instructions.

So where does it find all of
the lore of my infrastructure?

All of the preferences that we
made over time, the architecture

decisions that we made over time.

Are, I'm assuming that you have a bunch
of a series of plugins or a series of

integrations, that aren't just looking
at my, your infrastructure today, but

are also maybe looking at like, are you
looking at Linear, at Confluence, at Jira?

Like does it pull that kind
of information as well?

It's even better.

I will let Roxane be more specific
maybe about the mechanism of the

memory itself, like she is much
more, compared than me on this one.

but something that Anyshift does.

So, yeah, you're right.

If you give, any access to your
Confluence, your Jira, et cetera,

it'll crunch all that data.

She will learn about all, all that
data she'll learn as well, about

all the different chats you had.

And if, if you tell, Annie, for
example, during a chat, oh, you

didn't know that this service was
actually connecting to Redis, then

she will, remember that, This service
is actually using Redis as a backend.

So maybe next time you want to deprecate
this Redis to, to switch to Val Key.

and you can ask any, oh, is it safe
to remove this Redis like no other

system, I know of, are not using it?

And she can answer you, oh, but I know
about this specific s service that

is still using it because, I dunno,
your colleague told me that it was

still using it like two weeks ago.

And that's the type of things
that Annie has for the memory.

And she also does
something like super cool.

She spends 24 hours a day exploring the
graph, exploring your infrastructures,

exploring your logs, exploring
your metrics, exploring new data,

exploring your new Jira tickets,
your Linear issues, et cetera.

And she build her own memory, her own
feelings about, maybe not feelings, but

but feel like feeling sus.

Yeah.

yeah.

But yeah, she creates some
sense of things happening how?

Half a hour, like 24 hours a day.

And she builds internal reports for
her and when she explore something,

let's say, there's an incident and
she explores different hypothesis.

And the, during one exploration,
she discovers something new,

something she didn't know.

She will automatically remember this.

And if this specific thing she
discovered during an investigation

that was useless at that moment, but
maybe tomorrow you have a, an issue

and this is the root cause, then
in seconds she will remember it.

I cannot do this as a human.

Like I cannot remember all the
crap I'm saying like, every day

by, exploring the logs and stuff.

So it's, it's really, it's, it's
not even about tribal knowledge.

It's really about like being
exploring 24 hours a day.

So, I don't know, Roxane, if you
want to tell more about the memory,

feature we have, like how it works and.

Yes, the memory is like still a
hot topic and it's hard to handle

because you need to understand like
what is useful and what is not.

Right.

You can't put every memory of
every situation all day long in,

in the context window, right?

Like you've gotta have some sort
of searching or, I'm guessing

it also summarizes them so that
they're smaller and they can fit in.

Exactly.

So we have like different mechanism.

So we could publish, like if you
want to read on our blog, like it's a

state of the art with like Stanford,
we're super proud about it, how our

agents actually learned and able to
implement the self enforcement learning

for like the latest research paper.

And the thing is, so what it's gonna
work is that when you see something,

let's try to compare it to a human.

New data on new information,
you're first going to reflect.

So this is called the reflector
on what you've just seen,

and do you know it already?

And you're going to try also to
know is it something that is useful?

So this is the first phase that
the agent will do by itself.

And the second one, which is,
do I have some memory for that?

And where do I put it?

And so how do I update my
memory in a structured way?

Because very often today, like memory
can just be markdown files, but like

with just information all over the
place, and you don't know where to

fetch it and what's the most important.

So this mechanism is how do you
actually rank the information?

It'll be done automatically by the
agents, but so how do you update and

fetch the right memory at the right time?

The second thing which is
tricky, is how do you handle

short term and long term memory?

You have short term like preferences,
things you have seen yesterday.

I want to remember it.

But you have also like long-term memory,
like very rare incident, very rare

pattern that you want to have somewhere
in your brain, kind of a cold storage,

that you want to store somewhere.

And that if at some point it gets
really tricky, you want to go as well

fast to the long-term memory, this
cold one, to be able to find the right

context and then to solve the incident
or to answer to a tough question.

So that would be the kind
of principle in place.

I can go deeper if you want.

Yeah.

but

it's,

I was gonna say like, that sounds like
that's like the magic of a SaaS, right?

Like that, that there's a hard problem.

And this kinda reminds me of like
the things that people aren't talking

about is, okay, sure, I can give it
access to infrastructure and have

it go look at the Kubernetes API.

Like I can have any model go do that.

I can tell it, it has access to kubectl
and Terraform, and it can look at things

in that moment, and it might even be able
to look at APIs and live infrastructure.

But capturing the history of all
of our Jira tickets and all of

our pager duty outages and the
Slack channel conversations.

Like when I think about the big
picture of, we're really trying

to just replace engineers, right?

At the end of the day, we're trying
to have like a, not necessarily

replace people, but we're trying to,
we're trying to have an additional

AI buddy engineer that's going to
replace all the things that I forgot.

And in this world where nobody keeps
the same job for more than a year

or two, it's really hard to maintain
all that team knowledge, that tribal

knowledge that isn't maybe documented.

And none of us really enjoy
writing, handwriting, documentation.

I think I speak, I use
one of the whisper apps.

I don't have to even type as much
of this stuff anymore when I'm just

sort of going off on a tangent.

But we have this, we had this, I have
this theory that like, if these things

are gonna eventually be someone who is
just as smart as anyone else in my team

in terms of knowing the current state
of things, the current plans of the

company, the current budget restrictions,
the current, you know, what do we just

talk about in the standup this week?

That at some point these AIs are either
gonna have to be connected in, either

it's a two A or something, or we're
like, they're gonna need to be in the

meetings, they're gonna need to know,
have the planning documents, they're

gonna need to know the new budget
restrictions that we have on cloud

compute or whatever in the company.

Like, they're gonna need to know
all these things so that they can

make these decisions in real time.

Like, oh, well, you know, we have a
capacity issue, so I'm gonna spend up

some new servers, but I also have a
budget concern, so I'm gonna use Arm,

you know, or whatever to stay cheaper.

Or I'm gonna, I'm gonna reduce my capacity
over here so like an increased capacity

over here, 'cause this is optional
and I can slow that down for later.

'cause it's just job
management or whatever.

Like, it's gonna need a ton of
context about the business, because

that's when I walk into teams,
that's what's happening in real

time, is people are making decisions.

It's not docu, it might be documented
in an email or a Slack message or

in a document from a Zoom call, but
that's not in my infrastructure.

That's not in my Terraform plan.

So how does this agent, you know, maintain
the intelligence that another team

member would maintain in the real world?

So I'm guessing that you're
thinking that far out.

Maybe you're thinking like a couple
years in the future when all this stuff

is there, but you did mention already
earlier, we don't have to go back over

it again, but like, the whole idea of
agents talking to agents, and maybe

you have, you, maybe you have like a
meetings and email agent or something

where you're literally CCing some sort of
agent that's digesting emails so that it,

you know, so the epic long email threads
or slack threads are all being assessed

in some sort of agent over here, but
then that agent over there is managing

infrastructure and has the context of
real current status of infrastructure.

And the two have to meet some in
some way so that your decisions

are made in with the same awareness
that a human would have that week.

I don't know if you're thinking of those
kind of hard problems that far out,

but that's something where I'm trying
to imagine how we're gonna get there.

Like, how do, how do, I'm not the smart
one that's gonna figure it out, but

I'm just imagining that's the problem.

I'm so excited about that because it's
really like our vision, that context

when you speak about production, and
let's speak about only production,

it's not only like infrastructure data.

You also need to have like
security kind of integration,

with for instance, identity and
ownership, business data as well.

And when you connect all of
this data together, you cannot

just like call different MCP
and APIs to get this context.

I cannot insist on the, of the
fact that there's a difference

between correlation and causality.

When you need to make a decision,
like either, like to understand what

is the root cause of an incident, I
like to make a change and understand

like the best pages you need to
have the causality of events.

How, what is the source
and what is the symptom?

And this data, if you just take
it as MCP or APIs, it's only

going to be like correlation
between different data sources.

And when we speak about context
in production, you need to have

the causality of event full
time, so that's the tough part.

And not only for like Kubernetes,
AWS and code basis, but also for

Slack, for Jira, for conferences,
Okta, duetta, like everything that

makes your job like daily job kind of
interesting in terms of the context.

You need to gather and connected
for nodes and dependencies.

And this is the type of
context we're excited to build.

It's not something that
you want to build yourself.

You want to be able to make some actions
or to make decisions based on that.

We really like focus on the context.

How does this context look like?

Yeah, so I guess at the end
of the day, the, Annie, right?

Annie is the name of the AI at Anyshift.

Like, Annie's working on the causation
while I'm sleeping, hopefully.

Like looking at the repetitive incident
a decade ago, over a decade ago, I was

working, with a platform that was kind of
like a Netflix platform where they were

Argo CD in a video all around the world.

And we had, and I was managing, this is
like pre Terraform, so we were using Salt

Stack, I think, and we were doing, using
a lot of cloud formation and we had a bug.

we weren't, we were the ops.

We were the ops team, so we weren't
writing the code and the code, I think

it was PHP code, had a a memory leak
so that we knew about once a day, this

certain series of servers all around the
world, were gonna need to be restarted.

And we didn't really have full like
connection failover, so it needed

to be, you know, ideally in that
part of the world in an off hour.

And so we had to come
up with this whole plan.

And so for the longest time
it was just human toil.

It was whoever was on, whoever was
available in the team at that moment was

gonna be the one that was gonna have to
kick off a job to, for that part of the

world and those regions to recycle all
these servers and basically reboot them.

That was the strategy for like three
months, that was the strategy, because

we were waiting on this supposed magic p
magic PHP fix that was gonna fix this bug.

And this is the kind of thing where I
feel like this is the kind of thing that

one, the AI should be just detecting, like
researching the problem and, and finding

the fix much faster than a human would.

And two, like I should just be able
to tell some, you know, SRE bot that,

hey, this is the problem, automatically
create me a schedule of recycling

these servers, until we can get the
fix that some other AI is gonna fix.

'cause maybe the developers have
their own, you know, their own Claude

and they're gonna, they're gonna
implement a fix eventually with that.

But I need this solution now, and
the solution means someone's got

to cron job this out worldwide
with all these different regions.

And that seems like a plausible job
that I just wanna give an AI, I do not

want to have to get up in the middle
of the night and reboot servers myself.

And since I can't, maybe I don't
have an automation system yet,

like, an Argo workflows or something
that can automatically schedule and

spin things up and reboot things.

So, I'm excited to see how this progresses
in terms of your AI being able to do,

you know, do operations on my behalf.

And maybe it's deterministic, I think
you mentioned earlier, you were talking

about a script that one of your customers
needed to run, and that's also something

recently that I've been seeing more
people experiment with as essentially

using the AI, the non-deterministic,
crazy texting robot to write a

deterministic workflow that it executes.

And then eventually, you know, I think
for those of us in ops, like we, we are

scared of the non-determinism because
we feel like that's just a crazy, no,

not letting a crazy bot loose in my
infrastructure tore wreak havoc, havoc.

But if I can get the AI to write a
deterministic program or workflow or

something and then implement that,
that reduces, I think, I feel like

the risk of the AI, because the
AI is not deciding that today I'm

gonna do these steps out of order.

Because that might happen, but if
it can write the code or write a

deterministic workflow for me, that
feels like the right thing to do.

That feels like something
a human would do.

And I, I don't know if you're seeing
those patterns, of like people are

using Anyshift to detect the problem
and then they're, they're not using

AI to automate or solve the problem,
they're may be using AI to write the

deterministic, fix or deterministic
workflow to fix that problem.

Do you, you gave that example earlier, but
I was just curious if you had any stories

on how that is continuing to happen or
if that's your strategy or you, how you

recommend this to people to implement.

Yeah.

We actually have.

We had a very nice story, maybe it was
two weeks ago, with that one customer.

He so that customer got a report by Annie
saying, oh, I'm, I discovered on this

a account, like a lot of EBS snapshots,
and it amounts to a lot of money.

And that, that customer didn't stop here.

He asked Annie, okay, gimme a plan.

Like, can we do cold storage?

Can you suggest you have access
to our Terraform stuff, so can you

just help us just do it like now?

And so it came, it started, by a report,
and it ended up just in in a matter

of, minutes, to having a full fledged
solution, like a lot of money saved

and a much better backup plan in place.

So it's really, it's really about like
yeah, exchanging, talking with your

agent and just having that agent the
right context to take the proper decision

and help you in your already, hard job.

Nice.

I feel like we could talk about this
forever, but we, because I'm super

interested in like the patterns in
the future of how, how we're gonna

interface with this AI, how we're gonna
keep this AI on the rails, like how

we're gonna keep it safe in production.

I think there's a lot of people
that are concerned about that.

And I think the, I'm glad to get
this episode out because I feel

like, you know, getting the word out
on how companies are implementing

these technologies in a reliable,
reproducible, safe way, so that we're not

constantly tearing down infrastructure.

You know, we all see like the
hacker news story of someone who

had the AI run Terraform apply and
had decided to delete everything

before it ran it again or whatever.

That was a recent one.

And we see those and I think sometimes
some of us were like, that's what you get.

That's what you get for
doing that with the AI.

Like, shame, shame on you,
you shouldn't have done that.

But the reality is this stuff is
coming and people are experimenting

now, like you said, you've got some
customers that are like, we want to

go full on, like, give us the right
access, give us the implementation.

So I think it's definitely coming and
it's exciting to talk about the different

models, not the AI models, but the
different workflow models of how we're

gonna do this together and safely, and how
these kind of tools are gonna integrate

with the rest of our infrastructure
and the rest of our AI agents.

What's, what's next?

Like, gimme, gimme a hot take.

What's, what's the next thing that
you're gonna, you're excited about?

If I take this one and I think
like we have different excitement.

Stephane and I,

Okay.

I would say like in term of context,
how we make it like grow even more.

So like we want it into much more
information for the teams to like, have

everything at hand to perform a task.

That's, we're going to add like
quiz, Notion, like as notes in the

graph and people are asking for it.

I'm excited about like how big
we grow in term of context.

Yeah.

A giant, giant graph.

Excited about the giant graph as well,
but I would like, like more agentic

Annie, I want Annie to do things.

I want Annie to proactively
like, do things.

I will, I want my, I, I want
AGI, I want to sleep at night.

So I want the, I want Annie to detect
problems, fix problems, and just

send me an email tomorrow morning.

That's, that's the shot at roadmap.

I'm really excited about this.

Yeah.

I do like the idea of like getting up
in the morning, I'm brushing my teeth

and I've got my Annie app on my phone,
and we're having a voice chat, kinda

like, I just did the, this morning
with ChatGPT, where we're having a

conversation and I'm like, okay, what
happened overnight while I was sleeping?

Like, gimme, gimme, like, did we
have unusual spikes in traffic?

Did we have any sort of hiccups?

Was there any cloud outages
while I was sleeping?

You know, like we all wake up
and GitHub's not running today.

You know, that's the kind of
world that we live in right now.

And instead of me having to like peruse
Hacker news or check status pages

of things or, you know, everything
might be fine right now, but there

might have been a ton of stuff
that happened in the last 10 hours.

And instead of me having to read Slack,
like just infinite slack messages and

channels that just are giving me alert
fatigue all day long, I'd love to

really have, like morning summaries,
that are, that essentially read to

me by either a British or Australian
accent AI that because I gotta have a

the killer feature.

Let's let's strip this.

I got a French accent over here.

I got a South African accent over here.

Like, I just, I want all the, I want
all the robots to have more personality

so that I, I recognize their voice when
we're in a group call with nothing but me

and the agents, and one of them knows my
infrastructure and one of them knows my

GitHub situation and, you know, whatever.

One of them's cooking
my breakfast downstairs.

I don't know what's gonna happen.

But it's exciting to see this
future unfolding in front of us.

And I think, the last thing here is
like, when is that a year from now?

is that five years from now?

Like where do you, can you even imagine
where Anyshift is gonna be in a year?

Do you even have a vision board of
where this is all going in a year?

I don't know how fast the transition
from agent to agent will happen.

For sure, like today, like
incident management and resolution

is still an unsolved problem.

For instance, it'll not
be manual in the future.

Six now, six months, a year
from now, it should be solved.

And like how agents actually also
like perform action in production

should be something that I believe
in the six months, a year from

that, will, will be done as like
the models get so much better.

We made a bet from day one that
we would be the best type context

and that agents will improve.

And we have seen the leap of like,
progress, like Sonet 4.6, for instance.

It's only gonna get better and
we are the best at providing this

causality chain between events.

And then our agent will be able
to be the best to solve it with

the latest models improving.

Yeah.

A new model to me.

I mean, I feel like we're all
in the lull right now waiting

for the, waiting for whoever's
gonna release their next version.

And it feels like right now, every time
we have a major Sota model release a

new version, it makes things that we
thought we needed to do in our tooling,

irrelevant, because the model can now
do that now, or, you know, the model's

been updated so that I don't need that.

I don't need for it to read the
documentation every time because now

it's been trained on that documentation.

before it was, a lot of the models
I think we're using now, were built,

were built in September last year.

So, a lot of the tools I'm
using barely even exist.

I mean, OpenClaw wasn't
even a thing back then.

Like, there's so many things.

Docker sandboxes didn't exist.

Like, we have so many new things
that, that models don't know about.

So I have to constantly point it
to websites, give it documentation.

So I'm excited for that future too.

And and it looks like, even though
I can't predict what's gonna happen

in six months, I feel like the,

it's all gonna be amazing
and I'm here for it.

So that's why we had these podcasts.

it's great to have you on
the, podcast and I'm looking

forward to what you all do next.

Thank you so much.

Thank you so much,

Alright.

You can find [email protected].

I'm assuming both of you, should
people follow you on LinkedIn?

I guess where are the socials
that people should find you all?

Twitter and LinkedIn.

yeah.

Twitter, LinkedIn.

Nice.

All right.

Ciao everybody.

Thanks for joining us, and I'll
see you in the next episode.

Episode Video

Creators and Guests

Bret Fisher
Host
Bret Fisher
Cloud native DevOps Dude. Course creator, YouTuber, Podcaster. Docker Captain and CNCF Ambassador. People person who spends too much time in front of a computer.
Beth Fisher
Producer
Beth Fisher
Producer of the DevOps and Docker Talk and Agentic DevOps podcasts. Assistant producer on Bret Fisher Live show on YouTube. Business and proposal writer by trade.
Cristi Cotovan
Editor
Cristi Cotovan
Video editor and educational content producer. Descript, Camtasia and Riverside coach.
Roxane Fischer
Guest
Roxane Fischer
Co-Founder @ Anyshift.io | The context for AI in prod | Euro Seed 50