Mark Loveless, aka Simple Nomad, is a researcher and hacker. He frequently speaks at security conferences around the globe, gets quoted in the press, and has a somewhat odd perspective on security in general.

Using AI Responsibly

Using AI Responsibly

There are concerns about AI. The big ones include the increased use of energy in a grid-tied climate-impacted world, AI’s hallucinations and simple incorrect information in certain scenarios, and the replacement of normal human jobs with AI-driven ones. In a world where security-minded people who can barely get non-security people to 1) stop clicking on everything, 2) patch systems to the latest versions which have security vulnerabilities fixed, and to 3) use multi-factor authentication, well, the mere suggestion of vibe coding and the wave of new vulnerabilities in software strikes horror in the minds security pros simply trying to keep up with securing “all the things”. So what is a security pro supposed to do, especially when you can see both potential advantages and disadvantages in using AI? I was seeing advantages at work, but what about outside of work where I still do plenty of computer stuff? This was a question I tried answering for myself.

Objectives

I first sat down and came up with a list of objectives for this project. Simple experiments into this AI arena already showed promise, especially since my very earliest and fondest memories of the early computer age involved ELIZA (on an Apple II no less), and I even worked on an ML project while I was at MITRE a number of years ago. I didn’t need to be “won over”. I knew the tech wasn’t perfect and never has been, but I really felt that these new LLMs could really make a huge difference. So here were my objectives:

  • Don’t add to the energy problem. There is a climate crisis, and some simple searching on my blog site for things like my use of solar panels, batteries, driving an EV, replacing gas things with heat pump things suggests this is very high on my list.

  • Come up with a personal LLM “test”. I needed some method to probe and query LLMs to look for the warning signs I care about such as bias against age, race, sexism, sexual preferences and identity for those scenarios where I need raw text, and most importantly secure code development. In other words, I wanted to know the limits of what an LLM could do.

  • Pass the “re-eval” scenario. This is where I take the output of an input query of an LLM and in a separate “chat” tell the LLM it is mine but I need it checked for flaws. This is especially true for code that is generated. Most of the code I will want to generate will most likely have some security component to it. This is one idea I’ve already touched on in a previous blog post.

That’s the objectives. These are my objectives, possibly not yours. Here is how I’ve been dealing with them.

Green Solutions

Like I alluded to before, any observer of my blog entries will notice a LOT of blog posts about green energy solutions such as solar and whatnot - something I take extremely seriously. I wanted to use the large LLMs because they seem to work so well (only after you’ve literally hacked together a prompting “style” that gets the results you want), but this is sort of a problem for me. Mister Everything Lean and Green and Energy Clean contributing to the climate problem directly by using large LLMs. So the greenest thing I could think of was to host my own LLM and be able to monitor the power usage, perhaps working out the timing for heavy use based upon the energy I have available to me.

So I decided I wanted to get a computer to function as the AI server and have it run Ollama with a downloaded LLM. Of course I wanted something that was rather energy efficient yet still capable of running LLMs at usable speeds. I opted for a Mac Mini M4 Pro, which was getting decent reviews when it came time for performing AI tasks including LLMs. It also doesn’t use a lot of power. Most likely I will do a blog post covering just this new computer so I’ll hold off on diving too deep into that. It begged the question “which LLM should I use?” I mean, I know there are specific LLMs that are designed and trained for certain topics over simple general knowledge, but that still doesn’t mean they are going to fit my needs. I want to know what they can do and what their limits are.

Testing LLMS

I couldn’t rely on other peoples’ reviews of LLMs that I found online. Oh sure, there are plenty of articles, blogs, and video tutorials online but these usually mean the person doing the review met their own needs. It did give me enough of an idea on choosing the Mac, but I needed a way to test the LLMs myself. More specifically, I needed to come up with a way for my needs, which evolved around coding and security.

I started searching for ways to “test” LLMs and there is an entire cottage industry with tons of people posting the prompts they use, although most of them had little or nothing to do with secure coding. However a lot of the prompts were in fact interesting, and many of them were using or had shared the same prompts. So I wrote many of them down, came up with a few of my own, grouping them all into two categories I’d referred to as “logic” and “bias”. I figured it is possible I might want to add other elements to some coding projects, such as detailed explanations in a README needed for a project that reflected on philosophical reasons behind a coding project or something. More importantly I figured others would find the questions useful. Finding logic errors or social biases could be quite important for many people.

Making sure an LLM could produce secure code was another. I developed a couple of sets of incomplete code samples. One set has a few hopefully rather obvious and easy-to-find security flaws, which any LLM should find. The prompt for the simple flaws doesn’t even ask for security flaws, it simply asks for an example of “code evaluation”. The expectation is that the LLM will immediately spot the security flaws.

The other set contains much larger code examples, with each example containing five recognizable flaws that are a little harder to find. These are what I’d call “the major flaws” that any decent LLM should be able to find the flaws, however there is a bit of a catch. The code examples are larger, and I did not go through the code line-by-line myself, I mainly constructed the samples from existing code where hard-to-find bugs were found and discussed, and did a fair share of cutting and pasting to come up with the overall examples. Some of the examples I personally have rather limited experience in the language being evaluated, and am relying on rather ambiguous sources from the Internet and various bug reports from researchers to toss things together. Nonetheless, if an LLM finds the five main flaws and then uncovers an additional three or four, great. However if the additional flaws are not exactly real flaws, maybe even hallucinations, then maybe that’s not the best example of an LLM that would make safe coding decisions.

I’ve made this grouping of questions and the associated coding samples available in case you are interested: https://gitlab.com/nmrc/nmrc-ai-test/

re-evaluation

This gets into some rather weird territory. Re-evaluation is where you get a response to a prompt, and then feed that response back into a separate chat with a completely separate prompt. Let me give you a couple of examples.

Let’s say you went through the bias section, and you got decent answers from the LLM regarding the changing of social philosophies. In a new prompt, you work with the LLM to develop a magazine article or a blog post. Or maybe you didn’t even use the LLM, you simply wrote the blog post on your own. Go back to the prompt where the changing of social philosophies was asked, and you construct a prompt that states something along the lines of “I wrote this and I’d like to make sure it is free of biases like we discussed” and paste in the text. See what happens. Does it find anything, and is it actually correct in its interpretation?

For coding, this can get really interesting. Let’s say you’ve written some code with some assistance with the LLM, or you’ve even tried to do some vibe coding. Feed that code back into the LLM in a new chat, take full credit for having written it, and ask for a deep security analysis to find all of the security flaws. Check for accuracy, and certainly check for hallucinations. Again, did it actually find something, and did it recommend a fix that closes the potential hole? You can even validate the newly updated code in another chat, just to make sure. My interest is mainly security, so if your expertise is in the field of optimization for speed, take that approach instead, with a re-evaluation noting how well the optimization recommendations are. You get the idea. Validate its findings based on your areas of expertise.

A Note on Privacy

Another thing that can help you make a decision, especially if you are using one of the major LLMs and not something you can download for Ollama, is whether privacy is important to your work. If you’re using an OpenAI product, it is entirely possible that the queries can be used by OpenAI for further training. If you’re uploading sensitive information this could be a serious concern. Also note that there might be conditions where a business account with your employer gets you the protected privacy policy where your queries are not used for future development of the models, but there is no such protection for free or individual accounts.

It should be noted that after running Ollama and examining network traffic during its use, there is no “phoning home” by Ollama and certainly none of that is being done by individual models. Quite frankly this is yet another reason why I personally am concentrating my LLM examination on Ollama and the associated models rather than one of the big websites like OpenAI.

I’m still evaluating open source Ollama LLMs and will later report on their “rankings” so to speak, but from what I can say for my secure coding tests so far the larger LLMs from places like OpenAI, Anthropic, and Google produce better results, with Anthropic being by far head and shoulders above everyone else. I am hoping an open source LLM with Ollama gets close to that high benchmark.

The Result

If there is a more established process for AI evaluation for an organization, then adoption is easier in the long run. In other words if you work out what type of problem or problems you are trying to solve, then choosing the right LLM for the job narrows down your choices. Remember - a lot of LLMs are updated periodically so when a new version comes out, take it as an opportunity to re-assess. Just remember, there is more to LLM evaluation than trying to “hack prompts” which you should still do, but that’s beyond the scope of this blog post. Regardless, have fun!

Ultimate Encryption Solution

Ultimate Encryption Solution

Tracking Advanced Port Scanning

Tracking Advanced Port Scanning