National Cyber Warfare Foundation (NCWF)

Analyzing Tokenizer Part 2: Omen + Tokenizer


0 user ratings
2024-12-05 04:16:15
milo
Blue Team (CND)

 




“I have not failed. I've just found 10,000 ways that won't work”


Thomas Edison


Introduction:


This is a continuation of a deep dive into John the Ripper's new Tokenizer attack. Instruction on how to configure and run the original version of Tokenizer can be found [Here]. As a warning, those instructions need to be updated as a new version of Tokenizer has been released that makes it easier to configure. The first part of my analysis can be found [Here].


This is going to be a bit of a weird blog entry as this is a post about failure. Spoiler alert: If you are reading this post to learn how to crack passwords, just go ahead and skip it. My tests failed, my tools failed, and my understanding of my tools failed. A disappointing number of passwords were cracked in the creation of this write-up. I'll admit, I was very tempted to shelve this blog post. But I strongly believe that documenting failures is important. Often when reading blog posts you don't really see the messy process that is research. Stuff just doesn't work, error messages that are obvious in retrospect are missed, and tests don't always turn out the way you expect. So as you read this, understand that it's more a journal of troubleshooting research tests when they go wrong, vs. a documentation of what to do.


To put it another way, the main audience for this blog post is:



  • My future self. I'm really annoyed at myself for not better documenting some of my past work. I'll go into that in more detail later.

  • Password cracking tool developers. This post won't help you crack more passwords or develop better password security strategies. Hopefully it will help those who are developing the tools to help you crack those passwords though.

  • People who play dwarf fortress and like [Fun].


Question 1: Why are my Tokenizer's "first 25 guesses" different from Solar Designer's



Background:

In response to my previous blog entry, Solar Designer wrote: "One thing that surprised me is that your top 25 for training on RockYou Full (including dupes, right?) is different from what I had posted in here at all (even if similar)." [Link].


That's a good question, and one that I had been wondering as well. There's a couple of things that could be causing this, from the way our Linux shells handle character encoding, the order of our training lists, to differences in our training lists. Or it could be something totally different that I'm not imaginative enough to come up with yet. At a high level, it's probably not that big of a deal since our experiences running Tokenizer attacks seem roughly the same (Solar Designer has posted tests comparing it to Incremental mode, and they roughly match what I've been seeing). But this can be a useful rabbit hole to dive down since it can expose some optimizations or environmental issue that could cause problems as more people start to use this tool. There's a big gulf between "it works on my machine" and "its easy for anyone else to run".


Conclusion Up Front:


Tokenizer (and base-Incremental mode) seem resilient to the order of the passwords they are trained on, and setting  'export LC_CTYPE=C' did not seem to impact guess generation.


Bonus Finding:


When manually analyzing password guesses DO NOT pipe the output of a password cracking session into "less". At least in my WSL Ubuntu shell, this seemed to add artifacts into the guesses I was creating which gave me bad data. Note, This doesn't impact running actual password cracking sessions.


Instead, when using John the Ripper, make use of the "--max-candidates" option. Aka:




  • ./john --incremental=tokenize1 --external=untokenize1 --stdout --max-candidates=25



Discussion


This was an area where my analysis setup really let me down so I chased a lot of unproductive leads before I was able to find the ground truth. For my first test I sorted Rockyou1 and compared a Tokenizer attack trained on it to a Tokenizer attack trained on an unsorted Rockyou1 training set. Initially they appeared to generate different guesses. For example:



This led down an unproductive rabbit hole where I ended up generating a lot of different character sets for Incremental mode to try and track down what was causing the differences in guess generation. It wasn't until I got really frustrated and ran a "diff" on the different .chr files that Incremental mode uses and found they were EXACTLY THE SAME that I realized the problem might be in how I was displaying the guesses.


Still, I learned a few new things, and improved my testing process. So it wasn't a complete waste.


Question 2: How does the PCFG OMEN mode attack differ from the original Rub-SysSec version?


Background:

This question was inspired by a comment by @justpretending on the Hashcat Discord channel.



OMEN stands for Ordered Markov ENumerator and the original paper/implementation can be found [Here]. I became interested in it after it was presented at PasswordsCon where it was shown to be more effective than my PCFG attack and could generate guesses at speeds making it practical. That's certainly one way to get my attention! To better understand the OMEN attack I took the Rub-SysSec OMEN code and re-implemented it in Python. The standalone version of the python code (py-omen) is still available [Here]. Liking what I saw, I then replaced the existing Markov attack (based on JtR's --Markov mode) in the PCFG toolset with OMEN for the PCFG version 4 rewrite. 


That's a lot of words to say that while the different implementations weren't 1 to 1, I expected my version of OMEN be "mostly" similar to the original Rub-SysSec version. But it appears there are differences, so let's look into them!


Challenges with Unsupported Tools:


The first challenge I ran into was getting the stand-alone versions of py-omen to run. For example, I get the following error when try to generate guesses with py-omen:



I vaguely remember having to update my ConfigParser calls in the PCFG toolset, so that error tracks. My guess is if you ran py-omen with Python3.6 it would work, but it looks like it isn't compatible with Python3.12. While it is tempting to fix this bug now as having py-omen working would be nice for doing more experimentation with OMEN, it's really outside the scope of this investigation and the error message brings me joy. Long story short, I'm going to defer that work until a later point.



The important tool to run though is the C OMEN version developed by Rub-SysSec. The "make" build process worked without a hitch, but when I tried to run it the following error was displayed:



Looking into the open issues for the code I found one [Link] that highlighted that this problem occurs when you run it from an Ubuntu system. I verified this happens both with a WSL install of Ubuntu as well as a copy of Ubuntu running on bare metal. When I installed a Debian WSL environment though, I was able to get the original OMEN code to work.


Another challenge I ran into with the Rub-SysSec OMEN code was that by default it only generates 1 billion guesses and then stops. I "believe" you can override this using the "-e" endless flag, but I didn't figure this out until I had run my tests, so the following tests only display a 1 billion guess session vs. the 5 billion guess sessions I used in my previous blog posts.


Test 2) How does the original OMEN code perform compared to the PCFG OMEN code?


Test 2 Design:


Training the Rub-SysSec OMEN code on the RockYou1 training set (1 million random RockYou passwords), an attack will be run using a variation of the following command. Disclaimer, I didn't actually use the "-e" flag but I'm including it here to make it easier the next time I need to copy/paste a command from this blog into a terminal.



  • ./enumNG -p -e | python3 ../Password_Research_Tools/checkpass.py -m 1000000000 -t ../../research/password_lists/rockyou/rockyou_32.txt -o ../../research/blog/tokenize/OG_omen_rock1_rock32_stats.txt




You'll notice I don't specify a specific training ruleset for enumNG since the Rub-SysSec code only supports one training set at a time (aka you need to retrain it every time you want to use a different ruleset).


As to the target sets, I'm going to run enumNG against both the RockYou32 test set (a different set of 1 million passwords from RockYou), and the LinkedIn password dump that I used in the previous blog posts.


Test 2 Results:





Test 2 Discussion:


While I expected to see difference between the PCFG OMEN and the Rub-SysSec OMEN, I was still surprised by how much they differed. I obviously made some improvements in the PCFG version of OMEN while totally forgetting what they were. As you can see from these tests, the original Rub-SysSec OMEN performs comparably to the new JtR Tokenize attack (the original OMEN did better on Rockyou, but roughly the same or worse against LinkedIn).


The PCFG OMEN did much better though. These two attacks should be "mostly" the same! This difference in performance is like a grain of sand in my boot and I'd really like to better understand what makes them different. You'll notice both attacks have the "sawtooth" pattern though so there's certainly room for optimizing the underlying OMEN attack regardless of the implementation.


My first thought was these two tools used different Markov orders (or NGrams). Both of the tools can have the length of NGrams be specified via command line options during training, so having different default settings was a likely source of differences. Unfortunately when looking at their settings, both the Rub-SysSec and the PCFG OMEN use a default of NGram=4 (same as a 3rd order Markov chain). So that's ruled out. 


Another source of differences could be the alphabets each OMEN attacks uses. The alphabet is the set of characters OMEN selects from when generating guesses. One change I made to the PCFG OMEN code was to allow for a variable number of characters in the alphabet based on the training data, (as well as support for different character encodings such at UTF-8). You can see the differences between the two different tools which were both trained on the RockYou1 training data below:



While the different alphabets are probably the cause of some of the differences, given that the "extra" characters in the PCFG OMEN are unlikely to be in many passwords, this doesn't explain the entire difference. My current theory is that the probability smoothing and how the PCFG toolset uses the "Initial Probability" (IP) may be the source of many of the other differences. Side note: Neither Rub-SysSec or PCFG OMEN used "Ending Probability" (EP) by default.


What is very annoying though is I don't have any notes on what I did differently for smoothing, so to better understand this I need to dig back into the original Rub-SysSec version of OMEN as well as my own code. So this is something I need to research, but I'm going to defer most of that investigation to a later blog post.


TLDR: The Rub-SysSec and PCFG toolsets both use the OMEN attack, but there are implementation differences which cause them to behave very differently.


Question 3) Can the Tokenize approach be applied to the PCFG OMEN attack?


This test was brought up by SolarDesigner on the John-Users mailing and we discussed what this attack might look like [Here]. The proposed approach to test this can be summed up as follows:




  1. Use the tokenize.pl script to "modify" the training data with variable order Markov chains and create the "untokenize" JtR external mode, just like when running a normal tokenize attack with JtR's Incremental mode.

  2. Train a PCFG OMEN attack on this "modified" training data

  3. Use the PCFG OMEN attack to generate guesses, but before hashing/checking the password guesses, pipe the guesses into JtR --stdin mode to apply the --external=untokenize logic to convert the tokenize placeholder characters back to the multi-character strings.


This should be good enough "smoke test" to see if there might be value to add tokenize support to the main PCFG toolset.


Test 3 Training:


For this test, I'm finally updating my Tokenize code to the latest version in the John the Ripper Bleeding Jumbo github repository. To make things easier to compare against previous runs, I'll be training the new version Tok-OMEN on the RockYou1 1 million training subset. Here are the commands I used:




  1. Run tokenize.pl to generate the sed script and the external mode code:



  • ./tokenize.pl rockyou_1.txt


  • Convert the training file to make use of tokens by piping it through the sed command. Note: The new tokenizer sed script also converts the output to JtR pot format.


    • IMPORTANT: Remove the s/^/:/ from the end of the sed command so the results will be saved as a password only file. This is because the pcfg trainer no longer has the ability to train from a potfile (long story: It was a buggy feature. I need to revisit this feature as it is useful).

    • cat rockyou_1.txt | sed 'REALLY_LONG_SED_COMMAND' > tok_omen_rockyou1.training


  • Also copy the External mode script generated by tokenize.pl into your john-local.conf file as:


    • [List.External:Untokenize_omen]


  • Run the PCFG trainer on the tokenized training file.


    • -t = Training file. -r = Rule name. -c 0.0 = Only generate OMEN guesses

    • python3 trainer.py -t tok_omen_rockyou1.training -r tok_omen -c 0.0


    This appeared to work correctly as can be seen when I view the CP.level (basically 4 letter substrings OMEN uses for NGRAM=4) in Visual Studio Code:



    Test 3 Design:


    For a target set of passwords, I'm going to use the same RockYou32 1 million subset of passwords (different from the training passwords), and the LinkedIn password dump. This will allow me to directly compare this attack to the previous attacks I ran.


    To test the attack and generate the first 25 guesses I used the following command. Please note: It is very important to pipe the result of the pcfg_guesser into JtR to make use of the associated untokenize_omen rule.




    •  python3 pcfg_guesser.py -r tok_omen | JohnTheRipper/run/john --stdin --external=untokenize_omen --stdout --max-candidates=25


    The top 25 generated guesses were:




    1. ((Blank))

    2. k05icty

    3. butt

    4. 13babyg

    5. john

    6. 6543

    7. 5555

    8. b05ic1800ol1

    9. bigb

    10. 18chsadfg

    11. fu13cky

    12. p18chsasw

    13. blue

    14. h0708ok

    15. 13mommy

    16. s13coob

    17. budd

    18. augu

    19. 7777

    20. 6666

    21. sn13oop

    22. 07or765

    23. 13compu

    24. lu13cky

    25. mysg


    Note: Since OMEN works using "levels" it doesn't generate the most probable guesses first. Instead it generates all guesses at a specific level first. So for level 1 there are 104 guesses that can be created with this training set. You can see the keyspace per level in the PCFG rules file under Omen/omen_keyspace.txt. Still, this looks weird, and [[Spoiler Alert

    Matt Weir

    Source: Security Boulevard
    Source Link: https://securityboulevard.com/2024/12/analyzing-tokenizer-part-2-omen-tokenizer/



    Comments
    new comment
    Nobody has commented yet. Will you be the first?
     
    Forum
    Blue Team (CND)



    Copyright 2012 through 2024 - National Cyber Warfare Foundation - All rights reserved worldwide.