Hacker News

Lexical differential highlighting instead of syntax highlighting(wordsandbuttons.online)

249 pointsbased2 posted 3 months ago70 Comments
70 Comments:
dwheeler said 3 months ago:

I understood the problem, but I found the page's explanation a little confusing at first. In particular, "lexical differential highlighting" misled me, because the word "differential" made me think that his algorithm was comparing lines or tokens in some way, and it doesn't do that.

Basically, this algorithm tokenizes the source code, and tries to color each token so that identical tokens have the same color, but similar-looking tokens have very different colors. When tokenizing it specially handles comments and quoted text.

That's an interesting approach to countering errors from "it's almost the same but I didn't notice they were different". I wonder - if I were trying to review source code that were malicious, maybe I could vary the color algorithm using a random source so that the source code writer couldn't make different tokens look similar in color. That might be an interesting countermeasure to some kinds of underhanded code.

saagarjha said 3 months ago:

Yeah, I thought this would do something like highlight all "mov" derivatives the same way and was somewhat surprised at the brevity of the code at the bottom…

shipof123 said 3 months ago:

That reminds me of something I read in applied cryptography when I was young about how one could theoretically pass messages with “ \b” to generate infinite versions of “identical” text to cause collisions

kazinator said 3 months ago:

This idea is related to "rainbow parentheses" (e.g. for Lisp): different levels of parens just get arbitrary different colors. But matching parens are the same color, just like two occurrences of %ecx in the same line are the same.

human_banana said 3 months ago:

In emacs there's a package rainbow-delimiters-mode for parantheses, braces, brackets and what not, and rainbow-identifier-mode which makes variables names unique colors.

andrepd said 3 months ago:

It's legitimately one of the best features of Excel. Does anybody know how I can achieve that in Sublime? The few options I found were subpar.

kaibee said 3 months ago:

Don't know about Sublime, but there's a plugin that does this for Visual Studio.

https://marketplace.visualstudio.com/items?itemName=TomasRes...

Probably not helpful to you, but maybe some other lurker.

neotek said 3 months ago:

I'd love to know the answer to this as well, it would be so useful.

fake-name said 3 months ago:

There's a sublime text package that does this for a bunch of different languages: https://github.com/vprimachenko/Sublime-Colorcoder

I'm not involved in any way, I just ran it for a while at one point.

synthc said 3 months ago:

There is also an emacs package that does something similar: https://github.com/jacksonrayhamilton/context-coloring

sprobertson said 3 months ago:
synthc said 3 months ago:

I think DrRacket also has something like this, but it shows lines between identical variables instead of using colors.

xvilka said 3 months ago:

Seems dead for many years already.

cjs_2 said 3 months ago:

How many updates per month are you expecting for a package like this?

xvilka said 3 months ago:

Multiple times a day, like radare2. Seriously, if there is no activity in 6 months - then the project is dead.

mikekchar said 3 months ago:

This is a lexical highlighter that tries to highlight similar, but different text differently. There's a point in time where there are no new features necessary.

radare2 is a portable reversing framework. I can't think of 2 projects more dissimilar. Perhaps you were thinking that the highlighter actually did something other than color text in an arbitrary way? Can you give an example of something that you would expect to change about it, especially at the rate of multiple times a day?

guessmyname said 3 months ago:

> There's a sublime text package that does this for a bunch of different languages

You don’t need a package for this, Sublime Text 3 already does this automatically [1].

[1] https://www.sublimetext.com/docs/3/color_schemes.html#hashed...

nh2 said 3 months ago:

How can I use it?

The simplest way seems to be to use the "Celeste" color scheme which implements this. Is this the only way? I'd like to use a dark theme, like the default Monokai.

guessmyname said 3 months ago:

Yes, “Celeste” is the only theme with support for semantic highlighting.

For dark mode, I use this project — https://github.com/cixtor/monnokay

fake-name said 3 months ago:

Well, neat!

I haven't used the plugin since the ST2 days, so I didn't realize it was no longer needed.

soulofmischief said 3 months ago:

Webstorm has an option for this and it makes things like dense enclosures or JSON actually parsable.

galaxyLogic said 3 months ago:

Which feature is that? I've been using WebStorm for some time and wishing for a feature that would highlight all matching parenthesis (), [] and {}.

_virtu said 3 months ago:

- plugin: rainbow brackets

- preference: semantic highlighting

galaxyLogic said 3 months ago:

Thanks. I tried it but it did not quite do what I needed so I uninstalled it. (I'm afraid of plugins in general taking performace away). It worked on JS-files but I have HTML-documents containing (example) JavaScript etc. code. Seems it did not react to parenthesis in them. Also even in plain JS-files you may have strings containing parenthesis.

Standard WebStorm already highlights matching parenthesis in JavaScript and does a good job at that.

soulofmischief said 3 months ago:

I don't use rainbow brackets, but I do use semantic highlighting. It's worth seeing if semantic highlighting would still be useful to you. It greatly helps scanning speed.

cylon13 said 3 months ago:

What made you decide to stop using it?

gpspake said 3 months ago:

I remember Doug Crockford mentioning the idea of scope based highlighting for JavaScript in a workshop years back and thinking it would be useful. Cool to see it pop back up here.

Edit: Here's a scope based js highlighting repo that cites Crockford as the inspiration but unfortunately he posted the linked description on Google+ so... uh... oops https://github.com/azz/vscode-levels

zokier said 3 months ago:

Complete tangent but one thing that I've wondered about modernish asm mnemonics is how complex they are, and especially how much type information they encode in a semi-structured way. Taking the authors example of PMULHUW, the core operation is MUL(tiply), P for packed integers, H for high result, U for unsigned, and W for word sized (16 bit). I feel like there must be a better way to express the same thing that wouldn't lead stuff looking like one word all caps alphabet soup. I don't know exactly what that would be, spelling out everything would probably make assembly way too verbose. So some sort of middle ground would be nice.

chc4 said 3 months ago:

> I feel like there must be a better way to express the same thing that wouldn't lead stuff looking like one word all caps alphabet soup.

Yes, that's called a programming language :^)

Assembly is usually essentially a macro engine over the actual instructions you are emitting for your processor, and the Intel x86 chip manuals or whatever you're targeting use the outrageously long proper names, so your assembly will too. Heck, the author mentions specifically reading assembly too, so knowing what you're reading is 1:1 with the actual instruction stream is helpful, no matter how bad the official names are.

Actual programming languages just abstract away some complex instructions like SSE vectorizing (which have famously terrible names) to some high-level API and intrinsic functions. And you should too.

zokier said 3 months ago:

> the Intel x86 chip manuals or whatever you're targeting use the outrageously long proper names, so your assembly will too.

I don't see why that has to be the case; why I'd must use Intel specified mnemonics instead of my own syntax? While not as radical, the att vs intel syntax demonstrates that the vendor syntax is not the only option. As long as the syntax captures all the details of instructions to be completely unambiguous then it should be perfectly interchangeable.

I specifically do not desire higher level of abstraction because I want to maintain that 1:1 relation with the actual machine code. Heck, even Intel mnemonics do not truly have 1:1 relation to machine code, because the instruction (encoding) can depend on operand types.

breck said 3 months ago:

I’ve done some experiments with tree languages that compile to ASM. I think it’s definitely the way forward.

okaleniuk said 3 months ago:

Actually, it would be interesting to experiment with coloring all the abbreviations separately. P, then MUL, then H, then U, then W (or UW altogether). Not sure if it works, but it's something worth trying.

lifthrasiir said 3 months ago:

[1] was a similar idea where color is determined by the prefix, so for example `currentIndex` and `randomIndex` are distinguished from each other but `currentIndex` and `currentIdx` are not.

I'm not sure about both because, i) there are only a handful number of mutually distinguishable colors ([1] does mention the same complication), ii) we often want to highlight both the similarity and difference among identifiers and the cutoff is not clear. For i) we may want to leverage more formattings; for ii) I really don't have a good solution.

[1] https://medium.com/@evnbr/coding-in-color-3a6db2743a1e

css said 3 months ago:

Wow, this actually looks amazing for math (though it seems to be stripping out a lot of the code I pasted in): https://i.imgur.com/Iur9FgK.png

How difficult would it be to implement this as a VSCode extension?

petschge said 3 months ago:

This looks pretty good, but notice how it does not split "log(difference_squared" into two tokens. Adding '(' and ')' as delimiters should fix that.

css said 3 months ago:

Good point. That helps, but it still strips about half of the lines of my code out for some reason. Specifically, this part: https://i.imgur.com/L117fYm.png

BenFrantzDale said 3 months ago:

I love that visually I can find usages of, day, `alpha`.

I do wish it did some syntax highlighting, but one could easily imagine blending between this and conventional syntax highlighting.

panopticon said 3 months ago:

Tangential, but "Just as every other piece of code on Words and Buttons, it's properly unlicensed." reads like the code is literally unlicensed and not using the Unlicense license.

It's a little weird to me because unlicensed code is very different than the Unlicense license.

ChrisSD said 3 months ago:

And I'd add that CC0 is more "properly unlicensed" than Unlicensed is. Or at least more thoroughly so.

canadaduane said 3 months ago:

I think this is also called semantic coloring. Visual Studio Code has it on the roadmap to try this year: https://github.com/Microsoft/vscode/wiki/Roadmap#editor

sixplusone said 3 months ago:

No, semantic coloring is about the editor having deep knowledge about your code, this is about having very similar looking names or lexemes appear different. FTA:

It's fine that mov doesn't look like eax, but I'd rather prefer pmulhw and pmulhuw to be shown as differently as possible.

jcelerier said 3 months ago:

KDevelop has pioneered this a decade ago : https://zwabel.wordpress.com/2009/01/08/c-ide-evolution-from...

gmueckl said 3 months ago:

Ecliose also has had this for ages at this point. I don't remember when they introduced it, but when you can memorize the meanings of all the colors, it's great.

m0zg said 3 months ago:

I'm not a fan of this approach in general, but I am a fan of highlighting instructions from different subsets in different colors in asm, and perhaps differentiating the saturation by latency/throughput. I.e. a "heavy" instruction should probably be bright, urgent red, whereas loads, stores, adds, bit ops should probably be more muted.

IshKebab said 3 months ago:

Something like this is implemented in vscode-clangd. I used it for a bit but it's just too colourful. There are just colours everywhere and it's overwhelming. I went back to normal syntax highlighting.

KuhlMensch said 3 months ago:

Curious. I mean it sounds like relying simply on contrast rather than the structure. I know our visual system is insane at contrast, and we, as humans tend to group tokens as a shorthand.

What mades me immediately pause, is when I reflect reading javascript: How often do I scan past 3+ lines using colour as my "bridge"? As far as I can remember, not often. Maybe I've overestimated colour-to-lead-me-through-structure. Maybe it is often, colour-to-give-me-token-rhythm. Curious.

I'll have to remember to load up CSS or a test suite (with lots of framework calls) using this approach.

SilkySailor said 3 months ago:

I really like this idea. I always wanted to try to take this to insane levels. For example, for large code bases have different images associated with different modules. So that your brain has more things to latch on to. e.g.: This function from the banana module is calling the teddy bear module. It seems a bit absurd since there is no correlation between the image and the module functionality but I still want to try it.

stochastimus said 3 months ago:

This is really cool. It kinda looks like rainbow salad, but who cares? For me at least, it is much easier to visually parse.

DarmokJalad1701 said 3 months ago:

Nice to see some MASM32 code in there in one of the examples. That's from a WIN32 app if I am not wrong.

Brings back memories.

FrancisNarwhal said 3 months ago:

Oh my god this would have saved my bacon two days ago. p_value_default is so visually similar to v_value_default that after sitting there with another developer trying to figure out the problem for 30 mins we rewrote the whole method.

Only the next day after the deadline pressure was gone did I spot the problem.

Avamander said 3 months ago:

I understand it in the case of assembly, but I don't think it'd work for something like Python better than existing syntax highlighting. So it's nice and I hope things like Radare or IDA adopt it where people even intentionally make syntax highlighting nearly impossible.

ggm said 3 months ago:

I encourage the original author to find a way to talk about assembly coding in the nuclear industry.

gcbw2 said 3 months ago:

what do you expect to be different from your run-of-the-mill maintenance of outdated industrial automation gig?

YeGoblynQueenne said 3 months ago:

At a guess, an increased probability of causing a criticality accident as a result of getting a program slightly wrong.

exDM69 said 3 months ago:

I'm assuming the "reading assembly" part is verifying compiler output matches what the programmer thinks and signing it off as a "blessed binary".

Some safety critical areas of software are done this way, in aerospace for example. But run-of-the-mill automation jobs aren't.

ggm said 3 months ago:

bit flips from surplus neutrons? TMR? Batshit crazy lack of process checks on 'what does this button do'

war stories.

actually, I encourage anyone in coding to share run-of-the-mill maintenance of outdated industrial automation, as a gig. I'd read that blog.

pcwalton said 3 months ago:

In this particular case, the highlighting is a clever workaround for the fact that x86 register naming conventions are awful. RISC architectures tend to number the registers, which makes things significantly easier to read.

m463 said 3 months ago:

Not code, but I'm surprised that email clients don't have better colorization from the getgo.

I think it would be the single best thing to help a huge amount of people.

said 3 months ago:
[deleted]
gnuvince said 3 months ago:

There are too many colors in too many places. Everything is highlighted and nothing stands out.

galaxyLogic said 3 months ago:

I agree. Rather than rainbow the brackets I think a better solution is to highlight the matching brackets with a temporarily different color as user moves the cursor.

Or at least make it easy to turn the rainbows on and off.

Insanity said 3 months ago:

which forces you to read everything individually and not miss something. I prefer less highlighting for this reason. I highlight a few keywords but other than that I don't highlight. I find it helps me _read_ the code rather than skim the code. (and for skimming, I'd grep through it most likely looking for something specific rather than trying to understand it.)

Analemma_ said 3 months ago:

> In 2013 I was working in nuclear power plant automation ... the job required reading a lot of assembly code.

Does anyone else find this terrifying? Nuclear power plant automation should be done in the safest of the safe languages. I would be alarmed at the thought of stuff like this being written in C, never mind in assembly!

holy_city said 3 months ago:

Not really. There are plenty of chips out there without even a C compiler. Some don't even support Turing Completeness. There's even more that were designed and installed before manufacturers started slapping C compilers together for their DSPs, FPGAs, and MCUs.

It would be weird to care about memory safety when your board doesn't even have a heap!

ARandomerDude said 3 months ago:

To me, it's less terrifying than a complete rewrite in a modern language. Modern languages are great. Rewrites are often littered with bugs.

pvg said 3 months ago:

Systems like that tend to be designed with different kinds of safeties. A mildly silly example - your typical Rails app doesn't have a watchdog timer, your toaster probably does.

okaleniuk said 3 months ago:

An excellent example!

sixplusone said 3 months ago:

Yes he said reading assembly, not writing. Whatever they use, I'm glad that someone's having a glance at what the compiler spits out. Also could be talking about microcontrollers, and in an industrial setting PLCs wouldn't be unexpected.

said 3 months ago:
[deleted]
splittingTimes said 3 months ago:

Does something like this exist for Java eclipse?