People Recognize Objects by Visualizing Their “Skeletons”(scientificamerican.com)
Imho the most important part of the article:
One concern with the study is that the authors generated the objects specifically from skeletons rather than deriving them from shapes, either natural or human-made, covered by skin, metal, or other materials that people encounter in their day-to-day life. “The shapes that they generated are directly related to the hypothesis they’re testing and the conclusions they’re drawing,” says James Elder, a professor of human and computer vision at York University in Toronto. “If we’re interested in how important skeletons are to shape and object perception, we can’t really answer that question by only looking at the perception of skeleton-generated shapes. Because obviously in a world of skeleton-generated shapes, skeletons are probably fairly important because that’s the way those shapes were made.”
I looked into the paper first and thought: yea well it's really not surprising, that the skeleton models are most predictive for the kind of objects they tested. Their skeleton really is all that defines them.
The only thing they tested and proved is: Skeleton models are predictive for human decision when recognizing objects made just from skeletons with little flesh and hardly any texture whatsoever.
Nevertheless I think skeleton models are a good thing for object recognition
> The only thing they tested and proved is: Skeleton models are predictive for human decision when recognizing objects made just from skeletons with little flesh and hardly any texture whatsoever.
Isn’t it an important result that humans are able to recognize when an object is made just from skeletons and optimize recognition to focus solely on the skeleton? That sounds pretty neat to me
Yes, that's neat, but it's very different to and far more limited than the generality the title and the rest of the article claim.
> Skeleton models are predictive for human decision when recognizing objects made just from skeletons with little flesh
Exactly. One minute reading. Conclusion: junk science.
Humans are much better at noise removal than computers. Many people can look at an object and see what's extraneous to the basic form--what's left is the skeleton. Computers, so far, don't have the context to do this, and instead try to recognize objects based on visual patterns, etc.
Perhaps "weighting" models, allowing algorithms to look for centers of gravity and mechanical behavior would help. Humans exist in a 3d world, but we also interact with a simplified 3d world.
We don't worry about the plastic bag in the street because we can feel how our car will respond. It's trivial. There's no "weight" attached to the object.
Weight and balance are incredibly important psychologically (see the burgeoning popularity of weighted blankets), and that's a thing that's missing for computers. Having a tangible sense of the world in our minds gives us a huge leg up when relating to it.
I'd like to argue that it's rather the extreme connection density and feedback loops that connect all these different concepts. Taken on their own, each of these models that the brain (and perhaps artificial neural networks) construct are weak predictors. This is compensated for by their sheer number and the plasticity of the feedback loops between them.
As you say, when a human observes a plastic bag, a vast number of different models and transformations aggregate their predictions in a highly nonlinear fashion:
The bag has a plasticky look, it seems to be flopping around, it is slightly see-through, it produces a certain sound that implies a hollow cavity etc... these primary observations are processed by the sensory neurons, which do a first pass filter to remove noise completely subconsciously. If they don't get enough feedback, perhaps it was just a mirage of a bag - not real - and you do a double take and realize it was just a play of shadows.
But let's assume first pass feedback confirms that it is likely a real sensation. The primary inputs are then confirmed by secondary predictions:
The bag is carried by the wind, implying low density, the sound it makes is common for empty thin plastic materials, it has a matte surface that lets through some amount of light, etc... the subconscious thus makes the conclusion that it is indeed probably made of thin plastic and therefore of low density and low hardness and therefore not a threat in terms of high velocity impact. Your swerve reflex thus doesn't kick in and you drive straight.
But this reasoning requires an in-depth model of the world. It isn't enough to just recognize the shape of a bag, because that could be a myriad of other things. Only by having a model and thus understanding of all these different aspects of reality can one make a prediction as robustly as a human. And that is not a high bar, because humans are not good at predictions, let alone on short timescales. We are prone to biases, sensory errors, local minima from past bad experiences, basically the lot.
>> But this reasoning requires an in-depth model of the world. It isn't enough to just recognize the shape of a bag, because that could be a myriad of other things. Only by having a model and thus understanding of all these different aspects of reality can one make a prediction as robustly as a human.
This is a great summary of why I think current deep-learning based methods will never lead to 'intelligence' that is good enough to e.g. navigate the real world like humans do. They are all based on learning to recognize patterns to infer which things look the same as whatever was in their training set, but they have no semantic capabilities beyond simple classification.
>> And that is not a high bar, because humans are not good at predictions, let alone on short timescales. We are prone to biases, sensory errors, local minima from past bad experiences, basically the lot.
This observation I don't really follow, I would say the bar to match human reasoning abilities is extremely high for exactly the reasons you described yourself.
>> This observation I don't really follow, I would say the bar to match human reasoning abilities is extremely high for exactly the reasons you described yourself.
Sorry I should've phrased it better. I was trying ti imply that just matching human reasoning abilities is indeed an undertaking of incomprehensible complexity, _and_yet_ it is still highly error prone. I believe a system that replaces humans will be under close scrutiny and just being at par won't be enough.
>> This is a great summary of why I think current deep-learning based methods will never lead to 'intelligence' that is good enough to e.g. navigate the real world like humans do. They are all based on learning to recognize patterns to infer which things look the same as whatever was in their training set, but they have no semantic capabilities beyond simple classification.
I'm not a neurologist or cutting edge ML researcher by any measure, but this is my viewpoint as well. The astounding amount of information and internal models, and the astounding complexity of these models in terms of connections and feedback loops (and their plasticity) implies to me that our current pedestrian attempts at AI are nowhere near what is required for GAI, let alone human level GAI.
It seems to me like a lot of hubris to suggest (as I've seen people do) that in just a couple of years we could get there. Currently we have not even a clue how consciousness arises. We have evidence that it is physically possible, but that's it.
The leading enterprise in the area, Google/Youtube routinely fail to identify objects and sounds in videos.
My prediction is that what we have currently is a local optimum that expands our capabilities a lot, compared to what we had before, but in terms of genuine insight into human level AI, it will prove to be a dead end.
I'd love to be proven wrong though.
Regarding models of the world: isn't it conceivable that a computer could have a smaller, specialized, model of the world specific to its task?
A car could have a model of reality whose scope is only encompassed by the context of roads and driving. It is conceivable to me that a car could have an in-depth model of the "driving-world" that would allow it to make multi-sensory, tiered observations and predictions akin to human cognition.
> They are all based on learning to recognize patterns to infer which things look the same as whatever was in their training set, but they have no semantic capabilities beyond simple classification
Deep learning is more than just imagenet classification or object detection.
There are many approaches that require more understanding, such as future video prediction, captioning, question answering, reinforcement learning requiring an implicitly learned model of how the environment works beyond mere appearances, image generation, structure extraction, anomaly detection, 3d reasoning, external memory, few/one/zero shot learning, meta-learning, etc etc.
The field is huge and whatever "obvious shortcomings of deep learning" non-specialists come up with after reading popular articles are probably being tackled already in many groups and have several lines of approaches and papers already.
> Computers, so far, don't have the context to do this
As someone who did their Ph.D thesis on the statistics of shape using models based on the medial axis (i.e., a skeleton), I would beg to differ.
Whether these models are as easy to apply (computationally and conceptually) as the currently in-vogue techniques is another question, but there is nothing magical here that computers are incapable of.
Sure, just the skeleton part. How about density, deformation, and reflectiveness? All at once? We can simulate these, so we can obviously detect them, but not yet.
Hah, I was going to point this out. We already use MAT (Medial Axis Transform) - quite heavily in 3D printing and object/space recognition.
I think one of the most important aspects of human vision that everyone seems to overlook is that it's active. We aren't just sitting in a dark room looking through a video feed our whole lives, we actually live in and interact with the world.
Our eyes are active in that they move freely and can focus at different distances. We also happen to have two of them and our brains have a model for how far apart they are. These two features (active focusing and binocular vision) give us incredible depth perception.
Our brains use this depth information to separate objects from the background, something a machine learning algorithm cannot do if you're just feeding it a billion photo labeled training set.
The brain also makes decisions very early and updates it as it has time to reconsider the data. We've all probably had cases where we saw a person sitting down then realised it was just a jacket draped over a chair.
At least from my own personal experience, it's very biased too. It seems the more tired we are, the more likely we are to incorrectly recognise immobile objects as people or animals at a glance.
I know you meant it as a general statement, but I think it depends on the kind of noise, and obviously the type of signal. Its trivial for a computer to look past 'fixed pattern noise' to find the data in an audio-visual signal. For certain noisy signals, a compute device could amplify/scale the data (and then perform noise removal) to retrieve the signal, etc, etc..
This is regularization. Once that's solved, humans have no intrinsic advantage.
Direct link to the paper: https://www.nature.com/articles/s41598-019-45268-y
>Here we tested whether skeletal structures provide an important source of information for object recognition when compared with other models of vision. Our results showed that a model of skeletal similarity was most predictive of human object judgments when contrasted with models based on image-statistics or neural networks, as well as another model of structure based on coarse spatial relations. Moreover, we found that skeletal structures were a privileged source of information when compared to other properties thought to be important for shape perception, such as object contours and component parts. Thus, our results suggest that not only does the visual system show sensitivity to the skeletal structure of objects32,36,37, but also that perception and comparison of object skeletons may be crucial for successful object recognition.
I think it's telling that even young children are exceptionally good at object recognition, and if you ask them to draw an object, they'll typically give you a "skeleton" with basically no ability to reconstruct the textural components.
I think the real interesting question is: what is the internal representation of this skeleton? A graph? A forest of graphs? Some kind of field that's graph-like?
I think young children's drawing ability is more indicative of the type of tool we are giving them, they only have the ability to draw a fixed width line, how else would you represent a limb?
Well, it's not at all obvious that a line is the naive representation of a limb rather than a particularly intelligent encoding of it.
For example: CNNs, though pretty good at detecting limbs (and miscellaneous other things) have only a very limited ability to encode structural information in this way. An interesting open question in the field is what is the "right way" to encode this sparse, graph-like structural data (hence capsule networks).
The same instrument in the hands of a skilled artist would have no trouble using it to produce a convincing likeness of whatever they were drawing.
Absolutely, but that requires advanced fine motor control, understanding of how the instrument lays down color and what multiple layers of color look when on top of each other, and so on.
The naive way to use the instrument, is to run the instrument over the area one or a few times. The simplest way to do that in terms of motor control (e.g. fewest turns) is to run it up and down the longest axis one or more times. That's exactly what a child does.
I'd say (out of experience) that people do not recognize objects by visualizing their skeletons, but they recognize objects by a generalization of their shape.
In case of recognizing other animals, the generalization takes the form of a 'tree' of objects connected via nodes, which is actually what a skeleton does to a body.
But that does not happen with other objects, i.e. cars. For cars, the generalization is that of a box with circles at the bottom (for the wheels).
It shall also have to be noted that the details of objects are not really lost, but they are remembered, up to a certain degree, which allows us to recognize a person with fat body parts from a person with thin body parts of the same height and otherwise same general outlook.
The degree of generalization is also responsible for not being able to remember a new face that strongly resembles a face we already know, until we recognize for the new face some special attributes the old face does not have. In this case, the degree if generalizaton is such that does not allow us to immediately tell apart the old from the new face.
I'd say that recognition works in a step like fashion:
-we first recognize a generic abstraction of the object at hand: if the object is inanimate or not.
-then we recognize in which category of the inanimate or living objects the object under recognition is (for example, is it a human? an animal? etc).
-then we recognize more details; is the person tall, fat or blond? for example.
-then we recall our connections to that person, resulting in chosing a response.
I don't have data to back the above up, it's all from intuition and personal experience, but that's how I think objects are recognized by brains.
This seems very obvious.
Machines are taught from flat images. How can they be expected to create 3D from this?
Humans learn from binocular vision, and from multiple angles as we move around an object, making it a lot easier to get an idea of its shape.
My daughter aged 18 months could already recognise abstract signs like the mother and baby or disabled sign just from knowing the real object. Which must say something about the way she stored the representations of them.
Are children who lose one eye at a very young age unable to see and identify 3D objects?
They don't see flat static images, but a continuous stream of input that changes view angle constantly as they and the object make subtle movements. Moreover, they can interact with things and gather more visual information where needed. (Anything too big to interact directly with is probably too far away for binocular depth perception to be of much use.) See a big list of monocular depth cues here: https://en.wikipedia.org/wiki/Depth_perception
Why not use two cameras for training AI then?
Because they’re using existing data. You need thousands, maybe millions of images to train an AI to recognise something well, and only recognise the right characteristics. No-one has the resources to go take all those photos themselves.
Anyone know of a visual recognition AI being trained also with depth data? Would be interested to see what difference it makes.
This relates to something else I noticed differently about my daughter learning. You can show her one photo of a lion, from one angle and she will recognise other lions later on, at different angles. I think she must have seen enough animals already from many angles to have generalised their shape and then be able to presume the new animal is similar and just see the new characteristics like a mane. Something very different is happening in Human brains!
You are right and it would be interesting to quantify how much it could improve AI if datasets were binoculars.
pictures uploaded to facebook and google were only taken by 1 camera :P
Note that the leading image classification algorithms are apparently trained to recognize texture more than shapes, because that's the easiest way to win at current benchmarks. But that can be fixed:
So what's the implications of this for topological data analysis as an alternative (or complementary) framework to the convolutional approaches in image analysis?
(I'm being loose with language, but a CNN is not an optimal "hole finder", while persistent homology is not optimal for telling different kinds of fish apart.)
I was looking at something along those lines recently "TopoResNet: A hybrid deep learning architecture and its application to skin lesion classification" on arxiv.
>people do not evaluate an object like a computer processing pixels, but based on an imagined internal skeleton
Well, maybe not how computers typically process pixels nowadays, but back in the old days of computer vision one technique for simplifying an image was skeletonization : https://en.wikipedia.org/wiki/Topological_skeleton
Yeah, now I'm wondering what topological skeletons might have in common with the abstract simplicial complexes generated by running a persistent homology algorithm on a point cloud.
I dont think the study can be used to draw the conclusions that the article is trying to draw. The study presents new objects which are derived from skeletons for people to learn and identify. IMO people learn differently in short term vs long term. Short term, we try to reduce the dimensionality of the input to things we can hold in working memory. In this study that would be the skeleton of the object. That doesnt mean that that pattern holds up for long term learning (which is mostly how we visually identify things, because we've seen them many times already). The main reason I bring this up is because it seems to be in direct contrast with studies which show the opposite (i.e. that humans do operate like machines in identifying objects). That study was done by comparing the brain regions which activated when the person was exposed to visual input and found a consistent location which was activated due to seeing a horizontal / vertical line.
"Do humans learn the same way as computers?" Computers learn whichever way humans program them to...
OK, rephrase the question to "Does the way we've programmed computers to learn happen to resemble the way that humans learn?"
Our models of human learning are crude at best. Some programs attempt to approximate those models. But it resembles how humans learn the same way a stuffed animal chicken resembles a T-Rex.
no it does not
This is what fascinates me about machine learning. You can train an algorithm self adjust in ways that a human doesn't have the capacity to understand, which can then perform human like tasks.
If we get good at allowing programs to generate programs that find new ways of learning. Is it still behaving in a way that humans program them, or has it shifted to a law of nature that is fundamentally out of our control.
When it's all said and done, we decide if it was us, or a force beyond humans.
Programming that becomes more than just the sum of its parts. We can create systems or programs that can get beyond our "control" and function in unexpected ways.
This reminded me of a scene from the movie The Omen where Damien's mother Maria Scianna's skeleton turns out to be of a jackal's skeleton.
Reminds me a lot of the ancient idea of Platonic forms: https://en.wikipedia.org/wiki/Theory_of_Forms
Obviously not the same thing, but I think it's an interesting association.
Related problem (ImageNet-trained CNNs are biased towards texture)
This reminds me loosely of a paper I found on arxiv recently: "Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations"
Reminds me of Plato's theory of forms
I think that object recognition is hard because humans have much more data than computers. People see with two eyes which can focus on different distances, so our brain has 3D data to learn from. We later learn to recognise the same objects on pictures.
Computers usually start from flat pictures, and that trips the learning process.
I have zero data to back it up. Just my hunch :D
So children that's blind on one eye take longer to learn to recognize objects?
No but they cannot see depth very well because they can't use stereoscopic vision (triangulate). However, there are other cues that are used to infer for depth such as covered edges(if one object partially covers another then it is closer to you), perspective (if two objects that you know are similar in size but one appears smaller then it is farther) etc.
A friend of mine who cannot see with one eye and yet he is a painter. One thing I know he cannot do is drive a car.
> One thing I know he cannot do is drive a car
That's specific to your friend, not true in general. Lots of people drive with only one functional eye. At the visual distances involved, the depth perception provided by stereoscopic vision doesn't matter much. Especially with all the relative motion. My dad has been driving successfully for 65 years with only one working eye.
I drove for many years with amblyopia, so I didn't have much stereo depth perception. It did cause problems with parallel parking.
To be honest I don't know whether he is allowed to drive or not, he thinks he isn't allowed and never pursued it.
There's a cutoff of a 20/40 on at least one eye (corrected with class A restriction)
>> One thing I know he cannot do is drive a car.
You are allowed to drive a car if you only have one eye, though.
Besides - I don't really think I myself use the perception of depth from my stereo vision as much as I know that the cars on the road are all standard dimensions.
There's not so many objects on the road which have similar proportions of dimensions but are different in size so are easily mistaken.
You have enough distance information by simply observing the visible width of the car in front of you.
My friend thinks he isn't allowed and that's based on his knowledge. He might as well be wrong.
All those other cues are present in images from cameras as well. The only one I can think of that typically aren't used much for computer vision is focus distance, but for objects far away I don't think that helps us much in object recognition since all of the object are in focus anyway.
There are no cues from parallax motion in a static image though.
Not necessarily longer but it would probably have an effect on precision.
This seems to play very well with some of MIT CSAIL's research in training robots to be able to manipulate objects they haven't seen before.
TL;DR the objects are grouped into categories which determine the "Key points" on the objects (similar to this 'skeleton') which the robot knows how to interact with in order to bring about the intended manipulation.
It also seems that people have a tendency to represent things in drawing as either bubbles or stick figures. Even to ancient times, such as the humans from this cave paintint (Lascaux, I believe):
Or more recently:
This is because there is a metaphysical reality behind everything and humans instinctively recognize that even from a young age.
this is nothing but theory, and I am willing to bet any studies will be against this theory.