Saturday, July 6, 2019

// // Leave a Comment

Shape Bias in CNN for Better Results due to Wrong Texture Bias by Default

In its intro the authors of the paper below explain that it's been a common believe in the CNN community that the ImageNet trained neural networks developed a "shape bias" and stored "shape representations",  they propose a contrary view, that CNN are texture-biased and prove it with experiments:

IMAGENET-TRAINED CNNS ARE BIASED TOWARDS TEXTURE; INCREASING SHAPE BIAS IMPROVES ACCURACY AND ROBUSTNESS

To me that texture-bias has been obvious and obviously wrong. The CNNs recognise texture-features and search correlations between them, otherwise there wouldn't be adversarial hacks with changing a pixel and ruining recognition, it wouldn't need to be trained with so many examples, it would recognize wireframe drawings/sketches as humans do etc. etc. 

The "right" recognition would be robust if the system can do 3D-structure-and-light reconstruction ("reverse graphics"), at best incrementally, see: 


CapsNet, capsules, vision as 3D-reconstruction and re-rendering and mainstream approval of ideas and insights of Boris Kazachenko and Todor Arnaudov, Sunday, December 31, 2017



Colour Optical Illusions are the Effect of the 3D-Reconstruction and Compensation of the Light Source Coordinates and Light Intensity in an Assumed 2D Projection of a 3D Scene, 1.1.2012  

2012, discussions at AGI List:
AGI Digest: Chairs, Caricatures and Object Recognition as 3D Reconstruction






Developmental Approach to Machine Learning, Dec 2018
https://artificial-mind.blogspot.com/2018/12/developmental-approach-to-machine.html



News: Mathematics, Rendering, Art, Drawing, Painting, Visual, Generalizing, Music, Analyzing, Tuesday, September 25, 2012


[Topology, Vector Transformations, Adjacency/Connectedness...]


https://artificial-mind.blogspot.com/2012/09/news-mathematics-rendering-art-drawing.html


"...Vector transformations

In another "unpublished paper" from a few months ago, which would turn into a digest one day eventually (it's a published email discussion), I explained and shared some elegant fundamental AGI operations/generalizations which are based on simple visual 3D transformations. 

"Everything" is a bunch of vector transformations and the core of the general intelligence are the simplest representations of those "visual" representations, which are really simple/basic/general. 

And "visual" in human terms actually means just:

Something that encompasses features and operations in 1D, 2D, 3D and 4D (video) vector (Euclidian) spaces, and the vectors in these dimensions can be of dimensionality usually of up to 4 or 5, such as: //e.g. (Luma, R,G,B)

1D - luminance
2D - luminance + uniform 1D color space
3D/4D - luminance + splitted/"component" color space

+ Perspective projection, which is a vector transform, it can be represented as a multiplication of matrices - that is - the initial sources of visual data are of higher dimensionality than the stored representation, 3D is projected into 2D (a drawback of the way of sensing)/

Also, of course, there is topology, humans work fine with blended and deformed images - curved spaces, and curves, not simple linear vectors. However the topology is induced from the basic vector spaces, the simplest topological representation is just the adjacency of coordinates in a matrix.

The above may seem obvious, but the goal is namely to make things as explicit as possible...." 

+

Sunday, April 1, 2012


https://artificial-mind.blogspot.com/2012/04/jurgen-schmidhuber-talk-on-ted-creative.html
"Todor:  And it takes many months to get to 3D-vision and to increase resolution and develop 3D-reconstruction in the human brain. That adds ~86400 fold per day and 31,536,000 "cycles" per year.
What computing power is needed?

I don't think you need millions of the most powerful GPUs and CPUs at the moment to beat human vision, we'll beat it pretty soon, a lot of the higher level intelligence in my estimation is very low at its complexity (behavior, decision making, language at the grammar/vocabulary levels) and would need a tiny amount of MIPS, FLOPS and memory. It's the lowest levels which require vast computing power - 3D-reconstruction from 2D one or many static or motion camera sources, transformations, rotations, trajectories computations etc., and those problems are practically being solved and implemented...." 

0 коментара: