2017-12-10  Dick Grune  <Gebruiker@GEBRUIKER-PC>

	* 8086lang.l: 8086 assembler front end added, courtesy of Yaoyao Lei
	(雷瑶瑶 <lyyfight122@gmail.com>) of the Beijing University Of Posts
	and Telecommunications.

	* Makefile (Generic Language module): the languages module specs for
	the by now 8 languages are tedious and repetitious. Replaced by a
	generic module and a recursive call of 'make'.

2017-12-06  Dick Grune  <dick@dickgrune.com>

	* text.c (init_nl_buff): Removal of nl_buff or pass 2 - Considerations
	For finding similarities we need the input files in token form; for
	displaying similarities we need them in character form. Ideally we
	would like to have both in memory so files need to be read once, but
	for mass comparisons that is infeasible. Also, runs may start in the
	middle of an line, and we want to display whole lines, so to display a
	line we often need characters from before the first token of a run.

	History.
	The original 1986 design condenses the input into a token array (pass
	1), using calls of yylex(). At the same time pass 1 fills an array
	'nl_buff' with the required start-of-line info, if available memory
	permits (this is 1986).
	This costs 1 reading of each file using yylex().

	Compare creates a list of 'runs' from the token array. A run
	identifies identical sequences of tokens by two 'chunks'. Each chunk
	identifies the sequence by a file name and two 'positions' in that
	file; the positions are counted in tokens. The two sequences
	identified by the two chunks are equal when viewed as tokens, but they
	can of course differ widely as characters. 'Compare' also connects all
	chunks referring to the same file in a linked list.

	Pass 2 finds the correct line numbers that go with the token positions
	in the chunks, as follows. For each file F it takes the list L that
	connects the chunks for F, sorts them, and fills in the line numbers
	by following L in parallel with either nl_buff or, if that does not
	exist, F itself.
	This costs 0 (if nl_buff exists) or 1 (if not) reading of each file
	using yylex(). 'nl_buff' is an optimization to prevent another read
	operation using yylex().

	Pass 3 now has the begin and end points of text sequences in terms of
	character positions in the files, and can easily print the similar
	portions of these files.
	This costs N reading of each file using getc(), where N is the number
	of chunks in which the file occurs.

	The present situation.
	In mass comparisons there is no room for nl_buff, but mass comparisons
	usually want percentages rather than the text segments themselves, so
	for percentage measurements the nl_buff mechanism gets turned off.
	This leaves us with the 'normal' users, who supply a few hundred files
	and get a few dozen runs. For these users it does not matter what we
	do on present computers: there is memory and reading speed enough.
	So nobody cares for nl_buff any more: it is dead wood. Question is, do
	we remove it? (Doing so is simple enough).

	But we can go further and remove pass 2 at all: pass 3 could for each
	chunk in file F first obtain start-of-line info by reading F with
	yylex(), and then use this info to position and read F using
	This would cost N reads with yylex() and N reads with getc(); it
	also would require new code in pass 3.

	In summary, removing nl_buff would cost little effort and make sim 1
	yylex() read slower for each file occurring in a run; and removing
	pass 2 would require code substantial effort and slow down sim by N-1
	yylex() reads.

2017-12-05  Dick Grune  <dick@dickgrune.com>

	* stream.c (Open_Stream):  lex_tk_cnt is initially set to 0 but is
	raised before the token is delivered, so effectively lex_tk_cnt starts
	at 1. Also the first token is stored in	Token_Array[0].
	Now, in set_pos() in runs.c ps_tk_cnt is set from the parameter
	'start', which via set_chunk is set to 'i0 - txt0->tx_start'. This
	computes the start of the chunk as the distance of its first token to
	the start of the text segment, so for the first token this distance
	is 0. As a result ps_tk_cnt is 1 behind lex_tk_cnt. This requires
	funny code, for example in match_pos_list_of() in pass2.c where the
	code requires >= where > is obviously correct.
	This is harder to correct than the nl problem below, and such places
	in the code are marked with TK_CNT_HORROR.

2017-12-03  Dick Grune  <dick@dickgrune.com>

	* pass2.c (match_pos_list_of): ch_last.ps_nl_cnt was consistently 1 too
	high. This was caused by match_pos_list_of() taking the value of
	lex_nl_cnt only *after* the next EOL had been seen. This was corrected
	inside match_pos_list_of() for ch_last.ps_nl_cnt but not for
	ch_first.ps_nl_cnt. Corrected by tweaks in pass2.c and pass3.c .

2017-11-27  Dick Grune  <dick@dickgrune.com>

	* Since the amount of debugging code in some files made it difficult
	to get a good view of the algorithms, the debugging code was moved
	into *_db.i files, to be included with a simple #include.

	This did not work for db_print_nl_buff(), which is used in pass2.c but
	uses nl_buff, which is local to text.c. To be solved, or the problem
	might go away if and when the nl_buff system is removed.

2017-11-26  Dick Grune  <dick@dickgrune.com>

	* Makefile (ABS_CFS): cleaning up the interface between the *lang.l
	files and the rest of the program resulted in the removal of the
	language.[ch] module, leaving lang.[ch] as the only abstract
	module. See lang.h .

	(PROP_CFS): With these changes all traces of specifiec Algol
	properties have gone and "algollike.?" has been renamed "properties.?".

2017-03-19  Dick Grune  <dick@dickgrune.com>

	*  c++lang.l contributed by Evin Murphy (evin.murphy@ucdconnect.ie).

2016-08-07  Dick Grune  <dick@dickgrune.com>

	* compare.c (lcs): Finding the text into which i1 points was done by
	linear search. Replaced by binary search. This resulted in no direct
	improvement but should be beneficial for very large numbers of files.

2016-08-05  Dick Grune

	* hash.c (make_forward_references_perfect): This routine now
	constructs perfect forward references. This allows the comparison code
	in compare.c (lcs) to skip the first Min_Run_Size tokens when
	comparing two chunks. This yields about 8% speed-up.

2016-08-03  Dick Grune

	* hash.c (make_forward_references_using_hash): A lot of work was being
	done to compute the hash value of a run, using sampling etc. A new
	scheme was implemented in which the hash value consists of all tokens
	in the run, each circularly left-shifted over a distance proportional
	to its position in the run. This value can be computed incrementally
	in constant time, by removing the oldest and adding the newest token
	at each position.
	Care had to be taken not to produce negative values for entities of
	type size_t.
	For moderately large comparisons (~ 10M tokens) this yields an
	over-all speed-up of about 20%.

2016-05-29  Dick Grune

	* pass3.c: Surprisingly, very large runs of the program can run out of
	memory far into the computation, in spite of the fact that sim
	allocates almost all its memory while reading the input. However, the
	found runs of text are collected using Mallocked memory, for producing
	sorted output. In many cases one would rather have unsorted output
	than no output at all. Decided to implement the -u option, for
	unsorted output, thus not using any additional memory.

2016-05-13  Dick Grune

	* compare.c (lcs): the -X option works. It uses the linear algorithm
	to compare a file F to two segments of the text: the segment from the
	start of the text to the start of F, and from the end of F to the end
	of the text. Runs found are then reported in percentages.

2016-05-06  Dick Grune

	* compare.c: The results of the analysis in
	Similarity_Percentage_Computation.tex have been implemented, and under
	the -p option percentages are now computed consistently. In
	particular, the files in the directory Contributors/Debora_Weber-Wulff/
	yield the same relative percentages independent of the order they are
	entered in; see Pfubar_shuffle and Pslash_shuffle. Note: both parts of
	Pslash_shuffle were shuffled independently.

2016-04-10  Dick Grune

	* runs.c: aiso (arbitrary in sorted out) package replaced by an
	instantiation of the package sortlist.

2014-02-17  Dick Grune

	* sim.c (is_new_old_separator): MinGW sometimes (?) interprets the /
	as a command-line argument as a reference to the MinGW tree, which
	makes the / unusable as a separator. Even escaping it ("/") does not
	help. Added the | as a separator.

2014-01-26  Dick Grune

	*  %z from Marcus Brinkmann implemented by a routine size_t2string.

2014-01-26  Dick Grune

	*  Sizes on my present machine (HP 6730b laptop):
	unsigned short int: 2
	unsigned int: 4
	unsigned long int: 4
	unsigned long long int: 8

2013-05-31  Dick Grune

	*  (hash.c) Better hash2() function.

2013-04-28  Dick Grune

	*  Markus Brinkmann (marcus.brinkmann@ruhr-uni-bochum.de) supplied a
	64-bit version. %z does not work on MinGW C.

2012-11-28  Dick Grune

	* newargs.c (recursive_args): Liqun Chen (liqun.chen@hp.com)
	submitted a bug report noting that the separator / is expanded under
	the -R option. Corrected.

2012-09-30  Dick Grune

	* pass2.c (pass2_txt): Boyd Blackwell (Boyd.Blackwell@anu.edu.au)
	submitted a bug report in which the line numbers (and runs
	representations) were way off (75 lines). The input files were
	characterized by extremely long lines, hundreds of tokens (max. 521).

	After 2.5 days of debugging the cause was found: 1. since the mapping
	from token positions to line numbers is stored as the difference of
	the token positions from one line to the next (see text.c); 2. since
	these differences are stored in unsigned chars to save space; 3. since
	the nl_buff mechanism is switched off when one of these unsigned
	characters overflow; and since 521 tokens on one line overflowed this
	unsigned char, the nl_buff mechanism was shut off.

	Since when there is no nl_buff information in pass2, pass2 resorts to
	rereading the input file calling yylex again; 2. since the preceding
	file had few runs to find line number to, the preceding file was not
	read to the end, and the rest remained in flex's buffer, so a portion
	of the preceding file seemed prefixed to the present file, adding 75
	lines to it.

	Remedy: flushing flex's buffer explicitly in pass2_txt(); this is
	simpler than using flex's YY_BUFFER_STATE mechanism.

	Advice: get rid of the nl_buff mechanism; it is no longer relevant.

2012-06-09  Dick Grune

	* lang.h:
	The *lang.l files are unusual in two respects:
	1. they present two interfaces to the rest of the system:
	language.[ch], static data about the language, and lang.[ch], dynamic
	data about the input file's content;
	2. both interfaces come with multiple implementations, one for each
	*lang.l file; i.e., they are abstract.
	This has been sorted out with some difficulty.

2012-05-08  Dick Grune

	* Changed to 16-bit tokens, for better resolution for sim_text and
	on -F option, and for UTF-8 input.

	It was not worth while to save the 8-bit token code: on serious
	comparisons the increase in memory usage is about 10% (330 000 on a
	maximum allocation of 3 030 976 for comparing the sources of MCD2).

2009-03-11  Dick Grune  <dick@flits.few.vu.nl>

	* newargs.c: added -R option to follow directories recursively.
	  See recursive_args().

2008-09-22    <Dick@ACER>
	* added newargs.[ch], to supply file names from standard input,
	  for those compilers that do not have the @ facility. Implemented
	  without fixed limits.

2008-09-21    <Dick@ACER>
	* changed default format back to original, and inverted the
	  -v(erbose) option into a -T(erse) option.

2008-03-31  Dick Grune  <dick@flits.few.vu.nl>
	* *.l: the following are not universally recognized; removed.
		%option nounput
		%option never-interactive

2008-03-31    <Dick@ACER>
	Introduced aiso.* and Malloc.? as imported modules.

2007-11-21  Carlos Maziero <maziero@ppgia.pucpr.br>
	- output format modified in order to facilitate "grep" filtering
	- added option "-v" for a more verbose output
	- added option "-tN" to define a threshold %N (only similarities
	  over N% are shown)
	- fixed SEGV on writing to the output file
	- the file list can be informed through STDIN (one file per line,
	  accepts "/" marker); this is useful for compilers that lack the
	  @ facility

2007-08-23  Dick Grune  <dick@hydra.cs.vu.nl>
	LICENSE.txt added.

2006-11-27  Dick Grune  <dick@hydra.cs.vu.nl>
	Removal of setbuff() for compatibility.

2005-01-17  Dick Grune  <dick@blade014.cs.vu.nl>
	Corrections by Jerry James <james@eecs.ku.edu>; ANSIizing, etc.

2004-08-05  Dick Grune  <dick@blade014.cs.vu.nl>
	Finished the 'percentage' option.

08-Nov-2001	Dick Grune
	Begun to add a 'percentage' option, which will express the
	similarity between two files in percents.

27-Sep-2001	Dick Grune
	Split add_run() off from compare.c into add_run.c, to accommodate
	different add_run()s, for different types of processing.

27-Nov-1998	Dick Grune
	Installed a Miranda version supplied by Emma Norling (ejn@cs.mu.oz.au)

23-Feb-1998	Dick Grune
	Renamed text.l to textlang.l for uniformity and to make room for
	a possible module text.[ch].

	Isolated a module for handling the token array from buff.[ch] to
	tokenarray.[ch], and renamed buff.[ch] to text.[ch].

23-Feb-1998	Dick Grune
	There is probably not much point in abandoning the nl_buff list
	when running out of memory for TokenArray[]: each token costs 1
	byte for the token and 4 bytes for the entry in
	forward_references[], a total of 5 bytes.  There are about 3
	tokens to a line, together requiring 15 bytes, plus 1 byte in
	nl_buff yields 16 bytes.  So releasing nl_buff frees only 1/16 =
	6.7 % of memory.

	Since the code is a bother, I removed it.  Note that nl_buff is
	still abandoned when the number of tokens in a line does not fit
	in one unsigned char (but that is not very likely to happen).


21-Feb-1998	Dick Grune
	Printing got into an infinite loop when the last line of the
	input was not terminated by a newline AND contained tokens that
	were included in a matching run.
	This was due to a double bug: 1. the non-terminated line was not
	registered properly in NextTextTokenObtained() / CloseText(),
	and 2. the loop in pass 2 which sets the values of
	pos->ps_nl_cnt was terminated prematurely when the file turned
	out to be shorter than the list of pos-es indicated.
	Both bugs were corrected, the first by supplying an extra
	newline in CloseText() when one is found missing, and the second
	by rewriting the list-parallel loop in pass 2.

02-Feb-1998	Dick Grune
	Pascal does not differentiate between strings and characters
	(strings of one character); this difference has been removed
	from pascallang.l.

22-Jan-1998	Dick Grune
	Detection of non-ASCII characters added.  Since the lexical
	analyser itself generates non-ASCII characters, the test must occur
	earlier.  We could replace the input routine of lex by a
	checking routine, but with several lex-es going around, we want
	a more lex-independent solution.  To allow each language its own
	restrictions about non-ASCII characters, the check is
	implemented in the *lang.l files.

28-Nov-1997	Dick Grune
	Changed the name of the C similarity tester 'sim' to 'sim_c', for
	uniformity with sim_java, etc.

23-Nov-1997	Dick Grune
	Java version finished; checked by Matty Huntjens and crew.

24-Jun-1997	Dick Grune
	Started on a Java version, by copying the C version.

22-Jun-1997	Dick Grune
	Modern lexical analysers, among which flex, read the entire input into
	a buffer before they issue the first token.  As a result, ftell() no
	longer gives a usable indication of the position of a token in a file.
	This pulls the rug from under the nl_buff mechanism in buff.c, which
	is removed.  We loose a valuable optimization this way, but there just
	seems to be no way to keep it.

	Note that this has nothing to do with the problem in MS-DOS of
	character count and fseek position not being synchronized.  That
	problem has been solved on June 14, 1991 (which see) and the code has
	been running OK since.

18-Jun-1997	Dick Grune
	The thought has occurred to use McCreight's linear longest common
	substring algorithm rather than the existing algorithm, which has a
	small quadratic component.  There are a couple of problems with this:
	1.	We need the longest >non-overlapping< common substring;
		McCreight provides just the longest.  It is not at all clear
		how to modify the algorithm.
	2.	Once we have found our LCS, we want to find the
		one-but-longest; it is far from obvious how to do that in
		McCreight's algorithm.
	3.	Once we have found our LCS, we want to take one of its
		copies out of the game, to suppress duplicate messages.
		Again, it is difficult to see how to do that, without
		redoing all the calculations.
	4.	McCreight's algorithm seems to require about two binary
		tree nodes per token, say 8 bytes, which is double we
		use now.

17-Jun-1997	Dick Grune
	Did some experimenting with the hash function; it is still
	pretty bad: the simple-minded second sweep through
	forward_references easily removes another 80-99% of false hits.
	Next, a third sweep that does a full comparison will remove another
	large percentage.

	So I have left in the second sweep in all cases.

	There are a couple of questions here:
	1. Can we find a better hash function, or will we forever need a
		second sweep?
	2. Does it actually matter, or will we loose on more expensive
		hashing what we gain by having a better set of forward
		references in compare.c?


16-Jun-1997	Dick Grune
	Cleaned up sim.h and renamed aiso.[ch] to runs.[ch] since they
	are instantiations of the aiso module concerned with runs.
	Aiso.[spc|bdy] stays aiso.[spc|bdy], of course.

16-Jun-1997	Dick Grune
	Redid largest_function() in algollike.c.
	Corrected bug in CheckRun; it now always removes NonFinals from
	the end, even when it has first applied largest_function().

15-Jun-1997	Dick Grune
	Reorganized the layers around the input file.  There were and
	still are three layers: lang, stream and buff.

	Since the lex_X variables are hoisted unchanged through the levels
	lang, stream, and buff, to be used by pass1, pass2, etc., they
	have to be placed in a module of their own.

	The token-providing module 'lang' has three interfaces:
	-	lang.h, which provides access to the lowest-level token
			routines, to be used by the next level.
	-	lex.h, which provides the lex variables, to be used by
			all and sundry.
	-	language.h, which provides language-specific info about
			tokens, concerning their suitability as initial
			and final tokens, to be used by higher levels.

	This structure is not satisfactory, but it is also unreasonable
	to combine them in one interface.

	There is no single lang.c; rather it is represented by the
	various Xlang.c files generated from the Xlang.l files.

14-Jun-1997	Dick Grune
	Added a Makefile zip entry to parallel the shar entry.

13-Jun-1997	Dick Grune
	A number of simplifications, in view of better software and bigger
	machines:
	-	Removed good_realloc from hash.c; I don't think there are
		any bad reallocs left.
	-	Removed the option to run without forward_references.
		On a 16Mb machine this means you have at least 2M tokens;
		using a quadratic algorithm will take 4*10^6 sec. at an
		impossible rate of 1M actions/sec., which is some 50 days.
		Forget it.
	-	Renamed lang() to print_stream(), and incorporated it in sim.c
	-	Removed the MSDOS subdirectory mechanism in the Makefile.
	-	Removed the funny and sneaky double parameter expansion in
		the call of idf_in_list().

12-Jun-1997	Dick Grune
	Converted to ANSI C.  Removed cport.h.

09-Jan-1995	Dick Grune
	Decided not to do directories: they usually contain extraneous
	files and doing sim * is simple enough anyway.

09-Sep-1994	Dick Grune
	Added system.h to cater for the (few) differences between Unix and DOS.
	The #define int32 is also supplied there.

05-Sep-1994	Dick Grune
	Added many prototype declarations using cport.h.
	Added a depend entry to the Makefile.

31-Aug-1994	Dick Grune
	All these changes require a 32 bit integer; introduced a #define
	int32, set from the command line in the Makefile.

25-Aug-1994	Dick Grune
	It turned out that one of the most often called routines was .rem,
	from idf_hashed() in idf.c.  Moving the % out of the loop chafed off
	another 6% and reduced the time to 18.4 sec.

19-Aug-1994	Dick Grune
	With very large files (e.g., concatenated /usr/man/man1/*) the fixed
	built-in hash table size of 10639 is no longer satisfactory.  Hash.c
	now finds a prime about 8 times smaller than the text_size to use
	for hash table size; this achieves optimal speed-up without gobbling
	up too much memory.  Reduced the time for the above file from 30.2
	sec. to 19.6 sec.
	For checking, the same test was run with all hashing off; it took
	20h 27m 19s = 73639 sec.  But it worked.

11-Aug-1994	Dick Grune
	For large values of MinRunSize (>1000) a large part of the time
	(>two-thirds) was spent in calculating the hash values for each
	position in the input, since the cost of this calculation was
	proportional to MinRunSize.  We now sample a maximum of 24 tokens
	from the input string to calculate the hash value, and avoid
	overflow.  On my workstation, this reduces the time for
		sim_text -r 1000 -n /usr/man/man1/*
	from 60 sec to 21 sec.

30-Jun-1992	Dick Grune,kamer R4.40,telef. 5778
	There was an amazing bug in buff.c where NextTextToken() for pass 2
	omitted to set lex_token to EOL when retrieving newline info from
	nl_buff. Worked until now!?!

23-Sep-1991	Dick Grune
	Cport.h introduced, CONST and *.spc only.

17-Sep-1991	Dick Grune
	The position-sorting routine in pass2.c has been made into a
	separate generic module.

14-Jun-1991	Dick Grune (dick@cs.vu.nl) at dick.cs.vu.nl
	Replaced the determination of the input position through counting
	input characters by calls of ftell(); this is cleaner and the other
	method will never work on MSDOS.

30-May-1989	Dick Grune (dick) at dick
	Replaced the old top-100 module (which had been extended to top-10000
	already anyway) by the new aiso (arbitrary-in sorted-out) module.
	This caused a considerable speed-up on the Mod2 test bed:
		 %time  cumsecs  #call  ms/call  name
		  17.9    99.20   7209    13.76  _InsertTop
		   0.3     1.37   7209     0.19  _InsertAiso
	It turns out that malloc() is not a serious problem, so no special
	version for the aiso module is required.

23-May-1989	Dick Grune (dick) at dick
	No more uncommented comment at the end of preprocessor lines, to
	conform to ANSI C.

23-May-1989	Dick Grune (dick) at dick
	Added code in the X.l files to (silently) reject characters over 0200.
	This does not really help, since lex stops on null chars. Ah, well.

19-May-1989	Dick Grune (dick) at dick
	Made the token as handled by sim into an abstract data type, for
	aesthetic reasons. Sign extension is still a problem.

03-May-1989	Dick Grune (dick) at dick
	Optimized lcs() by first checking from the end if a sufficiently long
	run is present; if in fact only the first 12 tokens match, chances
	are good that you can reject the run right away by first testing
	the 20th token, then the 19th, and so on.

21-Apr-1989	Dick Grune (dick) at dick
	A run of sim_m2 finding 7209 similarities raised the question of
	the appropriateness of the linear sort in sort_pos(). Profiling
	showed that in this case sorting takes all of 7.5 % of the total
	time. Putting the word register in in the right places in
	sort_pos() lowered this number to 4.6%.

20-Apr-1989	Dick Grune (dick) at dick
	Moved the test for MayBeStartOfRun() from compare.c (where it is
	done again and again) to hash.c, where its effect is incorporated in
	the forward reference chain.

14-Apr-1989	Dick Grune (dick) at dick
	Replaced elem_of() by bit tables, headers[] and trailers[], to be
	prefilled from Headers[] and Trailers[] by a call of
	InitLanguage(). This saves a few percents.

13-Apr-1989	Dick Grune (dick) at dick
	Implemented the -e and the -S option, by putting yet another loop
	in compare.c

13-Apr-1989	Dick Grune (dick) at dick
	The -- option (displaying the tokens) will now handle more than one
	file.

20-Jan-1989	Dick Grune (dick) at dick
	After the modification of 19-Dec-88, 12% of the time went into
	updating the positions in the chunks, as they were produced by the
	matching process. This matching process identifies runs (matches)
	by token position, which has to be recalculated to lseek positions
	and line numbers. To this end the files are read again, and for
	each line all positions found were checked to see if they applied
	to this line; this was a awfully stupid algorithm, but since much
	more time was spent elsewhere, it did not really matter. With all
	the saving below, however, it had risen to second position, after
	yylook() with 35%.

	Th solution was, to sort the positions in the same order in which
	they would be met by the reading of the files. The process is then
	linear. This required some extensive hacking in pass2.c

06-Jan-1989	Dick Grune (dick) at dick
	The modification below did indeed save 25%. The newline information
	is now reduced to 2 shorts; 2 chars were not enough, since some
	lines are longer that 127 bytes, and a char and a short together
	take as much room as two shorts.

19-Dec-1988	Dick Grune (dick) at dick
	To avoid reading the files twice (which is still taking 25% of the
	time), the first pass will now collect newline information for the
	second pass in a buffer called nl_buff[].  This buffer, and the
	original token buffer now named TokenArray[], are managed by the file
	buff.c, which implements a layer between stream.h and pass?.c. This
	layer provides OpenText(), NextTextToken() and CloseText(), each
	with a parameter telling which pass it is.

06-Dec-1988	Dick Grune (dick) at dick
	As an introduction to removing the second pass altogether, the
	first and second scan were unified, i.e., their input is identical.
	This also means that the call sim -[12] has now been replaced by
	one call:  sim --.

23-Sep-1988	Dick Grune (dick) at dick
	Dynamic allocation of line buffers in pass 3.  This removes the
	restriction on the page width.

22-Sep-1988	Dick Grune (dick) at dick
	In order to give better messages on incorrect calls to sim, the
	whole option handling has been concentrated in a file option.c and
	separated from the options and their messages themselves. See sim.c

07-Sep-1988	Dick Grune (dick) at dick
	For long text sequences (say hundreds of thousands of tokens),
	the hashing is not really efficient any more since too many
	spurious matches occur.  Therefore, the forward reference table is
	scanned a second time, eliminating from any chain all references to
	runs that do not end in the same token.  For the UNIX manuals this
	reduced the number of matches from 91.9% to 1.9% (of which 0.06%
	were genuine).

30-Aug-1988	Dick Grune (dick) at dick
	For compatibility, NextTop has been rewritten to yield true or
	false and to accept a pointer to a run as a parameter.

30-Aug-1988	Dick Grune (dick) at dick
	When trying to find line-number and lseek position to beginnings
	and ends of runs found, the whole set of runs was scanned for each
	line in each file.  Now only the runs belonging to that file are
	scanned; to this end another linked list has been braided through
	the data structures (tx_chunk).

30-Aug-1988	Dick Grune (dick) at dick
	The longest-common-substring algorithm was called much too often,
	mainly because the forward references made by hashing suffered from
	pollution.  If you have say 1000 tokens and a hash range of say
	10000, about 5 % of the hashings will be false matches, i.e. 50
	matches, which is quite a lot on a natural number of 2 to 3 matches.
	Improved by doing a second check in make_forw_ref().

12-Jun-1988	Dick Grune (dick) at dick
	Installed a Lisp version supplied by Gertjan Akkerman.

15-Jan-1988	Dick Grune (dick) at dick
	Added register declarations all over the place.

14-Jan-1988	Dick Grune (dick) at dick
	It is often useful to match a piece of code exactly, especially
	when function names (or, even more so, macro names) are involved.
	What one would want is having all the letters in the text array,
	but this is kind of hard, since each entry is one lexical item.
	This means that under the -F option each letter is a lex item, and
	normally each tag is a lex item; this requires two lex grammars in
	one program; no good.  So, on the -F flag we hash the identifier
	into one lex item, which is hopefully characteristic enough.  It
	works.

30-Sep-1987	Dick Grune (dick) at dick
	Some cosmetics.

31-Aug-1987	Dick Grune (dick) at dick
	Moved the whole thing to the SUN (while testing on a VAX and a
	MC68000)

16-Aug-1987	Dick Grune (dick) at dick
	The test program lang.c is no longer a main program, but rather a
	subroutine called in main() in sim.c, through the command line
	option -1 or -2.

23-Apr-1987	Dick Grune (dick) at tjalk
	Changed the name 'index' into 'elem_of', because of compatibility
	problems on different Unices. Added a declaration for it in
	the file algollike.c

10-Mar-1987	Dick Grune (dick) at tjalk
	Changed the printing of the header of a run so that:
	-	long file names will no longer be truncated
	-	the run length is displayed

27-Jan-1987	Dick Grune (dick) at tjalk
	Switched it right off again!  Getting them in textual order is
	still more unpleasant, since now you cannot find the important
	ones if their are more than a few runs.

27-Jan-1987	Dick Grune (dick) at tjalk
	Going to experiment with leaving out the sorting; just all the
	runs, in the order we meet them.  Should be as good or better.
	Comparisons of more than 100 runs are very rare anyway, so the
	fact that those over a 100 are rejected is probably no great
	help.  Getting them in a funny order is a nuisance, however.  Down
	with featurism.  Just to be safe, present version saved as
	870127.SV

26-Dec-1986	Dick Grune (dick) at tjalk
	Names of overall parameters in params.h changed to more uniformity.

26-Dec-1986	Dick Grune (dick) at tjalk
	Since the top package and the instantiation system have grown
	apart so much, I have integrated the old top package into sim,
	i.e., done the instantiation by hand.  This removes top.g and
	top.p, and will save outsiders from wondering what is going on
	here.

23-Dec-1986	Dick Grune (dick) at tjalk
	Use setbuf to print unbuffered while reading the files (lex core
	dumps, other mishaps) and print buffered while printing the real
	output (for speed).

30-Nov-1986	Dick Grune (dick) at tjalk
	Various small changes in *lang.l:
		; ignored conditionally (!options['f'])
		new format for tokens in struct idf
		cosmetics: macro Layout, macro UnsafeComChar, no \n
			in character denotations, more than one char
			in a char denotations in Pascal, etc.

30-Nov-1986	Dick Grune (dick) at tjalk
	Added a Modula-2 version.

29-Nov-1986	Dick Grune (dick) at tjalk
	Restricting tokens to the ASCII95 character set is really too
	severe: some languages have many more reserved words (COBOL!).
	Corrected this by adding a couple of '&0377' in strategic places.
	Added a routine for printing the 8-bit beasties: show_token().

15-Aug-1986	Dick Grune (dick) at tjalk
	Since the ; is superfluous in both C and Pascal, it is now ignored
	by clang.l and pascallang.l

15-Aug-1986	Dick Grune (dick) at tjalk
	The code in CheckRun in Xlang.l was incorrect in that it used the
	wrong criterion for throwing away trailing garbage. I've taken
	CheckRun etc. out of the Xlang.l-s and turned them into a module
	"algollike.c".  Made a cleaner interface and avoided duplication of
	code.

02-Jul-1986	Dick Grune (dick) at tjalk
	Looking backwards in compare.c to see if we are in the middle of a
	run is an atavism. You can be and still be all right, e.g., if
	part of the run was rejected as not fitting for a function.
	Removed from compare.c.

10-Jun-1986	Dick Grune (dick) at tjalk
	The function hash_code() in hash.c could yield a negative value;
	corrected.

09-Jun-1986	Dick Grune (dick) at tjalk
	Changed the name of the file text.h to sim.h.  Sim.h is more
	appropriate and text.h sounds as if it belongs to text.l, with
	which it has no connection.

04-Jun-1986	Dick Grune (dick) at tjalk
	After having looked at a couple of hash functions and having done
	some calculations on the number of duplicates normally encountered
	in hash functions, I conclude that our function in hash.c is quite
	good.  Removed all the statistics-gathering stuff.

	Actually, hash_table[] is not the hash table at all; it is a
	forward reference table; likewise, the real hash table was called
	last[].  Renamed both.

	There is a way to keep the hash table local without putting it on
	the stack: use malloc().

02-Jun-1986	Dick Grune (dick) at tjalk
	Added a simple lex file for text: each word is condensed into a
	hash code which is mapped on the ASCII95 character set.  This
	turns out to be quite effective.

01-Jun-1986	Dick Grune (dick) at tjalk
	The macros cput(tk) and c_eol() both have a return in them, so any
	code after them may not be executed -> they have to be last in an
	entry.  But they weren't, in many places; I can't imagine why it
	all worked nevertheless.  They have been renamed return_tk(tk) and
	return_eol() and the entries have been restructured.

30-May-1986	Dick Grune (dick) at tjalk
	Moved the string and character entries in clang.l and pascallang.l
	to a place behind the comment entries, to avoid strings (and
	characters) being recognized inside comments.  I first thought
	this would not happen, but as Maarten pointed out, if both
	interpretations have the same length, lex will take the first
	entry. Now this will happen if the string occupies the whole line
	that would otherwise be taken as a comment.  In short,
	/*
	"hallo"
	*/
	would return ".

28-May-1986	Dick Grune (dick) at tjalk
	Added -d option, to display the output in diff(1) format (courtesy
	of Maarten van der Meulen).
	Rewrote the lexical parsing of comments (likewise courtesy Maarten
	van der Meulen).

20-May-1986	Dick Grune (dick) at tjalk
	Added a routine to convert identifiers to lower case in
	pascallang.l .

19-May-1986	Dick Grune (dick) at tjalk
	Added -a option, to quickly check antecedent of a file (courtesy
	of Maarten van der Meulen).

18-May-1986	Dick Grune (dick) at tjalk
	Brought everything under RCS/CVS.

18-Mar-1986	Dick Grune (dick) at tjalk
	Added modifications by Paul Bame (hp-lsd!paul@hp-labs) to have an
	option -w to set the page width.

21-Feb-1986	Dick Grune (dick) at tjalk
	Took array last[N_HASH] out of make_hash() in hash.c, due to stack
	overflow on the Gould (reported by George Walker
	tekig4!georgew@mcvax.uucp)

16-Feb-1986	Dick Grune (dick) at tjalk
	Corrected some subtractions that caused unsigned ints to turn
	pseudo-negative. (Reported by jaap@mcvax)

11-Jan-1986	Dick Grune (dick) at tjalk
	Touched up for distribution.

10-Jan-1986	Dick Grune (dick) at tjalk
	Fill_line was not called for empty lines, which caused them to be
	printed as repetitions of the previous line.

24-Dec-1985	Dick Grune (dick) at tjalk
	Reduced hash table to a single array of indices; it is used only
	in one place, which makes it very easy to make it (the hash table)
	optional.  General tune-up of everything.  This seems to be
	another stable "final" version.

14-Dec-1985	Dick Grune (dick) at tjalk
	Some experiments with hash formulas:
	h = (h OP CST) + *p++ OP CST yields	right	wrong
		* 96		- 32		205	562
		* 96		- 2		205	560
		* 96				205	560
		* 97				205	559
		<< 0				 66	3128
		<< 1				203	555
		<< 2				205	536
		<< 7				203	540
	Conclusion: it doesn't matter, unless you do it wrong.

01-Oct-1983	Dick Grune (dick) at vu44
	Oldest known files.

#	This file is part of the software similarity tester SIM.
#	Written by Dick Grune, Vrije Universiteit, Amsterdam.
#	$Id: ChangeLog,v 2.41 2017-12-10 16:39:57 dick Exp $
#
