Quran to find duplicates in Google Spreadsheet

Kind time of days, dear readers.

Do you remember Dolly the sheep? When working with google docs, especially spreadsheet (MS Excel) tables to resolve many unusual tasks, such as special formatting, multiple conditions or removing duplicates, it was necessary to learn how to use JavaScript, as not all can be executed with using standard functions, and formulas have limitations that will reveal in the next issue. Now we present an educational program for search of duplicates of links and text in a Google Spreadsheet.

image

So let's start writing scripts for Google Spreadsheet. If you know several programming languages, at least superficially, you will of course be easier. In JavaScript I started to write without problems, but worked mostly with VBA, Visual basic, C/C++, C#. As they say, a little bit of everything and everywhere a little bit. If programming skills are lame, you are welcome to our educational program.



the Introduction

In order for you was a clear example, we will study the script on the example remove duplicates in the tables in the google Doc spreadsheet. We'll be writing from scratch, so proceed.

the setting a task
Put the task — to find in the text the list of duplicates and remove them after creating a backup copy of the list.
Create a new document and call it “Search for duplicates”, the page call the same. Find duplicates will be at the 1st column. In order to make it easier, write the name of the columns and secure them. In order to fix a row or column, you need to pull for a bold line, while it moves down by 1 row or 1 column to the right (figure 1).

Figure 1. Freeze columns and rows.

As usually we need to use multiple columns, and the force of habit, we usually remove all the excess, we will add them again (columns B:D). If you have a new sheet, respectively, there are columns and add them do not need, and delete everything after that would be welcome. As the purity of the Deposit order.

Create 2 sheet “working page” where we will carry out data manipulations. “Copy” — list for a copy of the initial data, in case of incorrect work of a script or data loss (figure 2).


Figure 2. Create 2 sheets.

Start writing code. Click in the top drop-down menu Tools -> script Editor (figure 3).

Figure 3. Open the script editor in Google Spreadsheet.

Before us in a new window, will open script editor with options for scripting
(Figure 4)

Figure 4. The options for creating scripts in the script Editor.

If you select any of the items on the left (creating the script for Drive, Mail, etc.), you will see an introductory script with detailed explanations how to work with a particular script, But this is only a demonstration scripts. But we'll choose the “Empty project” and see just the beginning of all scripts
function myFunction(){}
.

Standard project name is “untitled Project”, I changed the name by clicking on these words (the upper line in the figure) and called it “Finding duplicates” (figure 5). This order then in the presence of 10 or more scripts, it was possible to distinguish between them and effortlessly find the desired script.

Figure 5. “Empty project” from the first standard lines of code and changed the name

In curly brackets we will write our script.

To get deeper into JavaScript, we particularly will not, but to understand even a beginner, in the course of writing code, I comment in detail on what is happening.
Before you start to write any automation script (task), we need to draw for myself algorithm of actions. And let to many this may seem nonsense and a waste of time and may you find these steps unnecessary, but it is a very important stage of work. When working with large projects it is very important because you can get confused and in 3 trees — algorithm solves this problem. This applies not only to JavaScript, it applies to absolutely any language, and indeed any action. For understanding let's give an example based on receiving a Cup of tea (figure 6).


Figure 6. The algorithm of actions of receiving a tea Cup (example).

I hope you brewed a Cup :) and in a good mood, read on.

Draw the algorithm for our script to find duplicates (figure 7), it will be more detailed than the example with the Cup of tea.

Figure 7. The algorithm for the task of finding duplicates

Will make a little explanation, this is the final form of the algorithm, so there are 2 units which may not immediately be clear to the average user.
Unit 2 — is the removal of the content and comments from the page, on the page that you just created, and nothing does not contain. This unit is necessary when reusing the script, which is why we are laying this function in the script. More about this will be written below.
Unit 5 — creating the array and converting, this will be discussed in detail below. Now I will say that the created array will be the key element of data processing.
the Kodin
Declare the variables for our sheets to let the program know how to handle it.

To declare a variable in the current scope, use the keyword var. Moreover, a feature of the javascript language, as well as many others such as php, is that you do not need to declare the type of variable (like in C#: declare number integer; line string, etc.).
In order to describe the lines of code inside the code, we will use the text that begins with two direct slashes “//”. In different programming languages the comment tags look different and represent a sequence of some symbols, here are some of them: (see Table 1.)


Table 1. Comparative table of comment tags in some languages

When writing programs in languages, or in applying the correct syntax, for example, the program notepad++, the program text will be highlighted in corresponding colors, which are set for the default language. The color of the text of the review, for example, will be green for most languages by default, but in fact, in programs for the coding usually has the ability to set usatovoy range, although not all. Google JS the color of the comment will be brown.

Figure 8. Illustration of the syntax highlighting code.

Declare three variables to our sheets:
sheet_work_page variable for the page called “working page”, here we will put the text to search for and after work to get the result.
sheet_find_dubles variable for the page called “find duplicates”, on this page will be processing duplicates.
sheet_copy variable for the page named “Copy” creates a copy of the data before processing, since you never know when you will need the original.

the

var sheet_work_page = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('working page');
var sheet_find_dubles = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Searching for duplicates');
var sheet_copy = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Copy');
</sourse>

As well as variables for work. 
var i,Page_Last_Row,k,archurls;

variable Page_Last_Row - the last line of the sheet for which this variable is declared.
archurls variable - array for storing and iterating through the values.

For each new treatment, we need to remove content and comments from the pages of “working page” and “Copy” as our script will operate with the review. This operation prevents the  end  result of the data that might remain from previous uses of the script. For example: You first worked with 1000 rows, and then switched to 20 rows without deleting the previous result, You end up with 1000 rows, not 20. This operation on the pages you can perform: delete the contents by pressing the delete key, and comments via contextual menu for the appropriate team. For simple removal, for example, when you press delete, the comments remain. 
You can also use the keyboard:
To embark on the second line, press Shift + Space(spacebar) so we highlight the entire row, then Shift+Ctrl+Arrow_Down (down arrow) (select to end of sheet) and through the context menu "Delete row". Get the pristine blank with a cap.
<img src="http://habrastorage.org/storage3/c53/628/51b/c5362851ba5f57dbdaa809c5f0d75dc7.jpg"/>
Figure 9 Correctly deleted notes.

To remove We need to know the boundaries of the range where the operation of cleaning. This is necessary, for example, leave the cap in the document untouched, or if we need to affect only part of the data, and it is also the need for  language  programming languages - to know the working range.
For our band it is based on a range of cells “A2:D the last line number” 
The number of the last row - the number of rows in the worksheet “Search for duplicates”.
The range starts from cell A2 as the first row or the cell A1 is in the fixation to indicate the cap sheet. Working with a fixed area the script does not know how. It is also a kind A2:Dномер the last line can be written as (2.1,the number of the last row,4) this does not change, more about this You can read here.
<a href=”http://habrahabr.ru/post/157933/”>Russian language</a>
<a href=”http://webhostingw.com/google-spreadsheet-formulas/”>English</a>
Will write code for all that was said above about removing annotations, and content, it's only 4 lines.

<source lang=”javascript”>
sheet_work_page.getRange("A2:D" + sheet_find_dubles.getLastRow()).clearContent();
sheet_work_page.getRange("A2:D" + sheet_find_dubles.getLastRow()).clearComment();
sheet_copy.getRange("A2:D" + sheet_find_dubles.getLastRow()).clearContent();
sheet_copy.getRange("A2:D" + sheet_find_dubles.getLastRow()).clearComment();

Describe a little bit more:
the sheet_work_page — character name that we assigned to the sheet “working page”.
the getRange(first row, first column, last row, last column) — the range to take further actions.
the clearContent() — delete the contents of the cells specified in the range.
the clearComment() — remove comments in the cells specified in the range.

Create a variable datatocopy and select a range of cells with data in A2:D the last line in the sheet “find duplicates” and copy in the “working page” and “Copy”.
The peculiarity of this operation is that to insert the copied range only if the relevant range.
the

var datatocopy = sheet_find_dubles.getRange("A2:D" + sheet_find_dubles.getLastRow()).getValues();
sheet_work_page.getRange(2,1,sheet_find_dubles.getLastRow()-1,4).setValues(datatocopy);
var datatocopy1 = sheet_find_dubles.getRange("A2:D" + sheet_find_dubles.getLastRow()).getValues();
sheet_copy.getRange(2,1,sheet_find_dubles.getLastRow()-1,4).setValues(datatocopy1);

Page_Last_Row variable is assigned the value of the last line of the page “find duplicates”
the

Page_Last_Row = sheet_find_dubles.getLastRow();

In the previously announced array archurls assign all the data from the first column of the sheet “find duplicates”.
the

archurls = sheet_find_dubles.getRange(2, 1, Page_Last_Row-1, 1).getValues();

and convert 2-dimensional array into one-dimensional. This is necessary in order to convert all the entries in the column in characters, because the search is happening character by character.
the

for (i=0; i<archurls.length; i++) //convert 2d array into 1d

archurls[i] = archurls[i][0];
}

To better understand the mechanism of action and to make a correct representation, consider the screenshot of “How to look the data in the array archurls” figure 8

The figure 8 looks Like an array with values from the inside.

The variable i, which in this case is used for sequence number check of the row of the sheet “working” page, the associated value of the 2nd row because the 1st is the names of the columns i = 2;

And move on to writing the main loop to find duplicates. Declare a while loop (a loop whose condition is checked before the start of the operations of the entire cycle) and iterate through it line by line the values from the sheet “working page”, comparing with the values of sheet “find duplicates”.
the

while (i <= sheet_work_page.getLastRow())
{ 

k = archurls.indexOf(sheet_work_page.getRange(i, 1).getValue());
if (k >= 0)
{
if ((k+2)==i)
{
sheet_work_page.getRange(i, 4).setComment('Found match of the string' + (k+1) + 'archive');
i++;
continue;
}

else
{
sheet_work_page.getRange("A" + i + ":C" + i).clearContent();
i++;
delete_count++;
continue;
}
}

}

In the end, on the working page, you get a list of 100% without duplicates.
the please Note that if the link contains the Cyrillic alphabet but visually it is similar to the link B, written in Latin, these links will be different.

Sortable, if there is such a need.
Description: Sheet.sort(ncol, true/false) true — And false — z to A
the

sheet_work_page.sort(3, true);

To ensure that we don't have to run a script to go into the editor, we derive a separate button on the menu bar.
The function code for the button is simple.
the

function onOpen() 
{
SpreadsheetApp.getActiveSpreadsheet().addMenu("find duplicates?", [{name: "Remove duplicates!", functionName: "check_duplicates_one_sheet"}]);
}

Add a menu called “Dig?” specify the name of the submenu “Dig!” and give the button the function that we wrote above:check_duplicates_one_sheet (figure 9)


Figure 9

You can go to our document described in this article.
the Addition:
“But if you want to simply remove duplicates from column A, a quick and no nonsense?”, you ask.
Well let's write a faster script.

the

function removeDuplicates() {
//declare our leaf.
var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Express cleaning of duplicates for one column");
//get array of data from our worksheet. 
var data = sheet.getDataRange().getValues();
//declare a new array for the results.
var newdata = new Array();
//check for duplicates
for(i in data){
var row = data[i];
var duplicate = false;
for(j in newdata){
if(row[0] == newdata[j][0]){
duplicate = true;
}
}
//update the string
if(!duplicate){
newdata.push(row);
}
}
//clear contents of the cells from the source sheet
sheet.clearContents();
//insert into the new cleaned array
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
//sort ascending
sheet.sort(1, true);
}


In the end, we have a quick script to remove duplicates.
Tested on 400,000 records.

Up to new meetings! Your obedient servant.

Can also visit our previous articles:
The Talmud by the formulas in Google SpreadSheet
Sending letters in Google Docs (Drive)
wait for our next releases.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

When the basin is small, or it's time to choose VPS server

Performance comparison of hierarchical models, Django and PostgreSQL

From Tomsk to Silicon Valley and Back